happily stolen from:
https://wiki.rocksclusters.org/wiki/index.php/Sun_GridEngine
cause I keep forgetting it...
Add Frontend as a SGE Execution Host in Rocks
To setup the frontend node to also be a SGE execution host which
queued jobs can be run on (like the compute nodes), do the following:
Quick Setup
# cd /opt/gridengine
# ./install_execd (accept all of the default answers)
# qconf -mq all.q (if needed, adjust the number of slots for [frontend.local=4] and other parameters)
# /etc/init.d/sgemaster.frontend stop
# /etc/init.d/sgemaster.frontend start
# /etc/init.d/sgeexecd.frontend stop
# /etc/init.d/sgeexecd.frontend start
Detailed Setup
1. As root, make sure
$SGE_ROOT
, etc. are setup correctly on the frontend:
# env | grep SGE
It should return back something like:
SGE_CELL=default
SGE_ARCH=lx26-amd64
SGE_EXECD_PORT=537
SGE_QMASTER_PORT=536
SGE_ROOT=/opt/gridengine
If not, source the file
/etc/profile.d/sge-binaries.[c]sh
or check if the SGE Roll is properly installed and enabled:
# rocks list roll
NAME VERSION ARCH ENABLED
sge: 5.2 x86_64 yes
2. Run the
install_execd
script to setup the frontend as a SGE execution host:
# cd $SGE_ROOT
# ./install_execd
Accept all of the default answers as suggested by the script.
- NOTE: For the following examples below, the text
should be substituted with the actual "short hostname" of your frontend (as reported by the command hostname -s
).
For example, if running the command
hostname
on your frontend returns back the "FQDN long hostname" of:
# hostname
mycluster.mydomain.org
then
hostname -s
should return back just:
# hostname -s
mycluster
3. Verify that the number of job slots for the frontend is equal to the
number of physical processors/cores on your frontend that you wish to
make available for queued jobs by checking the value of the
slots
parameter of the queue configuration for
all.q
:
# qconf -sq all.q | grep slots
slots 1,[compute-0-0.local=4],[.local=4]
The
[.local=4]
means that SGE can run up
to 4 jobs on the frontend. Be aware that since the frontend is
normally used for other tasks besides running compute jobs, it is
recommended that not all the installed physical processors/cores on the
frontend be available to be scheduled by SGE to avoid overloading the
frontend.
For example, on a 4-core frontend, to configure SGE to use only up to 3 of the 4 cores, you can modify the slots for
.local
from 4 to 3 by typing:
# qconf -mattr queue slots '[.local=3]' all.q
If there are additional queues besides the default
all.q
one, repeat the above for each queue.
Read
"man queue_conf"
for a list of resource limit parameters such as
s_cpu
,
h_cpu
,
s_vmem
, and
h_vmem
that can be adjusted to prevent jobs from overloading the frontend.
- NOTE: For Rocks 5.2 or older, the frontend may have been default configured during installation with only 1 job slot (
[.local=1]
) in the default all.q
queue, which will only allow up to 1 queued job to run on the frontend. To check the value of the slots
parameter of the queue configuration for all.q
, type:
# qconf -sq all.q | grep slots
slots 1,[compute-0-0.local=4],[.local=1]
If needed, modify the slots for
.local
from 1 to 4 (or up to the maximum number of physical processors/cores on your frontend that you wish to use) by typing:
# qconf -mattr queue slots '[.local=4]' all.q
- NOTE: For Rocks 5.3 or older, create the file
/opt/gridengine/default/common/host_aliases
to contain both the .local hostname and the FQDN long hostname of your frontend:
# vi $SGE_ROOT/default/common/host_aliases
.local .mydomain.org
- NOTE: For Rocks 5.3 or older, edit the file
/opt/gridengine/default/common/act_qmaster
to contain the .local hostname of your frontend:
# vi $SGE_ROOT/default/common/act_qmaster
.local
- NOTE: For Rocks 5.3 or older, edit the file
/etc/init.d/sgemaster.
:
# vi /etc/init.d/sgemaster.
and comment out the line:
/bin/hostname --fqdn > $SGE_ROOT/default/common/act_qmaster
by inserting a
#
character at the beginning, so it becomes:
#/bin/hostname --fqdn > $SGE_ROOT/default/common/act_qmaster
in order to prevent the file
/opt/gridengine/default/common/act_qmaster
from getting overwritten with incorrect data every time
sgemaster.
is run during bootup.
4. Restart both qmaster and execd for SGE on the frontend:
# /etc/init.d/sgemaster. stop
# /etc/init.d/sgemaster. start
# /etc/init.d/sgeexecd. stop
# /etc/init.d/sgeexecd. start
And everything will start working. :)