Friday, February 7, 2014

adding the head node of rocks as compute node

happily stolen from:


https://wiki.rocksclusters.org/wiki/index.php/Sun_GridEngine

cause I keep forgetting it...


 Add Frontend as a SGE Execution Host in Rocks

To setup the frontend node to also be a SGE execution host which queued jobs can be run on (like the compute nodes), do the following:
[edit]

Quick Setup

# cd /opt/gridengine
# ./install_execd    (accept all of the default answers)
# qconf -mq all.q    (if needed, adjust the number of slots for [frontend.local=4] and other parameters)
# /etc/init.d/sgemaster.frontend stop
# /etc/init.d/sgemaster.frontend start
# /etc/init.d/sgeexecd.frontend stop
# /etc/init.d/sgeexecd.frontend start
[edit]

Detailed Setup

1. As root, make sure $SGE_ROOT, etc. are setup correctly on the frontend:
# env | grep SGE
It should return back something like:
SGE_CELL=default
SGE_ARCH=lx26-amd64
SGE_EXECD_PORT=537
SGE_QMASTER_PORT=536
SGE_ROOT=/opt/gridengine
If not, source the file /etc/profile.d/sge-binaries.[c]sh or check if the SGE Roll is properly installed and enabled:
# rocks list roll
NAME          VERSION ARCH   ENABLED
sge:          5.2     x86_64 yes

2. Run the install_execd script to setup the frontend as a SGE execution host:
# cd $SGE_ROOT
# ./install_execd 
Accept all of the default answers as suggested by the script.


  • NOTE: For the following examples below, the text should be substituted with the actual "short hostname" of your frontend (as reported by the command hostname -s).
For example, if running the command hostname on your frontend returns back the "FQDN long hostname" of:
# hostname
mycluster.mydomain.org
then hostname -s should return back just:
# hostname -s
mycluster

3. Verify that the number of job slots for the frontend is equal to the number of physical processors/cores on your frontend that you wish to make available for queued jobs by checking the value of the slots parameter of the queue configuration for all.q:
# qconf -sq all.q | grep slots
slots                 1,[compute-0-0.local=4],[.local=4]
The [.local=4] means that SGE can run up to 4 jobs on the frontend. Be aware that since the frontend is normally used for other tasks besides running compute jobs, it is recommended that not all the installed physical processors/cores on the frontend be available to be scheduled by SGE to avoid overloading the frontend.
For example, on a 4-core frontend, to configure SGE to use only up to 3 of the 4 cores, you can modify the slots for .local from 4 to 3 by typing:
# qconf -mattr queue slots '[.local=3]' all.q
If there are additional queues besides the default all.q one, repeat the above for each queue.
Read "man queue_conf" for a list of resource limit parameters such as s_cpu, h_cpu, s_vmem, and h_vmem that can be adjusted to prevent jobs from overloading the frontend.


  • NOTE: For Rocks 5.2 or older, the frontend may have been default configured during installation with only 1 job slot ([.local=1]) in the default all.q queue, which will only allow up to 1 queued job to run on the frontend. To check the value of the slots parameter of the queue configuration for all.q, type:
# qconf -sq all.q | grep slots
slots                 1,[compute-0-0.local=4],[.local=1] 
If needed, modify the slots for .local from 1 to 4 (or up to the maximum number of physical processors/cores on your frontend that you wish to use) by typing:
# qconf -mattr queue slots '[.local=4]' all.q


  • NOTE: For Rocks 5.3 or older, create the file /opt/gridengine/default/common/host_aliases to contain both the .local hostname and the FQDN long hostname of your frontend:
# vi $SGE_ROOT/default/common/host_aliases
.local .mydomain.org


  • NOTE: For Rocks 5.3 or older, edit the file /opt/gridengine/default/common/act_qmaster to contain the .local hostname of your frontend:
# vi $SGE_ROOT/default/common/act_qmaster
.local


  • NOTE: For Rocks 5.3 or older, edit the file /etc/init.d/sgemaster.:
# vi /etc/init.d/sgemaster.
and comment out the line:
/bin/hostname --fqdn > $SGE_ROOT/default/common/act_qmaster
by inserting a # character at the beginning, so it becomes:
#/bin/hostname --fqdn > $SGE_ROOT/default/common/act_qmaster
in order to prevent the file /opt/gridengine/default/common/act_qmaster from getting overwritten with incorrect data every time sgemaster. is run during bootup.

4. Restart both qmaster and execd for SGE on the frontend:
# /etc/init.d/sgemaster. stop
# /etc/init.d/sgemaster. start
# /etc/init.d/sgeexecd. stop
# /etc/init.d/sgeexecd. start


And everything will start working. :)