Writing Job Scripts

From URCFwiki
Jump to: navigation, search

Overview

Univa Grid Engine documentation is online: https://proteusmaster.urcf.drexel.edu/UGE/user_guide.html Man pages are also available for all Grid Engine commands.

Please see also the following presentations from BioTeam (HPC consulting firm):

In short: everything depends, and everything is complex

Performance of codes depends on the code, the hardware, the specific computations being done with the code, etc. There are very few general statements that can be made about getting the best performance out of any single computation. Two different computations using the same program (say, NAMD) may have very different requirements for best performance. You will have to become familiar with the requirements of your workflow; this will take some experimentation.

Important

Do not try to use some other job control system within job scripts, e.g. a for loop repeatedly executing a program, a standalone process which monitors for "idle" nodes and then launches tasks remotely. To perform its function properly, Grid Engine expects each job to be "discrete". If your workflow involves running multiple jobs, either with or without interdependency between the jobs, please see below, and the following documentation:

Everyone has different workflows and attendant requirements. It is not possible to make general examples and recommendations which will work in every case.

Disclaimer: this article gives fairly basic information about job scripts. The full documentation should be consulted for precise details. On the login nodes, the command "man qsub" will give full information. See: File:UnivaGridEngineUserGuide.pdf. An outdated version of the UGE User Guide as a web page is available here: https://proteusmaster.urcf.drexel.edu/UGE/user_guide.html

File Format

If you created/edited your job script on a Windows machine, you will first need to convert it to Linux format to run. This is due to the difference in the handling of End Of Line.[1] Convert a Windows-generated file to Linux format using the dos2unix command:

    [juser@proteusa01 ~]$ dos2unix myjob.sh
    dos2unix: converting file myjob.sh to UNIX format ...

Basics

A (batch) job[2] script is a shell script[3] with some special comment lines that can be interpreted by the scheduler, Grid Engine (SGE, "S" = "Sun"). Comment lines which are intepreted by the scheduler are #$, starting at the first character of a line. Here is a simple one:

#!/bin/bash
#
### tell SGE to use bash for this script
#$ -S /bin/bash
### execute the job from the current working directory, i.e. the directory in which the qsub command is given
#$ -cwd
### join both stdout and stderr into the same file
#$ -j y
### set email address for sending job status
#$ -M fixme@drexel.edu
### project - basically, your research group name with "Grp" replaced by "Prj"
# -P fixmePrj
### select parallel environment, and number of job slots
#$ -pe openmpi_ib 256
### request 15 min of wall clock time "h_rt" = "hard real time" (format is HH:MM:SS, or integer seconds)
#$ -l h_rt=00:15:00
### a hard limit 8 GB of memory per slot - if the job grows beyond this, the job is killed
#$ -l h_vmem=8G
### want nodes with at least 6 GB of free memory per slot
#$ -l m_mem_free=6G
### want nodes with Intel CPUs
#$ -l vendor=intel
### select the queue all.q
#$ -q all.q

. /etc/profile.d/modules.sh

### These four modules must ALWAYS be loaded
module load shared
module load proteus
module load sge/univa
module load gcc

### Whatever modules you used, in addition to the 4 above, 
### when compiling your code (e.g. proteus-openmpi/gcc)
### must be loaded to run your code.
### Add them below this line.
# module load FIXME
 
echo "hello, world"

Note that these special comment lines are actually options to qsub. I.e., you may remove them from your job script, and pass them in the commandline to qsub,[4] instead:

    [juser@proteusi01]$ qsub -S bash -cwd -M fixme@drexel.edu -pe openmpi_ib 256 -l h_rt=00:15:00 -q all.q@@intelhosts

shebang

Shebang (from hash-bang, #!) defines the shell under which the job script is to be interpreted. We recommend using bash. This must be the very first line of the file, starting with the very first character.

    #!/bin/bash

Shell

Specify the interpreting shell to Grid Engine:

    #$ -S /bin/bash

Queue

There is only one active queue. You may select Intel nodes or AMD nodes by specifying a host group:

    #$ -q all.q@@intelhosts

or

    #$ -q all.q@@amdhosts

You may also, instead of using host groups, request a specific vendor resource:

    #$ -l vendor=intel
    #$ -q all.q

or

    #$ -l vendor=amd
    #$ -q all.q

Project

The "project" is an abstract grouping used in the share tree scheduler to determine job scheduling priority based on historical use. See the Job Scheduling Policy article.

    #$ -P myresearchPrj

There is a one-to-one correspondence between projects and research groups. If you are a member of myresearchGrp, your jobs are run under the project myresearchPrj. Access to the project is based on the effective group at the time the qsub command is issued. You may switch effective group in a session:

    [username@proteusi01 ~]$ newgrp myresearchGrp

Or you may put the command "newgrp myresearchGrp" in the file ~/.bash_profile so that the effective group is changed at login.

Serial Jobs (single-threaded, single CPU core)

There is no special command for a serial job. Jobs are serial by default unless a parallel environment (PE) is requested. (See next section.)

Parallel Environment (PE)

If your job is a parallel job, request an appropriate parallel environment, e.g.

    #$ -pe fixed64 256

That says: use the fixed64 environment with 256 slots. In Proteus, each "slot" corresponds with one CPU core. The fixed64 PE specifies exactly 64 slots per node will be assigned to the job.

For multithreaded jobs, use the "shm" PE. Multithreaded programs cannot communicate between nodes. Requesting more slots than are available on a single node will do nothing except tie up the extra slots without actually using them.

For MPI implementations compiled by URCF staff, i.e. modules named "proteus-mvapich2", "proteus-openmpi", etc., integration with Grid Engine has been compiled in, so the jobs will be able to read the environment to determine the hosts assigned. The same is true for Intel MPI. So, there is no functional difference between the various PEs.

For MATLAB jobs, the "matlab" PE does some preprocessing required for MATLAB's parallel library to work.

You can see all available parallel environments for a given queue by doing "qconf -sq", and look for the "pe_list" setting:

    [juser@proteusi01 ~]$ qconf -sq all.q
    qname                 all.q
    ...
    pe_list               shm,mvapich,mvapich_rr,openmpi_ib,openmpi_ib_rr, \
                          openmpi_rankfile,intelmpi,intelmpi_rr,vasp_rondo_test1, \
                          vasp_rondo_test2,fixed2,fixed3,fixed4,fixed5,fixed6, \
                          fixed7,fixed8,fixed9,fixed10,fixed11,fixed12,fixed13, \
                          fixed14,fixed15,fixed16,fixed64,matlab,testpe,abaqus, \
                          namd16,namd32,namd64,abqtest,spark.amd,spark.intel

The PEs which end with "_rr" assign slots in a round-robin order, i.e. the scheduler selects slot 1 from the first assigned node, slot 2 from the second assigned node, etc, and wraps around back to the first node when the last assigned node has been assigned a slot. The round-robin method also tries to maximize the number of nodes used in a job. So, if the job requests 32 slots, and there are 32 compute nodes available, the scheduler will use all 32 nodes with one slot per node.

The PEs which do not end with "_rr" assign slots by filling up one node before moving to the next node.

The PEs named fixedN, where N ranges from 2--16 inclusive and 64, are PEs which specify a fixed number N of slots.

Currently, except for the difference between the "round-robin" and the "fill-up" slot assignment methods, there is no effective difference between them, except for the intelmpi and intelmpi_rr PEs. The PEs for Intel MPI have an additional pre-processing step which informs MPI about the distribution of slots over nodes for the job.

Specific Number of Nodes

If you would like a multi-node parallel job to occupy all slots of each node, you will need to request one of the fixedNN PEs. You need to know the number of CPU cores per node of the type of node you wish to run on: @intelhosts have 16, and @amdhosts have 64. To use these PEs, request the PE with the total number of slots required by the job.

Examples:

  • If you want to run a 64-slot job on 4 Intel nodes (16 slots per node):
    #$ -pe fixed16 64
  • If you want to run a 128-slot job on 2 AMD nodes (64 slots per node):
    #$ -pe fixed64 128

NOTE: doing so means the job will have to wait for full nodes to become free, which may mean a longer wait time than requesting other PEs.

If you want your job to have exclusive use of all the nodes for the job, while not using all the slots in a node, request the "exclusive" complex:

    #$ -l exclusive 

See Proteus Hardware and Software for details of compute nodes.

PE for Multithreaded or Shared Memory Parallel

If you have a multithreaded or shared-memory parallel job, i.e. it must run on a single node, you must use the "shm" parallel environment:

    #$ -pe shm 16

If your program requires a number of slots which may fit on a single node, use the "shm" PE.

If your program uses OpenMP, you can set the OMP_NUM_THREADS environment variable to match the number of slots requested:

    export OMP_NUM_THREADS=$NSLOTS

However, OpenMP may be able to infer the number of assigned slots from the environment. You may have to run some tests to see if your application does so.

For other mechanisms of implementing multithreading, you must make sure your program uses the same number of threads as there are slots. WARNING If this is not done, and Grid Engine thinks that your job uses fewer processor cores than it actually does, GE will schedule other jobs onto the same compute node, resulting in an oversubscription, and slowing all jobs on that node down.

PE for Hybrid OpenMP + MPI Parallel

TBA

Core-binding and NUMA

GE also supports core binding (via the -binding option of qsub) and NUMA memory binding (via the -mbind option of qsub). For Open MPI programs using the proteus-openmpi/* modules, mpirun will read the GE environment to obtain the binding determined by the scheduler.

Resource Requests in General

The resource requests made by your job script, with the exception of h_rt and h_vmem, are not limits on your program enforced by the scheduler. Requests such as "-pe shm 16" or "-l m_mem_free=2G" tell the job scheduler to find nodes which have resources available which can fulfill those requests. What your program itself does once it starts running is completely up to the program itself, or you, as you specify commandline options to the program.

For instance, suppose we have a program named "compute_stuff" that is multithreaded. It also has a commandline option to set the number of threads:

   compute_stuff --nthreads 16

will tell compute_stuff to use 16 threads.

You can perfectly well write the following job script, which is not a good thing:

### XXX DO NOT DO THIS
#$ -pe shm 4

compute_stuff --nthreads 512

There is no control mechanism to prevent this job from actually running. (Technically, there is a mechanism, but this mechanism prevents Matlab from running at all, so it is not being used.)

What happens then is that compute_stuff now tries to run 512 threads, which will overwhelm the node. The exception to this are parallel jobs which use OpenMPI. OpenMPI has "tight integration" with Grid Engine, and will automatically read the environment to get slot allocations. However, memory usage limits are not read by OpenMPI.

A similar thing applies to memory. Suppose compute_stuff can be set to use a certain amount of memory, e.g. "compute_stuff --mem 4G" One can write a job script:

### XXX DO NOT DO THIS
#$ -l m_mem_free=4G

compute_stuff --nthreads 1 --mem 2048G

This will crash the node as the program tries to use 2 TB of memory, which isn't there.

The "h_vmem" value is meant to prevent this from happening. It sets a hard limit on your job's memory usage, and the scheduler will kill your job if its memory usage goes above h_vmem. However, one must be aware of the hardware limitation. The AMD nodes have 256 GB of installed RAM, and the Intel nodes have 64 GB of installed RAM. Setting a h_vmem value higher than this only means that rather than the scheduler killing the job, the job, and every other process on the node, terminates because it causes the node to crash.

Resource Requests

Resources are requested[5] with the "-l" option. At the least, you should specify the amount of wall clock time:

    #$ -l h_rt=hh:mm:ss

h_rt is actually a limit on your job: if your job exceeds the limit, GE will kill the job.

There are default limit requests in place; if your job does not specifically request h_rt, this value will be used:

    h_rt = 2:00:00

Specifying a reasonably accurate h_rt will help with scheduling: try not to ask for too much more than the actual amount of time your job will need.

If you have an idea of how much memory your job will need, request m_mem_free per slot. Say your job requires 8 GiB per slot, then your script should specify:

    #$ -l m_mem_free=8G

This specifies the minimum amount of free memory (per requested slot) nodes need to have in order for the job to be assigned to those nodes. NB every node consumes about 1 GiB of memory for system usage, so no nodes will have all of its RAM available. If you request the full amount of installed RAM, your job will not run.

There is also a default virtual memory limit in place (interpreted as per-slot for parallel jobs):

    h_vmem = 2G

NOTE For memory-related requests the suffix "g" (lower case) means 1000, and the suffix "G" (upper case) means 1024.

This places an upper limit on virtual memory usage by your job. If your job exceeds this limit, the job will be terminated by Grid Engine. This, however, does not prevent jobs from starting up on nodes where the available free memory per slot is less than h_vmem.

Resource Parameter name Default value Consumable
Hard wall clock time limit (hh:mm:ss) h_rt 2:00:00 No
Hard virtual memory limit per slot h_vmem 2G No
Minimum memory per slot m_mem_free 256M Yes

Recommendation for memory requests

NOTE 2017-Jul-04 h_vmem is back to its old meaning of per-slot value.

NOTE 2017-Mar-22 With the recent update to Univa Grid Engine 8.4.4, h_vmem seems to have changed meaning. Please treat h_vmem as a PER-JOB value (not per-slot), and we will update this article when more specific information becomes available.

  • m_mem_free and h_vmem should be close in value. You should always have some sense of how much memory your job will take: this may take a bit of trial and error, or it requires understanding the work your code does.
  • do not set h_vmem higher than the actual installed RAM on a host
  • because nodes may be busy, and occupied by other jobs (yours or other users'), setting h_vmem too high may mean all jobs on the node hang due to excessive paging. For example, job A requests m_mem_free=10g (1 slot) and h_vmem=40g. It starts running on an Intel node which has no other jobs. The next job (B) in the pending list also requests m_mem_free=10g and h_vmem=40g using 1 slot. It also starts running on the same node. Now, both jobs A & B grow in memory usage until they reach 35 GB. That total is 70 GB, which is greater than the 64 GB installed on Intel nodes. As a result, memory starts being paged out to disk, slowing both jobs down by a large factor (> an order of magnitude). This means the h_rt that both requested is now too short, and both jobs will terminate before completion.

Hardware Limitations

It is easy to request an amount of resources which renders a job impossible to run. Please see the article on Proteus Hardware and Software for what actual resources are available.

In addition, on each compute node, the system itself takes up some memory, typically about 1 GiB. If a job requests a (m_mem_free, nslots) pair which works out to the total installed RAM, the job will not be able to run. Not all the memory is free because some is being used by the system.

Resource Limits

There some limits in place to ensure fair access for all users. Please see: Job Scheduling Algorithm#Current_Limits.

Resource Reservation

Resource reservation is not the same as advance reservation. Resource reservation tells GE to block lower priority jobs with small resource requests in order to accumulate enough resources for a higher priority job with larger resource requests. This prevents many small serial jobs from starving a large parallel job.

In contrast, advance reservation submits a job to be run at a specific time in the future.

Resource reservation is a Boolean flag (the default is "no"):

    #$ -R y

NOTE: Resource reservation has a very large adverse effect on the performance of the scheduler. As such, Proteus is configured to handle only one job using reservation at a time.

Environment Variables

SGE sets some environment variables in every job, many start with "SGE_". These include:

Env. Variable Meaning
SGE_O_WORKDIR directory in which the qsub command was given
SGE_STDOUT_PATH path to file where stdout is redirected
SGE_STDERR_PATH path to file where stderr is redirected; if option "-j y" was given, SGE_STDOUT_PATH and SGE_STDERR_PATH will be the same
JOB_ID current job ID
JOB_NAME current job name (see the "-N" option of qsub)
SGE_TASK_ID index number of the current array job task
NHOSTS number of hosts used in job
NSLOTS number of slots used in job (can be used as argument to mpirun, e.g. mpirun -n $NSLOTS ...)
PE_HOSTFILE full pathname of file listing hosts included in the parallel environment of the job (usually handled automatically by MPI)

Output

Say you have a job script named my_job.sh. The outputs from your job will be in files named

     my_job.sh.oNNNNNN -- standard output
     my_job.sh.eNNNNNN -- error output

where "NNNNNN" is the job ID.

If you specify in your job script

     #$ -j y

then the two files will be joined into one, i.e. the error output file is combined into the standard output.

If you specify in your job script

     #$ -cwd

then the output files will appear in the same directory as the job script. If that option is not specified, the output files will appear in your home directory.

/tmp Directory or Temporary Disk Space

DO NOT use /tmp

Many programs will default to using /tmp for files created and used during execution. This is usually configurable by the user. Please use local scratch instead of /tmp: see the next section for the appropriate environment variables you can use. If the program you use follows convention and uses the environment variable $TMP, then it should work fine since that points to a job-specific local scratch directory.

Additionally, you may be able to configure the program to delete tmp files after completing.

Staging Work to Local Scratch

Your home directory, and your research group directory (/mnt/HA/groups/myresearchGrp), are NFS shared filesystems, mounted over the network.

If your job does a lot of file I/O (reads a lot, writes a lot, or both; where "a lot" means anything over a few hundred megabytes), and your job resides only on a single execution host, i.e. it is not a multi-node parallel computation, it may be faster to stage your work to the local scratch directory.

Each execution host has a local directory of about 400 GiB. Since this is a local drive, connected to the PCI bus of the execution host, i/o speeds are orders of magnitude faster than i/o into your home directory or your group directory.

In the environment of the job, there are two environment variables which give the name of a job-specific local scratch directory:

    $TMPDIR
    $TMP

The directory name includes the job ID, and this directory is automatically deleted at the end of the job.

In your job script, you move all necessary files to $TMPDIR, and then at the end of your job script, you move everything back to $SGE_O_WORKDIR.

Please see the next section for a link to an example.

IMPORTANT This local scratch directory is automatically deleted and all its contents removed at the end of the job to which it is allocated.

Staging Work to Fast Shared Scratch (Lustre)

For best performance for multi-node parallel jobs, it is better to run the job from a temporary directory in /scratch, and then at the end of the job, copy all the output back to your research directory in /mnt/HA/groups/mygroupDir/ The /scratch directory is shared to all nodes, and is hosted on the fast Terascala Lustre filesystem. Unlike local scratch, shared scratch directories are not automatically cleaned at the end of the job. However, files which have not been accessed in 45 days are deleted. If the filesystem is filled up, data will be deleted to make space.

There may be more than one way to do this.

Please see the example in /mnt/HA/examples/MPI_Simple/StagingExample/

No-setup Way

  • Create a directory for yourself in /lustre/scratch:
    [juser@proteusi01]$ mkdir /lustre/scratch/myname
    [juser@proteusi01]$ cd /lustre/scratch/myname
  • Manually cp your files to that directory.

Setup

  • All your job files (program, input files) are in ~/MyJob This can be considered a sort of "template" job directory.
  • Your job script will have to copy the appropriate files to /scratch/$USER/$JOB_ID, and then copy the outputs back once the job is complete

Script Snippet

NB This snippet will probably not work at all for your particular case. It is meant to show one way of staging job files to /scratch. The details will vary depending on your own particular workflow.

#!/bin/bash -l
#$ -S /bin/bash
#$ -P myresearchPrj
#$ -pe shm 16
...

### NOTE: if you use "#!/bin/bash -l" at the top of this script, the next 4 lines are unnecessary
#. /etc/profile.d/modules.sh
#module load shared
#module load sge/univa
#module load gcc/4.8.1

module load proteus-openmpi/gcc

export STAGINGDIR=/scratch/$USER/$JOB_ID
mkdir -p $STAGINGDIR
     
export PROGRAM=~/bin/my_program

# copy files to staging directory, and run from there
cp -R ./* $STAGINGDIR
cd $STAGINGDIR
     
# This script uses OpenMPI. "--mca ..." is for the OpenMPI implementation
# of MPI only.
# "--mca btl ..." specifies preference order for network fabric
#     not strictly necessary as OpenMPI selects the fastest appropriate one first
echo "Starting $PROGRAM in $( pwd ) ..."
$MPI_RUN --mca btl openib,self,sm $PROGRAM
    
# create job-specific subdirectory and copy all output to it
mkdir -p $SGE_O_WORKDIR/$JOB_ID
cp outfile.txt $SGE_O_WORKDIR/$JOB_ID
     
# the SGE output files are in $SGE_O_WORKDIR
cd $SGE_O_WORKDIR
mv *.o${JOB_ID}  *.e${JOB_ID} *.po${JOB_ID} *.pe${JOB_ID} $SGE_O_WORKDIR/$JOB_ID
     
# delete staging directory
/bin/rm -rf $STAGINGDIR

All the known output files will be in a subdirectory named $JOB_ID (some integer) in the directory where the qsub command was issued. So, if your job was given job ID 637, all the output would be in ~/MyJob/637/.

qsub Options

Various qsub options can control job parameters, such as which file output is redirected to, wall clock limit, shell, etc. Please see Grid Engine Quick Start Guide for some common parameters. More detail is in the Univa Grid Engine User Guide section on batch jobs.[2]

mpirun Options

There are 3 available MPI implementations available:

  • Intel MPI
  • MVAPICH2
  • OpenMPI

All have the mpirun command which is used to run MPI programs. The locally-compiled versions of MVAPICH2 and OpenMPI have Grid Engine integration, which allows for a simpler invocation -- mpirun reads the environment (NSLOTS, PE_HOSTFILE); Intel MPI also has Grid Engine integration:

MVAPICH2 - proteus-mvapich2
mpirun myprog (you may explicitly specify the resource manager by using "-rmk sge")
OpenMPI - proteus-openmpi
mpirun myprog
Intel MPI - intel-mpi/64
mpirun myprog

Modules

Environment Modules set up the environment for the job to run in. At the very least, you should have these modules loaded:

    shared
    proteus
    sge/univa
    gcc

If you used GCC with OpenMPI, you would also need:

    proteus-openmpi/gcc/1.8.1-mlnx-ofed

So, your job script would need to have:

    module load proteus-openmpi/gcc/1.8.1-mlnx-ofed

Module-loading is order-specific. Modules which are depended on by others must come earlier.

Submitting Multiple Jobs with a Single qsub

There are occasions when you may want to launch multiple jobs with a single qsub command. These jobs may all be similar, with small changes in input parameters. Or these jobs may be interdependent, the later jobs using the outputs of earlier jobs as their inputs.

Array Jobs

When you want to launch multiple similar jobs, with no dependencies between jobs, i.e. the jobs may or may not run simultaneously, you can use "array jobs"[6]

If you want to run many jobs, say, more than 50 or so up to some millions, you must use an array job. This is because using an array job drastically reduces the load on the scheduler. The scheduler can reuse information over all the tasks in the array job, rather than keeping a complete record of every single separate job submitted. The current limit on number of tasks in an array job is 1.75 million. It has been tested with 1.6 million jobs.

Each "subjob" is called an "array task". Use the "-t" option to qsub to specify an array job:

    [user@proteusi01 ~]$ qsub -t n[-m[:s]] my_job.sh

where:

  • n -- start ID
  • m -- end ID
  • s -- step size

As with all qsub options, this may be placed within the job script header:

    #$ -t n[-m[:s]]

In addition to the usual environment variables, the task ID is available:

Env. Variable Meaning
SGE_TASK_ID ID of the array task
SGE_TASK_FIRST ID of the first array task (n in the above)
SGE_TASK_LAST ID of the last array task (m in the above)
SGE_TASK_STEPSIZE step size (s in the above)

Array jobs should also enable "automatic restart". This enables individual tasks to restart if they end unexpectedly, e.g if a node crashes. To enable restart, set "-r y"

You can also throttle the number of simultaneous tasks to run by using the "-tc" option. So,

    ### run no more than 128 tasks simultaneously
    #$ -tc 128

IMPORTANT

  • All tasks in the array are identical, i.e. every single command that is in the job script will be executed by every single task. The tasks in the array are differentiated by two things:
    • the environment variable SGE_TASK_ID,
    • its own TMP environment variable (which is the full path to the task-specific local scratch directory).
  • You can think of the line "#$ -t 1-100:1" as the head of the for loop, i.e. the equivalent of "for i = 1; i <= 100; i = i+1". The body of the job script is then the body of the "for-loop"
  • If all the tasks in the job are reading the same file(s), make sure your program/script opens the file(s) as read-only (as opposed to read-write). This can have a significant impact on the job, as files open with write access involve coördinating potential parallel writes.

Here is a simple example of 100 tasks, which prints the SGE_TASK_ID and value of TMP of each task:

#!/bin/bash -l
#$ -S /bin/bash
#$ -cwd
#$ -j y
#$ -M fixme@drexel.edu
#$ -P fixmePrj
#$ -l h_rt=0:05:00
#$ -l h_vmem=256M
#$ -l m_mem_free=196M
#$ -q all.q
#$ -r y
#$ -t 10-1000:10

### name this file testarray.sh

echo $SGE_TASK_ID
echo $TMP

The result is 100 files named testarray.sh.oNNNNN.MMM, where NNNNN is the Job ID, and MMM is the Task ID, from the range {10, 20, ..., 1000}. Each file contains a single integer, from {10, 20, ..., 1000}.

N.B. SGE_TASK_ID is not "zero-padded", i.e. they are in the sequence 1, 2, 3, ..., 10, 11, ... instead of 01, 02, 03, ...

There is a relatively complex example on the login nodes:

     /cm/shared/apps/sge/univa/examples/jobs/array_submitter.sh
                                             step_A_array_submitter.sh
                                             step_B_array_submitter.sh

The array_submitter.sh example is really a meta script, which takes as input the number of tasks wanted in the array, and then calls qsub with the appropriate options. It also prevents all "step B" jobs from executing until the appropriate "step A" task has completed.

See also:

Job Dependencies

You may have a complex set of jobs, where some later jobs depend on the output of some earlier jobs. Please see the following example on either login node:

    /cm/shared/apps/sge/univa/examples/jobs/pascal.sh

It is a meta script which creates the network of dependent job scripts, and then submits them.

Basically, you provide the job which needs to wait with the job ID of the job which goes first. Say the job ID of the first job is 123456. Then,

   [juser@proteusa01 ~]$ qsub -hold_jid 123456 next_job.sh

So, next_job.sh will only start running once job 123456 completes.

A simpler example is available in the UGE User Guide. Below is a simple shell script which runs three jobs in sequence: the second one waits for the first to complete, and the third waits for the second to complete:

#!/bin/bash
. /etc/profile.d/modules.sh

module load shared
module load proteus
module load sge/univa

waitfirst=$( qsub -terse -b y -j y -S /bin/bash -cwd -P myresearchPrj /bin/sleep 60 )
waitsecond=$( qsub -terse -b y -j y -S /bin/bash -cwd -P myresearchPrj -hold_jid $waitfirst /bin/sleep 15 )
qsub -b y -j y -S /bin/bash -cwd -P myresearchPrj -hold_jid $waitsecond /bin/sleep 15

Once this script is created, make it executable, and then run it to submit the jobs:

    [juser@proteusi01 ~]$ chmod +x this_script.sh
    [juser@proteusi01 ~]$ ./this_script.sh

Job Dependencies with Job Arrays

Suppose you have a job array to produce a bunch of outputs, followed by a single job to do some post-procesing. You can a dependent job to handle this. However, you have to handle the terse qsub output, which is no longer just the job ID. For a job array, qsub -terse returns the job ID plus the array specification:

    [juser@proteusi01 ~]$ qsub -terse -b y -j y -cwd -t 1:5:1 /bin/sleep 60
    307768.1-5:1

To extract just the job ID:

### NOTE you can put the "-t 1:5:1" in the job script computation.sh rather than using it on the command line
waitid=$( qsub -terse -t 1:5:1 computation.sh | cut -f1 -d. )
qsub -b y -j y -S /bin/bash -cwd -P myresearchPrj -hold_jid  $waitid  postprocess.sh

You can also do it manually by examining the output of the first qsub. The job ID is a 6-digit number (it may be 7 digits once we have passed 1 million), that comes before the dot ".".

Very Large Number of Jobs/Tasks

If your workflow involves a large number (millions) of jobs or tasks, you should look into using Distributed Resource Management Application API (DRMAA), which is a programmatic interface to the job scheduler.[7] Using DRMAA can lower job submission times by a factor of 10.[8]

Exclusive Use of Nodes

Sometimes, parallel jobs need exclusive access to nodes while not consuming all slots on the nodes. This is typically due to the amount of memory required. Remember that all Intel nodes have 64 GB RAM and 16 cores, while AMD nodes have 256 GB RAM and 64 cores.

To do so, you would request the "exclusive" complex, and a round-robin PE:

    #$ -l exclusive
    #$ -l m_mem_free=29G
    #$ -pe openmpi_ib_rr 18
    #$ -q all.q

m_mem_free is a per slot value, which means the total memory required for the job is 522 GB. This would give you 9 Intel nodes, or 2 AMD nodes.

If you require Intel hosts:

    #$ -l exclusive
    #$ -l m_mem_free=7G
    #$ -pe intelmpi_rr 16
    #$ -q all.q@@intelhosts

This should give you up to 16 nodes, 1 slot per node. Requesting m_mem_free=7G means your job wants 126 GB memory. Since each Intel node has 64 GB, this can be provided by at least 2 nodes. The PE requests 16 slots, which could be provided by a single Intel node, except for the fact that the h_vmem request needs at least 2 nodes. Then, since a round robin PE is requested, the slots for the job are assigned first to one node, then the other, then back again until all slots are assigned. This gives 8 slots per node.

Examples

Please see Category:Examples, and also the directory $SGE_ROOT/examples/jobs

See Also

References

  1. Wikipedia article on Newline
  2. 2.0 2.1 Univa Grid Engine User Guide -- Submitting Batch Jobs
  3. To see why we default to using bash, please read Csh Programming Considered Harmful
  4. See the man page for qsub ("man qsub" at the command line).
  5. Univa Grid Engine User Guide -- Requestable Resources
  6. Univa Grid Engine User Guide -- Array Jobs
  7. Univa Grid Engine User Guide: Submission, Monitoring and Control via an API
  8. Grid Engine Blog: Client Side JSV Performance Comparison