Advanced job control
This page contains examples on using the PBS/Torque or Slurm resource managers to manage computing jobs. However, there are many more possibilities and options found on the manual or help pages of the specific commands.
- Interactive jobsEinklappen
-
To get an interactive job with an interactive shell on a compute node, you can call "qsub" with the "-I" (capital i) flag (PBS/Torque) or "srun" with the "--pty" flag (Slurm).
Example (PBS/Torque):
$> qsub -I -l nodes=1:ppn=40:compute20,walltime=01:00:00Example (Slurm) - Runs 1 tasks on a node with 32 cores:
$> srun -N 1 --cpus-per-task=32 --pty bash - Batch jobsEinklappen
-
The usual way to start a job is using a submit script defining (i) the requested resources and (ii) the job's script instructions telling the compute nodes what to do (cf. material for examples). One can submit these files using "qsub" (PBS/Torque) or "sbatch" (Slurm). Each job is assigned a job id for identification. Usually, a job is directly queued and waits for execution. However, jobs can be submitted with a dependency option. That ensures that a job is queue in a "hold" state and released from that hold state as soon as the dependency is resolved, e.g., when another job finished without errors.
Example (PBS/Torque):
i) Default submission
$> qsub jobscript.sub
ii) Dependency: jobscript.sub can only be executed when job JOB_DEPEND finished without errors
$> qsub -W depend=afterok:JOB_DEPEND_ID jobscript.sub
iii) Submit job in a hold state and release the hold state at a later time
$> qsub -h jobscript.sub
$> qrls JOB_IDExample (Slurm):
i) Default submission
$> sbatch jobscript.sub
ii) Dependency: jobscript.sub can only be executed when job JOB_DEPEND finished without errors
$> sbatch --dependency=afterok:JOB_DEPEND_ID jobscript.sub
iii) Submit job in a hold state and release the hold state at a later time
$> sbatch -H jobscript.sub
$> scontrol release JOB_ID - Checking job statusEinklappen
-
After submission, the job is attached to a queue and waits until the requested resources are available. A job's status can be checked by "qstat" (PBS/Torque) or "squeue" (Slurm).
Examples (PBS/Torque):
i) Check all jobs
$> qstat
ii) Check the jobs of user USER
$> qstat -u USERExamples (Slurm):
i) Check all jobs
$> squeue
ii) Check the jobs of user USER
$> squeue -u USER
iii) Get information about the queue and/or nodes
$> sinfo [-l] [-N]Moreover, Slurm allows one to get detailed information about the own job history and accounting information using "sacct".
$> sacct
- Editing a jobEinklappen
-
On can edit a job's resource request using "qalter" (PBS/Torque) or "scontrol" (Slurm).
Examples (PBS/Torque):
$> qalter JOB_ID -l walltime=24:00:00(Under construction)
- Job abortionEinklappen
-
Computing jobs can be calceled (if running or not) using "qdel" (PBS/Torque) or "scancel" (Slurm).
Examples (PBS/Torque):
$> qdel JOB_IDExamples (Slurm):
$> scancel JOB_ID - Advanced resource reservationEinklappen
-
It is possible to request special resources based on an assigned property or even to request or exclude specific nodes.
Examples (PBS/Torque):
i) Request 2 nodes (40 threads each) with the compute20 property:
$> qsub -l nodes=2:ppn=40:compute20
ii) Request nodes r02n01 and r02n02 of btrzx2 (Note: btrzx2 and btrzx4-nodes must be requested with an additional "-0" attached to the node names.
$> qsub -l nodes=r02n01-0:ppn=40+r02n02-0:ppn=40Examples (Slurm with interactive jobs, those flags can be used in a submit file as well):
i) Request 2 nodes with a constraint list CONSTR
$> srun -N 2 --pty tcsh --constraint=CONSTR
ii) Request the nodes r01n01 and r01n02 on btrzx1
$> srun -N 2 --pty tcsh --nodelist=r01n01,r01n02
iii) Exclude r01n01 on btrzx1 from the following job request
$> srun -N 1 --pty tcsh --exclude=r01n01 - Array jobsEinklappen
-
Jobarrays are very useful for jobs where only a few parameters differ.
Example (SLURM):
First assume we have a parameterlist, in "paramlist" text-file, like:#Generated parameterlist to demonstrate job arrays.
#1st 2nd 3rd 4th
-s 8 100 100
-s 8 100 200
-s 8 100 300
-s 8 100 400
-s 8 200 100
-s 8 200 200
-s 8 200 300
-s 8 200 400
...
-s 256 300 400
-s 256 400 100
-s 256 400 200
-s 256 400 300
-s 256 400 400
#EOFTo iterate over these list, by example for the first 10 parameters,
this could be done by using an arrayjob with following script:#!/bin/bash
#SBATCH --array=1-10 # Describe your array first
#SBATCH --job-name=job_array
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=ex_%A_%a.out # err/out file with <JOBID>_<JOBARRAY_INDEX>
#SBATCH --error=ex_%A_%a.err
#SBATCH --time=00:05:00#Starting the example.sh script with <n>th parameter-set from list.
#Using "grep -v "^#" to ignore commentlines
line=${SLURM_ARRAY_TASK_ID}
./example.sh $( cat paramlist | grep -v "^#" | sed -ne ${line}p )
#EOFthis will start the "./example.sh" with <n>th parameter of paramlist as follows:
$ ./example.sh -s 8 100 100
...
$ ./example.sh -s 8 300 200 - Sending signals to jobsEinklappen
-
Jobs can be sent a signal using "" (PBS/Torque) or "scontrol" (Slurm) with the "-s" or "--signal" option.
Examples (PBS/Torque):
$> qsig -s USR1 JOB_IDExamples (Slurm):
$> scontrol -s USR1 JOB_ID