Druckansicht der Internetadresse:

Forschungszentrum für wissenschaftliches Rechnen an der Universität Bayreuth

Seite drucken

Advanced job control

This page contains examples on using the PBS/Torque or Slurm resource managers to manage computing jobs. However, there are many more possibilities and options found on the manual or help pages of the specific commands.

Interactive jobsEinklappen

To get an interactive job with an interactive shell on a compute node, you can call "qsub" with the "-I" (capital i) flag (PBS/Torque) or "srun" with the "--pty" flag (Slurm).

Example (PBS/Torque):
$> qsub -I -l nodes=1:ppn=40:compute20,walltime=01:00:00

Example (Slurm) - Runs 32 tasks on a node with 1 core each:
$> srun -N 1 --ntasks-per-node=32 --pty tcsh

Example (Slurm) - Runs 1 task on the node with 32 cores:
$> srun -N 1 -c 32 --pty tcsh

Batch jobsEinklappen

The usual way to start a job is using a submit script defining (i) the requested resources and (ii) the job's script instructions telling the compute nodes what to do (cf. material for examples). One can submit these files using "qsub" (PBS/Torque) or "sbatch" (Slurm). Each job is assigned a job id for identification. Usually, a job is directly queued and waits for execution. However, jobs can be submitted with a dependency option. That ensures that a job is queue in a "hold" state and released from that hold state as soon as the dependency is resolved, e.g., when another job finished without errors.

Example (PBS/Torque):
i) Default submission
$> qsub jobscript.sub
ii) Dependency: jobscript.sub can only be executed when job JOB_DEPEND finished without errors
$> qsub -W depend=afterok:JOB_DEPEND_ID jobscript.sub
iii) Submit job in a hold state and release the hold state at a later time
$> qsub -h jobscript.sub
$> qrls JOB_ID

Example (Slurm):
i) Default submission
$> sbatch jobscript.sub
ii) Dependency: jobscript.sub can only be executed when job JOB_DEPEND finished without errors
$> sbatch --dependency=afterok:JOB_DEPEND_ID jobscript.sub
iii) Submit job in a hold state and release the hold state at a later time
$> sbatch -H jobscript.sub
$> scontrol release JOB_ID

Checking job statusEinklappen

After submission, the job is attached to a queue and waits until the requested resources are available. A job's status can be checked by "qstat" (PBS/Torque) or "squeue" (Slurm).

Examples (PBS/Torque):
i) Check all jobs
$> qstat
ii) Check the jobs of user USER
$> qstat -u USER

Examples (Slurm):
i) Check all jobs
$> squeue
ii) Check the jobs of user USER
$> squeue -u USER
iii) Get information about the queue and/or nodes
$> sinfo [-l] [-N]

Moreover, Slurm allows one to get detailed information about the own job history and accounting information using "sacct".

$> sacct

Editing a jobEinklappen

On can edit a job's resource request using "qalter" (PBS/Torque) or "scontrol" (Slurm).

Examples (PBS/Torque):
$> qalter JOB_ID -l walltime=24:00:00

(Under construction)

Job abortionEinklappen

Computing jobs can be calceled (if running or not) using "qdel" (PBS/Torque) or "scancel" (Slurm). 

Examples (PBS/Torque):
$> qdel JOB_ID

Examples (Slurm):
$> scancel JOB_ID

Advanced resource reservationEinklappen

​It is possible to request special resources based on an assigned property or even to request or exclude specific nodes.

Examples (PBS/Torque):
i) Request 2 nodes (40 threads each) with the compute20 property:
$> qsub -l nodes=2:ppn=40:compute20
ii) Request nodes r02n01 and r02n02 of btrzx2 (Note: btrzx2 and btrzx4-nodes must be requested with an additional "-0" attached to the node names.
$> qsub -l nodes=r02n01-0:ppn=40+r02n02-0:ppn=40

Examples (Slurm with interactive jobs, those flags can be used in a submit file as well):
i) Request 2 nodes with a constraint list CONSTR
$> srun -N 2 --pty tcsh --constraint=CONSTR
ii) Request the nodes r01n01 and r01n02 on btrzx1
$> srun -N 2 --pty tcsh --nodelist=r01n01,r01n02
iii) Exclude r01n01 on btrzx1 from the following job request
$> srun -N 1 --pty tcsh --exclude=r01n01

Array jobsEinklappen

Jobarrays are very useful for jobs where only a few parameters differ. 

Example (SLURM):
First assume we have a parameterlist, in  "paramlist" text-file, like:

#Generated parameterlist to demonstrate job arrays.
#1st    2nd     3rd     4th
-s      8       100     100
-s      8       100     200
-s      8       100     300
-s      8       100     400
-s      8       200     100
-s      8       200     200
-s      8       200     300
-s      8       200     400
...
-s      256     300     400
-s      256     400     100
-s      256     400     200
-s      256     400     300
-s      256     400     400
#EOF

To iterate over these list, by example for the first 10 parameters,             
this could be done by using an arrayjob with following script:

#!/bin/bash
#SBATCH --array=1-10  # Describe your array first
#SBATCH --job-name=job_array            
#SBATCH --nodes=1
#SBATCH --ntasks=1                               
#SBATCH --output=ex_%A_%a.out   # err/out file with <JOBID>_<JOBARRAY_INDEX>
#SBATCH --error=ex_%A_%a.err
#SBATCH --time=00:05:00

#Starting the example.sh script with <n>th parameter-set from list.
#Using "grep -v "^#" to ignore commentlines
line=${SLURM_ARRAY_TASK_ID}
./example.sh $( cat paramlist | grep -v "^#" | sed -ne ${line}p )
#EOF

this will start the "./example.sh" with <n>th parameter of paramlist as follows:

$ ./example.sh -s      8       100     100
...
$ ./example.sh -s      8       300     200


Sending signals to jobsEinklappen

Jobs can be sent a signal using "" (PBS/Torque) or "scontrol" (Slurm) with the "-s" or "--signal" option.

Examples (PBS/Torque):
$> qsig -s USR1 JOB_ID

Examples (Slurm):
$> scontrol -s USR1 JOB_ID


Verantwortlich für die Redaktion: Dr.rer.nat. Ingo Schelter

Facebook Twitter Youtube-Kanal Instagram UBT-A Kontakt