Druckansicht der Internetadresse:

Forschungszentrum für wissenschaftliches Rechnen an der Universität Bayreuth

Seite drucken

Getting started

General remarks

When getting involved in supercomputing at one of the University of Bayreuth's computer clusters as a new user, there usually appear several basic questions. Among those are which clusters are available, what are their characteristics, or how to get access. This page covers topics of this kind and answers some frequently asked questions in order to help new users.

Basic cluster organization and use principle

First of all, there exist several computer clusters at the University of Bayreuth which are regularly replaced by newer ones in a round-robin fashion. The currently available clusters and their characteristics are listed from newest to oldest on the clusters page.

A computer cluster consists of so-called nodes, which are fully operational computers with CPUs, main memory, mainboard, a hard drive or SSD, network interfaces, and its own operating system. There usually are

  • a management node for administration purposes, which is not accessible to users,
  • one or more login nodes which build the interface between the cluster and the rest of the world,
  • several compute nodes which perform the computations,
  • a high-performance network which connects the nodes,
  • several further hardware components such as switches or power distributors, and
  • software such as compilers, libraries, or application software which is available on the login and compute nodes.

The only components visible to a regular user are the login nodes. These allow users to log in, compile their own source code, and prepare and submit computation jobs. A computation job usually consists of a job script which defines the required resources and how to perform the job. By submitting this script, one can start the job, which gets a Job-ID and ends up in a job queue with the other computation jobs that wait for execution. As a handy feature, one can configure the job script to, e.g., send an email to a given address when the job starts and finishes or define dependencies between different jobs. We explain these steps in the paragraphs below.

Login

Users can access the login nodes of a cluster (e.g., btrzx2-1.rz.uni-bayreuth.de) via ssh:

$> ssh user@btrzx2-1.rz.uni-bayreuth.de

The ITS home directories as well as different public and group-specific file systems are mounted on the login and compute nodes of a cluster (e.g., /scratch/ on btrzx2 and btrzx4 or /panasas/scratch/ on btrzx3). After login, the user ends up in a Linux shell in her/his regular ITS home directory (the one which is also accessible, e.g., from the PC Pools) or a cluster-specific home direction (on btrzx1). One can set the default shell (bash or tcsh) in the ITS self-service portal.

When logging into a cluster, a shell-specific configuration file (.tcshrc or .bashrc, respectively) is automatically loaded. One can use this file to set up a default environment and configuration. Here, we provide an example script for the tcsh for the clusters btrzx2,4,3, and 5 which uses the CLUSTER environment variable to identify the cluster and loads environment modules that provide the respective Intel compilers and libraries.

Environment setup

Users can control the available software by environment variables which are managed by environment modules. module avail shows all available modules, e.g., on btrzx2


----------------------------- /cluster/modulefiles -----------------------------
apps/abaqus/2016.hf5                  intel/mpi/5.1.3.258
apps/ansys/18.2                       intel/python2/2.7.14
apps/BerkeleyGW/1.2.0                 intel/python3/3.6.3
apps/BerkeleyGW/7189                  intel/tbb/2016.4.258
apps/gams/24.7                        intel/tbb/2018.2.199
apps/HipGISAX/03_2018                 intel_parallel_studio_xe_2016_update4
apps/likwid/4.2.0                     intel_parallel_studio_xe_2018_update2
apps/likwid/4.3.1                     java/jdk/10.0.1
...

One can load or unload a module via module load and module unload, respectively. Furthermore, one can list all modules that are currently loaded through module list and show the effect of a module using module show. As an example, module show apps/likwid/4.2.0 returns


-------------------------------------------------------------------
/cluster/modulefiles/apps/likwid/4.2.0:

module-whatis    likwid performance tools 4.2.0
append-path      LD_LIBRARY_PATH /cluster/bayreuth/likwid/4.2.0/lib
append-path      PATH /cluster/bayreuth/likwid/4.2.0/bin
append-path      MANPATH /cluster/bayreuth/likwid/4.2.0/share/man
-------------------------------------------------------------------

In this case, loading the likwid-4.2.0 module above appends the Likwid library, binary, and manual paths to the respective environment variables and, thus, makes the Likwid software available. An example is likwid-topology, which shows detailed hardware information about the node. By default, several modules are available that provide compilers, numerical libraries, and application software. If you need software that is not already installed, please contact the HPC administrators.

Basic job control

On the cluster, a resource manager (e.g., Torque or Slurm) and possibly a back-end scheduler (e.g., Maui) running on the management node are responsible for job control by accepting and enqueueing new jobs, and starting them on the appropriate compute nodes. As a user, one typically starts a job on a login node by submitting a job script using

$> qsub jobscript
with Torque on btrzx2, btrzx4, btrzx3, and btrzx5 or
$> sbatch jobscript
with Slurm on btrzx1

A job script is a simple shell script with two sections, a header which informs the resource manager about the required resources and a body which tells the compute nodes how to perform the job. The header lines begin with a preceding #PBS (Torque) or #SBATCH (Slurm). One can specify special resources (e.g., a certain architecture or software license) by certain property attributes such as compute20, phi, cuda, or qchem. Moreover, on some clusters there run different queues. As an example, the default queue of a cluster only accepts jobs with a maximum wall time limit of 50 hours but is not restricted otherwise while a long queue could allow jobs with a higher wall time limit but is restricted to half of the available nodes. Consult the cluster page for a list of available properties and queues.

Example for the submit file header on btrzx2 using Torque:
#!/bin/tcsh
#PBS -l nodes=4:ppn=40:compute20,walltime=10:00:00
#PBS -q default
#PBS -j oe
#PBS -m abe
#PBS -M john.doe@uni-bayreuth.de
#PBS -N JOBNAME

Example for the submit file header on btrzx1 using Slurm:
#!/bin/bash
#SBATCH -J testjob # (job name)
#SBATCH -o job.%j.out # (name of the job output file, %j expands to the job name)
#SBATCH -N 2 # (Number of requested nodes)
#SBATCH --ntasks-per-node 32 # (Number of requested cores per node)
#SBATCH -t 01:30:00 # (Requested wall time)

In he Torque example, the first line in the example above is called a shebang and defines the interpreting shell for the script. In the lines beginning with #PBS one requests (-l) 4 nodes with the compute20 property and 40 slots per node (logical cores, see hyperthreading) for a wall time of 10 hours in the (-q) default queue. The jobname (-N) shall be JOBNAME (or the name of the job script if the -N option is absent), the output and error streams are joined into a single file (-j oe) called JOBNAME.oJob-ID and the PBS (Torque) shall send an email to the user (-m abe) on start, stop, and error of the job to the email address after the -M option. The list of options is not complete. As an example, the PBS can handle job dependencies such as "start job Y only when job X finished properly" or different queues (if available). We describe these options in the advanced topics. You can find all available options for qsub at the command reference website.

After the user submitted the job via qsub (Torque) or sbatch (Slurm), it is checked and hopefully accepted.

$> qsub JOBSCRIPT
100918.mgmt01

Thereafter, the job gets a job ID (here: 100918) and a priority and waits for its execution in the job queue which is managed by the scheduler. To check a job's status, one can use the qstat (Torque) or squeue (Slurm) command, which returns the status of all active jobs:

Job ID           Name        User       Time   S Queue
------------- ----------- ----------- -------- - -----
...       
100914.mgmt01    CC18        user1    04:08:29 R default          
100917.mgmt01    trit34      user2    56:06:43 R default        
100918.mgmt01    JOBNAME     USER            0 Q default

The status R means 'running' while Q means 'queued' and therefore waiting for execution. Moreover, one gets an overview of the requested resources and the current wall time of running jobs. More helpful in many situations are qstat are squeue with the -u USER option, which displays the user's own jobs only

                                                       Req'd   Req'd      Elap
Job ID        Username Queue  Jobname SessID NDS  TSK  Memory  Time    S  Time
------------- -------- ----- -------- ------ ---- ---- ------ -------- - ---------
100918.mgmt01   USER  default JOBNAME 148789  4   160    --   10:00:00 R 00:00:11

Finally, qdel Job-ID (Torque) or scancel Job-ID (Slurm) can be used to abort a job (running or not).

Writing a job script

While in the prior section we explained how to set up the header of a job script and how to submit, check, and abort jobs, we dedicate this section to the body of the job script which is executed by the compute nodes. To this end, we provide example job scripts for the different clusters here and explain their content in the following by means of a simplified version of the x2.compute20.sub example. The latter is designed for a purely MPI-parallel application.

#!/bin/tcsh
#PBS -l nodes=2:ppn=40:compute20,walltime=100:00:00
#PBS -j oe
#PBS -m abe
#PBS -M [My_Mail_Address]

The header of the job scripts looks similar to the one in the section above. In detail, we request two compute20 nodes on btrzx2 for a wall time of 100 hours. For later use, let's assume that nodes r02n60 and r02n59 become the compute nodes assigned to this job with ID 123456.

module load intel_parallel_studio_xe_2016_update4

unlimit
limit coredumpsize 0

echo 'Job id: '$PBS_JOBID
echo 'Directory: '$PBS_O_WORKDIR
cd $PBS_O_WORKDIR

In those lines, the script loads the Intel module which provides the Intel compiler, libraries, and Intel MPI. Next, resource limitations, e.g., for the stack size, are lifted and the script prints the job ID and working directory (stored in the environment variables PBS_JOBID and PBS_O_WORKDIR to the standard output (redirected into the file x2.compute20.sub.o123456). Finally, the script changes to the working directory, which is the directory in which the user submitted the job.

awk '{if(NR %2 ==0) print $1 ;}' $PBS_NODEFILE > mpd.hosts
set hostlist = (`cat mpd.hosts`)
set nodes = $#hostlist

When the job starts, the PBS provides a host file through the environment variable PBS_NODEFILE. The host file contains one entry for each slot on each node assigned to the job (here: 40 entries for r02n60 followed by 40 entries for r02n59). Since hyperthreading is enabled on btrzx2 but not wanted, the script writes only every second entry of the host file into a new host file called mpd.hosts. In the following two lines, the script assigns the local variable hostlist to the contents of the host file and the variable nodes, which will be the number of MPI processes below, to the number of entries in the host file.

/cluster/bayreuth/iws/bin/memsweeper mpd.hosts
/cluster/bayreuth/iws/bin/drop_buffers mpd.hosts

The two lines above call two scripts that are available on the clusters. The memsweeper script calls likwid-memsweeper on all nodes in the host file and drop_buffers empties the main memory caches of those nodes in order to clean up their main memories. This is useful if those memories are still occupied with data from previous jobs. However, on btrzx2 those scripts are automatically called after each job within an epilogue script.

$MPI_RUN -genv I_MPI_DEBUG=4 -prepend-rank -machinefile mpd.hosts -n $nodes [My_Program] >& output.out

Finally, the script calls mpirun through the environment variable MPI_RUN (set by the Intel environment module) to start the MPI program on the hosts in mpd.hosts and as many MPI processes (-n $nodes). While the command above is rather minimalistic, there are many useful options to  mpirun. For example, if different MPI processes print messages, the -prepend-rank option prepends each line of output with the rank of the writing MPI process. Furthermore, mpirun can print the mapping of MPI processes to cores by setting the environment variable I_MPI_DEBUG=4. In general, mpirun allows defining environment variables through the  -genv option. Finally, one can redirect the standard and error output of the MPI program into a file by appending, e.g., >& output.out.

Further remarks / FAQ

Exceeding a job's wall time:
It happens that a job exceeds its wall time. In this case, the PBS sends a SIGTERM  signal followed by a delayed SIGKILL. Usually, already the SIGTERM leads to the abortion of the job, which often means that the job is simply killed without returning valuable data. In principle, a program can handle the SIGTERM signal (see signal handling) and, e.g., output sufficient information for restarting from the final point. The SIGKILL, however, leads to the inescapable abortion of the job.

Requested resources:

  • In the case of a sequential program (which is not what the clusters are primarily intended for), one should request a single node with ppn=1 (or ppn=2 if hyperthreading is enabled).
  • If the code is able to run in parallel using OpenMP, one is restricted to a shared memory domain (usually a single node). Moreover, one must be aware that even if the main memory of a node appears as a single, logical memory, it consists of usually more than one physical memory blocks which are connected by an intra-node network. Therefore, when running an OpenMP job (which runs well on a single-CPU machine) on a node with multiple CPUs and physical memory interfaces without taking special care, one usually ends up with a dramatic loss in performance (see NUMA).
  • Finally, for MPI jobs the optimal resources depend on the parallelizability of the code and the interplay between the node-level performance and the speed of the high-performance network in between. In this case, there is often a tradeoff between the width of a job (i.e., the number of nodes) and the required wall time. A wide job requires less time for the computation but usually has a long waiting time in the queue and could lose efficiency if the code does not scale sufficiently.

Requested walltime:
The question about the requested wall time also depends strongly on the job requirements and the requested resources. One should always request a wall time that is sufficient for the job to finish with a sufficient buffer, which requires some experience. However, note that there is an upper limit for the wall time (usually 200 hours). In any case, simply using the wall time limit for every job is strongly discouraged since there are mechanisms (e.g., backfilling) which allow short-duration jobs to bypass wider jobs that are waiting for all requested nodes to become free. Using the wall time limit as default for all jobs prevents the PBS and scheduler to properly plan ahead and use techniques like backfilling, which finally results in a non-optimal utilization of resources.

Reservations:
Note that some resources are reserved for special users or groups or within a certain time slot for, e.g., some course. Nevertheless, a user can in principle request those resources for a job and, as long as those resources exist, the PBS accepts this job. However, the job is kept in the queue until the requested resources are available, which takes an infinitely long time if the reservations are permanent.

Reasons for jobs not starting:
There are several natural reasons that prevent a job from running even if the desired resources are (or seem to be) unoccupied.

  • The first possible reason is that the job requires resources that exist but are not available (e.g., due to one of the reservations mentioned above).
  • The second possibility is that there is a wide job in the queue with a higher priority than your own job. In this case, the PBS waits for the rest of the resources in order to start the wider job. If the wall time of your own job is larger than it takes to wait for those resources, your job cannot bypass and must wait.
  • A third reason is that nodes still appear as free in the visual representation of the cluster status on this website if not all slots on the node are occupied (and therefore only seem to be idle). As an example, if a 1-core job is running on a node with 8 cores, the node appears as free but an 8-core job cannot run. Confer the respective more detailed table to identify this event.

Hints for high cluster utilization and smaller queueing times:
If your job does not necessarily require a specific kind of resource, it can be a good practice to look for free resources on different clusters or on different node types on the same cluster (e.g., compute8 or compute20 on btrzx2). Sometimes, there is a long queue for compute20 nodes while compute8 nodes are idle.

How to treat unwanted hyperthreading:
On some of the clusters like btrzx2, hyperthreading is enabled. This means that a physical core appears as two (or even more) logical cores. If one requests, e.g., compute20 nodes on btrzx2 with ppn=40  (there are 20 physical cores with 2 hyperthreads each), the available slots (40 entries per node) are returned into a file stored in the environment variable PBS_NODEFILE. This node file can be given to mpirun as host file such that 40 MPI processes are started per node, i.e., two MPI processes are mapped onto a single physical core. However, it is often a better idea to run only 20 MPI processes per node, i.e., one per physical core. One can accomplish this, e.g., by removing every second line in the host file (see hyperthreading and the awk-commands in the examples above).


Verantwortlich für die Redaktion: Dr.rer.nat. Ingo Schelter

Facebook Twitter Youtube-Kanal Instagram UBT-A Kontakt