6. Working on remote clusters

If you intend to run CONCEPT on your local machine only, you should skip this section.

If you are running CONCEPT on a remote machine, i.e. logged in to a server/cluster via ssh, you’ve so far had to supply the additional --local option (or set CONCEPT_local=True) to the concept script. This is because CONCEPT has built-in support for submission of jobs to a job scheduler / queueing system / resource manager (specifically Slurm, TORQUE and PBS). If you are working remotely but do not intend to use a job scheduler, keep using --local and skip the rest of this section.

Submitting jobs

If you try to run a simulation without the --local option while logged into a remote server, CONCEPT will exit immediately, letting you know that it has created a job script named job/.jobscript_<date>, where <date> is a string of numbers labelling the creation time. This job script is a great starting point if you want to control the job submission yourself. It can be used as is, though you might want to edit/add directives at the top of the job script.

To automatically submit a given CONCEPT run, simply supply --submit to the concept script. Submitting a simulation using a parameter file named param/tutorial-6 using 8 cores then looks like

./concept \
    -p param/tutorial-6 \
    -n 8 \
    --submit

Tip

If remote CONCEPT jobs mysteriously fail, check out the ‘Problems when running remotely’ troubleshooting entry.

Typically you will need to specify a queue (called partition in Slurm) for which you wish to submit the job. To do this, use the -q option:

./concept \
    -p param/tutorial-6 \
    -n 8 \
    --submit \
    -q <queue>  # replace <queue> with queue name

As queue specification is only meaningful when submitting the job (local runs do not run within a queue), specifying -q in fact implies --submit. That is, the above can be shortened to just

./concept \
    -p param/tutorial-6 \
    -n 8 \
    -q <queue>  # replace <queue> with queue name

The 8 cores may be distributed over several (compute) nodes of the cluster. If you wish to control the number of nodes and number of cores per node, use e.g. -n 1:8 to request 1 node with 8 cores, or -n 2:4 to request 2 nodes each with 4 cores.

Note

As for queue specification, specification of the number of nodes to use is only meaningful when running the job remotely, so --submit is implied whenever a number of nodes is specified. Similarly, --memory and -w (see below) also implies --submit. Should you wish to not submit the job but just generate the job script — with the information from e.g. -q contained within it — use --submit False.

To specify a memory requirement, further supply --memory <memory>, where <memory> is the total memory required collectively by all cores on all nodes. Examples of legal memory specifications include --memory 8192MB, --memory 8192M, --memory 8G, --memory "2*4G", all of which specifies 8 gigabytes, i.e. 1 gigabyte per core if running with a total of 8 cores.

To specify a wall time limit, i.e. a maximum time within which the simulation is expected to be completed, further supply the -w <wall-time> option. Examples of legal wall time specifications include -w 60min, -w 60m, -w 1hr, -w 1h, which all request one hour of wall time.

A complete CONCEPT job submission could then look like

./concept \
    -p param/tutorial-6 \
    -n 8 \
    -q <queue> \
    --mem 8G \
    -w 1h

Tip

Note that in the above, --memory is shortened to --mem. Generally, as long as no conflict occurs with other options, you may shorten any option to concept in this manner. Also, the order in which the options are supplied does not matter.

A copy of the generated and submitted job script will be placed in the job/<ID> directory.

The watch utility

Once a job is submitted, CONCEPT will notify you that you may now kill (Ctrl+C) the running process. If you don’t, the submitted job is continually monitored, and its output will be printed to the screen once it starts running, as if you were running the simulation locally. This is handled by the watch utility, which is automatically called after job submission. It works by continually printing out updates to the log file in close to real time.

If you don’t want to watch the job after submission, you may supply the --watch False option to concept instead of having to kill the process after submission.

You may manually run the watch utility at any later time, like so:

./concept -u watch <ID>  # replace <ID> with job ID of remote running job

The job ID — and hence log file name — of submitted jobs is determined by the job scheduler, and is printed as soon as the job is submitted. You may also leave out any job ID when running the watch utility, in which case the latest submitted, running job will be watched. Again, to exit, simply press Ctrl+C.

Using a pre-installed MPI library

The installation procedure described in this tutorial installed CONCEPT along with every dependency, with no regard for possibly pre-installed libraries. Though generally recommended, for running serious, multi-node simulations one should make use of an MPI library native to the cluster, in order to ensure optimal network performance.