6. Working on remote clusters
If you intend to run CONCEPT on your local machine only, you should skip this section.
If you are running CONCEPT on a remote machine, i.e. logged in to a
server/cluster via ssh
, you’ve so far had to supply the additional
--local
option (or set CONCEPT_local=True
) to the concept
script.
This is because CONCEPT has built-in support for submission of jobs to a
job scheduler / queueing system / resource manager (specifically Slurm, TORQUE
and PBS). If you are working remotely but do not intend to use a job
scheduler, keep using --local
and skip the rest of this section.
Submitting jobs
If you try to run a simulation without the --local
option while logged
into a remote server, CONCEPT will exit immediately, letting you know
that it has created a job script named job/.jobscript_<date>
, where
<date>
is a string of numbers labelling the creation time. This job script
is a great starting point if you want to control the job submission yourself.
It can be used as is, though you might want to edit/add directives at the top
of the job script.
To automatically submit a given CONCEPT run, simply supply --submit
to the concept
script. Submitting a simulation using a parameter file
named param/tutorial-6
using 8 cores then looks like
./concept \
-p param/tutorial-6 \
-n 8 \
--submit
Tip
If remote CONCEPT jobs mysteriously fail, check out the ‘Problems when running remotely’ troubleshooting entry.
Typically you will need to specify a queue (called partition in Slurm) for
which you wish to submit the job. To do this, use the -q
option:
./concept \
-p param/tutorial-6 \
-n 8 \
--submit \
-q <queue> # replace <queue> with queue name
As queue specification is only meaningful when submitting the job (local runs
do not run within a queue), specifying -q
in fact implies --submit
.
That is, the above can be shortened to just
./concept \
-p param/tutorial-6 \
-n 8 \
-q <queue> # replace <queue> with queue name
The 8 cores may be distributed over several (compute) nodes of the cluster.
If you wish to control the number of nodes and number of cores per node, use
e.g. -n 1:8
to request 1 node with 8 cores, or -n 2:4
to request 2
nodes each with 4 cores.
Note
As for queue specification, specification of the number of nodes to use is
only meaningful when running the job remotely, so --submit
is implied
whenever a number of nodes is specified. Similarly, --memory
and -w
(see below) also implies --submit
. Should you wish to not submit the
job but just generate the job script — with the information from e.g.
-q
contained within it — use --submit False
.
To specify a memory requirement, further supply --memory <memory>
, where
<memory>
is the total memory required collectively by all cores on all
nodes. Examples of legal memory specifications include --memory 8192MB
,
--memory 8192M
, --memory 8G
, --memory "2*4G"
, all of which
specifies 8 gigabytes, i.e. 1 gigabyte per core if running with a total
of 8 cores.
To specify a wall time limit, i.e. a maximum time within which the simulation
is expected to be completed, further supply the -w <wall-time>
option.
Examples of legal wall time specifications include -w 60min
, -w 60m
,
-w 1hr
, -w 1h
, which all request one hour of wall time.
A complete CONCEPT job submission could then look like
./concept \
-p param/tutorial-6 \
-n 8 \
-q <queue> \
--mem 8G \
-w 1h
Tip
Note that in the above, --memory
is shortened to --mem
. Generally,
as long as no conflict occurs with other options, you may shorten any
option to concept
in this manner. Also, the order in which the options
are supplied does not matter.
A copy of the generated and submitted job script will be placed in the
job/<ID>
directory.
The watch utility
Once a job is submitted, CONCEPT will notify you that you may now kill
(Ctrl
+C
) the running process. If you don’t, the submitted job is
continually monitored, and its output will be printed to the screen once it
starts running, as if you were running the simulation locally. This is handled
by the watch utility, which is automatically called after job submission. It
works by continually printing out updates to the log file in close to real
time.
If you don’t want to watch the job after submission, you may supply the
--watch False
option to concept
instead of having to kill the process
after submission.
You may manually run the watch utility at any later time, like so:
./concept -u watch <ID> # replace <ID> with job ID of remote running job
The job ID — and hence log file name — of submitted jobs is determined by
the job scheduler, and is printed as soon as the job is submitted. You may
also leave out any job ID when running the watch
utility, in which case
the latest submitted, running job will be watched. Again, to exit, simply
press Ctrl
+C
.
Using a pre-installed MPI library
The installation procedure described in this tutorial installed CONCEPT along with every dependency, with no regard for possibly pre-installed libraries. Though generally recommended, for running serious, multi-node simulations one should make use of an MPI library native to the cluster, in order to ensure optimal network performance.