Troubleshooting

This page contains solutions to and helpful information about possible issues encountered when using CONCEPT. If help from this page is insufficient to solve your problem, do not hesitate to open an issue or contact the author at dakin@phys.au.dk.

Entries on this page:

Installation failed

We strive for a trivial installation process on as many Linux systems as possible. If the standard installation process (with every dependency allowed to be installed from scratch) keeps failing for some inexplicable reason, you may try looking for a clue in the logged installation output (of which there are a lot), in the .tmp/install_log and .tmp/install_log_err files, with the .tmp directory located within the installation directory.

One possible source of trouble is corrupted downloads. The install script downloads the source code of every primary dependency into the .tmp directory. If you suspect a corrupt download, you can try deleting subdirectories within this directory, which will trigger re-downloads.

It may happen that some dependency program fails to install due to some other dependency not working correctly. You may try adding the --tests option when invoking the install script, which tests the vast majority of the dependency programs after/during their individual installation. Carefully looking through the installation log files for failed tests may then reveal something.

If several compilers are found on the system, different compilers may be used for building different dependencies, which can cause issues. For the best chances for a successful installation, try setting GNU (GCC) as the preferred compiler,

export compiler_precedence="gnu"

and then redo the CONCEPT installation from scratch.

Compilation failed

Compilation of the CONCEPT code takes place as the last essential step during standard installation, as well as when invoking concept after changes have been made to the source files. This process may fail for several reasons, solutions to some are described in this entry.

To check whether the problem is confined to the compilation process, run the code in pure Python mode. From within the CONCEPT installation directory, do

./concept --pure-python --local

If this simple invocation of concept fails, the problem is not with the compilation process itself, in which case this troubleshooting entry cannot help you.

Insufficient memory

The minimum memory needed in order for compilation of the code to succeed is about 3 GB, though the exact number depends on the system. If you suspect the cause of compilation errors might be insufficient memory, try out the below steps from within the CONCEPT installation directory.

  • Standard (parallel) compilation:

    ./concept --rebuild --local
    
  • Compile each module in serial:

    make_jobs="-j 1" ./concept --rebuild --local
    
  • Add extra swap memory: If you have root privileges on the system, you can temporarily increase the available memory by adding a swap file:

    n=8
    sudo dd if=/dev/zero of=swapfile bs=1024 count=$((n*2**20))
    sudo chmod 600 swapfile
    sudo mkswap swapfile
    sudo swapon swapfile
    

    This will add an additional 8 GB of swap memory (taken from available disk space), which is plenty. If you do not have that much free disk space, you may try with a lower value of n. With this increased amount of memory, try compiling the code again. If it still fails even when compiling serially, insufficient memory is probably not the problem. To clean up the swap file, do

    sudo swapoff swapfile
    sudo rm -f swapfile
    

After successful compilation, CONCEPT will run just as performant as had the compilation taken place without trouble.

Dangerous optimizations

If the compilation errors were not due to insufficient memory, it may be that one or more of the applied optimizations cause trouble. First, try compiling without link time optimizations:

./concept --rebuild --linktime-optimizations False --local

Note

If this solves the problem, it may simply be because compilation without LTO requires significantly less memory. You are encouraged to check if you simply have insufficient memory for a fully optimized build.

If disabling link time optimizations makes the code compile, you may consider this a working solution, as the performance improvements obtained through link time optimizations are not crucial.

A much more drastic thing to try is to compile without any optimizations:

./concept --rebuild --optimizations False --local

If this works, the problem is definitely with some of the optimization flags. You should however not run CONCEPT simulations with the compiled code in a completely unoptimized state, as this reduces performance drastically. Instead, experiment with removing individual optimization flags added to the optimizations and optimizations_linker variables within src/Makefile. E.g. get rid of -ffast-math and/or -funroll-loops, and/or substitute the -O3 flag with -O2, then -O1, then -O0, before removing it completely. For each attempt, recompile CONCEPT without --optimizations False.

Terminal colour output looks weird

CONCEPT includes a lot of colour and other formatting in its terminal output. While most modern terminal emulators on Linux (GNOME Terminal, Terminator, xterm, etc.) fully support this, the story is different on other platforms.

  • Windows: If you are running CONCEPT through the Windows subsystem for Linux and the terminal formatting appears suboptimal, you can install a modern Linux terminal within the Linux subsystem. Note that this requires a running X server on the Windows side.

    If you are running CONCEPT from Windows locally through Docker or remotely via SSH (through e.g. PuTTY), no solution is known.

  • macOS: If you are running CONCEPT from macOS (locally through Docker or remotely via SSH) and the terminal formatting appears suboptimal, try using the superior iTerm2 terminal emulator.

If you want to disable colour and other formatted output altogether, set

enable_terminal_formatting = False

in your CONCEPT parameter files. Note that though this eliminates most formatting, a few elements are still formatted.

Error messages containing ‘Read -1’

If you see error messages of the form

Read -1, expected <int>, errno = <int>

whenever you run CONCEPT using more than a single process, it is likely a problem with OpenMPI, more specifically vader/CMA. If CONCEPT otherwise produces correct results, you can silence these messages by placing

export OMPI_MCA_btl_vader_single_copy_mechanism=none

in the .env file of your CONCEPT installation.

Clock skew and/or modification times in the future

If you receive warnings about clock skew and/or file modification times in the future, it is likely a problem either with the system clock or the time stamps of the CONCEPT source files. Regardless, the problem — as far as it pertains to CONCEPT — is likely to go away if you update the time stamps. To do so, run

touch * src/*

from within the CONCEPT installation directory.

Mixed compiled and pure Python mode

While technically possible, one never wants to run CONCEPT in such a way that some processes run in compiled mode (the default) while others run in pure Python mode. Such a mixed runtime state will be detected and a warning will be emitted.

This can happen if only some processes have access to a CONCEPT build, while others have not and thus fall back to running directly off of the Python source. This in turn may happen when running multi-node CONCEPT jobs on a cluster. If this is the case, make sure to specify a build directory which is accessible from all nodes, e.g. a directory within the CONCEPT installation directory.

The simulation hangs when calling CLASS

If the simulation hangs right at the beginning of the simulation, at the

Calling CLASS in order to set the cosmic clock …

step, it is probably because you have specified a cosmology that CLASS cannot handle. When running CONCEPT in compiled mode, CLASS may hang rather than exiting with an error message. To see the CLASS error message, run CONCEPT in pure Python mode using the --pure-python command-line option.

Crashes or other bad behaviour

This entry is concerned with problems encountered when using CONCEPT locally. If your problem occurs only for remote jobs, please see the ‘Problems when running remotely’ entry instead.

If you are unable to even compile CONCEPT, see the ‘Compilation failed’ entry.

If you are able to start CONCEPT runs, but they crash, hang, yield obviously wrong results, or exhibit other bad behaviour, it may be due to improper installation or a code bug. To inspect the extent of the erroneous behaviour, try running the full CONCEPT test suite via

./concept -t all

If any tests are unsuccessful and you are running an official version of CONCEPT (i.e. any release version or ‘master’), there is most probably a problem with your installation. You can try reinstalling CONCEPT along with all of its dependencies, perhaps using compilers different from the ones used the first time around.

If all tests pass despite the observed (and reproducible) bad behaviour, you may have found a bug in a code path not covered by the test suite. Please report this.

Problems when running remotely

This entry is concerned with problems encountered specifically with remote CONCEPT jobs. If you have not tried out CONCEPT locally, please do this first. If you encounter problems here as well, please see the ‘Crashes or other bad behaviour’ entry.

Even if CONCEPT runs fine on the front-end of a cluster (i.e. when supplying the --local option to the concept script), you may experience weird behaviour or crashes when running remote jobs. Typically, this is either due to an improper choice of the MPI executor, or the remote nodes having different hardware architecture from the front-end. Possible solutions to both of these problems are provided below.

Choosing an MPI executor

It may help to manually choose a different remote MPI executor. This is the term used for e.g. mpiexec/mpirun in CONCEPT, i.e. the executable used to launch MPI programs.

To see which MPI executor is used when running remotely, check out the mpi_executor variable in the produced job/<ID>/jobscript file. To manually set the MPI executor, overwrite the dedicated mpi_executor variable in the .env file. Helpful suggestions for the choice of MPI executor depends on the job scheduler in use (Slurm or TORQUE/PBS).

Note

Even if you are using Slurm, it may be that your MPI library is not configured appropriately for srun to be able to correctly launch MPI jobs. This can happen e.g. if you are using an MPI library that was installed by the CONCEPT install script, as opposed to an MPI library configured and installed by a system administrator of the cluster. If the below does not work, try setting the MPI executor as though you were using TORQUE/PBS.

If Slurm is used as the job scheduler and the MPI library used was not installed by the install script as part of the CONCEPT installation, the MPI executor will be set to srun --cpu-bind=none in job scripts by default (or possibly srun --cpu-bind=none --mpi=openmpi if OpenMPI is used). The first thing to try is to leave out --cpu-bind=none, i.e. setting

mpi_executor="srun"

in the .env file. Submit a new job, and you should see the manually chosen MPI executor being respected by job/<ID>/jobscript.

If that did not fix the issue, try specifying the MPI implementation in use, using the --mpi option to srun. E.g. for OpenMPI, set

mpi_executor="srun --mpi=openmpi"

in the .env file. To see which MPI implementations srun supports, run

srun --mpi=list

directly on the front-end. You may wish to try your luck on all supported MPI implementations. If you find one that works, do remember to test if it also works with the added --cpu-bind=none option, as this is preferred.

Note

On some systems, the --cpu-bind=none option is written as --cpu_bind=none, i.e. with an underscore. Try both.

Different hardware architecture on front-end and remote node

If CONCEPT and its dependencies have been installed from the front-end, these have been somewhat tailored to the architecture of the front-end. If the remote node to which you are submitting the CONCEPT job has a different architecture, things might go wrong. A trivial solution is then of course to switch to using a different remote queue/partition with nodes that have similar architecture to that of the front-end.

If you have installed CONCEPT using the standard installation process, CONCEPT itself and all of its dependencies have been built in a somewhat portable manner, meaning that CONCEPT should run fine on architectures different from that on the front-end, as long as they are not too different.

You may try rebuilding the CONCEPT code from the remote node as part of the submitted job, either by passing the --rebuild option to concept or supplying a new build directory with the -b option.

Note that the supposed portability is severely limited if you build CONCEPT with the --native-optimizations option. To rebuild the code without additional non-portable optimizations (default build), use the --rebuild option.

If rebuilding the code with only portable optimizations did not fix the problem, it is worth submitting a remote CONCEPT job without any optimizations via the --optimizations False option to the concept script, just to see what happens. Remember to also supply --rebuild to force recompilation. If this works, you should experiment with src/Makefile as described here, as running in a completely unoptimized state is far from ideal.

To fully ensure compatibility with the architecture of a given node, you may reinstall CONCEPT — including all of its dependencies — from that node. You may either do this by ssh’ing into the node and run the installation manually, or you may submit the installation as a remote job. Below you will find examples of Slurm and TORQUE/PBS job scripts for installing CONCEPT. In both cases you may wish to change concept_version and install_dir, load modules or perform other environment changes, and/or make use of a pre-installed MPI library as described here.

An example Slurm job script for installing CONCEPT is shown below.

#!/usr/bin/env bash
#SBATCH --job-name=install_concept
#SBATCH --partition=<queue>
#SBATCH --nodes=1
#SBATCH --tasks-per-node=8
#SBATCH --mem-per-cpu=2000M
#SBATCH --time=12:00:00
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null

concept_version=v1.0.1
install_dir="${HOME}/concept"

install_url="https://raw.githubusercontent.com/jmd-dk/concept/${concept_version}/install"
make_jobs="-j 8" bash <(wget -O- --no-check-certificate "${install_url}") "${install_dir}"

To use this installation job script, save its content to e.g. jobscript_install (replacing <queue> with the queue/partition in question) and submit it using

sbatch jobscript_install

Once a CONCEPT installation job has begun, you can follow the installation process by executing

tail -f <install_dir>/.tmp/install_log

It still does not work!

If you are still struggling, in particular if CONCEPT does launch but the MPI process binding/affinity is wrong, try removing some of the added environment variables that gets set in job/<ID>/jobscript (under the ‘Environment variables’ heading). After altering the job script, submit it manually using

sbatch job/<ID>/jobscript  # Slurm

or

qsub job/<ID>/jobscript  # TORQUE/PBS

Note

When manually submitting an auto-generated job script, a subdirectory within the job directory will be created for the new job, just as when a job is auto-submitted via the concept script. This subdirectory can take a minute to appear though.

It is also possible that the cluster configuration just does not play nicely with the current MPI implementation in use. If you installed CONCEPT using one of the MPI implementations present on the cluster, try again, using another pre-installed MPI library. If you instead let CONCEPT install its own MPI, try switching from MPICH to OpenMPI or vice versa, as described here.

When installing CONCEPT, try having as few environment modules loaded as possible, in order to minimize the possibility of wrong MPI identification and linking. In particular, beware of environment modules loaded and variables set automatically in files like ~/.bashrc and ~/.bash_profile.

Bad performance when using multiple processes/nodes

If you are running CONCEPT on a cluster and experience a significant drop in performance as you increase the number of processes from e.g. 1 to 2 or 2 to 4, or when using 2 nodes instead of 1 with the same total number of processes, the problem is likely that the MPI library used is not configured to handle the network optimally.

Be sure to install CONCEPT with optimal network performance on clusters. If you are observing bad network behaviour even so, you should try changing the MPI executor, as described here.

Problems when using multiple nodes

If you observe a wrong process binding (i.e. it appears as though several copies of CONCEPT are running on top of each other, rather than all of the MPI processes working together as a collective) when running CONCEPT across multiple nodes, you should try changing the MPI executor.

If you are able to run single-node CONCEPT jobs remotely, but encounter problems as soon as you request multiple nodes, it may be a permission problem. For example, OpenMPI uses SSH to establish the connection between the nodes, and so your local ~/.ssh directory need to be configured properly. Note that when using an MPI implementation pre-installed on the cluster, such additional configuration from the user ought not be necessary.

CONCEPT comes with the ability to set up the ~/.ssh as needed for multi-node communication. Currently this feature resides as part of the install script. To apply it, execute

./install --fix-ssh

from the CONCEPT installation directory.

Note that this will move all existing content of ~/.ssh to ~/.ssh_backup. Also, any configuration you might have done will not be reflected in the new content of ~/.ssh. If this indeed fixes the multi-node problem and you want to preserve your original SSH configuration, you must properly merge the original content of ~/.ssh_backup back in with the new content of ~/.ssh.