Long jobs - Durham University

Running long jobs on Hamilton

Most of the queues on Hamilton have a time limit of 3 days. If you have a job that will need to run for longer, the long queue is an option, but this queue has very limited capacity so wait times can be long. Another possibility is to checkpoint the job and restart it.

A long-running job has more to lose if it is interrupted, for example because it has reached its time limit or because of a system issue. Checkpointing is a technique in which a program saves a copy of its state at intervals as it progresses, with the intention that it can be restarted from that point if execution is interrupted. The process of checkpointing and restarting can be repeated until the program completes. This reduces the risk for long jobs and also allows the execution of jobs needing more time than the queues permit.

The best checkpoint and restore option is one built into the program itself, and many popular applications have this capability. It is sometimes also described as a 'restart' capability. If your application has this capability, use that. The rest of this page covers a feature installed on Hamilton that assists with cases that do not have built-in checkpointing.

Hamilton's Checkpoint/Restore feature

Hamilton has a feature called Checkpoint and Restore Jobs, which attempts to automatically save and restart a job when it reaches the queue's time limit, without the application knowing about it. The technique can be used for serial and multi-threaded jobs, but not ones that use multiple compute nodes or Slurm tasks, e.g. MPI applications.

Note: a job can restore from a checkpoint only if its files are in the same state as at the time of the checkpoint. For example, if a program modifies an output file after a checkpoint (e.g. because it updates the file very frequently) and then fails, it will not restart.

Before you start using Checkpoint/Restore:

If you would like to use the Checkpoint/Restore feature, please let us know so that we can give you access to it. You will not be able to use it without this access. You do not need to tell us if you are using your application's own checkpointing facility, only if you want to use the Hamilton feature.
Check that your job is suitable. It should not use multiple nodes or multiple slurm tasks (e.g. via MPI).
Check the locations of files used by the job, including temporary files. Advice for different storage areas is outlined below.

Home directory

Files used by the job cannot be held in your home directory and should be copied to /nobackup first. We provide a command migrate-to-nobackup to help with this. Usage is:

migrate-to-nobackup <file or directory>

Advice for some common cases:

R libraries

If you have installed R libraries in your account, you can migrate your R library storage location to /nobackup, such that it can still be used by R:

migrate-to-nobackup ~/R

WARNING - your R library directory will no longer be backed up.

Python libraries

If you have installed python libraries in your account using pip, you can migrate your pip library storage location to /nobackup, such that it can still be used by python:

migrate-to-nobackup ~/.local/lib/python*

WARNING - your pip python library directory will no longer be backed up.

Other files in your home directory

We provide a tool to help identify any files in your home directory that are used by your jobs, including hidden files created automatically by your application:

Run a test job (without attempting to checkpoint it)
While the job is running, type the following command on a login node:

chkptproblems <JOBID>

If the command reports a list of programs (under the "COMMAND" column) and files they have open (under the "NAME" column), then these files are stored in a location that will prevent your job from being successfully checkpointed.
Modify the job to use a different location and/or use the migrate-to-nobackup tool as above to move the files to /nobackup. See Manual checkpoint/restart cycles below for further information on testing your application.

Temporary files/TMPDIR

All files used by a checkpoint/restore job must be available from all compute nodes - so the node’s local storage should not be used. Because of this, Hamilton’s Checkpoint and Restore solution sets the TMPDIR environment variable to /nobackup/$USER/tmp so they will use this for temporary files. Check that your /nobackup quota can accommodate any files you place in TMPDIR.

Run the command chkptproblems <JOBID> on a login node while job <JOBID> is running, to report on any files used by the job that are located in a node’s local storage.

A future development may allow you to stage data in/out of a job so that you can take advantage of the local disk space on a compute node. Please let us know if this is important to you.

Running a job

To submit a Checkpoint and Restore job (the chmod command is only needed the first time):

module load chkpt
chmod u+x my_job_script.sh
sbatch $CHKPT_HOME/chkpt_job ./my_job_script.sh

Important: the contents of any #SBATCH lines in your job script will be ignored, so these need to be provided as a flag to sbatch instead. For example, if your job script included the line #SBATCH -c 8 to request 8 CPU cores, you would need to submit using:

sbatch -c 8 $CHKPT_HOME/chkpt_job ./my_job_script.sh

Important: the files in use by the job when it is checkpointed need to be in /nobackup, otherwise checkpointing will fail. This includes job output files. The job script may need to cd to somewhere in /nobackup to avoid running in your home directory.

By default, the job will have a time limit of 3 days, but will be checkpointed and requeued 1 hour before it reaches that limit. Other restrictions, such as the number of CPU cores and memory available to the job, will behave in the same way as they do for non-checkpoint/restart jobs.

Jobs using the checkpoint and restore feature run a maximum of 10 times by default, although this can be increased using the -r <maxruns> flag to chkpt_job

Changes to output files

A new file, chkpt-<jobid>.out will contain the output from your job script.

The usual Slurm output file, by default called slurm-<jobid>.out, will now contain the checkpoint/restore messages for the job, which we will find useful if you ask us to help you with a job that is not checkpointing or restarting correctly.

Retrieving Slurm accounting information about a Checkpoint and Restore job

A checkpointed job keeps its original jobID when it restarts. As the sacct command by default shows you only the most recent instance of a job, use the -D ("duplicates") flag to see the entire history of the job, including any restarts. For example, the following command shows some useful information:

sacct -j <jobid> -D -o jobid,state,totalcpu,cputime,reqmem,maxrss --units=G

Manual checkpoint/restart cycles

When using a new application with checkpoint and restart, we recommend that you test it. To force a job to checkpoint and restart, rather than waiting for it to run for three days, type:

scancel --signal=USR1 --batch <jobid>

To cancel a Checkpoint and Restore job completely:

scancel <jobid>

Restarting a previous Checkpoint and Restore job

As long as jobs fulfill the requirements above, including that files must not have been changed since the last checkpoint was taken, checkpoint/restore jobs can be restarted manually if necessary. If you find that a checkpoint and restore job has stopped restarting, e.g. because of a system failure or because the job has reached its maximum number of restarts, it can be restarted manually using:

module load chkpt
sbatch <sbatch_options> $CHKPT_HOME/chkpt_job -c <chkptid>

where <sbatch_options> are the flags you originally supplied to sbatch, and <chkptid> is the ID of the checkpoint to resume. <chkptid> takes the form of <jobid>__<runid>, where <runid> is incremented each time the job is restarted.

Available checkpoint IDs can be listed using the command:

ls /nobackup/chkpt/$USER

Checkpoint retention

Note that old checkpoints will be automatically deleted if unused for 30 days.