next up previous contents
Next: 5 Troubleshooting Up: 4 Performances Previous: 4.5 Understanding the time report   Contents

Subsections

4.6 Restarting

Since QE 5.1 restarting from an arbitrary point of the code is no more supported.

The code must terminate properly in order for restart to be possible. A clean stop can be triggered by one the following three conditions:

  1. The amount of time specified by the input variable max_seconds is reached
  2. The user creates a file named "$prefix.EXIT" either in the working directory or in output directory "$outdir" (variables $outdir and $prefix as specified in the control namelist)
  3. (experimental) The code is compiled with signal-trapping support and one of the trapped signals is received (see the next section for details).

After the condition is met, the code will try to stop cleanly as soon as possible, which can take a while for large calculation. Writing the files to disk can also be a long process. In order to be safe you need to reserve sufficient time for the stop process to complete.

If the previous execution of the code has stopped properly, restarting is possible setting restart_mode=``restart'' in the control namelist.

4.6.1 Signal trapping (experimental!)

In order to compile signal-trapping add "-D__TERMINATE_GRACEFULLY" to MANUAL_DFLAGS in the make.doc file. Currently the code intercepts SIGINT, SIGTERM, SIGUSR1, SIGUSR2, SIGXCPU; signals can be added or removed editing the file clib/custom_signals.c.

Common queue systems will send a signal some time before killing a job. The exact behaviour depends on the queue systems and could be configured. Some examples:

With PBS:

With LoadLeveler (untested): the SIGXCPU signal will be sent when wall softlimit is reached, it will then stop the job when hardlimit is reached. You can specify both limits as:
# @ wall_clock_limit = hardlimit,softlimit
e.g. you can give pw.x thirty minutes to stop using:
# @ wall_clock_limit = 5:00,4:30


next up previous contents
Next: 5 Troubleshooting Up: 4 Performances Previous: 4.5 Understanding the time report   Contents