EnablingCheckpoints
From UniCluster
Enabling Job Restart with Checkpoints
Enabling Grid Engine checkpoints allows rescheduling of jobs after a machine failure.
With checkpoints enabled, status of jobs is periodically saved to disk, and those jobs can be restarted from the checkpoint in the event they do not finish for some reason (e.g., due to a system crash). This reduces the possible loss of processing for long running jobs to a few minutes, rather than hours (or even days).
If you'd like to enable the checkpoint restart feature:
- Configure your checkpointing environment using qconf -mckpt command (use qconf -ackpt for adding a new environment), and make sure that the environment’s when parameter includes r (for reschedule).
- Use qconf -mconf to edit the global cluster configuration and set the reschedule_unknown parameter to a non-zero time. This parameter determines whether jobs in unknown state on one host are rescheduled and thus sent to other hosts. The default value of 00:00:00 means jobs will not be rescheduled from the host on which they were originally running.
- Rescheduling is only initiated for jobs with the rerun flag. Make certain checkpointed jobs are submitted with qsub -r y, in addition to the -ckpt <ckpt_env_name> option.
Note that jobs not using checkpointing are rescheduled only if they are running in queues that have the rerun option set to true, in addition to being submitted with -r y option. Parallel jobs are only rescheduled if the host on which their master task executes gets into an unknown state.
Back to Administrative How Tos.
