Scheduler Control Commands
Scontrol
Overview
The primary command to make modifications and show information on running, pending,
or held jobs, in addition to scancel, is scontrol. The format of the command is:
scontrol [options] [command]
Where [options] are typically:
| Option |
Short |
Long |
Description |
| Help |
-h |
--help |
Show scontrol help. |
| One-Line |
-o |
--oneliner |
Use one line per record for show command. |
| All |
-a |
--all |
List all jobs on all partitions for show command. |
| Details |
-d |
--details |
List all available information about jobs for show command. |
| Verbose |
-v |
--verbose |
Show debugging information when running any command. |
And [command] is usually:
| Command |
Input |
Example |
Description |
| update |
specification |
scontrol update jobid=12345 timelimit=4-00:00:00 |
Used to update a job specification. If timelimit, numcpus, numtasks, numnodes, minmemorycpu,
or minmemorynode is updated for a pending job it must be less than what was initially
requested. This command doesn't usually work for running jobs. |
| hold |
jobList |
scontrol hold 12345 |
Hold a pending job to prevent it from running. |
| suspend |
jobList |
scontrol suspend 12345 |
Suspend a running job to allow other jobs to run. |
| requeue |
jobList |
scontrol requeue 12345 |
Requeue a running, suspended, or completed job. |
| resume |
jobList |
scontrol resume 12345 |
Resume a suspended job. |
| release |
jobList |
scontrol release 12345 |
Release a held job. |
| show |
|
show job 12345 |
Show details about an entity. |
Job lists (jobList for a given [command]) is a single job id or a comma separated list such as 12345,12346,12347, for which a [command] is applied. Job names, with jobname=[some job name] instead of jobList, can be used to apply the [command] to any jobs with the same job name. Entities, specified with entity ID, can be:
|
|
|
|
|
|
|
Execute the sacctmgr command to see available accounts:
sacctmgr show assoc user=$USER format=account -P
|
|
|
|
Could be most anything, but it is probably best to include information regarding
job, dependencies, etc...
|
|
|
|
Job is on hold until a jobid, the ID, is satisfied:
- "after:jobid_1,jobid_2,...": Wait until jobs have started.
- "afterany:jobid_1,jobid_2,...": Wait until jobs have terminated.
- "afternotok:jobid_1,jobid_2,...": Wait until jobs have terminated in a non-zero state
(usually failed).
- "aftertok:jobid_1,jobid_2,...": Wait until jobs have terminated in a zero state (usually
succeeded).
- "singleton": Wait until jobs with the same name have terminated.
|
|
|
|
Cannot be modified, but can be used with show to see details about a job.
|
|
|
|
Exclude a comma separated list of nodes. Usually useful if a job has a problem with
a particular node.
|
|
|
|
Execute with at least ID's number of CPUs per node.
|
|
|
|
Execute with at least ID's memory per CPU for pending jobs. In megabytes.
|
|
|
|
Execute with at least ID's memory per Node for pending jobs. In megabytes.
|
|
|
|
Name of the job(s) to be shown or modified.
|
|
|
|
Shrink a job to the node's listed in ID. Must be a subset of currently allocated nodes
for the job.
|
|
|
|
Set minimum-maximum number of CPUs for job. Maximum is optional, and if the job is
running, it must be smaller than what is currently allocated for the job.
|
|
|
|
Set the number of tasks required for a job.
|
|
|
|
|
|
|
|
Set the job's quality of service.
|
|
|
|
Require a comma separated list of nodes. Usually useful if a job has a problem with
a particular node.
|
|
|
|
Set the job to requeue (resubmit) after a node failure.
|
|
|
|
Set the job to initiate on or after a particular time/date.
|
|
|
|
Set the job to terminate after running for an allotted amount of time.
|
|
|
|
Identify a job by user name.
|
Examples
Showing a currently running job with scontrol show:
[user@log001 ~] scontrol show job 185995 -d
JobId=185995 JobName=lipo2Nano_-0.2_17500_1.0472.sh
UserId=jspngler(10123) GroupId=users(100) MCS_label=N/A
Priority=4294771864 Nice=0 Account=mlaradjilab QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=3-22:27:11 TimeLimit=5-00:00:00 TimeMin=N/A
SubmitTime=2019-04-11T17:44:03 EligibleTime=2019-04-11T17:44:03
StartTime=2019-04-18T10:06:03 EndTime=2019-04-23T10:06:03 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-04-18T10:06:03
Partition=computeq AllocNode:Sid=log002:91002
ReqNodeList=(null) ExcNodeList=(null)
NodeList=c063
BatchHost=c063
NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=3200M,node=1,billing=8
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
Nodes=c063 CPU_IDs=10-13,20-23 Mem=3200 GRES_IDX=
MinCPUsNode=8 MinMemoryCPU=400M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.2/lipo2Nano_-0.2_17500_1.0472.sh --job-name=5NM_-0.2_17500_1.0472 -D /home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.2
WorkDir=/home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.2
StdErr=/home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.2/MD-185995-4294967294.err
StdIn=/dev/null
StdOut=/home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.2/MD-185995-4294967294.out
Power=
Updating a pending job with less memory per CPU:
[jspngler@log001 ~] scontrol show job 186120
JobId=186120 JobName=lipo2Nano_-0.7_21500_2.0944.sh
UserId=jspngler(10123) GroupId=users(100) MCS_label=N/A
Priority=4294771739 Nice=0 Account=mlaradjilab QOS=normal
JobState=PENDING Reason=AssocMaxJobsLimit Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=5-00:00:00 TimeMin=N/A
SubmitTime=2019-04-11T17:49:49 EligibleTime=2019-04-11T17:49:49
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-04-22T08:46:10
Partition=computeq AllocNode:Sid=log002:91002
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=3200M,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=8 MinMemoryCPU=400M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.7/lipo2Nano_-0.7_21500_2.0944.sh --job-name=5NM_-0.7_21500_2.0944 -D /home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.7
WorkDir=/home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.7
StdErr=/home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.7/MD-186120-4294967294.err
StdIn=/dev/null
StdOut=/home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.7/MD-186120-4294967294.out
Power=
[jspngler@log001 ~] scontrol update job 186120 minmemorycpu=300
[jspngler@log001 ~] scontrol show job 186120
JobId=186120 JobName=lipo2Nano_-0.7_21500_2.0944.sh
UserId=jspngler(10123) GroupId=users(100) MCS_label=N/A
Priority=4294771739 Nice=0 Account=mlaradjilab QOS=normal
JobState=PENDING Reason=AssocMaxJobsLimit Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=5-00:00:00 TimeMin=N/A
SubmitTime=2019-04-11T17:49:49 EligibleTime=2019-04-11T17:49:49
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-04-22T08:46:32
Partition=computeq AllocNode:Sid=log002:91002
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=2400M,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=8 MinMemoryCPU=300M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.7/lipo2Nano_-0.7_21500_2.0944.sh --job-name=5NM_-0.7_21500_2.0944 -D /home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.7
WorkDir=/home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.7
StdErr=/home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.7/MD-186120-4294967294.err
StdIn=/dev/null
StdOut=/home/jspngler/nanoPatch/test/2N_D10_vesicle/U_-0.7/MD-186120-4294967294.out
Power=