Job monitoring commands
SLURM job monitoring commands
1. Partition and node information
-
Shows a summary of available partitions and nodes, including their state (available, busy, etc.):
sinfo
sinfo -Nel
scontrol show partition
2. GPU resource details
-
Filter GPU resource details:
sinfo -Ne -p gpu --format "%.15N %.4c %.7m %G"
3. Usage over time
-
To view usage over time:
sreport cluster UserUtilizationByAccount user=$USER start=2024-12-01 -t hours
4. Job history
-
To retrieve job history:
sacct -u $USER --format=JobID,JobName,partition,node,alloccpus,state,elapsed,maxrss,totalcpu,start,end -S 2024-12-01
5. Information on a specific job
-
For a running or finished job:
sacct -j $JobID -
Show detailed information about a specific running or pending job:
scontrol show job <job-id>
6. Queue monitoring
-
To monitor the current job queue (PD: Pending, R: Running, ...):
All users:
squeue -o "%.8i %.9P %.8j %.5u %.5a %.5t %.16V %.16S %.16M %.16L %.16e %.2D %.4C %.13q %R %f"Your own jobs:
squeue -u $USER -o "%.8i %.9P %.8j %.5u %.5a %.5t %.16V %.16S %.16M %.16L %.16e %.2D %.4C %.13q %R %f"
7. Create an alias in .bashrc
-
To create an alias in
.bashrc:alias squeue='squeue -o "%.8i %.9P %.8j %.5u %.5a %.5t %.16V %.16S %.16M %.16L %.16e %.2D %.4C %.13q %R %f"'
8. Job efficiency analysis
-
To get JobIDs (adjust the date as needed):
sacct -u $USER --format=JobID,state --starttime=2025-01-01 --noheader | grep COMPLETED | egrep -v ".0|bat|ext" | awk '{print $1}' > job_$USER.out -
To analyze job efficiency once you have the list in
job_$USER.out:for i in $(cat job_$USER.out); do seff $i; done | egrep "Job ID|CPU Efficiency|Memory Efficiency" > seff_$USER.out
Note: $USER is predefined and corresponds to your eXplor user ID.
Monitoring resource usage (CPU and memory) on nodes where the job is running
To use these commands, first connect to the node where your job is running with:
ssh <node-identifier>
9. Monitor cluster usage and node status
- To monitor the cluster use:
sinfo,sinfo -Nel,qstat -q. - To adjust allocated resources, use tools like
top,htopornvidia-smi.
10. Command to report CPU usage
mpstat -P ALL
Example output
[login@vm-projet ~]$ ssh cnh01
[login@cnh01 ~]$ mpstat -P ALL
Linux 3.10.0-327.el7.x86_64 (cnh01.prod.explor) 03/10/2025 _x86_64_ (8 CPU)
09:45:00 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
09:45:00 PM all 93.47 0.00 2.21 0.00 0.00 1.37 0.00 0.00 0.00 2.95
09:45:00 PM 0 88.91 0.00 3.31 0.00 0.00 2.71 0.00 0.00 0.00 5.07
09:45:00 PM 1 97.05 0.00 1.23 0.00 0.00 0.01 0.00 0.00 0.00 1.71
09:45:00 PM 2 89.50 0.00 3.24 0.00 0.00 2.99 0.00 0.00 0.00 4.27
09:45:00 PM 3 97.32 0.00 1.21 0.00 0.00 0.00 0.00 0.00 0.00 1.47
09:45:00 PM 4 90.01 0.00 3.24 0.00 0.00 2.51 0.00 0.00 0.00 4.23
09:45:00 PM 5 97.46 0.00 1.15 0.00 0.00 0.00 0.00 0.00 0.00 1.38
09:45:00 PM 6 90.03 0.00 3.10 0.00 0.00 2.75 0.00 0.00 0.00 4.12
09:45:00 PM 7 97.45 0.00 1.18 0.00 0.00 0.00 0.00 0.00 0.00 1.36
When you run `mpstat -P ALL` you'll typically see these columns:
- CPU: The CPU number (or "all" for the average across all CPUs).
- %usr: Percentage of CPU utilization that occurred while executing at the user level (applications).
- %nice: Percentage of CPU utilization that occurred while executing at the user level with "nice" priority.
- %sys: Percentage of CPU utilization that occurred while executing at the system level (kernel).
- %iowait: Percentage of time the CPU(s) were idle while the system had outstanding disk I/O requests.
- %irq: Percentage of time spent handling hardware interrupts.
- %soft: Percentage of time spent handling software interrupts.
- %steal: Percentage of time a virtual CPU waits for the real CPU while the hypervisor services another virtual processor.
- %guest: Percentage of time spent running a virtual CPU for guest operating systems.
- %gnice: Percentage of time spent running a guest with "nice" priority.
- %idle: Percentage of time the CPU was idle and the system had no outstanding disk I/O requests.
Reminder
You have access to the node only when you have a running job.