SLURM Job Monitoring Commands
1. Information on Partitions and Nodes
- Displays a summary of available partitions and nodes, including their status (available, busy, etc.):
sinfo
sinfo -Nel
scontrol show partition
2. GPU Resource Details
- Allows filtering of GPU resource details:
sinfo -Ne -p gpu --format "%.15N %.4c %.7m %G"
3. Usage Over Time
- To view usage over time:
sreport cluster UserUtilizationByAccount user=$USER start=2024-12-01 -t hours
4. Job History
- To retrieve job history:
sacct -u $USER --format=JobID,JobName,partition,node,alloccpus,state,elapsed,maxrss,totalcpu,start,end -S 2024-12-01
5. Information on a Specific Job
- For a running or completed job:
sacct -j $JobID
- Display detailed information about a specific job that is running or pending:
scontrol show job <job-id>
6. Job Queue Monitoring
- To monitor the queue of running jobs (PD: Pending, R: Running, ...):
All users:
squeue -o "%.8i %.9P %.8j %.5u %.5a %.5t %.16V %.16S %.16M %.16L %.16e %.2D %.4C %.13q %R %f"
Your own jobs:
squeue -u $USER -o "%.8i %.9P %.8j %.5u %.5a %.5t %.16V %.16S %.16M %.16L %.16e %.2D %.4C %.13q %R %f"
7. Creating an Alias in .bashrc
- To create an alias in the
.bashrc
file:
alias squeue='squeue -o "%.8i %.9P %.8j %.5u %.5a %.5t %.16V %.16S %.16M %.16L %.16e %.2D %.4C %.13q %R %f"'
8. Analyzing Job Efficiency
- To retrieve JobIDs, use the following command (adjust the date as needed):
sacct -u $USER --format=JobID,state --starttime=2025-01-01 --noheader | grep COMPLETED | egrep -v ".0|bat|ext" | awk '{print $1}' > job_$USER.out
- To analyze job efficiency, once the list of jobs is retrieved in
job_$USER.out
, run the following command:
for i in $(cat job_$USER.out); do seff $i; done | egrep "Job ID|CPU Efficiency|Memory Efficiency" > seff_$USER.out
Note: $USER is predefined and corresponds to your eXplor identifier.
Monitoring Resource Usage (CPU and Memory) on Nodes Where Calculations Are Ongoing
To use these commands, first connect to the node where your calculation is running using the following command:
ssh <node identifier>
9. Monitor the Cluster Usage and Node Status
- To monitor the cluster, use the following commands:
sinfo
,sinfo -Nel
,qstat -q
. - To adjust the resources to allocate, use tools like
top
,htop
, ornvidia-smi
.
10. Command to Report CPU Usage
mpstat -P ALL
Example Output
[login@vm-projet ~]$ ssh cnh01
[login@cnh01 ~]$ mpstat -P ALL
Linux 3.10.0-327.el7.x86_64 (cnh01.prod.explor) 03/10/2025 _x86_64_ (8 CPU)
09:45:00 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
09:45:00 PM all 93.47 0.00 2.21 0.00 0.00 1.37 0.00 0.00 0.00 2.95
09:45:00 PM 0 88.91 0.00 3.31 0.00 0.00 2.71 0.00 0.00 0.00 5.07
09:45:00 PM 1 97.05 0.00 1.23 0.00 0.00 0.01 0.00 0.00 0.00 1.71
09:45:00 PM 2 89.50 0.00 3.24 0.00 0.00 2.99 0.00 0.00 0.00 4.27
09:45:00 PM 3 97.32 0.00 1.21 0.00 0.00 0.00 0.00 0.00 0.00 1.47
09:45:00 PM 4 90.01 0.00 3.24 0.00 0.00 2.51 0.00 0.00 0.00 4.23
09:45:00 PM 5 97.46 0.00 1.15 0.00 0.00 0.00 0.00 0.00 0.00 1.38
09:45:00 PM 6 90.03 0.00 3.10 0.00 0.00 2.75 0.00 0.00 0.00 4.12
09:45:00 PM 7 97.45 0.00 1.18 0.00 0.00 0.00 0.00 0.00 0.00 1.36
When you run mpstat -P ALL
, you will typically see output that includes the following columns:
- **CPU**: The CPU number (or "all" for the average across all CPUs).
- **%usr**: Percentage of CPU utilization that occurred while executing at the user level (application).
- **%nice**: Percentage of CPU utilization that occurred while executing at the user level with a "nice" priority.
- **%sys**: Percentage of CPU utilization that occurred while executing at the system level (kernel).
- **%iowait**: Percentage of time that the CPU or CPUs were idle while the system had a pending disk I/O request.
- **%irq**: Percentage of time spent by the CPU handling hardware interrupts.
- **%soft**: Percentage of time spent by the CPU handling software interrupts.
- **%steal**: Percentage of time that the virtual CPU waits for the real CPU while the hypervisor serves another virtual processor.
- **%guest**: Percentage of time spent running a virtual CPU for guest operating systems.
- **%gnice**: Percentage of time spent running a guest with a "nice" priority.
- **%idle**: Percentage of time that the CPU was idle and the system had no pending disk I/O requests.
Reminder
You only have access to the node when you have a running computation.