Job Monitoring Commands

SLURM Job Monitoring Commands

1. Information on Partitions and Nodes

Displays a summary of available partitions and nodes, including their status (available, busy, etc.):

sinfo sinfo -Nel scontrol show partition

2. GPU Resource Details

Allows filtering of GPU resource details:

sinfo -Ne -p gpu --format "%.15N %.4c %.7m %G"

3. Usage Over Time

To view usage over time:

sreport cluster UserUtilizationByAccount user=$USER start=2024-12-01 -t hours

4. Job History

To retrieve job history:

sacct -u $USER --format=JobID,JobName,partition,node,alloccpus,state,elapsed,maxrss,totalcpu,start,end -S 2024-12-01

5. Information on a Specific Job

For a running or completed job:

sacct -j $JobID

Display detailed information about a specific job that is running or pending:

scontrol show job <job-id>

6. Job Queue Monitoring

To monitor the queue of running jobs (PD: Pending, R: Running, ...):

All users:

squeue -o "%.8i %.9P %.8j %.5u %.5a %.5t %.16V %.16S %.16M %.16L %.16e %.2D %.4C %.13q %R %f"

Your own jobs:

squeue -u $USER -o "%.8i %.9P %.8j %.5u %.5a %.5t %.16V %.16S %.16M %.16L %.16e %.2D %.4C %.13q %R %f"

7. Creating an Alias in .bashrc

To create an alias in the .bashrc file:

alias squeue='squeue -o "%.8i %.9P %.8j %.5u %.5a %.5t %.16V %.16S %.16M %.16L %.16e %.2D %.4C %.13q %R %f"'

8. Analyzing Job Efficiency

To retrieve JobIDs, use the following command (adjust the date as needed):

sacct -u $USER --format=JobID,state --starttime=2025-01-01 --noheader | grep COMPLETED | egrep -v ".0|bat|ext" | awk '{print $1}' > job_$USER.out

To analyze job efficiency, once the list of jobs is retrieved in job_$USER.out, run the following command:

for i in $(cat job_$USER.out); do seff $i; done | egrep "Job ID|CPU Efficiency|Memory Efficiency" > seff_$USER.out

Note: $USER is predefined and corresponds to your eXplor identifier.

Monitoring Resource Usage (CPU and Memory) on Nodes Where Calculations Are Ongoing

To use these commands, first connect to the node where your calculation is running using the following command:

ssh <node identifier>

9. Monitor the Cluster Usage and Node Status

To monitor the cluster, use the following commands: sinfo, sinfo -Nel, qstat -q.
To adjust the resources to allocate, use tools like top, htop, or nvidia-smi.

10. Command to Report CPU Usage

mpstat -P ALL

Example Output

[login@vm-projet ~]$ ssh cnh01
[login@cnh01 ~]$ mpstat -P ALL
Linux 3.10.0-327.el7.x86_64 (cnh01.prod.explor)         03/10/2025      _x86_64_        (8 CPU)

09:45:00 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
09:45:00 PM  all   93.47    0.00    2.21    0.00    0.00    1.37    0.00    0.00    0.00    2.95
09:45:00 PM    0   88.91    0.00    3.31    0.00    0.00    2.71    0.00    0.00    0.00    5.07
09:45:00 PM    1   97.05    0.00    1.23    0.00    0.00    0.01    0.00    0.00    0.00    1.71
09:45:00 PM    2   89.50    0.00    3.24    0.00    0.00    2.99    0.00    0.00    0.00    4.27
09:45:00 PM    3   97.32    0.00    1.21    0.00    0.00    0.00    0.00    0.00    0.00    1.47
09:45:00 PM    4   90.01    0.00    3.24    0.00    0.00    2.51    0.00    0.00    0.00    4.23
09:45:00 PM    5   97.46    0.00    1.15    0.00    0.00    0.00    0.00    0.00    0.00    1.38
09:45:00 PM    6   90.03    0.00    3.10    0.00    0.00    2.75    0.00    0.00    0.00    4.12
09:45:00 PM    7   97.45    0.00    1.18    0.00    0.00    0.00    0.00    0.00    0.00    1.36

When you run mpstat -P ALL, you will typically see output that includes the following columns:

- **CPU**: The CPU number (or "all" for the average across all CPUs).
- **%usr**: Percentage of CPU utilization that occurred while executing at the user level (application).
- **%nice**: Percentage of CPU utilization that occurred while executing at the user level with a "nice" priority.
- **%sys**: Percentage of CPU utilization that occurred while executing at the system level (kernel).
- **%iowait**: Percentage of time that the CPU or CPUs were idle while the system had a pending disk I/O request.
- **%irq**: Percentage of time spent by the CPU handling hardware interrupts.
- **%soft**: Percentage of time spent by the CPU handling software interrupts.
- **%steal**: Percentage of time that the virtual CPU waits for the real CPU while the hypervisor serves another virtual processor.
- **%guest**: Percentage of time spent running a virtual CPU for guest operating systems.
- **%gnice**: Percentage of time spent running a guest with a "nice" priority.
- **%idle**: Percentage of time that the CPU was idle and the system had no pending disk I/O requests.

Reminder

You only have access to the node when you have a running computation.