Job monitoring commands

SLURM job monitoring commands

1. Partition and node information

Shows a summary of available partitions and nodes, including their state (available, busy, etc.):
```
sinfo
sinfo -Nel
scontrol show partition
```

2. GPU resource details

Filter GPU resource details:

sinfo -Ne -p gpu --format "%.15N %.4c %.7m %G"

3. Usage over time

To view usage over time:

sreport cluster UserUtilizationByAccount user=$USER start=2024-12-01 -t hours

4. Job history

To retrieve job history:

sacct -u $USER --format=JobID,JobName,partition,node,alloccpus,state,elapsed,maxrss,totalcpu,start,end -S 2024-12-01

5. Information on a specific job

For a running or finished job:
```
sacct -j $JobID
```
Show detailed information about a specific running or pending job:
```
scontrol show job <job-id>
```

6. Queue monitoring

To monitor the current job queue (PD: Pending, R: Running, ...):

All users:

squeue -o "%.8i %.9P %.8j %.5u %.5a %.5t %.16V %.16S %.16M %.16L %.16e %.2D %.4C %.13q %R %f"

Your own jobs:

squeue -u $USER -o "%.8i %.9P %.8j %.5u %.5a %.5t %.16V %.16S %.16M %.16L %.16e %.2D %.4C %.13q %R %f"

7. Create an alias in .bashrc

To create an alias in .bashrc:

alias squeue='squeue -o "%.8i %.9P %.8j %.5u %.5a %.5t %.16V %.16S %.16M %.16L %.16e %.2D %.4C %.13q %R %f"'

8. Job efficiency analysis

To get JobIDs (adjust the date as needed):

sacct -u $USER --format=JobID,state --starttime=2025-01-01 --noheader | grep COMPLETED | egrep -v ".0|bat|ext" | awk '{print $1}' > job_$USER.out

To analyze job efficiency once you have the list in job_$USER.out:

for i in $(cat job_$USER.out); do seff $i; done | egrep "Job ID|CPU Efficiency|Memory Efficiency" > seff_$USER.out

Note: $USER is predefined and corresponds to your eXplor user ID.

Monitoring resource usage (CPU and memory) on nodes where the job is running

To use these commands, first connect to the node where your job is running with:

ssh <node-identifier>

9. Monitor cluster usage and node status

To monitor the cluster use: sinfo, sinfo -Nel, qstat -q.
To adjust allocated resources, use tools like top, htop or nvidia-smi.

10. Command to report CPU usage

mpstat -P ALL

Example output

[login@vm-projet ~]$ ssh cnh01
[login@cnh01 ~]$ mpstat -P ALL
Linux 3.10.0-327.el7.x86_64 (cnh01.prod.explor)         03/10/2025      _x86_64_        (8 CPU)

45:00 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
45:00 PM  all   93.47    0.00    2.21    0.00    0.00    1.37    0.00    0.00    0.00    2.95
45:00 PM    0   88.91    0.00    3.31    0.00    0.00    2.71    0.00    0.00    0.00    5.07
45:00 PM    1   97.05    0.00    1.23    0.00    0.00    0.01    0.00    0.00    0.00    1.71
45:00 PM    2   89.50    0.00    3.24    0.00    0.00    2.99    0.00    0.00    0.00    4.27
45:00 PM    3   97.32    0.00    1.21    0.00    0.00    0.00    0.00    0.00    0.00    1.47
45:00 PM    4   90.01    0.00    3.24    0.00    0.00    2.51    0.00    0.00    0.00    4.23
45:00 PM    5   97.46    0.00    1.15    0.00    0.00    0.00    0.00    0.00    0.00    1.38
45:00 PM    6   90.03    0.00    3.10    0.00    0.00    2.75    0.00    0.00    0.00    4.12
45:00 PM    7   97.45    0.00    1.18    0.00    0.00    0.00    0.00    0.00    0.00    1.36

When you run `mpstat -P ALL` you'll typically see these columns:
- CPU: The CPU number (or "all" for the average across all CPUs).
- %usr: Percentage of CPU utilization that occurred while executing at the user level (applications).
- %nice: Percentage of CPU utilization that occurred while executing at the user level with "nice" priority.
- %sys: Percentage of CPU utilization that occurred while executing at the system level (kernel).
- %iowait: Percentage of time the CPU(s) were idle while the system had outstanding disk I/O requests.
- %irq: Percentage of time spent handling hardware interrupts.
- %soft: Percentage of time spent handling software interrupts.
- %steal: Percentage of time a virtual CPU waits for the real CPU while the hypervisor services another virtual processor.
- %guest: Percentage of time spent running a virtual CPU for guest operating systems.
- %gnice: Percentage of time spent running a guest with "nice" priority.
- %idle: Percentage of time the CPU was idle and the system had no outstanding disk I/O requests.

Reminder

You have access to the node only when you have a running job.

SLURM job monitoring commands​

1. Partition and node information​

2. GPU resource details​

3. Usage over time​

4. Job history​

5. Information on a specific job​

6. Queue monitoring​

7. Create an alias in .bashrc​

8. Job efficiency analysis​

Monitoring resource usage (CPU and memory) on nodes where the job is running​

9. Monitor cluster usage and node status​

10. Command to report CPU usage​

Example output​