Introduction to SLURM

What is SLURM?

SLURM is an acronym for Simple Linux Utility for Resource Management

SLURM is a free and versatile tool that streamlines job scheduling and resource allocation on Linux-based clusters. Imagine having dozens or even hundreds of compute nodes – SLURM ensures efficient utilization by:

  • Resource Allocation: Granting users exclusive or non-exclusive access to compute nodes for specific durations.
  • Job Scheduling & Execution: Providing a framework for launching and monitoring jobs, often involving parallel processing techniques like MPI (Message Passing Interface).
  • Queue Management: Maintaining a queue of submitted jobs, prioritizing them based on predefined rules, and ensuring fair access to resources => Resource contention

Why use SLURM?

  • SLURM is an open-source Job Scheduler for small and large linux clusters and unix-like kernels
  • SLURM is fault-tolerant and highly scalable cluster management system
  • SLURM is relatively self-contained with the components it needs to run

What are Important components of SLURM?

  • slurmctld, the central controller monitoring all the compute nodes that are registered as part of the cluster. Typically this is run dedicated management node
  • slurmd, the daemon that runs on each compute node

Picture Source: https://slurm.schedmd.com/arch.gif

What are different entities in SLURM ?

  • Nodes - Actual Compute nodes
  • Partitions - Logical grouping of nodes a.k.a Job Queues with an mix of constraints (job size limit / job time limit), priority
  • Jobs - Allocations of resources assigned to a user for a specified amount of time. Logical groups of job steps
  • Job Steps - sets of tasks with in a job. Typically there are parallel tasks

Picture Source: https://slurm.schedmd.com/entities.gif

Finally the list of important commands

Ordered in the most frequently used or relevant

  • sbatch - to submit a job for eventual execution
  • srun - submit job for execution in real time. It could even be used to initiate job steps in real time
  • scancel - to cancel a pending or running job step or job (all job steps)
  • squeue - state of jobs/job steps
  • sinfo - show the state of partitions and nodes as report
  • sview - GUI to display information for jobs/parititions and nodes
  • sstat - info about resource utilization by a running job/job step
  • sshare - info about fairshare usage. Only available in conjuction to priority/multifactor plugin
  • sprio - details of the factors/components/constraints affecting a job's priority
  • scontrol - admin tool to view/modify the state of slurm
  • sbcast - transfer a file between local storage on nodes with in a job allocation. This works like in a p2p fashion and simulates diskless nodes and improved performance relative to shared FS
  • sattach - attach stdin / stdout / stderror along with sending signal to an already running job
  • salloc - allocate resources in real time to the jobs. This spawns a shell with allocated resources and srun commands with in this shell is used to launch parallel task
  • sacct - report job or job step information about active/completed jobs

Recommended exercise to go through SLURM demo on EC2 instance SLURM demo on AWS Ubuntu 22.04 EC2 instance

Latest Blogposts

SLURM Demo on AWS Ubuntu EC2 instance

Demo of slurm usage on a single instance of Ubuntu 24.04 EC2 instances on AWS

29 May 2024

SLURM on WSL

Setting up SLURM on WSL

27 May 2024

How to find a linux machine is a VM (Virtual Machine) or a Bare Metal

If you can SSH into a linux machine and want to find out if its baremetal or Virtual Machine

7 November 2023

Storing Github access token in git credential store

Using git credentials store the github access token to avoid the re-prompting of username and pwd

4 April 2023

Token generation for Registering Self Hosted Github Runner via REST API

Explains how to generate a token using github API to be used in turn with Github self hosted runner registration

21 March 2023