SLURM Demo on AWS Ubuntu EC2 instance

It is recommended to go through Introduction to SLURM before going through this post

Steps for Spinning up EC2 instance

  • Either do it via AWS Console interactively via Browser or
  • Use AWS CLI to spin up the instance

    • Use AWS IAM to create a Access Key and configure the AWS CLI using aws configure
    • and Create an EC2 instance using the following command

      aws ec2 run-instances --image-id ami-0cf2b4e024cdb6960 --count 1 --instance-type t3.small --key-name XXXXX --security-group-ids sg-XXXXXXXXXX --subnet-id subnet-XXXXXXXX
      aws ec2 create-tags --resources i-aaabb3234dd --tags Key=Name,Value=slurm-01
      

Final check inspecting the OS on the instance

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Steps for Setting up Slurm on EC2 instance

Update the grub and disable SELINUX

sudo vim /etc/default/grub
sudo update-grub
sudo reboot

Install the slurm packages

sudo apt update -y
sudo apt install slurmd slurmctld -y

Commands to figure out the system's hardware => RAM and Cores (CPU)

Command to help with finding number of cores

$ lscpu
ubuntu@ip-172-31-21-208:~$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   2
.........
.........

Other command that lists per processor/core information

cat /proc/cpuinfo

Command to help with finding available memory (RAM)

$ sudo dmidecode --type memory
# dmidecode 3.5
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.

Handle 0x0008, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Unknown
        Maximum Capacity: 2 GB
        Error Information Handle: Not Provided
        Number Of Devices: 1

Handle 0x0009, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x0008
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 2 GB
        Form Factor: DIMM
        Set: None
        Locator: Not Specified
        Bank Locator: Not Specified
        Type: DDR4
        Type Detail: Static Column Pseudo-static Synchronous Window DRAM
        Speed: 2933 MT/s
        Manufacturer: Not Specified
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Rank: Unknown
        Configured Memory Speed: Unknown

Create the slurm config file at /etc/slurm/slurm.conf

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=localcluster
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#
# COMPUTE NODES
NodeName=localhost CPUs=2 RealMemory=1910 State=UNKNOWN
PartitionName=localhost Nodes=ALL Default=YES MaxTime=INFINITE State=UP

Restart the slurmctld / slurmd and munge

sudo systemctl start slurmctld && sudo systemctl start slurmd && sudo systemctl start munge
sudo scontrol update nodename=localhost state=idle

Checking the resources using sinfo --partition localhost

$ sinfo --partition localhost
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
localhost*    up   infinite      1   idle localhost

Testing Job Submission

Create sample_submission.sh

#!/bin/bash

#SBATCH --job-name=sample_job
#SBATCH --partition=localhost
#SBATCH --time=10:00
#SBATCH --ntasks=1

echo "First sample job running on localhost."
echo "Job Done well .. Exiting!"

exit 0

Submit Job and Check output

2 indicates the job_id in the system

$ sbatch sample_submission.sh
Submitted batch job 2

Inspect the job

This also has the information about StdOut information, that holds the output from the job as we used echo statements to print to stdout

$ scontrol show jobid 2
JobId=2 JobName=sample_job
   UserId=ubuntu(1000) GroupId=ubuntu(1000) MCS_label=N/A
   Priority=1 Nice=0 Account=(null) QOS=(null)
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2024-05-30T01:20:22 EligibleTime=2024-05-30T01:20:22
   AccrueTime=2024-05-30T01:20:22
   StartTime=2024-05-30T01:20:23 EndTime=2024-05-30T01:20:23 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-05-30T01:20:23 Scheduler=Backfill
   Partition=localhost AllocNode:Sid=ip-172-31-21-208:1046
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=localhost
   BatchHost=localhost
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=953M,node=1,billing=1
   AllocTRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/ubuntu/sample_submission.sh
   WorkDir=/home/ubuntu
   StdErr=/home/ubuntu/slurm-2.out
   StdIn=/dev/null
   StdOut=/home/ubuntu/slurm-2.out
   Power=

Checking the output

$ cat /home/ubuntu/slurm-2.out
First sample job running on localhost.
Job Done well .. Exiting!

Latest Blogposts

SLURM on WSL

Setting up SLURM on WSL

27 May 2024

Introduction to SLURM

Simple Linux Utility for Resource Management (SLURM)

26 May 2024

How to find a linux machine is a VM (Virtual Machine) or a Bare Metal

If you can SSH into a linux machine and want to find out if its baremetal or Virtual Machine

7 November 2023

Storing Github access token in git credential store

Using git credentials store the github access token to avoid the re-prompting of username and pwd

4 April 2023

Token generation for Registering Self Hosted Github Runner via REST API

Explains how to generate a token using github API to be used in turn with Github self hosted runner registration

21 March 2023