This Page documents how I installed Raspbian on a set of Raspberry Pis. I heavily leveraged the documentation on this site for information. This setup requires a setup with a domain controller to manage users. Here are how I have the machines divide up.
- 1 Domain Controller – 192.168.3.50
- 7 Domain Members
- 1 NFS Server – This node also acts as the main SLURM Controller and Build Bot Master 192.168.3.51
- 6 Domain Members 192.168.3.100-192.168.3.106
All of these systems, excluding the domain controller, mount an nfs share on one of the domain members using autofs.
Setting up the Controller
- open the /etc/hosts file
sudo vim /etc/hosts
- I added the names of all the system in my cluster in the end the file looked like this
127.0.0.1 localhost
192.168.3.51 pinfs.lan.maltshoppe.com pinfs
192.168.3.101 pi1.lan.maltshoppe.com pi1
192.168.3.102 pi2.lan.maltshoppe.com pi2
192.168.3.103 pi3.lan.maltshoppe.com pi3
192.168.3.104 pi4.lan.maltshoppe.com pi4
192.168.3.105 pi5.lan.maltshoppe.com pi5
192.168.3.106 pi6.lan.maltshoppe.com pi6
- Install the Slurm Manager
sudo apt install slurm-wlm -y
- Create the basic Slurm settings
cd /etc/slurm-llnl
sudo cp /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz .
sudo gzip -d slurm.conf.simple.gz
sudo gzip -d slurm.conf.simple.gz
sudo mv slurm.conf.simple slurm.conf
- Update the slurm.conf file
sudo vim /etc/slurm-llnl/slurm.conf
- Modify the following line to contain the name and IP of your slurm controller
SlurmctldHost=pinfs(192.168.3.51)
- Update the Compute Nodes settings to contain the basic information about each node. Note the various number of cores in my setup
- sudo vim /etc/slurm-llnl/slurm.conf
# COMPUTE NODES
NodeName=pi1 NodeAddr=192.168.3.101 CPUs=1 State=UNKNOWN
NodeName=pi2 NodeAddr=192.168.3.102 CPUs=1 State=UNKNOWN
NodeName=pi3 NodeAddr=192.168.3.103 CPUs=1 State=UNKNOWN
NodeName=pi4 NodeAddr=192.168.3.104 CPUs=4 State=UNKNOWN
NodeName=pi5 NodeAddr=192.168.3.105 CPUs=4 State=UNKNOWN
NodeName=pi6 NodeAddr=192.168.3.106 CPUs=4 State=UNKNOWN
- Slurm runs jobs in groups of nodes called partitions. Create one after the #Compute Nodes section
PartitionName=mycluster Nodes=node[02-04] Default=YES MaxTime=INFINITE State=UP
- The lates version of slurm allows for cgroup kernel isolation. To configure it create a config file.
sudo vim /etc/slurm-llnl/cgroup.conf
- Add the following
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup"
AllowedDevicesFile="/etc/slurm-llnl/cgroup_allowed_devices_file.conf"
ConstrainCores=no
TaskAffinity=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=no
AllowedRamSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30
- Now, whitelist system devices by creating the file…
sudo vim /etc/slurm-llnl/cgroup_allowed_devices_file.conf
- Add the following
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/mnt/nfs/*
- Copy the config file to the nfs drive so that the nodes can utilize them in a bit.
sudo cp slurm.conf cgroup.conf cgroup_allowed_devices_file.conf /mnt/nfs/slurm/
- Enable and start all of the daemons
sudo systemctl enable munge
sudo systemctl start munge
sudo systemctl enable slurmd
sudo systemctl start slurmd
sudo systemctl enable slurmctld
sudo systemctl start slurmctld
- Reboot
Setting up the Compute Nodes
- install the software on each node
sudo apt install slurmd slurm-client -y
- Update the /etc/hosts file as done in the controller setup on each compute node
sudo vim /etc/hosts
- My hosts files looks like this
127.0.0.1 localhost
192.168.3.51 pinfs.lan.maltshoppe.com pinfs
192.168.3.101 pi1.lan.maltshoppe.com pi1
192.168.3.102 pi2.lan.maltshoppe.com pi2
192.168.3.103 pi3.lan.maltshoppe.com pi3
192.168.3.104 pi4.lan.maltshoppe.com pi4
192.168.3.105 pi5.lan.maltshoppe.com pi5
192.168.3.106 pi6.lan.maltshoppe.com pi
- All of the config files need to match so copy the ones over that we created in controller setup
sudo cp /mnt/nfs/slurm/munge.key /etc/munge/munge.key
sudo cp /mnt/nfs/slurm/slurm.conf /etc/slurm-llnl/slurm.conf
sudo cp /mnt/nfs/slurm/cgroup* /etc/slurm-llnl
- Start and enable slurm
sudo systemctl enable munge
sudo systemctl start munge
- Reboot the node
sudo reboot now
Testing
- To test the cluster and munge I run the following command as a domain user… …in this case buildbot
ssh buildbot@pi1 munge -n | unmunge
- To get the following result
buildbot@pi1's password:
STATUS: Success (0)
ENCODE_HOST: pi1.lan.maltshoppe.com (192.168.3.101)
ENCODE_TIME: 2023-02-19 08:07:42 -0500 (1676812062)
DECODE_TIME: 2023-02-19 08:07:15 -0500 (1676812035)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: buildbot (111105)
GID: domain users (110513)
LENGTH: 0
- Resume all node by running the following command
sudo scontrol update NodeName=pi[1-6] state=RESUME
- Run the hostname command on the controller and you should get some results.
admin@pinfs:~ $ srun --nodes=6 hostname
pi1
pi6
pi5
pi4
pi3
pi2