Please keep in mind:
With great power comes great responsibility. You have sudo access on machines with people’s (sometimes personal) data on it. Please take this responsibility seriously.
We make decision by consensus here, and the general philosophy is that SkyNet is a single cluster, and not a union of sub-clusters (one for each lab). Please talk to others before making any sudo-level changes.
Some tips and tricks live here
You are now responsible for administrating Skynet -- congrats! If you're like me, you aren't really trained as a system admin but are a decently intelligent person and will figure it out. With that said, welcome to a crap-guide to administrating Skynet written by an untrained admin for an untrained admin -- good luck!
Slurm is our resource manager and the way users interact with the computation resources on Skynet. I encourage new admins to get familiar with the sinfo
, scontrol
, and sacctmgr
commands in addition to the typical user commands of squeue
, scancel
, srun
, and sbatch
. It wouldn't hurt to have at least a surface understanding of node / partitions, generic resources, and quality of service in Slurm as well before continuing -- but hey, you're the admin.
Let's begin.
Important Slurm Processes: The cliff-note version of Slurm is that each compute node runs a deamon slurmd
that communicates over an encrypted channel with a central controller deamon slurmctld
running on another node. In our case slurmctld
runs on a persistant VM marvin.cc.gatech.edu
and a backup controller runs on the compute node vicki.cc.gatech.edu
-- the backup should only take over if the primary fails. You can check the status of the control deamons from any node by running the comamnd scontrol ping
. If that isn't fun enough, there is also a slurmdbd
deamon that manages a database backend to keep track of user/group accounts and limits that is currently sitting on vicki
as well. Slurm also relies on the munged
daemon for message encryption and mysql
deamon for database backend on vicki
. These should be configured to start automatically. I would recommend you get acquainted with the service
command in linux which actually manages all of these daemons. For instance, if I want to check the status of the compute-node slurmd
daemon, I could run sudo service slurmd status
. If the status reports that the daemon is down and I wish to restart it, sudo service slurmd restart
.
Slurm State: In order to allow for switching between the primary and backup controller, the slurm state is stored over NFS at /srv/share/slurm/state
. This hopefully never matters, but is worth knowing in case something odd breaks. It also leaves us somewhat open to the danger of that partition filling up and Slurm running into an error.
Configuration: Slurm is configured through /etc/slurm-llnl/slurm.conf
, /etc/slurm-llnl/gres.conf
, and slurmdb.conf
corresponding to overall and generic resource configuration files as well as the database configuration. We even use an autoconfigured version of gres.conf
which autodetects resources, so gres.conf is simply
autodetect=nvml
To update slurm.conf
, directly edit it on nodes marvin and vicki and reload via
sudo systemctl restart slurmctld
sudo scontrol reconfigure
Partitions in our slurm setup correspond to labs, and are setup as follows in slurm.conf
PartitionName=<lab-name> QOS=<lab-name> PriorityTier=2 Default=NO AllowAccounts=<lab-name>
Each server in Skynet is a node on Slurm with a corresponding number of GPU generic resources (gres) of the appropriate type. These nodes are grouped based on their resources types, and each of these groups is assigned a weight. Here's a snippet of slurm.conf
as an example
slurm.conf
## -- A40 group 1 --
NodeName=DEFAULT Weight=20
NodeName=spot,major Gres=gpu:a40:4 CPUs=128 CoresPerSocket=32 RealMemory=515000 MemSpecLimit=8000
NodeName=consu Gres=gpu:a40:4 CPUs=128 CoresPerSocket=32 RealMemory=515000 MemSpecLimit=8000
# -- A40 group 2 --
NodeName=DEFAULT Weight=21
NodeName=ig-88,conroy,cyborg,omgwth,qt-1,spd-13,dave,nestor,sonny,robby,xaea-12,deebot,baymax,megabot,heistotron,chappie Gres=gpu:a40:8 CPUs=128 CoresPerSocket=32 RealMemory=515000 MemSpecLimit=8000
To figure node memory, ssh into the node and then check with htop
(or similar). Note that memory in the config is in MB, i.e. take the memory in GB and multiply by 1000. To figure out the CPU specs, run ssh into the node and run lscpu
> lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
Stepping: 4
CPU MHz: 3348.675
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 5201.83
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 19712K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d
Then take the values of CPU(s):
, Thread(s) per core:
and Core(s) per socket:
. Note that ThreadsPerCore only needs to be specified if it is not 2, as that is the default value.
As you'll notice in the actual config, we define many nodes with a single NodeName=...
line. This allow slurm to more efficiently find a node that satisfies the jobs requirements as it can check in batches. When adding a new node, either add it to an already existing group if it has the same hardware layout as other nodes or create a new group.
Thanks to the growth of multi-GPU jobs, we want to try to compact jobs to some small set of nodes -- leaving some other nodes will all GPUs free so multi-GPU jobs don't starve. However, if we do a strict "fill one node at a time" strategy, we end up with some terrible load-balancing. We try to strike a balance by having multiple groups which are filled sequentially and that load-balance within themselves. To do this, we set groups of nodes to have identical weight
values.
We will occasionally need to take a node down for some reason or to bring a downed node back up. This is done through the scontrol command which I encourage you to look at, but will cover the basics of here. There are quite a lot of different node states but the ones that you'll likely encounter the most are IDLE
, MIXED
, DOWN
, or DRAIN
. If a node is down or drained, we can bring it back by changing its state to RESUME
. If we want to take a node down for maintenance but don't want to suddenly kick everyone off of it, we can set its state to DRAIN
and no new jobs will be scheduled. In sinfo
, the node will report as in a drng
state until all existing jobs have finished at which point it will change to drain
and we are free to reboot / otherwise muck around with it. This applies to all compute nodes but the head node. To much with the head node you should probably notify all its users first.
scontrol show node [NODE_NAME]
sinfo -R
scontrol update node=[NODE_NAME] state=[DRAIN] reason=[TEXT REASON]
sudo shutdown --reboot 0
. Wait for the node to boot back up.nvidia-smi
.scontrol update node=[NODE_NAME] state=resume
.scontrol update node=[NODE_NAME1],[NODE_NAME2] state=resume
.Rather than buying, maintaining, and relying on particular hardware themselves, labs that contribute to Skynet are alloted GPU use equal to their contribution but gain the benefit of the cluster's idle resources and resiliance in case of individual node failures. We keep track of contributions in this spreadsheet with per-lab GPU limits being calculated on the Member Contributions tab. From time to time new groups need to be added or limits adjusted as new machines are added. This is done through the sacctmgr
utility in Slurm (read more here).
To create a new group:
sudo sacctmgr add account [GROUP_NAME]
To set or modify group GPU limits:
sudo sacctmgr modify account name=[GROUP_NAME] set GrpTRES=gres/gpu=[NUM_GPUS],cpu=[NUM_CPUS]
Overcap: One thing worth understanding is how users from one lab can make use of idle resources from another lab -- i.e. how can users acquire more GPUs than they've contributed. The way we have it set up, a user that is otherwise maxed out on GPUs for their group can issue a job with the --account=overcap
(this will automatically tag it with --qos=overcap
also) option to run on idle resources from other groups. However if Skynet fills up, that job will be killed in order to make room for a job from a user whose group hasn't reached its limit. Setting this up required modifying the default normal
quality of service (QoS) that all accounts get by setting its Preempt
value to the overcap
QoS -- enabling preemption of overcap jobs by normal group ones. We also had to create an overcap
QoS with an outrageously huge GPU limit to override the user's group limit. This is set to 9999 GPUs right now -- may the old gods have mercy on your soul if Skynet has grown to the point this is insufficient. We also had to set a few lines in slurm.conf
to enable QoS based preemption, specifically:
PreemptMode=requeue
PreemptType=preempt/qos
JobRequeue=0
and to make sure folks are unable to create uninterruptible jobs in the overcap account, we also need to enforce that a QOS tag is valid for a given association:
AccountingStorageEnforce=limits,associations,qos
Per user GPU Limits: Another thing in this vein is how per user GPU limits are implemented. Taking CVMLP as an example, there is a QoS, cvmlp-user-limits
, that is set as the default QoS for cvmlp-lab
(and is the only valid QoS for that account). This QoS limits a CVMLP user to only have 16 GPUs running in short/long. There is then the user-overcap
partition that has a no-user-limits
QoS applied to the partition and limits a user to have 9999 GPUs running at once (again, may the gods have mercy on your soul). Due to the way slurm works, the user GPU limit from the partition is the one that it picks as that is the one it sees first. This page goes into more details.
I explain all this not to bore you or to enshrine my efforts, but because it feels like the sort of thing someone may want changed / have questions about at some point.
New users come up all the time and will be a minor nuisance throughout your term as admin. Getting a new user up and running on Skynet requires:
1 The user mailing helpdesk@cc.gatech.edu with their Buzzport login user id with an Skynet associated admin / faculty member CC'ed indicating they need Skynet access. This typically resolves within a few hours to a day at worst and may take a little time to propagate. This is an excellent time to remind a new user to go look over the Skynet usage wiki.
2 After this, the user should be able to log into any node on Skynet with their GT user id and password (not their CoC account), but they don't yet have access to slurm. To add them, an admin has to assign them to a group account and the overcap account.
To add a new user to a group account:
sudo sacctmgr add user [USER_ID] DefaultAccount=[GROUP_NAME]
Add a user to the overcap account:
sudo sacctmgr add user [USER_ID] account=overcap
Make sure you add a user to both their group account and the overcap account. A user can be added to both and all will be fine.
To change a user's group account:
sudo sacctmgr remove user name=[USER_ID]
sudo sacctmgr add user [USER_ID] DefaultAccount=[GROUP_NAME]
sudo sacctmgr add user [USER_ID] account=overcap
(Okay... you're thinking surely there is another way than just deleting them. You are right, but it is annoying and requires adding the new account to the user, changing the default, and then removing the old one. The worst that deleting and recreating the user does is resets a tracker of how many resources they've used historically -- not something we currently use for anything but could become important if we shift to a different scheduler down the line. Until then......)
I did the steps to add a user above but they didn't work!
Sometimes the changes to add a user don't get picked up until the slurmdb
daemon is restarted. To do that,
ssh vicki
sudo service slurmdbd restart
A user will occasionally ask you why their job isn't running. Often this can be resolved by just doing an scontrol show job [JOBID]
on it and looking at the reason for it being queued. More often than not, it is because they've reached their GPU cap. With sudo
you can also cancel other user's jobs with the scancel
command.
Non-Slurm Points: People will sometime have a job killed by the out-of-memory killer tasks on an overloaded server. This is not transparent to people and they will be disappointed and confused why their jobs keep dieing! Ask them which node they were on and then go check the syslog
s sudo cat /var/log/syslog | grep "Out of memory: Kill process"
is a decent command to see when and which processes were killed this way.
We use Ansible to configure all the nodes at once -- including updating Slurm configuration files, firewall management, and package installation. We keep our Ansible playbooks in a skynet_ansible git repo you should have access to. I would recommend installing Ansible on your local machine and launching commands from there [PLEASE READ THE README IN THAT REPO. MAKE SURE YOU HAVE A munge.key
].
I'm not going into depth about how to write new Ansible plays, the documentation is good for that. I will however point out some tasks that come up frequently and our current protocols for addressing them.
Note: You will either need to be on campus or on the GT vpn when running ansible as the compute nodes are behind GT's firewall.
Note 2: Ansible has a -f
argument that runs things in parallel. This is quite helpful but don't increase it above 10. When above 10 sometimes GT's LDAP system will think there's some malicious attack going on with your account and temp ban you. This obviously causes issues.
Before configuring a new node, there is a little bit of configuration of your local machine that needs to be done. Ansible talks to the nodes over ssh
and requires an ssh-key. If you don't already have an ssh-key setup on Skynet, first, sorry that you have been having to live your life without one, second, go set one up with the steps in the ssh section here: wiki/skynet#how-to-get-access.
The run
ansible-playbook -K -u [YOUR_USERNAME] -f 10 -i hosts.cfg known_hosts.yml
to update your ~/.ssh/known_hosts
(not already having all the nodes in your known_hosts
is problematic). Note that this won't duplicate any already known hosts.
Once a node comes up and the IT folks have put the bare-bones on it, they will pass it to us. To bring it in line with the rest of our systems, we'll want to configure it with Ansible. First, we'll need to add it to the list of Ansible compute nodes in hosts.cfg
. First by adding it as a node at the top, then by adding the node name alias to the cvmlp_compute:children
group. We also need to tell slurm about the node and any gpus it has by modifying /etc/slurm-llnl/slurm.conf
appropriately on vicki
and marvin
. After these configuration changes have been made, we need to restart and reconfigure vicki
and marvin
. Finally, the spreadsheet that keeps track of nodes in the cluster needs to be updated to indicate the node is configured, the GPU cap for the contributing group needs to be increased, and users/admins need to be notified.
A more bare-bones but precise and list of commands is in the subsection below.
In each configuration step we run some version of the following command:
ansible-playbook -K -u [YOUR_USERNAME] -l [NEW_NODE] -f 10 -i hosts.cfg cvmlp.yml | tee ansible_log
from within the root of the skynet_ansible
repo. This command will log into the new node using your username, attempt to escalate to root with your entered password, and then execute the full ansible playbook on that node.
I recommend looking over the execution report somewhat carefully to see if anything has FAILED -- the summary at the end will typically be the best place to get an overview. In general, things shouldn't fail.. but if they do, you'll need to puzzle out why and it may mean a change for the cluster as a whole or just a readjustment on the version of packages installed through Ansible.
Typically I run most commands with --check
before running it for real just to make sure I know what it's going to do. The first time you run ansible-playbook with --check
, if you haven't run it before, it will probably whine that "python-apt must be installed to use check mode." Forego the --check
and use it for subsequent commands (the requirement will be met after the first run).
Step 0 Make sure all open pull-requests are closed (https://bitbucket.org/batra-mlp-lab/skynet_ansible/pull-requests). Also email helpdesk@cc.gatech.edu to set up autofs on the node (you cannot proceed until this is done)
Step 1
Modify hosts.cfg
. Add name of the new node at the top, then add its name alias to the cvmlp_compute:children
group.
You may want to run the known_host playbook to propagate the changes.
ansible-playbook -K -u <username> -f 10 -i hosts.cfg known_hosts.yml
To install slurm, run the following playbook specifying the node you want to install it on.
ansible-playbook -K -u <username> -l <new_node> -i hosts.cfg slurm_upgrade.yml
Install nvidia drivers and cuda. Nvidia installation changes somewhat frequently so our playbooks become out of date. Instead, just directly ssh into the node you are setting up and run the following commands. Please update these commands as this also will eventually go out of date.
apt-get remove --purge '^nvidia-.*' -y && apt-get remove --purge '^libnvidia-.*' -y && apt-get remove --purge '^cuda-.*' -y && sudo apt-key del 7fa2af80 && apt autoremove -y
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.
dpkg -i cuda-keyring_1.1-1_all.deb
rm cuda-keyring_1.*
apt update && apt install cuda-toolkit cuda-drivers -y
apt-key list
cd /etc/apt/sources.list.d/
rm cuda.list
cd ~
apt update && apt install cuda-toolkit cuda-drivers -y
apt autoremove -y && apt update && apt full-upgrade -y && apt autoremove -y
reboot
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub
You can run nvidia-smi
to verify your installation.
Install main configuration. This can fail, and often it works if you just run it again :/. If error persists, try to uninstall and reinstall ansible. Also, make sure that your munge.key is there.
$ ansible-playbook -l alexa -K -i hosts.cfg cvmlp.yml -e "ubuntu=20"
Edit /etc/slurm-llnl/slurm.conf
on both marvin
and vicki
(directly ssh via ssh marvin
). Then commit this new version to bitbucket. Two changes are needed: add name of the new node to the long list following PartitionName=DEFAULT
; add the new node to its corresponding group based on its specs (e.g. GPU type). Do the same thing on vicki
. See section 1.2 for details on how to update slurm.conf
Run the following commands such that your changes in the previous step will take effect:
$ sudo systemctl restart slurmctld
$ sudo scontrol reconfigure
sacctmgr modify qos <lab name> set grptres=gres/gpu:<gpu_type>=<num gpu_type>
sacctmgr create qos <qosname> grptres=<>
Once this is done, you'll want to go onto a node and start poking around and verifying nothing has broken. One good way to do that is to hop onto marvin
and check the /var/log/slurm-lnll/slurmctld.log
file to see if things have started up right. If you made any mistake in the syntax of the configuration files, it'll report it here.
If everything seems right but some nodes are not showing up or showing as down in sinfo
, you can use the scontrol show node
command to get more details. Or you can check to make sure services have started appropriately with sudo services slurmd status
on any nodes that don't seem to be coming up.
Users will ask for new packages to be installed. When they do, bring it up on the cvmlp-servers
channel to make sure no one knows anything terrible about those packages. If things seem good, install it manually on a single node and ask the requesting user to verify it works for them there. If so, we'll want to port it to all the node for the sake of homogeneity. Software packages are listed in /roles/cvmlp/tasks/base.yml
depending on which software installs them (i.e. pip vs apt-get). You can add, modify, or set version for package installs there. Once these are changed, you'll need to run the play on all nodes to install.
ansible-playbook -K -u [YOUR_USERNAME] -f 10 -i hosts.cfg cvmlp.yml | tee ansible_log
This command will run all plays on all nodes and make take a while.
vicki
is the node that slurmdbd is running on. slurmdbd manages the user accounts and partitions.
sudo slurmdbd
. If slurmdbd is not running (check with ps aux | grep slurmdbd
) then you should restart.sudo slurmdbd -Dvvv
will run in debug mode and display useful error messages.mysql
running in background:sudo service mysql start
sudo mysqld --init-file=/coc/testnvme/aszot3/mysql-init.txt
ALTER USER 'root'@'localhost' IDENTIFIED BY 'computervisionishard';
/etc/mysql/my.cnf
so it contains the following. Otherwise slurm will complain about the settings.
[mysqld]
innodb_buffer_pool_size=4096M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900
max_allowed_packet=16M
/etc/slurm-llnl/slurmdbd.conf
the following properties in that file:
StoragePass=computervisionishard
StorageType=accounting_storage/mysql
StorageUser=root
Sometimes an ansible run might fail with a message like
Failed to update apt cache:
To further diagnose this run apt-get update
against the failed machine to update its apt repository. Sometimes a GPG key has expired. apt-get update
will just give a user warning when this happens but ansible errors out completely. To fix it run apt-get update
to see which key is out of date and update it. Most GPG keys are managed under the gpg_keys
tag in the cvmlp.yml
playbook, so running the following command may also catch any out of date keys.
ansible-playbook -l ava --tags gpg_keys -K -i hosts.cfg cvmlp.yml
See this list, addressed toward new users:
http://optimus.cc.gatech.edu/wiki/skynet#how-to-get-access
Actions that admins need to take:
sudo sacctmgr add user [USER_ID] DefaultAccount=[GROUP_NAME]
sacctmgr add user [USER_ID] account=overcap
A few things need to happen to add skynet admins:
See here for how to add fileservers.