Skynet Administration

Notes About Power and Skynet Philosophy

Please keep in mind:

With great power comes great responsibility. You have sudo access on machines with people’s (sometimes personal) data on it. Please take this responsibility seriously.
We make decision by consensus here, and the general philosophy is that SkyNet is a single cluster, and not a union of sub-clusters (one for each lab). Please talk to others before making any sudo-level changes.

Tips and Tricks

Some tips and tricks live here

A Crap-Guide to Administrating Skynet

You are now responsible for administrating Skynet -- congrats! If you're like me, you aren't really trained as a system admin but are a decently intelligent person and will figure it out. With that said, welcome to a crap-guide to administrating Skynet written by an untrained admin for an untrained admin -- good luck!

1 Slurm

Slurm is our resource manager and the way users interact with the computation resources on Skynet. I encourage new admins to get familiar with the sinfo, scontrol, and sacctmgr commands in addition to the typical user commands of squeue, scancel, srun, and sbatch. It wouldn't hurt to have at least a surface understanding of node / partitions, generic resources, and quality of service in Slurm as well before continuing -- but hey, you're the admin.

Let's begin.

Important Slurm Processes: The cliff-note version of Slurm is that each compute node runs a deamon slurmd that communicates over an encrypted channel with a central controller deamon slurmctld running on another node. In our case slurmctld runs on a persistant VM marvin.cc.gatech.edu and a backup controller runs on the compute node vicki.cc.gatech.edu -- the backup should only take over if the primary fails. You can check the status of the control deamons from any node by running the comamnd scontrol ping. If that isn't fun enough, there is also a slurmdbd deamon that manages a database backend to keep track of user/group accounts and limits that is currently sitting on vicki as well. Slurm also relies on the munged daemon for message encryption and mysql deamon for database backend on vicki. These should be configured to start automatically. I would recommend you get acquainted with the service command in linux which actually manages all of these daemons. For instance, if I want to check the status of the compute-node slurmd daemon, I could run sudo service slurmd status. If the status reports that the daemon is down and I wish to restart it, sudo service slurmd restart.

Slurm State: In order to allow for switching between the primary and backup controller, the slurm state is stored over NFS at /srv/share/slurm/state. This hopefully never matters, but is worth knowing in case something odd breaks. It also leaves us somewhat open to the danger of that partition filling up and Slurm running into an error.

1.0 Configless Slurm

Configuration: Slurm is configured through /etc/slurm-llnl/slurm.conf, /etc/slurm-llnl/gres.conf, and slurmdb.conf corresponding to overall and generic resource configuration files as well as the database configuration. We even use an autoconfigured version of gres.conf which autodetects resources, so gres.conf is simply

autodetect=nvml

To update slurm.conf, directly edit it on nodes marvin and vicki and reload via

sudo systemctl restart slurmctld
sudo scontrol reconfigure

1.1 Managing Partitions

Partitions in our slurm setup correspond to labs, and are setup as follows in slurm.conf

PartitionName=<lab-name> QOS=<lab-name> PriorityTier=2 Default=NO AllowAccounts=<lab-name>

1.2 Managing Nodes

Each server in Skynet is a node on Slurm with a corresponding number of GPU generic resources (gres) of the appropriate type. These nodes are grouped based on their resources types, and each of these groups is assigned a weight. Here's a snippet of slurm.conf as an example

slurm.conf

##     -- A40 group 1 --
NodeName=DEFAULT Weight=20
NodeName=spot,major Gres=gpu:a40:4 CPUs=128 CoresPerSocket=32 RealMemory=515000 MemSpecLimit=8000
NodeName=consu Gres=gpu:a40:4 CPUs=128 CoresPerSocket=32 RealMemory=515000 MemSpecLimit=8000
#     -- A40 group 2 --
NodeName=DEFAULT Weight=21
NodeName=ig-88,conroy,cyborg,omgwth,qt-1,spd-13,dave,nestor,sonny,robby,xaea-12,deebot,baymax,megabot,heistotron,chappie Gres=gpu:a40:8 CPUs=128 CoresPerSocket=32 RealMemory=515000 MemSpecLimit=8000

To figure node memory, ssh into the node and then check with htop (or similar). Note that memory in the config is in MB, i.e. take the memory in GB and multiply by 1000. To figure out the CPU specs, run ssh into the node and run lscpu

> lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                56
On-line CPU(s) list:   0-55
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
Stepping:              4
CPU MHz:               3348.675
CPU max MHz:           3700.0000
CPU min MHz:           1000.0000
BogoMIPS:              5201.83
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              19712K
NUMA node0 CPU(s):     0-13,28-41
NUMA node1 CPU(s):     14-27,42-55
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d

Then take the values of CPU(s):, Thread(s) per core: and Core(s) per socket:. Note that ThreadsPerCore only needs to be specified if it is not 2, as that is the default value.

As you'll notice in the actual config, we define many nodes with a single NodeName=... line. This allow slurm to more efficiently find a node that satisfies the jobs requirements as it can check in batches. When adding a new node, either add it to an already existing group if it has the same hardware layout as other nodes or create a new group.

Node Groups

Thanks to the growth of multi-GPU jobs, we want to try to compact jobs to some small set of nodes -- leaving some other nodes will all GPUs free so multi-GPU jobs don't starve. However, if we do a strict "fill one node at a time" strategy, we end up with some terrible load-balancing. We try to strike a balance by having multiple groups which are filled sequentially and that load-balance within themselves. To do this, we set groups of nodes to have identical weight values.

Changing Node States

We will occasionally need to take a node down for some reason or to bring a downed node back up. This is done through the scontrol command which I encourage you to look at, but will cover the basics of here. There are quite a lot of different node states but the ones that you'll likely encounter the most are IDLE, MIXED, DOWN, or DRAIN. If a node is down or drained, we can bring it back by changing its state to RESUME. If we want to take a node down for maintenance but don't want to suddenly kick everyone off of it, we can set its state to DRAIN and no new jobs will be scheduled. In sinfo, the node will report as in a drng state until all existing jobs have finished at which point it will change to drain and we are free to reboot / otherwise muck around with it. This applies to all compute nodes but the head node. To much with the head node you should probably notify all its users first.

Show detailed status of a node scontrol show node [NODE_NAME]
To check the status of all down nodes: sinfo -R
To take a node offline: scontrol update node=[NODE_NAME] state=[DRAIN] reason=[TEXT REASON]
If a node was assigned to down/drain by slurm, often a restart will fix it:
- SSH into the node and restart it sudo shutdown --reboot 0. Wait for the node to boot back up.
- When back, ensure the GPUs are all working correctly by SSHing into the machine and running nvidia-smi.
- Put the node back in the resume state with scontrol update node=[NODE_NAME] state=resume.
Taking a batch of nodes back online: Script for restarting a batch of nodes, script for checking GPU status of a batch of nodes. When setting to resume you can specify a set of nodes like: scontrol update node=[NODE_NAME1],[NODE_NAME2] state=resume.

1.3 Managing Group Accounts

Rather than buying, maintaining, and relying on particular hardware themselves, labs that contribute to Skynet are alloted GPU use equal to their contribution but gain the benefit of the cluster's idle resources and resiliance in case of individual node failures. We keep track of contributions in this spreadsheet with per-lab GPU limits being calculated on the Member Contributions tab. From time to time new groups need to be added or limits adjusted as new machines are added. This is done through the sacctmgr utility in Slurm (read more here).

Useful commands for managing group accounts:

To create a new group:

sudo sacctmgr add account [GROUP_NAME]

To set or modify group GPU limits:

sudo sacctmgr modify account name=[GROUP_NAME] set GrpTRES=gres/gpu=[NUM_GPUS],cpu=[NUM_CPUS]

Overcap: One thing worth understanding is how users from one lab can make use of idle resources from another lab -- i.e. how can users acquire more GPUs than they've contributed. The way we have it set up, a user that is otherwise maxed out on GPUs for their group can issue a job with the --account=overcap (this will automatically tag it with --qos=overcap also) option to run on idle resources from other groups. However if Skynet fills up, that job will be killed in order to make room for a job from a user whose group hasn't reached its limit. Setting this up required modifying the default normal quality of service (QoS) that all accounts get by setting its Preempt value to the overcap QoS -- enabling preemption of overcap jobs by normal group ones. We also had to create an overcap QoS with an outrageously huge GPU limit to override the user's group limit. This is set to 9999 GPUs right now -- may the old gods have mercy on your soul if Skynet has grown to the point this is insufficient. We also had to set a few lines in slurm.conf to enable QoS based preemption, specifically:

PreemptMode=requeue
PreemptType=preempt/qos
JobRequeue=0

and to make sure folks are unable to create uninterruptible jobs in the overcap account, we also need to enforce that a QOS tag is valid for a given association:

AccountingStorageEnforce=limits,associations,qos

Per user GPU Limits: Another thing in this vein is how per user GPU limits are implemented. Taking CVMLP as an example, there is a QoS, cvmlp-user-limits, that is set as the default QoS for cvmlp-lab (and is the only valid QoS for that account). This QoS limits a CVMLP user to only have 16 GPUs running in short/long. There is then the user-overcap partition that has a no-user-limits QoS applied to the partition and limits a user to have 9999 GPUs running at once (again, may the gods have mercy on your soul). Due to the way slurm works, the user GPU limit from the partition is the one that it picks as that is the one it sees first. This page goes into more details.

I explain all this not to bore you or to enshrine my efforts, but because it feels like the sort of thing someone may want changed / have questions about at some point.

1.4 Managing Users

New users come up all the time and will be a minor nuisance throughout your term as admin. Getting a new user up and running on Skynet requires:

1 The user mailing helpdesk@cc.gatech.edu with their Buzzport login user id with an Skynet associated admin / faculty member CC'ed indicating they need Skynet access. This typically resolves within a few hours to a day at worst and may take a little time to propagate. This is an excellent time to remind a new user to go look over the Skynet usage wiki.

2 After this, the user should be able to log into any node on Skynet with their GT user id and password (not their CoC account), but they don't yet have access to slurm. To add them, an admin has to assign them to a group account and the overcap account.

Useful commands for managing user slurm accounts:

To add a new user to a group account:

sudo sacctmgr add user [USER_ID] DefaultAccount=[GROUP_NAME]

Add a user to the overcap account:

sudo sacctmgr add user [USER_ID] account=overcap

Make sure you add a user to both their group account and the overcap account. A user can be added to both and all will be fine.

To change a user's group account:

sudo sacctmgr remove user name=[USER_ID]
sudo sacctmgr add user [USER_ID] DefaultAccount=[GROUP_NAME]
sudo sacctmgr add user [USER_ID] account=overcap

(Okay... you're thinking surely there is another way than just deleting them. You are right, but it is annoying and requires adding the new account to the user, changing the default, and then removing the old one. The worst that deleting and recreating the user does is resets a tracker of how many resources they've used historically -- not something we currently use for anything but could become important if we shift to a different scheduler down the line. Until then......)

I did the steps to add a user above but they didn't work!

Sometimes the changes to add a user don't get picked up until the slurmdb daemon is restarted. To do that,

ssh vicki
sudo service slurmdbd restart

1.5 Managing Jobs

A user will occasionally ask you why their job isn't running. Often this can be resolved by just doing an scontrol show job [JOBID] on it and looking at the reason for it being queued. More often than not, it is because they've reached their GPU cap. With sudo you can also cancel other user's jobs with the scancel command.

Non-Slurm Points: People will sometime have a job killed by the out-of-memory killer tasks on an overloaded server. This is not transparent to people and they will be disappointed and confused why their jobs keep dieing! Ask them which node they were on and then go check the syslogs sudo cat /var/log/syslog | grep "Out of memory: Kill process" is a decent command to see when and which processes were killed this way.

2 Ansible

We use Ansible to configure all the nodes at once -- including updating Slurm configuration files, firewall management, and package installation. We keep our Ansible playbooks in a skynet_ansible git repo you should have access to. I would recommend installing Ansible on your local machine and launching commands from there [PLEASE READ THE README IN THAT REPO. MAKE SURE YOU HAVE A munge.key].

I'm not going into depth about how to write new Ansible plays, the documentation is good for that. I will however point out some tasks that come up frequently and our current protocols for addressing them.

Note: You will either need to be on campus or on the GT vpn when running ansible as the compute nodes are behind GT's firewall.

Note 2: Ansible has a -f argument that runs things in parallel. This is quite helpful but don't increase it above 10. When above 10 sometimes GT's LDAP system will think there's some malicious attack going on with your account and temp ban you. This obviously causes issues.

2.0 Updating your machine

Before configuring a new node, there is a little bit of configuration of your local machine that needs to be done. Ansible talks to the nodes over ssh and requires an ssh-key. If you don't already have an ssh-key setup on Skynet, first, sorry that you have been having to live your life without one, second, go set one up with the steps in the ssh section here: wiki/skynet#how-to-get-access. The run

ansible-playbook -K -u [YOUR_USERNAME] -f 10 -i hosts.cfg known_hosts.yml

to update your ~/.ssh/known_hosts (not already having all the nodes in your known_hosts is problematic). Note that this won't duplicate any already known hosts.

2.1 Configuring a New Node

Once a node comes up and the IT folks have put the bare-bones on it, they will pass it to us. To bring it in line with the rest of our systems, we'll want to configure it with Ansible. First, we'll need to add it to the list of Ansible compute nodes in hosts.cfg. First by adding it as a node at the top, then by adding the node name alias to the cvmlp_compute:children group. We also need to tell slurm about the node and any gpus it has by modifying /etc/slurm-llnl/slurm.conf appropriately on vicki and marvin. After these configuration changes have been made, we need to restart and reconfigure vicki and marvin. Finally, the spreadsheet that keeps track of nodes in the cluster needs to be updated to indicate the node is configured, the GPU cap for the contributing group needs to be increased, and users/admins need to be notified.

A more bare-bones but precise and list of commands is in the subsection below.

In each configuration step we run some version of the following command:

ansible-playbook -K -u [YOUR_USERNAME] -l [NEW_NODE] -f 10 -i hosts.cfg cvmlp.yml | tee ansible_log

from within the root of the skynet_ansible repo. This command will log into the new node using your username, attempt to escalate to root with your entered password, and then execute the full ansible playbook on that node.

I recommend looking over the execution report somewhat carefully to see if anything has FAILED -- the summary at the end will typically be the best place to get an overview. In general, things shouldn't fail.. but if they do, you'll need to puzzle out why and it may mean a change for the cluster as a whole or just a readjustment on the version of packages installed through Ansible.

Typically I run most commands with --check before running it for real just to make sure I know what it's going to do. The first time you run ansible-playbook with --check, if you haven't run it before, it will probably whine that "python-apt must be installed to use check mode." Forego the --check and use it for subsequent commands (the requirement will be met after the first run).

Installing slurm to a new node (bare bones)

Step 0 Make sure all open pull-requests are closed (https://bitbucket.org/batra-mlp-lab/skynet_ansible/pull-requests). Also email helpdesk@cc.gatech.edu to set up autofs on the node (you cannot proceed until this is done)

Step 1

Modify hosts.cfg. Add name of the new node at the top, then add its name alias to the cvmlp_compute:children group. You may want to run the known_host playbook to propagate the changes. ansible-playbook -K -u <username> -f 10 -i hosts.cfg known_hosts.yml
To install slurm, run the following playbook specifying the node you want to install it on. ansible-playbook -K -u <username> -l <new_node> -i hosts.cfg slurm_upgrade.yml
Install nvidia drivers and cuda. Nvidia installation changes somewhat frequently so our playbooks become out of date. Instead, just directly ssh into the node you are setting up and run the following commands. Please update these commands as this also will eventually go out of date.

apt-get remove --purge '^nvidia-.*' -y && apt-get remove --purge '^libnvidia-.*' -y && apt-get remove --purge '^cuda-.*' -y && sudo apt-key del 7fa2af80 && apt autoremove -y
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.
dpkg -i cuda-keyring_1.1-1_all.deb
rm cuda-keyring_1.*
apt update && apt install cuda-toolkit cuda-drivers -y
apt-key list
cd /etc/apt/sources.list.d/
rm cuda.list
cd ~
apt update && apt install cuda-toolkit cuda-drivers -y
apt autoremove -y && apt update && apt full-upgrade -y && apt autoremove -y
reboot

sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub

sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub

You can run nvidia-smi to verify your installation.

Install main configuration. This can fail, and often it works if you just run it again :/. If error persists, try to uninstall and reinstall ansible. Also, make sure that your munge.key is there.
```
  $ ansible-playbook -l alexa -K -i hosts.cfg cvmlp.yml -e "ubuntu=20"
```
Edit /etc/slurm-llnl/slurm.conf on both marvin and vicki (directly ssh via ssh marvin). Then commit this new version to bitbucket. Two changes are needed: add name of the new node to the long list following PartitionName=DEFAULT; add the new node to its corresponding group based on its specs (e.g. GPU type). Do the same thing on vicki. See section 1.2 for details on how to update slurm.conf
Run the following commands such that your changes in the previous step will take effect:
```
  $ sudo systemctl restart slurmctld

  $ sudo scontrol reconfigure 
```
Update GPU cap in the spreadsheet. Then update this in slurm as well via...
- Updating existing lab quota: sacctmgr modify qos <lab name> set grptres=gres/gpu:<gpu_type>=<num gpu_type>
- Create new quota (new lab or maybe gpu type): sacctmgr create qos <qosname> grptres=<>
Notify everyone
Either commit and push on bitbucket or make a PR if you'd like to make sure everything was changed correctly. If you make a PR, please make sure it gets looked at and merged.

2.2 Debugging

Once this is done, you'll want to go onto a node and start poking around and verifying nothing has broken. One good way to do that is to hop onto marvin and check the /var/log/slurm-lnll/slurmctld.log file to see if things have started up right. If you made any mistake in the syntax of the configuration files, it'll report it here.

If everything seems right but some nodes are not showing up or showing as down in sinfo, you can use the scontrol show node command to get more details. Or you can check to make sure services have started appropriately with sudo services slurmd status on any nodes that don't seem to be coming up.

2.3 Adding or Modifying Software Packages

Users will ask for new packages to be installed. When they do, bring it up on the cvmlp-servers channel to make sure no one knows anything terrible about those packages. If things seem good, install it manually on a single node and ask the requesting user to verify it works for them there. If so, we'll want to port it to all the node for the sake of homogeneity. Software packages are listed in /roles/cvmlp/tasks/base.yml depending on which software installs them (i.e. pip vs apt-get). You can add, modify, or set version for package installs there. Once these are changed, you'll need to run the play on all nodes to install.

ansible-playbook -K -u [YOUR_USERNAME] -f 10 -i hosts.cfg cvmlp.yml | tee ansible_log

This command will run all plays on all nodes and make take a while.

2.4 Slurmdbd

vicki is the node that slurmdbd is running on. slurmdbd manages the user accounts and partitions.

Start slurmdbd: sudo slurmdbd. If slurmdbd is not running (check with ps aux | grep slurmdbd) then you should restart.
sudo slurmdbd -Dvvv will run in debug mode and display useful error messages.
You need mysql running in background:
sudo service mysql start
To setup the root password sudo mysqld --init-file=/coc/testnvme/aszot3/mysql-init.txt
- File contains only: ALTER USER 'root'@'localhost' IDENTIFIED BY 'computervisionishard';

Modify /etc/mysql/my.cnf so it contains the following. Otherwise slurm will complain about the settings.

[mysqld]
innodb_buffer_pool_size=4096M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900
max_allowed_packet=16M

The database is configured in via /etc/slurm-llnl/slurmdbd.conf the following properties in that file:
```
StoragePass=computervisionishard
StorageType=accounting_storage/mysql
StorageUser=root
```

2.5 Known Ansible Issues

GPG Keys

Sometimes an ansible run might fail with a message like

Failed to update apt cache:

To further diagnose this run apt-get update against the failed machine to update its apt repository. Sometimes a GPG key has expired. apt-get update will just give a user warning when this happens but ansible errors out completely. To fix it run apt-get update to see which key is out of date and update it. Most GPG keys are managed under the gpg_keys tag in the cvmlp.yml playbook, so running the following command may also catch any out of date keys.

ansible-playbook -l ava --tags gpg_keys -K -i hosts.cfg cvmlp.yml

3 Adding People to Skynet

Adding Users

See this list, addressed toward new users:

http://optimus.cc.gatech.edu/wiki/skynet#how-to-get-access

Actions that admins need to take:

Add a new user to a lab slurm account: sudo sacctmgr add user [USER_ID] DefaultAccount=[GROUP_NAME]
Add a new user to overcap: sacctmgr add user [USER_ID] account=overcap
Add a new user to slack: Ask for someone who has access to do it on the admin channel.

Adding Administrators

A few things need to happen to add skynet admins:

Email Ken/Helpdesk to grant them sudo access.
Add them to the admin slack channel.
Add them to the email list.
Send them this document.
Add them to the ansible repo.
Update the list of admins on the main skynet page.

Adding fileservers

See here for how to add fileservers.