High Performance Computing at Stanford
A step-by-step guide to using ClusterJob and Sherlock to run massive computational experiments at Stanford
Table of Contents
In this tutorial, I’ll walk through the process of running a large computational experiment using two tools: ClusterJob and Sherlock. ClusterJob is an automation system for high-throughput reproducible computations, created by Hatef Monajemi and David L. Donoho. Sherlock is Stanford’s High Performance Computing (HPC) cluster.
You can also find an exact replica of this tutorial on a separate website I created for Statistics 285: Massive Computational Experiments, Painlessly, a course taught at Stanford University in Spring 2021. Note that Sherlock is, unfortunately, only accessible to the Stanford community.
My goal is to provide a relatively painless introduction to High Performance Computing at Stanford.
Step 1: Setting up Sherlock
Create an account
If you don’t already have a Sherlock account, now’s the time to request one. After creating an account, you’ll want to set up your access credentials using SSH keys, which allows for remote communication between your local machine and the cluster.
Set up SSH connection
To set up your connection, check if you already have SSH keys on your machine. You can do this in two ways:
☞︎ Go to your home directory and navigate to the hidden1 .ssh
folder
☞︎ Go to your terminal and enter ls -al ~/.ssh
If you see a .pub
file, you already have SSH keys set up. If you don’t, run the following in your terminal:
ssh-keygen -t rsa -C "your_email@example.com"
Now, all you need to do is copy your keys over to the remote cluster.
ssh-copy-id your-username@sherlock.stanford.edu #run this in your terminal
Next, add the following to your ~.ssh/config
file to avoid Two-Factor Authentication every time you access the cluster, either from your local machine or through ClusterJob.
Host sherlock sherlock?? sherlock.stanford.edu sherlock??.stanford.edu
ControlMaster auto
ControlPath ~/.ssh/%r@%h:%p
ControlPersist yes #this enables direct access to sherlock using your terminal
Host login.sherlock.stanford.edu
ControlMaster auto
ControlPath ~/.ssh/%l%r@%h:%p
ControlPersist yes #this enables access to sherlock using ClusterJob
Show me how to edit my ~.ssh/config
file
Editing your ~.ssh/config
file can be done in two ways:
☞︎ Directly, by navigating to .ssh > config
from your home directory
☞︎ Through your terminal, by running:
cd ~/.ssh #change directory to the .ssh folder
open config
You can now check whether your connection works by trying to connect directly to the server. In your terminal, run:
ssh your-username@sherlock.stanford.edu
The first time you do this, you’ll likely get a warning like this:
The authenticity of host 'login.sherlock.stanford.edu' can't be established.
ECDSA key fingerprint is SHA256:eB0bODKdaCWtPgv0pYozsdC5ckfcBFVOxeMwrNKdkmg.
Are you sure you want to continue connecting (yes/no)?
Simply type ‘yes’ and proceed. After completing these steps, your Sherlock account should be ready to go! 🎉
Step 2: Setting up ClusterJob
Create an account
Set up an account with ClusterJob using your @edu
email account. Take note of your chosen ClusterJob ID and the ClusterJob Key assigned to your account.
Install ClusterJob
In your terminal, run:
git clone https://github.com/monajemi/clusterjob.git ~/CJ_install #clones CJ from GitHub
sudo cpan -i Data::Dumper Data::UUID FindBin File::chdir File::Basename File::Spec IO::Socket::INET IO::Socket::SSL Getopt::Declare Term::ReadLine JSON::PP JSON::XS Digest::SHA Time::Local Time::Piece Moo HTTP::Thin HTTP::Request::Common JSON URI #installs perl dependencies
alias cj='perl ~/CJ_install/src/CJ.pl'; #builds an alias for CJ
Set up SSH configuration
You just installed a CJ_install
folder to your home directory. In it, you will find two important files that you’ll need to edit: cj_config
and ssh_config
.
In cj_config
, you’ll provide your ClusterJob ID (the username you chose) and the ClusterJob Key you received when creating your account.
Copy the following into the cj_config
file:
CJID your-id #edit this
CJKEY
eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJhZG1pbiI6MCwiZCI6eyJ1aWQiOiJzaGlsYWFuIiwiY2pwYXNzY29kZSI6IjJmYWQ2OWU1YjZlNDQxNDE2NjhhOTIxZThiMmNlNTYwIn19.P60piuQFOmzny9dFmwWoWDGeNrtsi6UHl_16OIdoICa-C6Y8KeGadT6pcMJvyKLlBs163rR_p1CXkm33l6L8fhH9tsJGG3UN4cMocWsVeWH_ORfZsdNvuWa24IO2Yh7MPMTj067e9UodDcOYe7N2swu9eWfvC82YBk7Ubna3ZDnHi4icK06exK1_mIj8jv0fDzHS4m0eWd5u0Sg1YecMp9YXU3DEc_l3Hxroyc_qnfVmK9WhiDTfAx6ZYoHxFF2VecWVsOB6-Pq6cjYKKw7BQIiLbQ0VLIZmwjX3QiQRTvi6vX4vsfwxHTvsNKGE_L2ru9NAfcuRigX1mgOCLBwU9g #edit this
SYNC_TYPE manual
SYNC_INTERVAL 300
cj_config
file is somehow very sensitive to spacing and line breaks. Make sure you add linebreaks, like in the example above, or you’ll get obscure error messages when running ClusterJob.In ssh_config
, you’ll provide information about the Sherlock cluster. Copy the following into the ssh_config
file:
[sherlock2]
Host login.sherlock.stanford.edu
User your-username #edit this to your own username
Bqs SLURM
Repo /scratch/users/your-username/CJRepo_Remote #edit this
Python python/3.8.8
Pythonlib IPython:pandas:numpy:libgcc:scipy:matplotlib:cvxpy:-c conda-forge
Alloc --time 48:00:00 --mem 32G
R R/3.4.0
Rlib ggplot2
[sherlock2]
Show me how to edit my cj_config
and ssh_config
files
Again, you can edit these two files through your terminal or manually:
☞︎ Directly navigate to CJ_install > cj_config/ssh_config
from your home directory
☞︎ Through your terminal, run:
cd ~/CJ_install
open cj_config
open ssh_config
We’re ready to check if everything is working correctly. In your terminal, run:
cj init #initialize your CJ agent
cj who #check if the agent is installed
cj update #update to newest version
Step 3: Submit your first job ⚒︎
Here’s where the real benefit of ClusterJob comes in. To run a job on Sherlock, you normally have to write a job submission script describing your resource request and submission options. ClusterJob automates this process and does it for you! No need to learn anything about Sherlock’s job scheduler, Slurm.
Run your first serial computation
The CJ_install
folders comes with some example scripts to run, so let’s give those a try.
Show me the script I’m about to run
# This is a test Python script for CJ
# Author: Hatef Monajemi June 11 2017
import numpy as np;
import csv;
SUID = 'monajemi'
file = SUID+'_results.csv';
Var0 = np.array([1,2,3]);
Var1 = [1,2];
with open('file.txt','w') as myfile:
for i in range(len(Var0)):
for j in range(len(Var1)): # This is a comment
# write to a text file for testing reduce
with open(file,'a') as csvfile:
resultswriter = csv.writer(csvfile,delimiter=',');
resultswriter.writerow([i,j,Var0[i]+Var1[j] ]);
In your terminal, run:
cd ~/CJ_install/example/Python/ #change directory to the folder with Python example
cj run simpleExample.py sherlock2 -m “A message.” #run your first serial job!
When you run the second command, starting with cj run
, you’ll get a message like this:
CJmessage::initiating package 0df1b4e7
In this case, 0df1b4e7
is your job ID (referred to by ClusterJob as pid
, for process identifier); take note of it.
Show me what my CJ messages should look like
CJmessage::initiating package 0df1b4e7
CJmessage::runing [simpleExample.py] on [sherlock2] with:
alloc: --time 48:00:00 --mem 32G
CJmessage::sending from: /Users/shilaan/CJ_install/example/Python
CJmessage::Creating/checking conda venv. This may take a while the first time...
CJmessage::Creating reproducible script(s) reproduce_simpleExample.py
CJmessage::compressing files to propagate...
CJmessage::sending 1.92 KB to: sherlock2:/scratch/users/shilaan/CJRepo_Remote/simpleExample
CJmessage::extracting package...
CJmessage::Submitting job...
CJmessage::1 job(s) submitted (24992894)
After you’ve successfully submitted your job, you can check its status by running:
cj state
When your job is done, you’ll see a message like this:
pid 0df1b4e7aed70b1a367212f861729d0bc8fcfc29
remote_account: shilaan@login.sherlock.stanford.edu
job_id: 24992894
state: COMPLETED
Now, another one of ClusterJob’s benefits: it’s really easy to get the results of your job back onto your local computer. Run:
cj get 0df1b4e7 #replace with your own job id
Now, you’ll get the following message:
CJmessage::Getting results from 'sherlock2'
CJmessage::Please see your last results in /Users/shilaan/CJ_get_tmp/0df1b4e7aed70b1a367212f861729d0bc8fcfc29
Your results are ready for viewing! 🥳 The mentioned folder will include your results, your original script, and a script that fully reproduces the results.
Run your first parallel computation
We can rerun the same job we just submitted, but this time do it in parallel. The script we ran included a for loop over 6 elements (or index combinations). Instead of running this script serially, we can run it in parallel: we can submit a separate job for each index combination. In other words, we’ll submit 6 separate jobs to Sherlock. Again, ClusterJob will fully automate this process for you.
In your terminal, simply run:
cd ~/CJ_install/example/Python/ #change directory to the folder with Python example
cj parrun simpleExample.py sherlock2 -m “A message.” #run your first parallel job!
Now, you should receive the following message:
CJmessage::6/6 job(s) submitted
Show me what my CJ messages should look like
CJmessage::initiating package 30236535
CJmessage::parruning [simpleExample.py] on [sherlock2] with:
alloc: --time 48:00:00 --mem 32G
CJmessage::sending from: /Users/shilaan/CJ_install/example/Python
CJmessage::Creating/checking conda venv. This may take a while the first time...
CJmessage::Invoking Python to find range of indices. Please be patient...
Checking command 'python' is available...
python available.
finding range of indices...
Closing Python session!
CJmessage::no SLURM partition specified. CJ is using default partition: long,normal
CJmessage::Creating reproducible script(s) reproduce_simpleExample.py
CJmessage::compressing files to propagate...
CJmessage::sending 2.96 KB to: sherlock2:/scratch/users/shilaan/CJRepo_Remote/simpleExample
CJmessage::extracting package...
CJmessage::Submitting job(s)
CJmessage::6/6 job(s) submitted (24994033-24994039)
To get the results, we have to first reduce the results of our parallel run into a single file. To do this, we have to identify the file that contains our results – in this case, monajemi_results.csv
– and include our job ID.
In your terminal, run:
cj reduce monajemi_results.csv 30236535 #change to your own job ID
You’ll be asked if you want to submit the reduce script to the queue via srun. This is recommended for big jobs, but in this case you can simple answer n
. Afterwards, you’ll get the following message:
CJmessage::Reducing results done! Use "CJ get 30236535 " to get your results.
Show me what my CJ messages should look like
CJmessage::30236535
CJmessage::Checking progress of runs...
CJmessage::Reducing monajemi_results.csv
CJmessage::Do you want to submit the reduce script to the queue via srun? (recommneded for big jobs) Y/N?
n
system:ssh shilaan@login.sherlock.stanford.edu 'cd /scratch/users/shilaan/CJRepo_Remote/simpleExample/3023653586353cab1f4601074519eec3c34f9346; bash -l cj_collect.sh'
SubPackage 1 Collected (16.67%)
SubPackage 2 Collected (33.33%)
SubPackage 3 Collected (50.00%)
SubPackage 4 Collected (66.67%)
SubPackage 5 Collected (83.33%)
SubPackage 6 Collected (100.00%)
CJmessage::Reducing results done! Use "CJ get 30236535 " to get your results.
cj get 30236535
. Afterwards, you should get a message like this:CJmessage::Please see your last results in /Users/shilaan/CJ_get_tmp/3023653586353cab1f4601074519eec3c34f9346
Navigate to the folder to see the results of your parallel computation!
Acknowledgements
❤︎ Statistics 285 Teaching Team
❤︎ Mahsa Lotfi and Andrew Donoho, who went above and beyond in helping me get set up
❤︎ The ElastiCluster and Clusterjob Tutorial, written by Mahsa Lotfi
❤︎ The ClusterJob documentation
❤︎ The Sherlock documentation
On Mac, you can view your hidden folders using
Command + Shift + .
↩︎