ACM group: SHC Cluster info page

The relevant CACR help page for the SHC is here, and it covers more topics than this website, but in less detail. This is intended for first-time users of a large cluster.


Latest updates

Aug, 2009: the cluster has undergone some major changes in August. See below for more deails.

Oct 17, 2008: to use the graphical version of Matlab (or any program) on an interactive computing node, use qsub -I -X (if you use just qsub -I, then x-windows applications won't work).


Getting started

Logging in requires a ssh key. See the ssh-key instructions website. When you log in, you will be on a head node. Do not run computationally intensive programs here! You may run text-editors, and move around files, etc. but do not run Matlab sessions, out of consideration to other users.

The cluster runs a Portable Batch System, which is a queueing software that controls when and where the users's jobs are run. The software can give priority to certain users; in general, the less computation time the user or group has used, the more the system will like you. As of August 2008, the ACM group accounts for less than 1% of the computation on the system; however, certain groups automatically have higher priority, and our group is not one of them.

Basic commands on the queue are:

In general, it is a good idea to specify the wall-time. This is an estimate of how long the job will take. There is a tradeoff here: if you give a short estimate, then the software will prioritize your job and your computation will start sooner; but if your estimate is shorter than it actually takes your job to finish, you might be cutoff. I think it is harmless if your estimate is too long, besides the fact that you were not given extra priority.

Options for qsub may be specified either on the command line, or in a script. In the latter case, options follow a very specific format. The line must start with #PBS, followed by the option (this is similar to how a line like #!/bin/bash dictates which shell script to use, i.e. # is usually a comment symbol, but not always!). For example, in the beginning of the script you might have lines like

#PBS -N job_name 
#PBS -l walltime=10:30,mem=320kb


Example: running several Matlab scripts

The cluster requires a little bit of work to submit a computing job. Users normally submit a script, written in a scripting language like bash or tcsh. In general, the script is just a short wrapper that calls an executable (e.g. matlab). You might also have a meta-script that calls several wrapper scripts, each with a different set of parameters. Below is an example submitJobs.sh I wrote that loops over several values of the variable j, then calls a matlab script.

#!/bin/bash

# This is an example of submitting several
# Matlab batch jobs
# Stephen Becker, 8/21/08
# Note: my username is "srbecker", so you
#   should replace this with your own username

RESULTS_DIRECTORY='/home/srbecker/results'

# -- we setup the common part of all the scripts --
cat > prefixScript.sh << EOF
#!/bin/bash

# ask for 1 node, for 1 minutes
#PBS -l nodes=1
#PBS -l walltime=00:01:00
EOF

# -- loop over one (or more) parameters of interest --
for (( j=0; j<=20; j+= 10 )); do
        cp prefixScript.sh msub.$j
        cat >> msub.$j << EOF
# put stdout and stderr where I want them
#PBS -o $RESULTS_DIRECTORY/out.$j
#PBS -e $RESULTS_DIRECTORY/err.$j
EOF

        # -- now, setup the matlab command we want --

        echo "matlab -nodisplay >& $RESULTS_DIRECTORY/log.$j << EOF" >> msub.$j
        echo "disp(primes($j+1))"              >> msub.$j
        echo "h = $j; % set matlab variables"  >> msub.$j
        echo "% we could call a script also, e.g.:"   >> msub.$j
        echo "%scriptName"                     >> msub.$j
        echo "datestr(now)"                    >> msub.$j
        echo "exit"                            >> msub.$j
        echo "EOF"                             >> msub.$j

        # -- and submit it to the batch system --

        qsub msub.$j
        #qsub -q weekend msub.$j

done
The above script will make several "msub" files. For example, the file msub.20 that it created looks like
#!/bin/bash

# ask for 1 node, for 1 minutes
#PBS -l nodes=1
#PBS -l walltime=00:01:00
# put stdout and stderr where I want them
#PBS -o /home/srbecker/results/out.20
#PBS -e /home/srbecker/results/err.20
matlab -nodisplay >& /home/srbecker/results/log.20 << EOF
disp(primes(20+1))
h = 20; % set matlab variables
% we could call a script also, e.g.:
%scriptName
datestr(now)
exit
EOF

By default, matlab will start in your home directory. To change where it starts, add a line like cd DIRECTORY_NAME before the matlab -nodisplay... command; you might have something like cd $RESULTS_DIRECTORY for example.

Other modifications. In Bash, you can get the output of commands using the ` quote marks (this is the character above the TAB key and to the left of the 1 key). For example, to set the directory at the beginning of the submitJobs.sh script file, there could be a line like

RESULTS_DIRECTORY=`pwd`
and then the script should be run in the desired directory (since pwd will print the current working directory).

You can also have the system email you when the job is done (by default, it emails you if there is the program is aborted, e.g. because it took longer than the requested walltime). Use the flags -m e to have the program email you when the program ends. You can also use a and b to be emailed when the program aborts and begins, resp. For example, if we wish to be notified by email when the last job finishes, we could change submitJobs.sh to something like this:

... [ same as submitJobs.sh before ] ...
START=1
END=50
for (( j=$START; j<=$END; j+= 1 )); do
    cp prefixScript.sh $prefix.$j

    # send mail when the job ends (or is aborted)
    # (but only for last job, so that my inbox
    # isn't flooded!)
    if [ $j == $END ]; then
        echo "#PBS -m ae" >> $prefix.$j
    fi

    ... [ same as submitJobs.sh before ] ...

Updates

June, 2009. MATLAB upgraded to version R2009a! This is great news.

August, 2009. The software stack is being upgraded, and isn't backward compatible, so all MPI-based executables must be rebuilt. Details on rebuilding are here: http://www.cacr.caltech.edu/main/?page_id=440. The target shc configuration by end of August is 229 nodes (163 nodes, dual processor/dual core, and 66 nodes, dual processor/quad core.) The new cluster is shc-c, as opposed to the shc-a and shc-b.

There are at least 32 quad core, dual cpu nodes (i.e. 8 processors in total, per node). The PBS directive that specifies your request for a quad core node is this:

 #PBS -l nodes=4:core8 

which would give you 8 threads. If you ask for more than 8 threads, then the first 8 will be sent to the first node, etc. More examples can be found at this link: http://www.cacr.caltech.edu/main/?page_id=440




page maintained by Stephen Becker
last update 8/18/09