Running Your Program on Kodiak
Contents
- The Batch System
- Submitting a Job
- Getting Job Info
- Job Output
- Stopping a Job
- Submitting Parallel Jobs
The Batch System
Although you will typically edit and compile your programs on the Kodiak login node, you actually run the programs on Kodiak's compute nodes. But there are many compute nodes, possibly running other programs already. So how do you choose which compute node or nodes to run your program on?
You don't, at least, not explicitly. Instead you use Kodiak's batch system, also sometimes called a job scheduler. You tell the batch system the resources, such as how many nodes and processors, that your program requires. The batch system knows what is currently running on all of the compute nodes and will assign unused nodes to your program if they are available and automatically run your program on them. If none are available, your program will wait in a queue until they are. When dealing with the batch system, you will occasionally see the abbreviation "PBS" which stands for "Portable Batch System".
Note: Just to reiterate - Do not run resource intensive programs on the login node. You must use the batch system as described below to run your program on one or more compute nodes.
Submitting a Job
Below is a simple C program, "howdy.c". When we run it on the login node, we can see that the hostname is "login001" as expected. Again, you will not run your program on the login node. This is just an example.
Note: The web-based editor for these pages sometimes implicitly modifies the actual intended text. Some of the source code may not compile if you were to copy and paste from the web page. For example, in the simple program below, you may see a "#include" line without the header file. It is actually trying to include stdio.h, but because the editor translates explicit html "escaped" less-than-sign and greater-than-sign into the actual characters, your browser may interpret that as an html tag and not display it on the page.
$pwd
/home/bobby/examples/howdy $ls
howdy.c howdy.sh $cat howdy.c
#include int main() { char hostname[80]; gethostname(hostname, 80); printf("Node %s says, "Howdy!"\n", hostname); } $gcc -o howdy howdy.c
$./howdy
Node login001 says, "Howdy!"
To run your program on Kodiak, you need to submit it to the batch system. This is done with the qsub
command and is known as submitting a job. Don't try to submit the executable program itself. Instead, submit a shell script that runs your program. The shell script can be extremely basic, possibly a single line that runs the program. Or it can be complex, performing multiple tasks in addition to running your program.
$pwd
/home/bobby/examples/howdy $ls
howdy howdy.c howdy.sh $cat howdy.sh
cd /home/bobby/examples/howdy ./howdy $qsub howdy.sh
8675309.batch
Notice that the qsub
command above returned the text "8675309.batch". This is known as the job id. You will usually only care about the numeric part. At this point, your program is either running on one of the compute nodes or is in the queue waiting to run.
You can see in the howdy.sh shell script above that we included a line to cd
to change the working directory to the directory where the program is actually located. This is because a submitted job is essentially a new login terminal session only without the terminal. When it starts, the working directory is your $HOME directory just as if you had logged in interactively. We changed to the submitted job's working directory and ran howdy with relative path "./howdy" but we could have run the howdy program using it's full path instead.
More detailed instructions on submitting jobs can be found below.
Job Output
The "howdy" program normally prints its output on the terminal, i.e., standard output. But when you submit your job it runs on a compute node that doesn't have a terminal. So where does the output go? By default, standard output is saved to a file named ".o". The default job_name is the name of the shell script submitted with qsub. So for our job above, the output file is "howdy.sh.o8675309". There is a similar file, "howdy.sh.e8675309" for standard error output as well. (Hopefully, the error file will be empty...) Note that these files will contain stdout and stderr only. If your program explicitly creates and writes to other data files, or if it uses ">" to redirect standard output to a file, that output will not appear in the job's output file.
$ls
howdy howdy.c howdy.sh howdy.sh.e8675309 howdy.sh.o8675309 $cat howdy.sh.o8675309
Node n005 says, "Howdy!"
Getting Job Info
When you submit a job on Kodiak, it is placed in a queue. If your job's requested resources (nodes and processors) are available, then the job is run on a compute node right away. Because Kodiak runs jobs for many users, it is possible that all of the processors on all of the nodes are currently in use by others. When that happens, your job will wait until the requested resources are free and wait in a queue. The queue is, for the most part, "first come, first served". If two jobs are waiting and both require the same resources, then the job that was submitted earlier will run first.
You can get a list of all of the jobs currently running or waiting to run with the qstat
command.
$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
8675291.batch prog_01a betty 1414:57: R batch
8675293.batch prog_01b betty 1414:57: R batch
8675294.batch prog_02a betty 1228:40: R batch
8675295.batch prog_02b betty 1228:40: R batch
8675296.batch prog_03a betty 0 Q batch
8675297.batch prog_03b betty 0 Q batch
8675301.batch test bubba 00:12:13 R gpu
8675309.batch howdy.sh bobby 0 R batch
You can see each job's job id, name, user, time used, job state, and queue. (There are actually multiple queues on Kodiak. The default queue is named "batch" and is the one you will usually use.) The "S" column shows the job's current state. An "R" means the job is running; "Q" means it is queued and waiting to run. Other state values that you might see are E (exiting), C (complete), and H (held).
You can display just your jobs by adding a -u
option. You can also display the nodes that a job is using with the -n
option. You can see below that your job is running on node n005.
$ qstat -n -u bobby
batch:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
8675309.batch bobby batch howdy.sh 139656 1 1 -- 1488: R 00:00
n005/1
Stopping a Job
If, for some reason, you want to stop your job, you can do so with the qdel
command. If your job is currently running, qdel
will terminate it on the compute node(s). Otherwise it will simply remove it from the queue. Specify the job's job id without the ".batch" bit. You can only qdel
your own jobs.
$ qdel 8675309
More On Submitting Jobs
Above, we submitted a minimal shell script with qsub
. By default, qsub
will allocate one processor on one node to the job. If you are submitting a parallel/MPI or multi-threaded program, you would need to specify more than that. Below is the parallel/MPI version of the "howdy" program, called "howdies", compiled with mpicc
. In addition to printing out the hostname, it prints out the MPI "rank" and "size".
Note: The mpicc
and mpixec
(see below) commands are enabled when you load an MPI environment module. The examples in this document will use OpenMPI but there are several different implementations version of MPI available.
$cat howdies.c
#include #include int main(int argc, char **argv) { int rank, size; char hostname[80]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); gethostname(hostname, 80); printf("Howdy! This is rank %d of %d running on %s.\n", rank, size, hostname); MPI_Finalize(); return 0; } $module load openmpi-gcc/2.0.1
$mpicc -o howdies howdies.c
To run an MPI program, you don't just run the program itself. Instead you use the mpiexec
command with your program as one of its arguments and it will launch multiple instances of your program for you. You specify how many processes to launch with the -np
option and on which nodes to launch them with the -hostfile
option. The "host file" is a simple text file containing a list of nodes to run on.
$cat nodes.txt
n001 n001 n001 n001 n001 n001 n001 n001 n002 n002 n002 n002 n002 n002 n002 n002 $mpiexec -np 16 -hostfile nodes.txt ./howdies
So we need to put the above "mpiexec" line in a shell script and submit with qsub
. But what about the host file? How do we know what nodes to place in there? We don't. Recall that the batch system allocates the nodes for our job based on availability. It will also create the host file for us and put the path to that host file in an environment variable, $PBS_NODEFILE, that our submitted script can use. Below is a quick test script that just displays $PBS_NODEFILE and its contents. Note that the created host file and the $PBS_NODEFILE environment variable only exist during the execution of a submitted job on a compute node so we'll need to qsub
it and look at the results in the job's output file.
$cat test.sh
echo "Host file: ${PBS_NODEFILE} echo "Contents:" cat $PBS_NODEFILE $qsub test.sh
8675309.batch $cat test.sh.o8675309
Host file: /var/spool/pbs/aux/8675309.batch Contents: n005 n005 n005 n005 n005 n005 n005 n005 n012 n012 n012 n012 n012 n012 n012 n012
We see that there are 16 lines in the host file, eight n005s and eight n012s. So when we call mpiexec
with an option -np 16
, eight processes will launch on n005 and eight will launch on n012. Recall that earlier we stated that the default behavior of the batch system is to allocate one processor on one node. So how did the batch system know to put 16 entries in the file instead of just 1 entry? Because this document cheated. The qsub
command above would not actually create the host file that was displayed. We also need to tell qsub
how many nodes we want as well as the number of processors on each node. We do this with the -l nodes=N:ppn=P
option. (That's a lower case L and not a one, and stands for "resource_list".)
With the -l
option, you are actually requesting the number of processors per node. MPI programs typically run one process per processor so it is often convenient to think of "ppn" as "processes per node" instead of "processors per node". But that's not entirely accurate, and not always the case, even with MPI programs.
The -l
option to create the above machine files was actually:
qsub -l nodes=2:ppn=8
which is 2 nodes x 8 processors per node = 16 total processors.
Recall that earlier in this document, our submitted shell script had a cd
command to change the working directory from $HOME to the directory where we ran qsub. The batch system sets an environment variable, $PBS_O_WORKDIR, to the jobs working directory so we just need to cd $PBS_O_WORKDIR
in our script and we won't need to hard-code full paths to our program or data files.
So now back to our MPI "howdies" program...
In addition to specifying with module load
) which version of MPI to use when compiling our program, we will also need to specify the same version of MPI when we run it. This allows the system to find the correct MPI libraries and mpiexec
command. To do this, add the appropriate module load
command at the top of the script. It is probably a good idea to module purge
any other loaded modules first just to make sure there are no conflicts with modules that may have been loaded in your .bashrc or .bash_profile login scripts.
For this example, we want to run it with 8 processes, but only 4 per node. We will need to change the "mpiexec -np" option to run 8 MPI processes instead of 16.
$pwd
/home/bobby/examples/howdies $ls
howdies howdies.c howdies.sh $cat howdies.sh
#!/bin/bash module purge module load openmpi-gcc/2.0.1 cd $PBS_O_WORKDIR echo "Job working directory:" $PBS_O_WORKDIR echo mpiexec -np 8 -hostfile $PBS_NODEFILE ./howdies $qsub -l nodes=2:ppn=4 howdies.sh
8675309.batch $ls
howdies howdies.c howdies.sh howdies.sh.e8675309 howdies.sh.o8675309 $cat howdies.sh.o8675309
Job working directory: /home/bobby/examples/howdies Howdy! This is rank 5 of 8 running on n027 Howdy! This is rank 1 of 8 running on n009 Howdy! This is rank 2 of 8 running on n009 Howdy! This is rank 3 of 8 running on n009 Howdy! This is rank 0 of 8 running on n009 Howdy! This is rank 7 of 8 running on n027 Howdy! This is rank 6 of 8 running on n027 Howdy! This is rank 4 of 8 running on n027
Note: You can see in the output file above that the printed statements are in no particular order.
Now let's run it but with only 4 processes (i.e., qsub -l nodes=2:ppn=2
). Again, we need to remember to modify the script and change the mpiexec -np
option. It would be useful if we could somehow calculate the total number of processes from the value(s) in qsub
's -l
option. Remember that the host file ($PBS_NODEFILE) lists nodes, 1 per processor. The total number of processes is just the number of lines in that file. We can use the following code:
cat $PBS_NODEFILE | wc -l
to return the number of lines. By using "command substitution", i.e, placing that code within "backticks" (the ` character), or within "$( xxx )" we can assign the result of it to a variable in our shell script and use that with mpiexec
.
Note: In addition to just being a useful shortcut, calculating the number of processes dynamically has another benefit. It will help you avoid "oversubscription", where your job is using more resources than are allocated to it. If you request 8 processors but you forget to update the script which is still trying to run on 32 processors, it may cause your job to run much slower. More importantly, it may cause other users' jobs to run much slower.
Another useful thing is to use the uniq
or sort -u
commands with the $PBS_NODEFILE. This will print out non-repeated lines from the file, essentially giving you a list of the nodes your job is running on. Normally, it shouldn't matter which node(s) your job runs on. But occasionally, something goes wrong on the system and knowing which compute node(s) the job was running on makes tracking down issues much easier.
$cat howdies.sh
#!/bin/bash module purge module load openmpi-gcc/2.0.1 cd $PBS_O_WORKDIR echo "Job working directory: $PBS_O_WORKDIR" echo num=`cat $PBS_NODEFILE | wc -l` echo "Total processes: $num" echo "Node(s):" uniq $PBS_NODEFILE echo mpiexec -n $num -hostfile $PBS_NODEFILE ./howdies $qsub -l nodes=2:ppn=2 howdies.sh
8675309.batch $cat howdies.sh.08675309
Job working directory: /home/bellc/bobby/examples/howdies Total processes: 4 Node(s): n006 n013 Howdy! This is rank 0 of 4 running on n006. Howdy! This is rank 1 of 4 running on n006. Howdy! This is rank 2 of 4 running on n013. Howdy! This is rank 3 of 4 running on n013.
Below are descriptions of several other options that can be included with the qsub
command. A convenient feature of the batch system is the ability for you to include qsub options inside your script rather than have to type them at the command line every time. These are called "PBS directives". To do this, add lines that have the format "#PBS " at the top of your script, after any "#!/bin/sh" line but before any other commands. For example, if your script will always run with "-l nodes=1:ppn=16" then add:
#PBS -l nodes=1:ppn=16
to the top of your shell script. Then just submit it with qsub
without the "-l" command line option. You can use multiple PBS directives in your script. From now on, this document will often include PBS directives in the sample scripts instead of qsub command line options.
You can have the batch system send you an email when the job begins, ends and/or aborts with the -m
and -M
options:
-m bea -M Bobby_Baylor@baylor.edu
With the qsub -N
option, you can specify a job name other than the name of the submitted shell script. The job name is shown in the output listing of the qstat
command and is also used for the first part of the output and error file names.
If you are submitting jobs often, you may find yourself overwhelmed by the number of output files (jobname.o#####) and error files (jobname.e#####) in your directory. Some useful qsub
options are -o
and -e
which allow you to specify the job's output and error file names explicitly rather than using the job name and job id. Subsequent job submissions will replace the contents of these files.
$qsub -l nodes=1:ppn=2 -o howdies.out -e howdies.err howdies.sh
1748946.n131.localdomain $ls
howdies howdies.c howdies.err howdies.out howdies.sh howdy howdy.c howdy.sh test.sh $cat howdies.out
Job working directory: /home/bobby/howdy Total processes: 2 Node(s): n119 Howdy! This is rank 0 of 2 running on n119 Howdy! This is rank 1 of 2 running on n119
If your qsub
options don't change between job submissions, you don't have to type them every time you run qsub
. Instead, you can add PBS directives to your shell script. These are special comment lines that appear immediately after the "#!" shebang line. Each directive lines start with "#PBS" followed by a qsub
option and any arguments.
$cat howdies.sh
#!/bin/sh #PBS -l nodes=1:ppn=4 #PBS -o howdies.out #PBS -e howdies.err #PBS -N howdies #PBS -m be -M Bobby_Baylor@baylor.edu module purge module load mvapich2/1.9-gcc-4.9.2 echo "------------------" echo echo "Job working directory: $PBS_O_WORKDIR" echo num=`cat $PBS_NODEFILE | wc -l` echo "Total processes: $num" echo echo "Job starting at `date`" echo cd $PBS_O_WORKDIR mpiexec -n $num -machinefile $PBS_NODEFILE ./howdies echo echo "Job finished at `date`" $qsub howdies.sh
1748952.n131.localdomain $cat howdies.out
Job working directory: /home/bobby/howdy Total processes: 2 Howdy! This is rank 0 of 2 running on n119 Howdy! This is rank 1 of 2 running on n119 ------------------ Job working directory: /home/bobby/howdy Total processes: 4 Nodes: Job starting at Fri Dec 6 12:23:49 CST 2013 Howdy! This is rank 0 of 4 running on n075 Howdy! This is rank 1 of 4 running on n075 Howdy! This is rank 2 of 4 running on n075 Howdy! This is rank 3 of 4 running on n075 Job finished at Fri Dec 6 12:23:51 CST 2013
The batch system's queue is more or less FIFO, "first in, first out", so it's possible that your job may be waiting behind another user's queued job. But if the other user's job is requesting multiple nodes/processors, some nodes will remain unused and idle while waiting on other nodes to free up so that the job can run. The batch system tries to be clever and if it knows that your job can run and finish before then, rather than force your job to wait, it will allocate an idle node it and let it run. The default time limit for jobs is 5000 hours which is considered "infinite" by the batch system. You can specify a much shorter time for your job by adding a "walltime=hh:mm:ss" limit to the qsub
command's -l
option.
qsub -l nodes=1:ppn=1,walltime=00:30:00 howdy.sh
The job will may not run immediately, but after a set amount of time, should run before the other user's "big" job runs. Be aware that the walltime specified is a hard limit. If your job actually runs longer than the specified walltime, it will be terminated.
In the "Getting Job Info" section above, we saw that you can see the nodes that a job is running on by adding a -n
option to qstat
. If you added the uniq $PBS_NODEFILE
command to your shells script, the nodes will be listed at the top of the job's output file. If necessary, you could log in to those nodes with ssh
, and run top
or ps
to see the status of the processes running on the node. You could use ssh
to run ps
on the node without actually logging in.
[bobby@login001 howdy]$qsub howdies.sh
1754935.n131.localdomain [bobby@login001 howdy]$qstat -u bobby -n
n131.localdomain: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 1754935.n131.loc bobby batch howdies 23006 1 1 -- 5000: R -- n010/0 [bobby@n130 howdy]$ssh n010
[bobby@n010 ~]$ps u -u bobby
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND bobby 23006 0.0 0.0 65924 1336 ? Ss 10:52 0:00 -bash bobby 23065 0.0 0.0 64348 1152 ? S 10:52 0:00 /bin/bash /var/spool/torque/... bobby 23072 0.2 0.0 82484 3368 ? S 10:52 0:00 mpiexec -n 1 -machinefile /var/spool/torque/aux//1754935.n131.local bobby 23073 0.3 0.0 153784 7260 ? SLl 10:52 0:00 ./howdies bobby 23078 0.0 0.0 86896 1708 ? S 10:52 0:00 sshd: bobby@pts/1 bobby 23079 0.2 0.0 66196 1612 pts/1 Ss 10:52 0:00 -bash bobby 23139 0.0 0.0 65592 976 pts/1 R+ 10:52 0:00 ps u -u bobby [bobby@n010 ~]$exit
[bobby@n130 howdy]$ [bobby@n130 howdy]$ssh n010 ps u -u bobby
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND bobby 23006 0.0 0.0 65924 1336 ? Ss 10:52 0:00 -bash bobby 23065 0.0 0.0 64348 1152 ? S 10:52 0:00 /bin/bash /var/spool/torque/... bobby 23072 0.0 0.0 82484 3368 ? S 10:52 0:00 mpiexec -n 1 -machinefile /var/spool/torque/aux//1754935.n131.localdomain ./howdies bobby 23073 0.1 0.0 153784 7260 ? SLl 10:52 0:00 ./howdies bobby 23146 0.0 0.0 86896 1672 ? S 10:53 0:00 sshd: bobby@notty bobby 23147 2.0 0.0 65592 972 ? Rs 10:53 0:00 ps u -u bobby
This was easy for a trivial, one-process job. If your job is running on several nodes, it can be a hassle. A useful trick is to add following function to your .bashrc startup script:
function myjobs()
{
if [ -f "$1" ]
then
for i in `cat "$1"`
do
echo "-----"
echo "Node $i:"
echo
ssh $i ps u -u $LOGNAME
done
fi
}
You could also modify the code above to be a standalone shell script if you wanted. Now within your shell script that you submit with qsub
, add the following code:
# Parse the Job Id (12345.n131.localhost) to just get the number part
jobid_num=${PBS_JOBID%%.*}
nodelist=$PBS_O_WORKDIR/nodes.$jobid_num
sort -u $PBS_NODEFILE > $nodelist
What the above code does is parse the job id (environment variable $PBS_JOBID) to strip off the ".n131.localhost" part and then create a file, "nodes.12345", that contains a list of nodes that the job is running on. Once the job has completed, the nodes.12345 file is no longer useful. You could rm $nodelist
at the bottom of the shell script if you wanted.
$qsub howdies.sh
1754949.n131.localdomain $ls nodes*
nodes.1754949 $myjobs nodes.1754949
----- Node n022: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND bobby 26684 0.0 0.0 73080 3108 ? Ss 14:01 0:00 orted -mca ess tm... bobby 26685 0.2 0.0 221420 7992 ? SLl 14:01 0:00 ./howdies bobby 26686 0.2 0.0 221420 7980 ? SLl 14:01 0:00 ./howdies bobby 26687 0.2 0.0 221420 7980 ? SLl 14:01 0:00 ./howdies bobby 26688 0.1 0.0 221420 7988 ? SLl 14:01 0:00 ./howdies bobby 26702 0.0 0.0 86004 1668 ? S 14:02 0:00 sshd: bobby@notty bobby 26703 0.0 0.0 65584 976 ? Rs 14:02 0:00 ps u -u bobby ----- Node n023: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND bobby 22127 0.1 0.0 65908 1332 ? Ss 14:01 0:00 -bash bobby 22128 0.0 0.0 13284 804 ? S 14:01 0:00 pbs_demux bobby 22183 0.0 0.0 64332 1160 ? S 14:01 0:00 /bin/bash /var/spool/... bobby 22190 0.1 0.0 83004 3956 ? S 14:01 0:00 mpiexec -n 8 -machinefile /var/spool/torque/aux//1754949.n131.localdomain ./howdies bobby 22191 0.2 0.0 221420 7988 ? SLl 14:01 0:00 ./howdies bobby 22192 0.2 0.0 221420 7976 ? SLl 14:01 0:00 ./howdies bobby 22193 0.1 0.0 221420 7984 ? SLl 14:01 0:00 ./howdies bobby 22194 0.1 0.0 221420 7980 ? SLl 14:01 0:00 ./howdies bobby 22205 0.0 0.0 86004 1668 ? S 14:02 0:00 sshd: bobby@notty bobby 22206 2.0 0.0 65576 976 ? Rs 14:02 0:00 ps u -u bobby
Special Cases
Exclusive Access
Because Kodiak is a multi-user system, it is possible that your job will share a compute node with another user's job if your job doesn't explicitly request all of the processors on the node. But what if your program is very memory intensive and you need to run fewer processes on a node? You can specify -l nodes=4:ppn=2
and run just 2 processes each on 4 different nodes (8 processes total). Unfortunately, although your job only will only run 2 processes on each node, the batch system sees that there are still 6 unused processors on the nodes, and can assign them to other, possibly memory intensive, jobs. This defeats the purpose of specifying fewer processors per node.
Instead, to get exclusive access to a node, you need to force the batch system to allocate all 8 processors on node by specifying ppn=8. If we specify -l nodes=2:ppn=8
, we get a $PBS_NODEFILE that looks like the following:
n001
n001
n001
n001
n001
n001
n001
n001
n002
n002
n002
n002
n002
n002
n002
n002
Calling mpiexec -np 16 -machinefile $PBS_NODEFILE ...
, the first process (rank 0) will run on the first host listed in the file (n001), the second process (rank 1) will run on the second host listed (also n001), etc. The ninth process (rank 8) will run on the ninth host listed (now n002), etc. If there are fewer lines in the machinefile than the specified number of processes (-np value), mpiexec will start back at the beginning of the list. If the machine file looked like the following:
n001
n002
The first process (rank 0) will run on host n001, the second (rank 1) will run on host n002, the third (rank 2) will run on host n001, the fourth (rank 3) will run on host n002, and so on. So all we have to do is call
mpiexec -np 8 -machinefile $PBS_NEW_NODEFILE
and 4 processes will run on n001 and 4 will run on n002. But because the batch system has reserved all 8 processors on n001 and n002 for our job, no other jobs will be running on the nodes. Our job will have exclusive access, which is what we want. So how do we turn the original $PBS_NODEFILE, created by the batch system, into our trimmed down version? One simple way would be to use the sort -u
command to sort the file "uniquely", thus keeping one entry for each host.
PBS_NEW_NODEFILE=$PBS_O_WORKDIR/trimmed_machinefile.dat
sort -u $PBS_NODEFILE > $PBS_NEW_NODEFILE
mpiexec -np 8 -machinefile $PBS_NEW_NODEFILE
Because modifying the $PBS_NODEFILE itself will cause problems with the batch system you should always create a new machine file and use it instead.
Multi-threaded Programs
If your program is multi-threaded, either explicitly by using OpenMP or POSIX threads, or implicitly by using a threaded library such as the Intel Math Kernel Library (MKL), be careful not to have too many threads executing concurrently. You will need to specify the number of processes per node but launch fewer or just one process. This is similar to method described in the "exclusive access" section above. Because each thread runs on a separate processor, you will need to tell the batch system how many processors to allocate to your process.
By default, OpenMP and MKL will use all of the processors on a node (currently 8). If, for some reason, you want to decrease the number of threads, you can do so by setting environment variables within the shell script that you submit via qsub
. The threaded MKL library uses OpenMP internally for threading, so you can probably get by with modifying just the OpenMP environment variable.
OMP_NUM_THREADS=4
export OMP_NUM_THREADS
MKL_NUM_THREADS=4
export MKL_NUM_THREADS
You can also set the number of threads at runtime within your code.
// C/C++
#include "mkl_service.h"
mkl_set_num_threads(4);
! Fortran
use mkl_service
call mkl_set_num_threads(4)
Below is a trivial, multi-threaded (OpenMP) program, "thready". All it does is display the number of threads that will be used and initializes the values of an array. Then, in the parallel region, each thread modifies a specific value of the array and then sleeps for two minutes. After exiting the parallel region, we display the new (hopefully correct) values of the array. The program is compiled with the Intel C compiler icc
with the -openmp
option.
$cat thready.c
#include #define MAX 8 int main(int argc, char **argv) { int i, arr[MAX]; int num; num = omp_get_max_threads(); printf("Howdy! We're about to split into %d threads...\n", num); for (i=0; iicc -openmp -o thready thready.c $./thready
Howdy! We're about to split into 4 threads... Before: arr[0] = -1 arr[1] = -1 arr[2] = -1 arr[3] = -1 arr[4] = -1 arr[5] = -1 arr[6] = -1 arr[7] = -1 After: arr[0] = 0 arr[1] = 1 arr[2] = 2 arr[3] = 3 arr[4] = -1 arr[5] = -1 arr[6] = -1 arr[7] = -1
Next we need to submit the thready program. For this example, we want it to use 4 threads so we call qsub
with a -l nodes=1:ppn=4
option to reserve 4 processors for our threads.
$cat thready.sh
#!/bin/bash cd $PBS_O_WORKDIR echo "Node(s):" sort -u $PBS_NODEFILE echo export OMP_NUM_THREADS=4 echo "OMP_NUM_THREADS: $OMP_NUM_THREADS" echo echo "Job starting at `date`" echo ./thready echo echo "Job finished at `date`" $qsub -l nodes=1:ppn=4 thready.sh
1755333.n131.localdomain $tail thready.sh.o1755333
Node(s): n005 OMP_NUM_THREADS: 4 Job starting at Wed Dec 18 11:35:32 CST 2013
Now let's make sure that the process and threads running on the compute node are what we expect. We can see that the job is running on node n005. If we run the ps -f -C thready
command on node n005, we can see that only one instance of the thready program is running. The only useful information is that its PID (process ID) is 27934. But if we add the -L
option to the ps
command, we can also get information about the threads.
$ssh n005 ps -f -C thready
UID PID PPID C STIME TTY TIME CMD bobby 27934 27931 0 10:22 ? 00:00:00 ./thready $ssh n005 ps -f -L -C thready
UID PID PPID LWP C NLWP STIME TTY TIME CMD bobby 27934 27931 27934 0 5 10:22 ? 00:00:00 ./thready bobby 27934 27931 27935 0 5 10:22 ? 00:00:00 ./thready bobby 27934 27931 27936 0 5 10:22 ? 00:00:00 ./thready bobby 27934 27931 27937 0 5 10:22 ? 00:00:00 ./thready bobby 27934 27931 27938 0 5 10:22 ? 00:00:00 ./thready
The LWP column above is the "lightweight process" ID (i.e., thread ID) and we can see that the thready process (PID 27934) has a total of 5 threads. The NLWP (number of lightweight processes) column confirms it. But why are there 5 threads instead of 4 threads which was specified? Notice that the first thread has the same thread ID as its process ID. That is the master or main thread, which is created when the program first starts and will exist until the program exits. When the program reaches the parallel region, the other 4 threads are created and do their work while the main thread waits. At that point, there are 5 theads total. When the program finishes with the parallel region, the 4 threads are terminated and the master thread continues.
Interactive Sessions
Occasionally, you may need to log in to a compute node. Perhaps you want to run top
or ps
to check the status of a job submitted via qsub
. In that case, you can simply ssh
to the compute node. But there may be times where you need to do interactive work that is more compute intensive. Although you could just ssh
to some arbitrary compute node and start working, this is ill-advised because there may be other jobs already running on that node. Not only would those jobs affect your work, your work would affect those jobs.
Instead, you should use the interactive feature of qsub
by including the -I
option. This will use the batch system to allocate a node (or nodes) and a processor (or processors) just like you would for a regular, non-interactive job. The only difference is that once the requested resources are available, you will be automatically logged into the compute node and get a command prompt. The "batch" job will exist until you exit or log out from the interactive session.
[bobby@n130 ~]$qsub -I
qsub: waiting for job 1832724.n131.localdomain to start qsub: job 1832724.n131.localdomain ready [bobby@n066 ~]$top
top - 15:47:36 up 102 days, 23:20, 0 users, load average: 7.50, 7.19, 7.57 Tasks: 289 total, 9 running, 280 sleeping, 0 stopped, 0 zombie Cpu(s): 55.7%us, 1.0%sy, 0.0%ni, 42.8%id, 0.4%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16438792k total, 2395964k used, 14042828k free, 162524k buffers Swap: 18490804k total, 438080k used, 18052724k free, 454596k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6715 betty 25 0 1196m 134m 1168 R 99.0 0.8 9:44.35 prog_01a 6801 betty 25 0 1196m 134m 1168 R 99.0 0.8 7:56.28 prog_01a 6832 betty 25 0 1196m 134m 1168 R 99.0 0.8 4:55.46 prog_01a 6870 betty 25 0 1196m 134m 1164 R 99.0 0.8 1:25.80 prog_01a 6919 betty 25 0 1196m 134m 1164 R 99.0 0.8 1:04.84 prog_01a 6923 betty 25 0 1196m 134m 1164 R 99.0 0.8 0:55.40 prog_01a 7014 betty 25 0 1196m 134m 1164 R 99.0 0.8 0:46.67 prog_01a 7075 bobby 15 0 12864 1120 712 R 1.9 0.0 0:00.01 top 1 root 15 0 10344 76 48 S 0.0 0.0 9:44.65 init 2 root RT -5 0 0 0 S 0.0 0.0 0:00.82 migration/0 3 root 34 19 0 0 0 S 0.0 0.0 0:00.13 ksoftirqd/0 4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 5 root RT -5 0 0 0 S 0.0 0.0 0:00.59 migration/1 [bobby@n066 ~]$cat $PBS_NODEFILE
n066 [bobby@n066 ~]$exit
logout qsub: job 1832724.n131.localdomain completed [bobby@n130 ~]$
If there aren't enough available nodes, your interactive job will wait in the queue like any other job. If you get tired of waiting, press ^C.
$ qsub -I -l nodes=20:ppn=8 qsub: waiting for job 1832986.n131.localdomain to start^C
Do you wish to terminate the job and exit (y|[n])?y
Job 1832986.n131.localdomain is being deleted
If you are logged into Kodiak and have enabled X11 forwarding (i.e., logged in with ssh -X bobby@kodiak.baylor.edu
from a Mac or Linux system, or have enabled SSH X11 forwarding option in Putty for Windows) you can run X11 options when logged into a compute node interactively. Because X11 from a compute node (or Kodiak's login node) can be slow, you typically won't want to do this. But there may be times when it is required, for example, when you need to run MATLAB interactively, or need to debug a program with a graphical debugger. This assumes you are running an X server (XQuartz for Mac, or Cygwin/X or Xming on Windows) on your desktop system.
First, on the Kodiak login node, make sure that X11 is, in fact, working. The xlogo
and xmessage
commands are a simple way to test this.
$echo $DISPLAY
localhost:10.0 $xlogo &
[1] 9927 $echo 'Howdy from node' `hostname` | xmessage -file - &
[2] 9931
If you are sure that X11 works from the login node, start an interactive session, as above, but add a -X
to tell qsub
that you wish to forward X11 from the compute node.
[bobby@n130 howdy]$qsub -I -X
qsub: waiting for job 2444343.n131.localdomain to start qsub: job 2444343.n131.localdomain ready [bobby@n083 ~]$echo 'Howdy from node' `hostname` | xmessage -file -
[bobby@n083 ~]$ exit
qsub: job 2444343.n131.localdomain completed [bobby@n130 howdy]$
Multiple Job Submissions
There may be times where you want to run, or at least submit, multiple jobs at the same time. For example, you may want to run your program with varying input parameters. You could submit each job individually, but if there are many jobs, this might be problematic.
Job Arrays
If each run of your program has the same resource requirements, that is, the same number of nodes and processors, you can run multiple jobs as a job array. Below is a trivial program that simply outputs a word passed to it at the command line.
$ cat howdy2.c
#include
int main(int argc, char **argv)
{
if (argc == 2)
{
printf("Howdy! You said, "%s".\n", argv[1]);
}
else
{
printf("Howdy!\n");
}
}
What we want to do is run the program three times, each with a different word. To run this as a job array, add the -t
option along with a range of numbers to the qsub
command. For example, qsub -t 0-2 array_howdy.sh
. This command will submit the array_howdy.sh script three times but with an extra environment variable, $PBS_ARRAYID, which will contain a unique value based on the range. You could test the value of $PBS_ARRAYID and run your program with different arguments.
$ cat array_howdy.sh
#!/bin/sh
#PBS -o array.out
#PBS -e array.err
cd $PBS_O_WORKDIR
if [ $PBS_ARRAYID == 0 ]
then
./howdy2 apple
elif [ $PBS_ARRAYID == 1 ]
then
./howdy2 banana
elif [ $PBS_ARRAYID == 2 ]
then
./howdy2 carrot
fi
For a more complex real-world program, you might have multiple input files named howdy_0.dat through howdy_N.dat. Instead of an unwieldy if-then-elif-elif-elif-etc construct, you could use $PBS_ARRAYID to specify an individual input file and then use a single command to launch your program.
$ls input
howdy_0.dat howdy_1.dat howdy_2.dat $cat input/howdy_0.dat
apple $cat input/howdy_1.dat
banana $cat input/howdy_2.dat
carrot $cat array_howdy.sh
#!/bin/sh #PBS -o array.out #PBS -e array.err cd $PBS_O_WORKDIR DATA=`cat ./input/howdy_$PBS_ARRAYID.dat` ./howdy2 $DATA
When you submit your job as a job array, you will see a slightly different job id, one with "[]" appended to it. This job id represents all of the jobs in the job array. For individual jobs, specify the array id within the brackets (e.g., 123456[0]). Normally, qstat
will display one entry for the job array. To see all of them, add a -t
option.
The job array's output and error files will be named jobname.o-#, where # is replaced by the individual job array id numbers.
$qsub -t 0-2 array_howdy.sh
2462930[].n131.localdomain $qstat -u bobby
n131.localdomain: Job ID Username Queue Jobname -------------------- ----------- -------- ---------------- ... 2462930[].n131.l bobby batch array_howdy.sh $qstat -t -u bobby
n131.localdomain: Job ID Username Queue Jobname -------------------- ----------- -------- ---------------- ... 2462930[0].n131. bobby batch array_howdy.sh-1 2462930[1].n131. bobby batch array_howdy.sh-2 2462930[2].n131. bobby batch array_howdy.sh-3 $ls array.out*
array.out-0 array.out-1 array.out-2 $cat array.out-0
Howdy! You said, "apple". $cat array.out-1
Howdy! You said, "banana". $cat array.out-2
Howdy! You said, "carrot".
Sequential, Non-concurrent Jobs
When you submit a job array as above, all of the individual jobs will run as soon as possible. But there may be times where you don't want them all to run concurrently. For example, the output of one job might be needed as the input for the next. Or perhaps you want this program to run, but not at the expense of some other, higher priority program you also need to run. Whatever the reason, you can limit the number of jobs that can run simultaneously by adding a slot limit to the -t
option. For example, qsub -t 0-9%1
will create a job array with 10 jobs (0 through 9) but only 1 will run at a time. The others will wait in the queue until the one that is running finishes. When running qstat
, you can see that the status of each non-running job is H (held) as opposed to Q (queued).
$qsub -t 0-2%1 array_howdy.sh
2463526[].n131.localdomain $qstat -t -u bobby
n131.localdomain: Job ID Username Queue Jobname S Time -------------------- ----------- -------- ---------------- ... - ----- 2463526[0].n131. bobby batch array_howdy.sh-0 R -- 2463526[1].n131. bobby batch array_howdy.sh-1 H -- 2463526[2].n131. bobby batch array_howdy.sh-2 H --
This works if you can run your job as a job array. But what if it can't? For example, you may want to submit multiple jobs but if each requires a different number of nodes or processors then a job array won't work. You will need to qsub
the jobs individually. You might be tempted to add a qsub
command at the end of your submitted script. However, this will not work because the script runs on the compute nodes but the qsub
command only works on Kodiak's login node. Instead, to have a job wait on another job to finish before starting, you add a -W
("additional attributes") option to the qsub
to specify a job dependency for the new job. The general format for the -W
option is:
qsub -W attr_name=attr_list ...
There are several possible attr_names that can be specified as additional attributes. In this case, the attribute name we want to use is "depend", the attribute list is the type of dependency ("after" or "afterok"), and the depend argument (the id of the job we are waiting on). The difference between after and afterok is that with the former, the new job will start when the first one finishes, not matter what the reason; with the latter, the new job will start only if the first one finished successfully, and returned a 0 status. So the full option will be:
qsub -W depend=after:job_id ...
For the job id, you can use either the full job id (e.g., 123456.n131.localdomain) or just the number part.
$cat howdies.sh
#!/bin/bash #PBS -o howdies.out #PBS -e howdies.err #PBS -N howdies module purge module load mvapich2/1.9-gcc-4.9.2 cd $PBS_O_WORKDIR echo "------------------" echo echo "Job id: $PBS_JOBID" echo num=`cat $PBS_NODEFILE | wc -l` echo "Total processes: $num" echo echo "Job starting at `date`" echo mpiexec -n $num -machinefile $PBS_NODEFILE ./howdies echo echo "Job finished at `date`" echo $qsub -l nodes=1:ppn=8 howdies.sh
2463553.n131.localdomain $qsub -l nodes=1:ppn=4 -W depend=afterok:2463553.n131.localdomain howdies.sh
2463555.n131.localdomain $qsub -l nodes=1:ppn=2 -W depend=afterok:2463555.n131.localdomain howdies.sh
2463556.n131.localdomain $qsub -l nodes=1:ppn=1 -W depend=afterok:2463556.n131.localdomain howdies.sh
2463557.n131.localdomain $qstat -u bobby
n131.localdomain: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------- --------- -------- --------- ------ --- --- ------ ----- - ----- 2463553.n131.loc bobby batch howdies 10460 1 8 -- 5000: R 00:00 2463555.n131.loc bobby batch howdies -- 1 4 -- 5000: H -- 2463556.n131.loc bobby batch howdies -- 1 2 -- 5000: H -- 2463557.n131.loc bobby batch howdies -- 1 1 -- 5000: H -- $qstat -u bobby
n131.localdomain: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------- --------- -------- --------- ------ --- --- ------ ----- - ----- 2463555.n131.loc bobby batch howdies 15434 1 4 -- 5000: R 00:00 2463556.n131.loc bobby batch howdies -- 1 2 -- 5000: H -- 2463557.n131.loc bobby batch howdies -- 1 1 -- 5000: H -- $qstat -u bobby
n131.localdomain: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------- --------- -------- --------- ------ --- --- ------ ----- - ----- 2463556.n131.loc bobby batch howdies 15525 1 2 -- 5000: R 00:01 2463557.n131.loc bobby batch howdies -- 1 1 -- 5000: H -- $qstat -u bobby
n131.localdomain: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------- --------- -------- --------- ------ --- --- ------ ----- - ----- 2463557.n131.loc bobby batch howdies 15613 1 1 -- 5000: R 00:04 $cat howdies.out
------------------ Job id: 2463553.n131.localdomain Total processes: 8 Job starting at Thu Jan 30 13:48:41 CST 2014 Howdy! This is rank 7 of 8 running on n017 Howdy! This is rank 1 of 8 running on n017 Howdy! This is rank 5 of 8 running on n017 Howdy! This is rank 4 of 8 running on n017 Howdy! This is rank 6 of 8 running on n017 Howdy! This is rank 2 of 8 running on n017 Howdy! This is rank 3 of 8 running on n017 Howdy! This is rank 0 of 8 running on n017 Job finished at Thu Jan 30 13:53:43 CST 2014 ------------------ Job id: 2463555.n131.localdomain Total processes: 4 Job starting at Thu Jan 30 13:53:44 CST 2014 Howdy! This is rank 3 of 4 running on n082 Howdy! This is rank 0 of 4 running on n082 Howdy! This is rank 1 of 4 running on n082 Howdy! This is rank 2 of 4 running on n082 Job finished at Thu Jan 30 13:58:46 CST 2014 ------------------ Job id: 2463556.n131.localdomain Total processes: 2 Job starting at Thu Jan 30 13:58:47 CST 2014 Howdy! This is rank 1 of 2 running on n082 Howdy! This is rank 0 of 2 running on n082 Job finished at Thu Jan 30 14:03:49 CST 2014 ------------------ Job id: 2463557.n131.localdomain Total processes: 1 Job starting at Thu Jan 30 14:03:50 CST 2014 Howdy! This is rank 0 of 1 running on n082 Job finished at Thu Jan 30 14:08:52 CST 2014