Using Interactive Jobs to Increase Productivity

MSU HPC users often have a work process similar to the following:

1. they test their software or code on a development node with small amounts of data, or short time periods;

2. write a submission (qsub) script to submit to the queue;

3. let the system run their programs and wait for output; then

4. check the output for errors or valid results, and usually;

5. make changes and try again.

Sometimes the testing phase can take a really long time, especially when one has to wait for a job to complete to see if it fails or not. Many times, the development nodes are busy and they can get slow. Also a single development node is not the best environment for those users who want to test multi-node (parallel worker or MPI) software.

What if you could run a job on the cluster but have full interactivity just like on the development node? You can - with an "interactive job"

Jobs that need to be run interactively are not just for testing, but often require a lot of user input and run a graphical user interface (GUI). Some examples of typical interactive jobs include:

● Multi-node jobs you need to test

● Matlab code that requires a user interface and needs to run for more than 2 hours

● Large-memory interactive jobs

● Work that requires both a GUI and also high amount of memory and processors

● Debugging large OpenMP or MPI jobs (e.g using TotalView GUI or the

Our development nodes are designed for much of this kind of work, but by using the scheduler, you can run longer on a compute node and excede the CPU time limits imposed on the development nodes.

Scheduling an Interactive Job

To schedule a command-line or GUI interactive job, you must provide the -I option to the qsub command. For example:

qsub -I -l nodes=2:ppn=4,walltime=02:00:00,mem=32 gb -N MyInteractivejob

Note on syntax: we recommend that qsub options appear in a single qsub script, but you can also provide those directly on the command line as you see above. The above command will seek out 2 nodes that have 4 free cores and 16 gb each available, and reserve it for 2 hours for you.

After you hit enter to the command above, you will have to wait until the scheduler can provide the resources that you requested. Once that happens, then you will be logged in to a compute node directly that has been reserved for this job. If you reserve all of the cores (28 cores in the case of Laconia/Intel16 nodes) or memory then you will have the node to yourself (but it may take a long time to schedule. If you request less than all the cores, you will share this compute node with other jobs but is still less than on a busy dev-node. >

For the example above, you will be given 2 hours to work at which point the job will be cancelled and you will be logged out. If you exit from the compute node before the time is up, the job is cancelled. You can simply re-submit the qsub, and start again.

Example commands

1) Connect to the HPCC normally from your computer (terminal, moba xterm, putty, etc). Use the "-X" command if you are using MobaXterm or XQuartz and intend to use a GUI program (e.g Stata, Matlab, Ansys), but do not include the -X if your using Mac terminal

ssh -X yournetid@hpcc.msu.edu # -X is only neede for GUI jobs.

you will be connected to a gateway node.

2) Connect to any of the development nodes. The dev-node you select does not affect which type of compute node the interactive jobs runs on. Note that Interactive jobs do not always work from gateways (you may get the error "job apparently deleted"). Again use -X only if you are using XQuartz

ssh -X dev-intel14-k20

3) Issue the qsub command from any dev-node and issue from there. Again, if you plan on using GUI software, you need to include the "-X" option. For example, here is a qsub command asking for one node, 10 cores, and 2gb/cores. Note if you get an error "[qsub: DISPLAY not set] " then you are not using an X-enabled terminal (e.g. Mac Terminal) and should not use this option. (Use XQuartz/Mac or MobaXterm/Windows or our Remote Desktop Connect instead if you want to use GUI software)

qsub -I -X -l nodes=1:ppn=10,mem=20gb,walltime=04:00:00 -N testing

You should, after a while, see something like this:

[billspat@dev-intel14-k20 ~]$ qsub -I -l nodes=1:ppn=10,mem=20gb,walltime=01:00:00

-N testing

qsub: waiting for job 50482872.mgr-04.i to start

qsub: job 50482872.mgr-04.i ready

[billspat@csp-018 ~]$

While inside the job, you can use all of the QSUB variables like $PBS_JOBID etc for example qstat -f $PBS_JOBID to see current statistics of the job you are inside. To see all of the PBS variables and settings you can use the command

env |grep PBS

(which shows all environment variables and filters only those that contain PBS). When you are finished using your commands and need to exit the job, use 'exit' to quit

[billspat@csp-018 ~]$ exit

logout

qsub: job 50482872.mgr-04.i completed

[billspat@dev-intel14-k20 ~]$

Note that the regular job policies are in place : keeping your job to < 4 hours will schedule more quickly as it can run on idle buy-in nodes (see https://wiki.hpcc.msu.edu/display/hpccdocs/Scheduling+Jobs)

Below is a full interactive session, where I use the R system for stats and examine aspects of parallel computing ( see our R tutorial on the HPCC wiki by Nanye Long). Note I'm using R in text mode, not GUI mode (there is no -X parameter)

I hope these details help you find it easier to perform ad-hoc testing and analysis on our HPCC. As always, if you have any questions please contact us via https://contact.icer.msu.edu/contact

Example R session in an Interactive Job

$ qsub -I -l nodes=1:ppn=10,mem=20gb,walltime=01:00:00 -N testing

qsub: waiting for job 50482872.mgr-04.i to start

qsub: job 50482872.mgr-04.i ready

[billspat@csp-018 ~]$ # note - need to load modules in this new session

[billspat@csp-018 ~]$ module load GNU/4.9 OpenMPI R/3.3.2 3

[billspat@csp-018 ~]$ R

R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"

Platform: x86_64-pc-linux-gnu (64-bit)

[Previously saved workspace restored]

> library(doParallel) > Loading required package: foreach

> foreach: simple, scalable parallel programming from Revolution Analytics

> Use Revolution R for scalability, fault tolerance and more.

> http://www.revolutionanalytics.com > Loading required package: iterators

> Loading required package: parallel > > # Request a single node (this uses the "multicore" functionality)

> registerDoParallel(cores=as.numeric(Sys.getenv("PBS_NUM_PPN")[1]))

> x <- iris[which(iris[,5] != "setosa"), c(1,5)]

> trials <- 10000

> ptime <- system.time({

- r <- foreach(icount(trials), .combine=cbind) %dopar% {

- ind <- sample(100, 100, replace=TRUE)

- result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit)) - coefficients(result1)

-}- })[3]

> cat("Time elapsed:", ptime, "\n")

> Time elapsed: 4.635

> Currently registered backend: doParallelMC

> cat("Number of workers used:", getDoParWorkers(), "\n")

> Number of workers used: 10

> q()

> Save workspace image? [y/n/c]: y

[billspat@csp-018 ~]$ qstat -f $PBS_JOBID

Job Id: 50482872.mgr-04.i

```Job_Name = testing

Job_Owner = billspat@dev-intel14-k20.i

resources_used.cput = 00:00:25

resources_used.energy_used = 0

resources_used.mem = 116460kb

resources_used.vmem = 524028kb

resources_used.walltime = 00:06:33

job_state = R

queue = main

server = mgr-04.i

Checkpoint = u

ctime = Thu Dec 14 11:20:57 2017

Error_Path = /dev/pts/0

exec_host = csp-018/0-9 4

Hold_Types = n

interactive = True

Join_Path = n

Keep_Files = n

Mail_Points = a

mtime = Thu Dec 14 11:21:44 2017

Output_Path = /dev/pts/0 Priority = 0

qtime = Thu Dec 14 11:20:57 2017

Rerunable = False

Resource_List.nodes = 1:ppn=10

Resource_List.mem = 20gb

Resource_List.walltime = 01:00:00

Resource_List.nodect = 1

session_id = 59066 ```

# etc... cut for brevity

# note - exit the job and return to dev node

[billspat@csp-018 ~]$ exit

logout

qsub: job 50482872.mgr-04.i completed

View and download photos of ICER's hardware.

Using Interactive Jobs to Increase Productivity