Tips for GPU Computing on HPCC

As more and more software has GPU computing capability to accelerate computation, nodes with GPU cards are now in very high demand. Below we share a few tips to enable users to better use these resources on HPCC.

Know whether your software has GPU support and how to run it. Check with the software release notes to find out if the software you are using has GPU support. Additionally, ensure that the software is installed with GPU capability. Some software are designed so that users can decide if they want to build/install the GPU capability and if they want to turn this capability on or off. Consult the user manual for details such as: how many GPUs can be used with the software, whether a specific GPU can be selected, whether the GPU be turned on and off, etc..
Run your GPU program on a node with GPU installed. For development and testing, ssh to dev-intel14-k20 or dev-intel16-k80. For running your program as a batch job, ensure that you request the right number of GPUs. For details on how to request GPU, please see the page linked here. Note that the number of GPUs may only be “visible” by a job when the job makes the request.
Monitor your job’s GPU usage. When the program is running on a development node, the command nvidia-smi or gpustat can be run from another window on the same node to get a snapshot of the GPU's current status. Users may also keep a window to “watch” the display of GPU status. To watch the updated GPU status every second, run watch -d -n 1 nvidia-smi or watch -d -n 1 gpustat. Ensure that the process running your program appears in the display. You should also check the memory usage and percentage of the GPU utilization. If your program is running as a batch job on a compute node, to monitor the GPU usage first obtain the node name, then run the GPU monitoring command via a remote shell. For example, if your job is running on node nvl-000, you may run ssh nvl-000 gpustat or ssh nvl-000 nvidia-smi to get a snapshot of the GPU usage of the node, nvl-000. You should see your jobs’ processes in the output. Otherwise, check that your job script correctly reserves GPUs and your command correctly turns on the GPU computation. Additionally, you may also include the command to monitor your GPU usage in your job script so that the output file of the job will contain the snapshot of your GPU usage for your reference.

Note that the more resources a job reserves, the longer your job may stay in the queue waiting to be started. You may also want to compare the execution time of your job with and without using GPU. For some cases, using GPU may take the computation a longer time to complete due to the overhead of loading the computation and data to/from GPU to CPU.

If you have any questions regarding the performance of GPU computing, feel free to contact the ICER research consultant team.