GPU and CPU Usage Limitation Imposed on General Accounts
As many of our users have noticed, the HPCC job policy was updated recently. SLURM now enforces the CPU and GPU hour limit on general accounts. The command “SLURMUsage” now includes the report of both CPU and GPU usage. For general account users, the limit of CPU usage is reduced from 1,000,000 to 500,000 hours, and the limit of GPU usage is reduced from unlimited to 10,000 hours. For more details, visit the ICER wiki page at Job Policies.
Although most of our users will not be affected by the changes because they either have buy-in accounts or their usages are well below the limit, there may be some users who need to take actions to avoid the issue of exceeding the limit. Here are some helpful tips for these users to improve the efficiency of resource usage.
Monitor your jobs frequently and cancel a job as soon as you know it cannot produce correct results. For example, you have an array of 20 similar jobs for which you request 24 hours of walltime. If one of them crashes after 1 hour of execution, then you should try to find out the cause of the problem and figure out if the problem would affect the other 19 jobs. For problems caused by out-of-memory, disk space quota, file access permission etc., you should consider cancelling all jobs in the array and resubmitting them after the problem is resolved. This will reduce the rate of failed jobs. Users can check our wiki page for more help on job management at Job Management by SLURM.
Check and modify your job script to avoid overbooking the resources for your jobs. Users are responsible for reserving the right amount of resources to ensure the successful execution of their jobs, including the number of CPUs and GPUs. Note that the usage of CPU/GPU time is calculated by multiplying the number of CPUs/GPUs and the elapsed time of the job. The CPUs/GPUs are reserved for the job no matter if they are used or not. For example, you have a workflow consisting of a sequence of 5 tasks. You could either (a) wrap the whole workflow into one job and submit one job for each run, or (b) wrap each task as a job and submit 5 jobs for each run. Which method is preferable? Method (a) is preferable if all 5 tasks use the same number of CPUs/GPUs. Method (b) is preferable if the number of CPUs/GPUs needed by each task vary a lot. Separating tasks into jobs would allow the resource request to be exactly as needed by task, rather than reserving the resources sufficient for all of each task. Furthermore, suppose only one of five tasks uses a GPU, and that GPU task only takes about 10% of the total execution time of the workflow run. In method (b), GPU usage of workflow execution is only 10% of the total execution time of the workflow, while in method (a), it would be equal to the total execution time.
If additional SLURM CPU/GPU hours are needed, users can make a request via online requisition form at CPU/GPU Request Form.