CPU and GPU Hour Credits for Node Failures
ICER has identified an issue with the way CPU usage is recorded on our system, and we have implemented a fix to address it. To ensure fair access to computing resources, researchers using the HPCC who have not bought cluster hardware have an annual usage limit of 500,000 CPU Hours and 10,000 GPU hours. The scheduler records usage against these limits continuously as jobs run, but does not consider whether that job was completed successfully. Due to the complexity of HPCC systems, occasional node failures are unavoidable. These node failures may cause jobs to terminate unexpectedly through no fault of the user, leaving users with resource usage counted against their annual limits and no results to show for it.
ICER does not want to penalize users for system failures outside their control, so we have credited hours for jobs affected by failed nodes back towards users’ annual limits. You may notice your annual CPU or GPU hour limit has increased. Going forward, ICER will periodically credit resource hours back to users for jobs impacted by node failures. If you have any questions about resource limits or you are worried your account may be affected by this issue, please contact us.
Steven Ford
ICER System Administrator