Troubleshooting Failed HPCC Jobs
If your submitted job has failed, the first thing to investigate is the SLURM output file. Most of the time, the error message in the output file will provide clues for fixing the problem. If not, the next step is to try to run the command in the job script directly on a dev-node. Usually, the command only needs to run for a short time to make sure there are no syntax errors or I/O problems. Finally, if you still cannot figure out the reason for job failure, feel free to send us a ticket. Remember to include the job ID; it is normally part of the SLURM output file name (e.g., slurm-100100.out).
On a related note, you can print out a list of historical jobs by using the following command:
sacct -S 2022-01-01 -D -X --format="state%25,JobName%15,JobID%20,Start,End"
Above, the "-S" flag is used to indicate the start date. The output will be all jobs that happened from that time point up to now. Check out the usage of sacct (link is external) if you want to learn more.
Nanye Long, PhD
ICER Research Consultant