Good Practices for Using Local Disk Space
Recently, several users have had their HPCC jobs terminated with unexpected error messages or canceled by a system administrator because their jobs are causing the node to crash. The cause in these cases is related to full local disk space on these nodes. To help users better understand the local disk space and good use practices, we present here some FAQs.
1. Does my program use any local disk space? There are several ways that a job may be using local disk space.
a. If you have I/O operations on files where a part of the filename has the environment variable $TMPDIR;
b. If your program has /tmp or /mnt/local as a prefix to the filename for I/O operations, such as file read or write;
c. If your program utilizes a software platform which uses local disk space for some of its functions. For example, OpenMPI uses local disk for temporarily buering data when transferring data between processes.
2. Will my job be affected by the full local disk issue? If you are using local disk space in any of the ways listed above AND the node your job is running on has a full local disk, your job may crash if there is no space for its I/O operations. Your job may be responsible for the crash or it may be a victim of another job running on the same node.
3. How can my job avoid the full local disk issue? With careful selection of disk space to run your job, you may be able to avoid or at least greatly reduce the number of cases in which full local disk issues occur. For more information about selecting disk space, please refer to https://wiki.hpcc.msu.edu/ display/ITH/File+system. Additionally, the following precautions should be taken when using local disk space:
a. Be aware of the size of local disk space. Most compute nodes on HPCC have 170GB local disk space that may be shared by several jobs/users. If you use local disk for storing temporary files, you should plan to remove them right after your program’s execution. You can remove these files either within your program or within your job script. By cleaning up after using this space, TICKET HIGHLIGHTS Mar. 2018 If you have any further question, please feel free to contact us. the disk space is recycled for reuse sooner than if you wait for the automatic system cleanup process. Use $TMPDIR to refer to the local disk space in a job rather than hard coding file name with /tmp or /mnt/local because the directory $TMPDIR will be cleaned up automatically by the job manager after job completion. If your program needs large disk space (for example, to read/write a large file or lots of temporary files), consider using a larger, dedicated disk space, such as your scratch, research or home spaces to avoid flooding the local disk.
b. Be aware of the cost of I/O operations. Many users choose to use local disk space versus another networked file system because I/O on the local disk does not have any network overhead. The flash file system,/mnt/ffs17, is another storage option on HPCC if your program has high frequency I/O operations on many small files. For details, please refer to our wiki page at https://wiki.hpcc.msu.edu/pages/viewpage.action?pageId=11895935. Note that the flash file system is not local; it is accessible from all nodes. It is faster due to its flash memory technology that makes data access faster than disk drive.
4. Can I reserve the local disk space for my job to avoid overloading it? Users can request local file size by using #PBS -l file=