With the successful completion of the most recent Quest downtime, occurring September 12-19, 2022, I am pleased to share with you some details about the work that was completed and related benefits.
Quest includes the Quest Analytics Nodes, the Genomics Compute Cluster (GCC), and the Kellogg Linux Cluster (KLC). Quest downtimes are typically done to perform critical system patches and updates that cannot be done while jobs are running. Regular hardware, firmware, and software updates allow for a stable cluster and keep Quest from being susceptible to security vulnerabilities that come with using outdated and unsupported software.
Highlights of the work that was performed throughout the September 2022 downtime are detailed below:
Operating System Updated to Red Hat version 7.9
- Allows for updated filesystem, drivers, and additional system updates
- Enables the use of updated research applications
Job Scheduler Upgraded to Slurm version 22.05
- Addresses known issues in which certain jobs would get stuck in a “Completing Job” state and need a manual intervention to recover from this state
- For all jobs submitted after the upgrade, Slurm will now save your job submission script for a year and make it accessible via sacct command. This new feature will help with both troubleshooting jobs and repeating successfully completed jobs. This is especially helpful for when scripts are needed for reproducible research requirements
Quest 7 Nodes Retired; Quest 10 Nodes Added
- Quest 7 nodes (28 cores and 128 GB memory per node) have been replaced with Quest 10 nodes (52 cores and 192 GB memory per node) on the cluster, increasing the compute and memory capacity on Quest. For more information about the nodes available on Quest, see the Quest Technical Specifications
NVIDIA GPU Drivers Updated to version 515
- Upgrading the NVIDIA drivers will now allow you to use software which has been compiled against the latest version of CUDA, version 11.7
Filesystem Software, IBM General Parallel File System (GPFS), and Interconnect Drivers Updated
- Addresses memory leak bug in the previous version of GPFS, which would cause periodic performance issues
- Jobs with heavy I/O (that is reading and writing) should see improved performance with the upgraded GPFS
- Interconnect (InfiniBand) drivers improve performance and stability for node-to-node and node-to-storage communications
- GPFS and Interconnect driver updates enable us to make the most recent Message Passing Interface (MPI) performance libraries available on Quest for software parallelization
Graphical Interface for Remote Access, FastX, Upgraded to FastX3; FastX2 Retired
- The FastX3 desktop client provides faster graphics support than FastX2
- Provides access to the newest FastX features (such as sharing sessions), which are only available in FastX3