Research Computing is planning a complete HiPerGator service outage to perform maintenance starting Monday, May 6, 2019 at 8:00am and ending Thursday, May 9, 2019 at 8:00pm. This will impact all areas of HiPerGator including login and compute nodes, filesystems, network connectivity, and the Slurm scheduler. Users will not be able to access any portion of the HiPerGator cluster effective May 6, 2019 at 8:00am and any active sessions will be terminated at that time.
All maintenance was completed on time. However, there was a problem with the Orange file system, and it was not brought online with the rest of the cluster on May 9th.
After investigation, the cause of the issues for the Orange file system crashing were determined and it was brought back online on May 13th, but without SAMBA support for Windows machines. Users who rely on SAMBA connections to the Orange file system are requested to use SFTP or SCP protocols until a solution to the problem of the interface between SAMBA and the Orange file system is in place. DDN (the hardware provider for Orange storage) has been contacted and a patch is in the works.
Job Scheduling: No jobs can be scheduled to run past 8:00am on 5/6.
The scheduling system has been adjusted so that jobs will not be able to run past the scheduled service maintenance start time. Because of this, it may become increasingly more difficult for your jobs to start. If you submit a job prior to the scheduled maintenance date but the job is unable to be completed before the maintenance start time, your job will remain in the queue until maintenance has been completed and scheduling is resumed. Another option is to decrease the time requested for a job by using the ‘––time’ flag and setting it to a value that will fit within the remaining window.
Work to be performed
There are two major tasks to be performed, both resulting in the need to take all services offline for a period of time. The first major task is the migration of the shared directories, /home and /apps, from an aging filesystem to a new, modern scalable solution. Since all users mount /home immediately upon login, it is imperative that there is no activity on the cluster during this migration process in order to avoid potential data corruption. The new storage system was announced in the UFIT News on March 12, 2019 and more information can be found on the UF RC website at www.rc.ufl.edu. The second major task involves a series of software upgrades for our latest Lustre storage environment, known to most as Orange storage. These upgrades will provide a more recent version of the Lustre filesystem which includes several new features that will be beneficial to the HPC community, as well as bug fixes, security patches and firmware updates. The upgrade process however, does involve taking the entire /orange filesystem offline.
Resumption of Services
Once all services have been restored and extensive testing has been completed by Research Computing staff, access to the cluster will be restored and the scheduler will be resumed. Please notify support immediately if you encounter any issues with HiPerGator services after the maintenance activities have been completed. Also, feel free to contact support if you have any questions or concerns regarding the upcoming maintenance.