Power Incident on 25 August
On 25 August a power failure in the data center caused the loss of some jobs on the HiPerGator cluster.
Cause of Incident
At approximately 7:00pm the maintenance started, at which time the backup generator for the cluster was started. Once it was running and supplying power, the UPS was switched from receiving power from the city power lines to the generator, at which point maintenance on that side of the electrical system could be started. Unfortunately, part of the switching lineup was missed, and while the backup generator was providing power, it was not being transmitted into the system properly and powering the UPS. As such, the UPS drained over the next 5-10 minutes, and when it was out of power, a part of the cluster lost power. This happened at approximately 7:15pm.
Once it was noticed that power had been lost to the cluster, it was quickly restored. However, the damage had already been done in terms of operations within HiPerGator.
The systems affected by this incident were:
- HiPerGator-RV: Most functionality was lost
- HiPerGator 3.0: Many compute nodes lost power and subsequently the jobs running on them
- HiPerGator 2.0: Some compute nodes were affected, but most continued to run the jobs on them
- HiPerGator-AI: Very low impact. Affects seen were due to switches that this part of the cluster is connected to losing power.
- Open Science Grid CMS Storage: Partial loss of power on this system caused this filesystem to be inaccessible
Once UFRC staff were notified of the situation at around 8:00pm, the SLURM scheduler was paused so that no jobs would be issued while staff brought systems back online. The scheduler was paused for approximately 90 minutes. The cluster was restored to working order (with a couple of nodes still shutdown) by 10pm. The CMS storage system was returned to service around 10:30pm.
There are still a couple of nodes that are down, which is not unexpected. Whenever a large number of nodes are shutdown we expect one or two to have issues afterwards simply due to mechanical/electronic failure. These will be addressed but should not affect running jobs.