As some may know, UF Research Computing has been having some issues with the Slurm scheduler on the cluster.
The problem appears to stem with a flaw in the way Slurm handles prologue scripting. Last week we had instituted a small patch to the scheduler that at least mitigated the effects of the crash better, reducing the amount of time between a restart which meant that users would not be likely to see those effects. However, this did not address the problem.
Yesterday SchedMD (the makers of Slurm) released a patch that they suspect will be the proper solution to this problem. We have installed this patch and are now monitoring the situation to ensure that it does fix the problem.