Outage16Oct2018
From NJIT-ARCS HPC Wiki
Emergency maintenance on Kong 10/15-16/2018
Background
- In August 2017, the Grid Engine scheduler (GE) started dropping entire queue instances.
- A workaround was created that restored the proper nodes to their corresponding queues.
- Later in 2017 the queue instances disappearance would almost always occur when GE was restarted after it crashed. GE crashes were infrequent.
10/12-15/2018
- GE started crashing once or twice a day
- On 10/15, a broken queue configuration was found that was causing all following queue configurations to fail when GE was restarted
- Fixing this caused GE to start properly, but it was running many more jobes on nodes than there were available cores on the nodes
- The decision was made to have a 24-hour outage in which to identify the causes and correct them
10/16/2018
- Complete queues cleanup
- Update operating system on all nodes
- Update GE
- Run tests