Outage16Oct2018

From NJIT-ARCS HPC Wiki
Jump to: navigation, search

Emergency maintenance on Kong 10/15-16/2018

Background

  • In August 2017, the Grid Engine scheduler (GE) started dropping entire queue instances.
  • A workaround was created that restored the proper nodes to their corresponding queues.
  • Later in 2017 the queue instances disappearance would almost always occur when GE was restarted after it crashed. GE crashes were infrequent.

10/12-15/2018

  • GE started crashing once or twice a day
  • On 10/15, a broken queue configuration was found that was causing all following queue configurations to fail when GE was restarted
  • Fixing this caused GE to start properly, but it was running many more jobes on nodes than there were available cores on the nodes
  • The decision was made to have a 24-hour outage in which to identify the causes and correct them

10/16/2018

  • Complete queues cleanup
  • Update operating system on all nodes
  • Update GE
  • Run tests