Outage17Feb2016

From NJIT-ARCS HPC Wiki
Jump to: navigation, search


At about 13:00 Wed 2/17/2016, Kong was taken down to migrate the user home directories from the Kong headnode to a dedicated NFS server. This was done in order to address the problem of poor response times on Kong.

From the start, the transfer of files from the headnode to the dedicated NFS server took much longer than was expected, due in part to the misconfiguration mentioned below, finally completing on 2/25. As far as is known, there was no data loss in the transfer. In addition, during the file transfer, system response to user commands on the headnode was often extremely slow. The extremely slow file transfer indicated some serious underlying problem(s).

Starting on 2/18, we attempted to identify the cause of the problem. Despite intensive effort, we were not successful in identifying the cause until about 14:30 on 2/26, at which point the problem was corrected.

The source of the problem was a network misconfiguration of one Kong compute node. This node was misconfigured on 9/24/2015 but had been powered off since then and thus did not cause any issues. The entire Kong cluster was rebooted on 2/17/2016. This brought this particular misconfigured node on line and caused the the extremely poor performance of the Kong headnode when it came back on line on 2/18.

We sincerely regret the incident, and the time lost by researchers during this outage.