-

This site is deprecated and will be decommissioned shortly. For current information regarding HPC visit our new site: hpc.njit.edu

Metrics

From NJIT-ARCS HPC Wiki
Jump to: navigation, search

Historically, ARCS has concentrated its resources on providing the best HPC environment - hardware, software, consulting - for researchers that it could. ARCS did not not track usage as a percent of capacity.

It is not that ARCS can not produce HPC usage and capacity metrics, but that we need more time and smaller scope to provide reliable and meaningful metrics.

  • Stock metric programs (such as XDMOD) assume the cluster does not change in size over the accounting period; Kong had nodes removed and quite different nodes added (in terms of number of cores, RAM, number of GPUs, number of GPUs). Many man-hours were put into an effort to show these metrics for the 3/2017-3/2018 period, but yielded un-trustworhy results - usage patterns that did not make sense. A shorter effort using the period 10/2017 to 3/2018 showed believable results for some categories (AKA queues), but some non-believable results for others.
  • The task is complicated by GPU nodes being used as symmetric multiprocessor (SMP) high-RAM nodes: If a user wants all the CPU cores on the nodes, but 2 are in use by GPU jobs, then that job has to wait. Conversely, if a user is using both CPU slots (a common occurrence) then GPU jobs are waiting even though the GPUs are not in use.
  • At that point in time (4/2018) the project was put on hold due to a backlog of equally high-priority tasks (iterative price quotes, hardare installation) for research faculty.

  • As 5/14/2018 the backlog is somewhat clear and HPC metrics is now higher priority and can be resumed. We are particularly hopeful a new approach using one part of XDMOD - the "shredder" tool - which converts Grid Engine account records into MySQL database records. This will allow other (and faster) tools to analyze the records.
  • XDMOD will be implemented on Stheno when it moves to OHPC in summer 2018, and on Kong within the next few weeks.