ISTStrategicHPCPlanningRoadmapGov

Executive Summary

This document provides a sustainable framework by which a high performance computing (HPC) infrastructure and support staff can meet the ongoing needs of NJIT researchers.

Motivation

To ensure continuation of NJIT's newly-gained Very High Research Activity (R1) status, we must recognize the rapidly increasing importance of research computing across all science and engineering disciplines.

NJIT lags far behind its peers in providing computational resources. This adversely affects current research programs, hampers attracting top new faculty and students, and jeopardizes R1 status.

NJIT must provide its growing body of researchers with state-of-the-art, professionally-managed, on-demand computing infrastructure and necessary support staff. These resources should be designed, expanded, and regularly refreshed to meet modern research demands. By establishing a regular budgeting and procurement model we can maintain hardware compatibility and maximize utility over the lifecycle of each investment.

Implementation Framework

Porposed: a flexible model that provides a range of computational services to support diverse modes of current and future research.

A Condominium Model (CM), whereby an NJIT-subsidized infrastructure would be established for researchers to purchase computational resources. IST will maintain a catalog of compatible resources which could be rapidly deployed. Computational resources for funded projects can be allocated immediately if available, or ordered through normal procurement. The catalog will serve to reduce unnecessary variability and promote re-use of investment over the entire lifecycle of purchased equipment. The public-access nodes in the CM would be available to all researchers and their students, providing general access to modern computational resources, including high speed processors, GPUs, a parallel file system, high speed networks, and high capacity storage.
Cloud-based resources for short term projects and bursting (AKA "Cloud Bursting"), as an adjunct to the on-premise infrastructure. Taking advantage of our strategic partnership with Amazon Web Services (AWS) and high speed direct connection provided by NJEDge, NJIT can utilize cloud resources transparently to researchers. AWS provides rapid access to state-of-the-art and evergreen computational resources on a elastic, on-demand rental basis. AWS also provides access to practically unlimited storage capacity, e.g., for large data sets.
Research pods are an existing service which should be expanded. In situations where a researcher and their students/assistants need physical access to computational hardware the research pod provides the best balance of self-service hosting and full control in a professionally-managed facility. Researchers, provides an access-controlled computer room with cooling, power, network and locking racks. Researchers with the assistance of IST staff, can install their equipment and maintain unimpeded network and physical access. Physical access is granted to researchers. This service is currently available in the GITC 4320 server room.
Strategic partnerships are critical. NJIT is a member of Xsede Campus Champions and the Eastern Regional Network (ERN). We will continue to seek additional public and private partnerships.
Augmentation of the Academic and Research Computing Systems staff by several positions.

Implementation Specifics

Computational Infrastructure

The computational infrastructure is comprised of CPU and GPU nodes and node interconnects for general use by all researchers, and the resources purchased by researchers for their own use.

Kong Public Access Nodes Deprecation and Retirement

The Kong public access nodes currently consists of resources obtained by donation. These resources are over 10 years old, and have reached the end of their useful life

The Kong public access nodes cannot be updated to current versions of supported operating systems. The current node interconnect, GigE (one gigabit/second), is severly inadequate for current MPI parallel simulations. Additionally, sppoprt for the operating system on Kong, Scientific Linux 6, end November 30, 2020 and the successor version 7 is incompatible with the node interconnects. Due to the significant limitations of these nodes, the Kong public access nodes will be deprecated starting June 1, 2020. These nodes will be taken out of service on January 4, 2021. See Kong Public Nodes Retirement Timeline

Note that the above does not impact any of the Data Science, faculty-purchased, or public GPU nodes. These newer nodes will be the foundation for a new cluster currently in the planning stage.

New Cluster Lochness

As an R1 research institution, in order to be competitive in providing computational resources for existing researchers, as well as attracting and retaining desirable new researchers, significant and sustained investment in research computing infrastructure and staffing needs to be made.

Use the nodes purchased by NCE, CSLA, and others for installation in early Fall 2019 as the basis for a new cluster, Lochness. These nodes will be dedicated to the researchers who purchased them, but will also be shareable, as described below.
NJIT fund initial public-access nodes for Lochness, and establish a funding model for the refreshing of these nodes.

Condominium Model

Establish a new condominium model for researcher-purchased resources. Lochness racks with both CPU and GPU nodes and appropriate node interconnects and switches will be purchased and deployed. These nodes will be available for researcher purchase. When a certain percentage of the nodes has been purchased, an additional rack populated with nodes will be purchased and deployed.

Included in this model is a parallel file system (PFS). Researchers can purchase their own storage in the PFS.
The scheduler and resource manager for the Lochness cluster, SLURM, is capable of dynamically allocating otherwise unused resources to jobs requesting those resources. Thus, investments made by the university in this infrastructure are inherently shareable among researchers.
Plan for a major Lochness cluster replacement cycle of five years, with the goal of state-of-the-art infrastructure at that point, with annual smaller expansions.

Cloud-based Resources

Encourage researchers to take advantage of grant opportunities offered by cloud providers specifically aimed at research.

Continue to construct and refine a framework to transparently make cloud resources available to researchers.

Establish a sustained cloud-bursting budget to enable baseline resource jobs that require additional resources to run in the cloud.

Plan for use of regional, national, and commercial resources for those cases in which researchers' needs can not be met by on-premise resources.

Research Pod Hosting

The current research pod hosting facility in GITC 4320 providrs a 1GB/s uplink per pod; this should be upgraded to 5GB/s uplink per pod.

Additional pod racks should be purchased, along with supporting required electrical work.

Planning should include eventually outgrowing the GITC 4320 server room and expanding this facility to another location.

Augmentation of the Academic and Research Computing Systems (ARCS) staff

ARCS staffing for research computing consists of 2 FTEs. In comparison with most other R1 universities, NJIT research computing support is considerably understaffed in terms of quantity and capabilities.

In order to adequately support research computing, the following additional staff is proposed.

Proposed Additional Computational Staffing

Computational scientist. This position works with researchers in areas such as algorithm choice, code optimization, parallel programming, data analysis, machine learning, porting and debugging code, and use of appropriate hardware resources.
System administrator. This position concentrates on all hardware-related aspects of HPC clusters.
Technical writer and documentation specialist. This position provides user documentation and user education and training workshops.

Cluster Management Software The purchase of cluster management software may be a cost-effective method of enhancing the effectiveness of system administrators, and should be seriously considered.

Roadmap

Establish planning task force
Establish sustainable funding for :
- Annual updates / maintenance
- Expand / refresh, as warranted (typically every three years)
Short-term public-access research computing infrastructure, to replace the decommissioning of all public-access capability in early Jan 2021
Medium- and long-term infrastructure to support research computing