-

This site is deprecated and will be decommissioned shortly. For current information regarding HPC visit our new site: hpc.njit.edu

ISTHPCPlanning

From NJIT-ARCS HPC Wiki
Jump to: navigation, search


Executive Summary

This document porovides a sustainable framework by which a high poerformance computing (HPC) infrastructure and support staff can meet the ongoing needs of NJIT researchers.

Motivation

To ensure continuation of NJIT's newly-gained Very High Research Activity (R1) status, we must recognize the rapidly increasing importance of research computing across all science and engineering disciplines.

NJIT lags far behind its peers in providing computational resources. This adversely affects current research programs, hampers attracting top new faculty and students, and jeopardizes R1 status.

NJIT must provide its growing body of researchers with state-of-the-art, professionally-managed, on-demand computing infrastructure and necessary support staff. These resources should be designed, expanded, and regularly refreshed to meet modern research demands. By establishing a regular budgeting and procurement model we can maintain hardware compatibility and maximize utility over the lifecycle of each investment.

Implementation Framework

We propose a flexible model that provides a range of computational services to support the diverse types of current and future research.

  • A baseline computational resource available to all researchers and their students. This will provide general access to the most common forms of computation in current demand. This will include high speed processors, GPUs, a parallel file system, high speed networks and high capacity storage to support big data research.
  • A Condominium model, whereby an NJIT-subsidized infrastructure would be established for researchers to purchase computational resources. IST will maintain a catalog of compatible resources which could be rapidly deployed. Computational resources for funded project can be allocated immediately if available in the baseline resource, or ordered through normal procurement. The catalog will serve to reduce unnecessary variability and promote re-use of investment over the entire lifecycle of purchased equipment.
  • Cloud-based resources for short term projects and bursting (AKA "Cloud Bursting"), as an adjunct to the on-premise infrastructure. Taking advantage of our strategic partnership with Amazon Web Services (AWS) and high speed direct connection provided by NJEDge, NJIT can utilize cloud resources transparently to researchers. AWS provides rapid access to state-of-the-art and evergreen computational resources on a elastic, on-demand rental basis. This means we can allocate resources for short term projects immediately rather than waiting for a procurement cycle. AWS also provides access to practically unlimited storage capacity for large data sets.
  • Research Pods are a existing service which should be expanded. In situations where a researcher and their students/assistants need physical access to computational hardware the research pod provides the best balance of self-service hosting and full control in professionally managed facility. IST provides an access-controlled computer room with cooling, power, network and locking racks. Researchers with the assistance of IST staff can install their equipment and maintain unimpeded network and physical access. Access is granted to research assistants and students by simple request. This service is currently available in the GITC 4320 server room.
  • Strategic partnerships are critical. NJIT is a member of Xsede Campus Champions and the Eastern Regional Network (ERN). We will continue to seek additional public and private partnerships.
  • Augmentation of the Academic and Research Computing Systems staff by several positions.

Implementation Specifics

Computational Infrastructure

The computational infrastructure is comprised of CPU and GPU nodes and node interconnects for general use by all researchers, and the resources purchased by researchers for their own use.

The computing infrastructure for general use currently consists primarily of resources obtained by donation. These resources are over 10 years old, and are not capable of handling the research needs for which they are intended.

As an R1 research institution, in order to be competitive in providing computational resources for existing researchers, as well as attracting and retaining desirable new researchers, significant and sustained investment in research computing infrastructure and staffing needs to be made.

  1. Purchase a new cluster to serve as the general-access resource for all researchers and their students. This cluster is referred to as the baseline resource.
  2. Establish a new condominium model for researcher-purchased resources. Racks with both CPU and GPU nodes and appropriate node interconnects and switches will be purchased and deployed. These nodes will be available for researcher purchase. When a certain percentage of the nodes have been purchased, an additional rack populated with nodes will be purchased and deployed.
    • Included in this model is a parallel file system (PFS). Researchers can purchase their own storage in the PFS.
    • The scheduler and resource manager for the cluster, SLURM, is capable of dynamically allocating otherwise unused resources to jobs requesting those resources. Thus, investments made by the university in this infrastructure are inherently shareable among researchers.
  3. Plan for a major cluster replacement cycle of five years, with the goal of state of the art infrastructure at that point, with annual smaller expansions.
  4. Plan for use of regional, national, and commercial resources for those cases in which researchers' needs can not be met by on-premise resources.

Cloud-based Resources

Cloud computing concepts and resources are used by a number of existing research projects. We expect increased use of these resources. As applicable, we encourage researchers to take advantage of grant opportunities offered by providers specifically aimed at research.

We continue to construct and refine a framework to transparently make cloud resources available to researchers.

We should establish a sustained cloud-bursting budget to enable baseline resource jobs that require additional resources to run in the cloud.

Research Pod Hosting

The current research pod hosting facility in GITC 4320 solution provides reasonable network connectivity for most users with a 1GB/s uplink per pod. We propose to upgrade this to provide 5GB/s uplink per pod.

To complete the pod hosting concept, we will also make capital improvements to increase guest capacity by purchasing additional pod racks and electrical work.

We should also consider eventually outgrowing the GITC 4320 server room and expanding this facility to another location.

Augmentation of the Academic and Research Computing Systems (ARCS) staff

The ARCS staffing for research computing consists of 2 FTEs. In comparison with other R1 universities, NJIT is considerably understaffed in terms of quantity and capabilities.

In order to adequately support research computing, we are proposing the following additional staff.

Proposed Additional Computational Staffing

  • Computational scientist. This position works with researchers in areas such as algorithm choice, code optimization, parallel programming, data analysis, machine learning, porting and debugging code, and use of appropriate hardware resources.
  • System administrator. This position concentrates on all hardware-related aspects of HPC clusters.
  • Technical writer and documentation specialist. This position provides user documentation and user education and training workshops.