ForPNSFCCStar

From NJIT-ARCS HPC Wiki
Revision as of 21:12, 24 January 2021 by Hpcwiki1 dept.admin (Talk | contribs) (Importing text file)

Jump to: navigation, search

ForPNSFCCStar

CC* Form Structure

1. Preamble

This form is intended to gather information for inclusion in NSF 21-528 Campus Cyberinfrastructure (CC*) Program Area 4, "Campus Computing and the Computing Continuum".

The NJIT proposal will seek funding for public-access GPU nodes, which will be part of the Lochness.njit.edu cluster. Program Area 4 supports awards up to $400,000 for up yo teo years.

By participating, you will be provoiding data necessary for this proposal.

2. Participant info

2.1 What is your position? {button}

  • Faculty
    • Tenured
    • Tenure-track
    • Non-tenure-track
  • Academic research staff {text box}
  • Postdoc

2.1.1 What is your department {dd menu}

Newark College of Engineering

  • Biomedical Engineering
  • Biological and Pharmaceutical Engineering
  • Department of Civil and Environmental Engineering
  • Electrical and Computer Engineering
  • Engineering Technology
  • Mechanical and Industrial Engineering

College of Science and Liberal Arts

  • Department of Aerospace Studies (AFROTC)
  • Department of Chemistry and Environmental Science
  • Department of Humanities
  • Department of Mathematical Sciences
  • Department of Physics
  • Federated Department of Biological Sciences
  • Federated Department of History
  • Rutgers/NJIT Theatre Arts Program

Ying Wu College of Computing

  • Department of Computer Science
  • Department of Informatics

Martin Tuchman School of Management

College of Architecture and Design

  • NJ School of Architecture
  • School of Art and Design`

2.2 For approximately how long have you and/or your research group been using IST-managed HPC resources? {dd menu}

  • Never used
  • Less than 6 months
  • 6+ to 12 months
  • 1+ to 2 years
  • 2+ to 5 years
  • 5+ years
  • Don't know

2.3 What is the general classification of computations for which you and/or your research gropup use IST-managed HPC {dd menu}

  • Bioinfomatics
  • Bioinformatics
  • Bioinformatics, text analysis
  • Biophysics
  • Computational Chemistry
  • Computational PDE
  • Computational biophysics
  • Computational chemistry
  • Computational fluid dynamics
  • Computational physics and chemistry
  • Condensed matter physics
  • Electromagnetism, Wave propagation
  • Granular science
  • Image forensics
  • Materials research
  • Monte Carlo
  • Neural networks, genetic algorithms
  • Software verification, static analysis
  • Statistical analysis
  • Steganalysis and image forensics
  • Transportation data analysis
  • Other {text box}

2.4 What is the specific description(s) of the computations for which you and/or your research group use IST-managed HPC? {text box}

0.4.1 If response to 0.1 is "faculty/staff": Do you yourself use IST-managed HPC and/or BD resources, or are these resources used only by other members of your research group? {button yes/no}

If yes: proceed.

If no: Do you want to proceed anyway, or do you want to exit the survey now {button proceed/exit}

0.5 Terminology: parallel and serial computations

The definitions of parallel and serial computing are technical, and the distinctions between the two are often unclear. For the purposes of this survey, one of whose goals is to determine how the existing infrastructure is used, refer to the following guidelines.

Definitions of parallel and serial computing

  • Parallel computations
    • Working definition: The application uses a set of independent cores that can work cooperatively on tasks at the same time in solving a problem
    • Common implementation platforms
      • Distributed memory clusters - e.g., Kong, Stheno
      • Shared memory machines - e.g., Kong "smp" queue, Gorgon, Cnrdp, Phi
      • Graphical processing units (GPU) - present on Kong and Stheno only
      • The CPUs that are part of the GPU nodes
  • Serial computations
    • Working definition: The application can use only one core at a time, and processes tasks sequentially
    • Common implementation platforms
      • Distributed memory clusters - e.g., Kong, Stheno
      • Shared memory machines - e.g., Kong "smp" queue, Gorgon, Cnrdp, Phi
      • The CPUs that are part of the GPU nodes

Note that both parallel and serial computations can be done on both clusters and shared memory machines, including the CPUs that are part of the GPU nodes. The GPUs themselves are not suitable for serial processing.

Determining whether you are doing parallel or serial computation

  • You are doing parallel computation if your applications are using any of the following:
    • A message passing interface (MPI) protocol
    • MP, an application programming interface (API) for multithreaded, shared memory parallelism
    • GPUs
  • You are doing serial computation if you are not doing parallel computation. If in doubt, you are probably doing serial computation.

0.6 For which type(s) of computations do you use IST-managed HPC resources? {multiple choice}

  • Parallel
  • Serial

Please select the resources on which you run parallel computation

  • Kong, including "smp" queue
  • Stheno
  • Cnrdp
  • Gorgon
  • Phi
  • GPU
  • Not doing parallel computation

1. HPC computational resources

1.1 You have previously indicated that you use HPC for parallel computations

  • Nodes are the discrete physical units in the HPC racks (typical dimensions: 26.5" wide, 36" deep, 1.75" high)
  • Nodes contain multiple CPUs
  • CPUs contain multiple cores; cores do the computations
  • Nodes are connected by a network internal to the cluster
    • Kong has a 1 Gigabit/second Ethernet (1GigE) node interconnect
    • Stheno has an InfiniBand node interconnect
    • Stheno's InfiniBand is 15 times faster than 1GigE

The following table, from the HPC and BD wiki, provides information on HPC hardware resources:

  • HPC specs extract image

1.1.1 Please indicate the adequacy of parallel resources for Kong, including "smp" queue {array}

  • Column headings: Adequate, Moderate increase needed, Large increase needed, Don't know
  • Rows

  • Number of cores
  • Number of nodes
  • CPU speed
  • Max RAM per node
  • Node interconnect (internal network) speed

1.1.1.1 You have indicated that at least one HPC parallel computational resource on Kong is inadequate.

Please rank the remediation options given below for addressing any inadequacy(ies), independent of the number and types of the inadequacy(ies).

Resource provisioning terminology:

  • NJIT-provided HPC resources shared among users
  • User-purchased HPC resources at NJIT
  • NJIT-provided commercial off-premise resources shared among users
  • User-purchased commercial off-premise resources
  • Use publicly available HPC resources (e.g., at a national supercomputing center); successful proposal by researcher is required

Note: Off-premise providers include Amazon Web Services, Azure, Google Cloud Platform, IBM Bluemix, Oracle Cloud, Penguin Computing

{rank}

  • NJIT-provided HPC shared resources
  • User-purchased HPC resources at NJIT
  • NJIT-provided shared off-premise resources
  • User-purchased off-premise resources.
  • Increase use of publicly available HPC resources

1.1.2 Please indicate the adequacy of parallel resources for Stheno, Cnrdp, Gorgon, Phi, GPU {array}

Stheno: same array as for Kong

Cnrdp: same array as for Kong minus "Node interconnect speed"

Gorgon: same array as for Kong minus "Node interconnect speed"

Phi: same array as for Kong minus "Node interconnect speed"

GPU: same array as for Kong minus "Node interconnect speed"

1.2 You have previously indicated that you use HPC for serial computations

Please select the resources on which you run serial computations

  • Kong, including "smp" queue
  • Stheno
  • Cnrdp
  • Gorgon
  • Phi
  • Not doing serial computation
  • Nodes are the discrete physical units in the HPC racks; typical dimensions: 26.5" wide, 36" deep, 1.75" high
  • Nodes contain multiple CPUs
  • CPUs contain multiple cores; cores do the computations
  • Nodes are connected by a network internal to the cluster
    • Kong has a 1 Gigabit/second Ethernet (1GigE) node interconnect
    • Stheno has an InfiniBand node interconnect
    • Stheno's InfiniBand is 15 times faster than 1GigE

The following table provides information on HPC hardware resources:

  • HPC specs extract image

1.2.2 Please indicate the adequacy of serial resources for Kong {array}

  • Column headings: Adequate, Moderate increase needed, Large increase needed, Don't know
  • Rows

  • Number of cores
  • Number of nodes
  • CPU speed
  • Max RAM per node

Please rank the remediation options given below for addressing any inadequacy(ies), independent of the number and types of the inadequacy(ies).

Resource provisioning terminology: (Definition of terms in text box)

  • NJIT-provided HPC shared resources
  • User-purchased HPC resources at NJIT
  • NJIT-provided shared off-premise resources
  • User-purchased off-premise resources.
  • Increase use of publicly available HPC resources

1.2.3 Please indicate the adequacy of serial resources for Stheno, Cnrdp, Gorgon, Phi {array}

Stheno: same array as for Kong

Gorgon: same array as for Kong

Cnrdp: same array as for Kong

Phi: same array as for Kong

1.3 Current use of non-NJIT HPC resources

1.3.1 Do you currently use any non-NJIT HPC computational resources?

If yes

1.3.2 Please list the non-NJIT HPC computational resources you currently use {text box}

1.3.3 Please select the reason(s) you are using non-NJIT HPC computational resources. The following applies to NJIT resources. {multiple choice with comments}

  • Not enough CPU cores
  • CPU cores are too slow
  • Not enough RAM per core
  • For parallel computing, node interconnect speed is too slow
  • Not enough GPU cores
  • Read/write of temporary files is too slow
  • Storage space is inadequate
  • Required software is not available
  • Other - specify

1.3.4 Please provide any additional information on why you are using non-NJIT HPC computational resources {text box}

1.4 The following is a list of new processors that are being made available. Are any of them of interest to you for your research? {multiple choice with comments}

  • Intel Core i7 or i9 processor
  • Google Tensor Processing Unit (TPU)
  • Intel Nervana Neural Network Processor (NNP)
  • Intel Xeon Phi processor
  • AMD Epyc
  • Other
  • None


1.5 HPC documentation

1.5.1 Please indicate the adequacy of HPC computational resources documentation at the HPC and BD Wiki

  • Adequate
  • Moderately better needed
  • Much better needed
  • Don't know

1.5.2 You have indicated that the computational resources documentation is not adequate. Please suggest improvements. {text box}.

1.6 Please provide any comments on HPC computational resources. {text box}

2. HPC storage

Storage terminology:

  • AFS distributed filesystem: General computational use; accessible from all cluster nodes; accessible from all other HPC servers
  • NFS distributed filesystem: General computational use; separate filesystems accessible from Kong and Stheno; not accessible from other HPC servers

The base storage allocations for researchers are given in the table below.

2.1 Please indicate the adequacy of base allocations {dual dd}

  • AFS storage
    • Adequate
    • Moderate increase needed
    • Large increase needed
    • Don't know
  • NFS storage
    • Adequate
    • Moderate increase needed
    • Large increase needed
    • Don't know

2.1.2 You have indicated that at least one HPC storage resource is inadequate

Please rank the remediation options given below for addressing any inadequacy(ies), independent of the number and types of the inadequacy(ies).

Resource provisioning terminology: (Definition of terms in text box)

  • NJIT-provided HPC shared resources
  • User-purchased HPC resources at NJIT
  • NJIT-provided shared off-premise resources
  • User-purchased off-premise resources.

2.2 Parallel file system

Background

  • The HPC cluster model uses hundreds of compute nodes, each containing several CPUs, each of which contains several processors (cores) to perform calculations. The two NJIT HPC clusters, Kong (general-access) and Stheno (Dept. Mathematical Sciences only) between them contain 3,448 cores. Jobs running on these cores write/read temporary files to/from disk as part of the computational processes.
  • When the temporary files are more than a few gigabytes in size, the most time required for computations is consumed by writing to and reading from temporary files on disk - i.e., disk I/O is the bottleneck. In addition, parallel jobs handling temporary files of any size will encounter this bottleneck.
  • The problem is exacerbated by the very large increase in the computational capacity and number of cores in Kong in summer 2015, which has resulted in a very large increase in the amount of large temporary data that the compute nodes are attempting to write to and read from disk.

Implication

  • Researchers at NJIT using HPC clusters are dealing with increasingly large sets of data, with the concomitant need for much higher I/O capacity for temporary space.

Parallel file system (PFS) appliance

    A PFS is a file system that distributes file data across multiple servers and provides for concurrent access by multiple tasks. A PFS can be used by both serial and parallel processes running on an HPC cluster. A PFS appliance can be connected to multiple clusters. PFS examples: IBM General Parallel File System (GPFS), Lustre

2.2.1 Please indicate the importance for your research of having PFS capability in the HPC cluster(s) that you use {dd menu} </font>

  • Very important
  • Moderately important
  • Not important
  • Don't know

2.2.2 You have indicated that a PFS would be either very or moderately important for your research. Please state the reason (s). {text box}

2.3 Cost of purchasing storage

2.3.1 Are you involved in decisions regarding the purchase of additional storage? {button yes/no}

If yes

Background:

Researchers can purchase storage in addition to the base allocation of 500GB each of AFS and NFS space.

2.3.2 Additional storage can be either Tier 1 (very high performance, suitable for high-speed transactional databases), or Tier 2 (high performance, suitable for most HPC applications). Backup choices are: daily, reduced frequency (two to three times per week), and no backup.

Please indicate if the costs listed below are suitable for your research needs

  • Tier 1, no backup: $870/TB/year
  • Tier 1, reduced frequency backup: $1010/TB/year
  • Tier 1, daily backup: $1160/TB/year
  • Tier 2, no backup: $250/TB/year
  • Tier 2, reduced frequency backup: $390/TB/year
  • Tier 2, daily backup: $540/TB/year
  • None suitable

2.3.3 If None

What is the maximum cost, in dollars per TB per year, at which you would purchase NJIT storage, including backup? {text box}

2.3.4 Currently, storage costs are payed annually. Is this satisfactory? {button yes/no}

If no

2.3.5 Please provide the reason that payment on a one year (annual) basis is not satisfactory. {text box}

On what yearly basis should NJIT storage costs be charged? {array}

  • Two year
  • Three year
  • Four year
  • Five year

2.4 Please indicate the importance of platform-independent access to HPC storage {dd menu}

(Definition of terms in text box) "Platform-independent access" means that file paths and authorization are independent of which platform - Linux, MacOSX, Windows - is being used to access files.

  • Very important
  • Moderately important
  • Not important
  • Don't know

2.5 Storage documentation

2.5.1 Please indicate the adequacy of HPC storage documentation at the HPC and BD wiki

  • Adequate
  • Somewhat better needed
  • Much better needed
  • Don't know

2.4.2 You have indicated that the HPC storage documentation is not adequate. Please suggest improvements. {text box}.

Please provide any comments on HPC storage {text box}

3. Big data computational and storage resources

The Hadoop/Spark infrastructure is a virtual environment based on VMware Big Data Extensions (BDE).

VMware introduced Big Data Extensions, or BDE, as a commercially supported version of Project Serengeti designed for enterprises seeking VMware support. BDE enables customers to run clustered, scale-out Hadoop applications on the vSphere platform, delivering all the benefits of virtualization to Hadoop users. BDE delivers operational simplicity with an easy-to-use interface, improved utilization through compute elasticity, and a scalable and flexible Big Data platform to satisfy changing business requirements. VMware has built BDE to support all major Hadoop distributions and associated Apache Hadoop projects such as Pig, Hive, and HBase.

The hardware (Horton.njit.edu) associated with BDE is as follows:

  • 2 x IBM iDataPlex dx360 M3 nodes, each with:
    • 2 x Intel Xeon CPU E5-2680 (8 Core)
    • 16 CPU CORES @ 2.70GHz
    • 32 Logical Processors with Hyperthreading
    • 128G RAM
  • 3TB HDFS (Hadoop Distributed File System) disk

3.1 Please indicate the adequacy of big data computational and storage resources {dd menu}

  • Column headings: Adequate, Moderate increase needed, Large increase needed, Don't know
  • Rows

  • Number of cores
  • Max RAM per node
  • Amount of HDFS storage

You have indicated that at least one big data computational and storage resource is inadequate.

Please rank the remediation options given below for addressing any inadequacy(ies), independent of the number and types of the inadequacy(ies).

Resource provisioning terminology: (Definition of terms in text box)

  • NJIT-provided HPC shared resources
  • User-purchased HPC resources at NJIT
  • NJIT-provided shared off-premise resources
  • User-purchased off-premise resources.

3.2 Cost of purchasing storage

3.2.1 Are you involved in decisions regarding the purchase of additional storage? {button yes/no}

If yes

Background:

Researchers can purchase storage in addition to the base allocation of 500GB each of AFS and NFS space.

3.2.2 Additional storage can be either Tier 1 (very high performance, suitable for high-speed transactional databases), or Tier 2 (high performance, suitable for most HPC applications). Backup choices are: daily, reduced frequency (two to three times per week), and no backup.

Please indicate if the costs listed below are suitable for your research needs

  • Tier 1, no backup: $870/TB/year
  • Tier 1, reduced frequency backup: $1010/TB/year
  • Tier 1, daily backup: $1160/TB/year
  • Tier 2, no backup: $250/TB/year
  • Tier 2, reduced frequency backup: $390/TB/year
  • Tier 2, daily backup: $540/TB/year
  • None suitable

3.2.3 If None

What is the maximum cost, in dollars per TB per year, at which you would purchase NJIT storage, including backup? {text box}

3.2.4 Currently, storage costs are payed annually. Is this satisfactory? {button yes/no}

If no

3.2.5 Please provide the reason that payment on a one year (annual) basis is not satisfactory. {text box}

On what yearly basis should NJIT storage costs be charged? {array}

  • Two year
  • Three year
  • Four year
  • Five year

3.3 Please provide any comments on big data computational and storage resources {text box}

4. Software environment

The software environment is the combination of applications - open source and commercial, libraries, utilities, and modules. Modules are used for setting the user's environment for specific software. Almost all HPC and BD software has an associated module. The following lists the modules currently available.

Images from modules available list Not current

4.1 Please rate the suitability of software environment for your work {dd menu}

  • Excellent
  • Good
  • Fair
  • Poor
  • Don't know

Please identify software not listed above that you wish to be available.

Does the software have any associated costs?

To what extent would you use the software for research and/or teaching?

Expected use choices:

  • Research - high, medium, low, no use
  • Teaching - high, medium, low, no use

Please provide any comments on the software environment {text box}

5. Internet bandwidth, Science DMZ

Internet bandwidth is the capacity of NJIT's connection to and from the Internet.

5.1 Please rate the suitability of Internet bandwidth, including Internet 2 if applicable, for your work {dd menu}

  • Excellent
  • Good
  • Fair
  • Poor
  • Don't know

Has your research been hampered by difficulties in transferring data from or to the Internet? {button yes/no}

If yes: Please describe in detail the nature of the difficulties in transferring data from or to the Internet {text box}

Please provide any comments on Internet bandwidth {text box}

5.2 Science DMZ "Science DMZ" refers to a computer subnetwork that is structured to be secure, but without the performance limits that would otherwise result from passing data through a stateful firewall.

The Science DMZ is designed to handle high volume data transfers, typical with scientific and high-performance computing, by creating a special DMZ to accommodate those transfers.

Science DMZ is typically deployed at or near the local network perimeter, and is optimized for a moderate number of high-speed flows, rather than for general-purpose business systems or enterprise computing.

5.2.1 Please rate the desirability of implementing a Science DMZ at NJIT as it relates to your work

  • High
  • Moderate
  • Low
  • Don't know

Please provide any comments on Science DMZ {text box}

6. Consultation

Consultation is interaction with Academic and Research Computing Systems (ARCS) staff in areas such as getting started in HPC and BD, problems encountered when running jobs, optimizing throughput, running parallel jobs, managing disk space, and assistance in working with and managing big data.

6.1 Please rate the effectiveness of consultation in your work {dd menu}

  • Excellent
  • Good
  • Fair
  • Poor
  • Never consulted

Please provide any comments on consultation {text box}

7. End game

7.1 What is your and/or your research group's satisfaction with their use of IST-managed HPC and/or BD resources? {dd menu}

  • High
  • Medium
  • Low
  • Don't know

7.2 Please suggest up to five changes you would like to see in the NJIT HPC and/or BD environment. {text box}

7.3 Please provide any comments on your and/or your research group's use of IST-managed HPC and/or BD resources. {text box}

The final questions are about the device and browser that you used.

7.4 What device did you take this survey on? {dd menu}

  • Desktop computer
  • Laptop
  • Tablet
  • Smartphone

7.5 What browser did you use? {dd menu}

  • Chrome
  • Firefox
  • Internet Explorer / Edge
  • Safari
  • Other

7.6 Did you encounter any problems with visual presentation and/or navigating the survey? {button yes/no}

If yes

Please describe the visual presentation and/or navigation problem(s) you encountered in taking the survey {text box}