From HPC Wiki
Jump to: navigation, search
Num System What Why When Comments
1 Kong Migrate Bozzelli nodes from Sun X6220 to SuperMicro SS2016 NJIT clusters have standarized on Intel processors; SS2016 are Intel, X6220 are AMD Fall 2016 Done
2 Kong Retire the long-amd queue This queue uses AMD processors, which are being retired in favor of Intel processors Fall 2016 Done
3 Kong Provide debug queue, comprising 3 to 5 nodes, with maximum wall time of 15 minutes, and highest queue priority Provides an environment for testing and debugging distinct from the production compute nodes Fall 2016 Done
4 Kong OS upgrade to Scientific Linux 6.8 from 6.2 OS is too old to support some needed builds on both CPUs and GPUs, and commercial applications Winter 2016-2017 -
5 Kong, Stheno Automate prophylactic compute node reboots Administration efficiency Winter 2016-2017 -
6 Kong, Stheno Create PXE-booted virtualized login nodes for Kong and Stheno Remove dependence on physical headnodes Spring 2017 -
7 Kong, Stheno Virtualized compute node in virtual infrastructure - Spring 2017 Tentative
8 Stheno Virtualize headnode; deploy /home from virualized headnode; possibly obtain InfiniBand interface from headnode to virtualized infrastructure Administration efficiency, upgraded architecture Spring 2017 -
9 Stheno OS upgrade to Scientific Linux 7.X Needed for current gcc, needed builds, commercial applications Spring 2017 -
10 Kong OS upgrade to Scientific Linux 7.X Needed for current gcc, needed builds, commercial applications Spring 2017 -
11 Kong, Stheno New HPC nodes catalog Provide researchers standard supported hardware purchase options Spring 2017 -
12 Kong Extend ipmi2do tool to work with Dell and iDataplex Currently only supports Supermicro Spring 2017 -
13 Kong Modify VNFS parameters and ipmi2do to redirect node console to serial-over-LAN (SOL) on boot Administrative node access via IPMI Spring 2017 -
14 Kong, Stheno Replace Son of Grid Engine (SGE) scheduler with Slurm [1] Support for SGE has waned since Oracle purchased Sun Microsystems. Slurm is being actively developed and is in widespread use. Spring-Summer 2017 Requires extensive user education
15 Kong, Stheno Deploy virtualized Warewulf (node provisioning) server Administration efficiency, upgraded architecture Spring-Summer 2017 -
16 Kong, Stheno Implementation of AUKS [2] Like ksub, with improved functionality Spring-Summer 2017 -
17 Kong, Stheno Automatic checkpointing of all checkpointable jobs Jobs can be restarted after a system crash in the state they were in at the crash Fall 2017 Berkeley Lab Checkpoint/Restart (BLCR) [3]
18 Kong Proof-of-concept InfiniBand-SMP using ScaleMP on test cluster (vapor) Preparation for SMP on Kong Fall 2017 -
19 Kong Very large RAM symmetric multiprocessing machine (SMP), using ScaleMP [4] to construct a virtual SMP (vSMP) using several nodes Addresses some researchers' need for such a computational resource Fall 2017 -
20 Kong Finalize plans for deployment of InfiniBand and 10GigE on switches, or the Open Compute Project Wedge 100 specification Support for high speed node interconnect on switches TBD -
21 Kong Automated /nscratch/ purging process Administration efficiency TBD -
22 Kong, Stheno 1. Cross mount Kong:/home to Stheno:/home and vice versa for DMS users with logins on both clusters
Or, 2. Mount same /home/?/$USERNAME from one cluster on both
Improve user experience TBD -
23 Kong, Stheno Reconfigure queues so nodes of given type are usually in one rack, so that MPI passing between racks is minimized Increase computation efficiency for parallel jobs TBD -
24 Kong, Stheno Burst computing into NJIT virtual infrastructure Use VMs for short periods of high need that exceed physical machine capacity TBD Tentative
25 Kong, Stheno Automatic nodes/racks powerdown during low utilization periods Conserve power TBD Tentative
26 Kong InfiniBand node interconnect Current 1-gigabit Ethernet (1GigE) node interconnect is a bottleneck for many parallel jobs Pending funding -
27 Kong 10GigE node interconnect Current 1GigE node interconnect is a bottleneck for many parallel jobs Pending funding -
28 Kong, Stheno Parallel file system appliance for storing temporary files Large increase in read and write operations compared to current Pending funding -