-

This site is deprecated and will be decommissioned shortly. For current information regarding HPC visit our new site: hpc.njit.edu

Difference between pages "HPCBaselineAWS" and "HPCEnvironment"

From NJIT-ARCS HPC Wiki
(Difference between pages)
Jump to: navigation, search
(Importing text file)
 
(Importing text file)
 
Line 1: Line 1:
== Computational Cost ==
+
== Research computing environment overview ==
The purpose of this exercise is to provide approximate pricing for a baseline
+
HPC resource hosted entirely in AWS.
+
  
[https://calculator.s3.amazonaws.com/index.html#r=IAD&key=calc-CA413A5A-E5B9-458E-BE72-F00F64AA9079 AWS pricing calculator]
+
<table width="100%" style="border:1px solid black; border-collapse:collapse;">
  
(Note that the above link includes pricing for support. This is not included in the above pricing or the spreadsheet.)
+
<caption>HPC Environment Overview</caption>
 
+
<tr bgcolor="#dddddd">
AWS provides several pricing instances, including <em>on-demand</em>, <em>reserved</em>, and <em>spot</em>
+
  <th style="border:1px solid black;">Category</th>
 
+
  <th style="border:1px solid black;">Sub-category</th>
For this exercise pricing using reserved instances was chosen as being the most appropriate.
+
  <th style="border:1px solid black;">General<br /> Access</th>
An HPC resource built with reserved instances most closely
+
  <th style="border:1px solid black;">Node Age [1]<br />(years: %)</th>
resembles an on-premise, always available resource both in functionality and structure.
+
  <th style="border:1px solid black;">DMS-only<br />Access [2]</th>
 
+
  <th style="border:1px solid black;">Node Age [1]<br />(years: %)</th>
Costing for on-demand and spot instances can be highly variable and unpredictable.
+
  <th style="border:1px solid black;">Private<br />Access [3]</th>
 
+
  <th style="border:1px solid black;">Node Age [1]<br />(years: %)</th>
On-demand pricing can possibly be more cost effective than reserved pricing.
+
  <th style="border:1px solid black;">Notes</th>
 
+
</tr>
However, it is extremely difficult to predict the level of demand, even if
+
<tr>
accurate historical HPC usage is available :
+
<td style="border:1px solid black;">CPU </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
</tr>
 +
<tr>
 +
<td style="border:1px solid black;"></td>
 +
<td style="border:1px solid black;"> Nodes </td>
 +
<td style="border:1px solid black;"> 240 </td>
 +
<td style="border:1px solid black;"> >10: <<font color="red">7</font><br />3-10: <<font color="red">90</font><br />0-3: <font color="green">3</font> </td>
 +
<td style="border:1px solid black;"> 31 </td>
 +
<td style="border:1px solid black;"> >10: <font color="red">0</font><br />3-10: <font color="red">100</font><br />0-3: <font color="red">0</font> </td>
 +
<td style="border:1px solid black;"> 21 </td>
 +
<td style="border:1px solid black;"> >10: <font color="red">0</font><br />3-10: <font color="red">34</font><br />0-3: <font color="green">76</font> </td>
 +
<td style="border:1px solid black;"> About 1000 Kong cores are permanently out of service due to hardware failure </td>
 +
</tr>
 +
<tr>
 +
<td style="border:1px solid black;"></td>
 +
<td style="border:1px solid black;"> Cores </td>
 +
<td style="border:1px solid black;"> 1,896 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 380 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 168 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
</tr>
 +
<tr>
 +
<td style="border:1px solid black;"></td>
 +
<td style="border:1px solid black;"> RAM, TB </td>
 +
<td style="border:1px solid black;"> 10.5 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 3.6 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 4.7 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
</tr>
 +
<tr>
 +
<td style="border:1px solid black;">CPU with GPU </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
</tr>
 +
<tr>
 +
<td style="border:1px solid black;"></td>
 +
<td style="border:1px solid black;"> Nodes </td>
 +
<td style="border:1px solid black;"> 2 </td>
 +
<td style="border:1px solid black;"> >10: <font color="red">0</font><br />3-10: <font color="red">100</font><br />0-3: <font color="green">0</font> </td>
 +
<td style="border:1px solid black;"> 2 </td>
 +
<td style="border:1px solid black;"> >10: <font color="red">0</font><br />3-10: <font color="red">100</font><br />0-3: <font color="green">0</font> </td>
 +
<td style="border:1px solid black;"> 8 </td>
 +
<td style="border:1px solid black;"> >10: <font color="red">0</font><br />3-10: <font color="red">0</font><br />0-3: <font color="green">100</font> </td>
 +
<td style="border:1px solid black;"> </td>
 +
</tr>
 +
<tr>
 +
<td style="border:1px solid black;"></td>
 +
<td style="border:1px solid black;"> GPU Cores </td>
 +
<td style="border:1px solid black;"> 10,752 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 15,320 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 64,512 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
</tr>
 +
<tr>
 +
<td style="border:1px solid black;"></td>
 +
<td style="border:1px solid black;"> CPU Cores </td>
 +
<td style="border:1px solid black;"> 40 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 44 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 10 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
</tr>
 +
<tr>
 +
<td style="border:1px solid black;"></td>
 +
<td style="border:1px solid black;"> RAM, TB </td>
 +
<td style="border:1px solid black;"> 0.25 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 0.26 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 2.0 </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
</tr>
 +
<tr>
 +
<td style="border:1px solid black;">Node interconnect </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 13 of 10Gb/sec, 12 of 56 Gb/sec; rest 1Gb/sec </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 40 Gb/sec </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> 12 of 56Gb/sec; rest 10Gb/sec </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> </td>
 +
</tr>
 +
<tr>
 +
<td style="border:1px solid black;">Parallel file system<br />(PFS) </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> None </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> None </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> None </td>
 +
<td style="border:1px solid black;"> </td>
 +
<td style="border:1px solid black;"> Traditionally used for temporary files, PFS's are now used for all kinds of storage. Researcher request s for storage are routinely 10 to 20TB, compared to 50 to 100GB two to three years ago </td>
 +
</tr>
 +
</table>
 +
<p></p>
 +
<strong>Legend</strong><br />
 +
[1] :
 
<ul>
 
<ul>
        <li>Existing researchers' problem domains, size of models, and scale of analyses
+
<li>Applies also to Cores and RAM</li>
change</li>
+
<li><font color="red">Red numbers</font> mean that the nodes are <strong>Out of Warranty</strong></li>
 
+
<li><font color="green">Green numbers</font> mean that the nodes are <strong>In Warranty</strong></li>
        <li>New researchers bring unknown needs for computational resources</li>
+
 
</ul>
 
</ul>
  
On-demand instances are most cost-effective when historical HPC usage can be used to
+
[2] :
reliably predict future usage. This is not currently the case at NJIT, where the
+
"DMS" refers to the Department of Mathematical Sciences. DMS owns the <em>Stheno</em> cluster.
base level of HPC activity, augmented regularly by new researchers, is still changing
+
significantly.
+
 
+
Spot instances can be terminated without warning, requiring
+
workload checkpointing, which may be difficult to implement. This unpredictability
+
will cause significant user frustration.
+
 
+
The instances in this exercise were chosen to most closely resemble the
+
on-premise HPC Baseline Resource:
+
 
+
Compute Nodes : 25
+
<br />
+
Total Cores : 1,800
+
<br />
+
Total RAM : 12.8 TB
+
<br />
+
GPU Nodes : 5
+
<br />
+
Total GPU : 40
+
<br />
+
Total Cores : 320
+
<br />
+
Total RAM : 2.4 TB
+
 
+
Total Cost for Compute and GPU instances : <strong>$2,500,571.75 for 3 years</strong>
+
 
+
[https://docs.google.com/spreadsheets/d/1COS-qYL7FHgI_So1tNPrV7hHC4MPZ6bMG3GU6UbyrwI/edit?ts=5bf460dd#gid=0 AWS pricing Gsheet]
+
 
+
== Parallel File System Cost ==
+
The on-premise HPC Baseline resource includes a 1-PB IBM Spectrum Scale (formerly called GPFS) parallel file system (PFS).
+
 
+
Spectrum Scale licensing is not available on the AWS pricing calculator.
+
Instead, 1 PB of storage was calculated using 100 TB of EBS storage and 900 TB of S3
+
storage.
+
  
Storage cost, 3 years : <strong>$699,738.84</strong>
+
[3] :
 +
"Private Access" refers to cluster hardware purchased by individual researchers. That hardware is
 +
dedicated to those researchers.
  
== Total cost for hosting the HPC Baseline Resource at AWS for Three Years ==
+
== High-speed node interconnect and parallel file system ==
$2,500,571.75 (computational) + $699,738.84 (storage) : <strong>$3,200,310.59</strong>
+
[[ IBandPFS | Roles of internal Network and PFS ]]
  
Three years, used in this example, is a commonly used time period for costing cloud services.
+
== HPC cluster storage ==
 +
[[ ClusterStorage | Storage accessible to HPC clusters ]]
  
The cost varies linearly with the time period the service is used.
+
----

Latest revision as of 16:33, 5 October 2020

Research computing environment overview

HPC Environment Overview
Category Sub-category General
Access
Node Age [1]
(years: %)
DMS-only
Access [2]
Node Age [1]
(years: %)
Private
Access [3]
Node Age [1]
(years: %)
Notes
CPU
Nodes 240 >10: <7
3-10: <90
0-3: 3
31 >10: 0
3-10: 100
0-3: 0
21 >10: 0
3-10: 34
0-3: 76
About 1000 Kong cores are permanently out of service due to hardware failure
Cores 1,896 380 168
RAM, TB 10.5 3.6 4.7
CPU with GPU
Nodes 2 >10: 0
3-10: 100
0-3: 0
2 >10: 0
3-10: 100
0-3: 0
8 >10: 0
3-10: 0
0-3: 100
GPU Cores 10,752 15,320 64,512
CPU Cores 40 44 10
RAM, TB 0.25 0.26 2.0
Node interconnect 13 of 10Gb/sec, 12 of 56 Gb/sec; rest 1Gb/sec 40 Gb/sec 12 of 56Gb/sec; rest 10Gb/sec
Parallel file system
(PFS)
None None None Traditionally used for temporary files, PFS's are now used for all kinds of storage. Researcher request s for storage are routinely 10 to 20TB, compared to 50 to 100GB two to three years ago

Legend
[1] :

  • Applies also to Cores and RAM
  • Red numbers mean that the nodes are Out of Warranty
  • Green numbers mean that the nodes are In Warranty

[2] : "DMS" refers to the Department of Mathematical Sciences. DMS owns the Stheno cluster.

[3] : "Private Access" refers to cluster hardware purchased by individual researchers. That hardware is dedicated to those researchers.

High-speed node interconnect and parallel file system

Roles of internal Network and PFS

HPC cluster storage

Storage accessible to HPC clusters