-

This site is deprecated and will be decommissioned shortly. For current information regarding HPC visit our new site: hpc.njit.edu

Difference between pages "GITC4320" and "FAQ"

From NJIT-ARCS HPC Wiki
(Difference between pages)
Jump to: navigation, search
(Importing text file)
 
(Importing text file)
 
Line 1: Line 1:
  
==GITC 4320 Data Center==
+
==<strong><font color="#9966cc">Accessing AFS Space</font></strong>==
 +
===How can I access my AFS space from HPC cluster compute nodes ?===
 +
Jobs submitted by the SGE scheduler to compute nodes have to get the user's
 +
AFS token in order to access space in AFS that that user has access to.
 +
The current method for doing this is via "ksub". See [[UsingKsub]].
  
===Purpose===
+
Just before running <em>ksub</em>, be <strong>sure</strong> you have your Kerberos
The purpose of the GITC 4320 data center is to host physical machines purchased by  
+
ticket and AFS token by :
researchers that meet certain criteria, and are managed by the researcher.
+
<pre code>kinit && aklog</pre>
  
===Self-managed Machines===
+
An improved method for doing what ksub does, "auks", is in the process of being implemented.
The owners of self-managed machines are responsible for <strong>all</strong> aspects of the
+
management of those machines.
+
  
===Justification===
+
==<strong><font color="#9966cc">Compilers</font></strong>==
There are cases where NJIT's HPC or virtual infrastructure is not suitable for the
+
===What compilers are available ?===
computational needs of researchers, for a variety of reasons. In such cases, "bare metal"  
+
In addition to the compilers listed by "module av", the GNU compilers that are
hardware, or "physical machine" is needed.
+
part of the standard operating system installation are available.
  
The data center provides:
+
See [[SoftwareModulesAvailable]]
 +
 
 +
==<strong><font color="#9966cc">Compiling CUDA code</font></strong>==
 +
===How do I compile CUDA programs ?===
 +
<ol>
 +
<li>
 +
Get an interactive login a GPU node, e,g., node151 or node152
 +
<pre code>qlogin node151</pre>
 +
</li>
 +
 
 +
<li>
 +
Load gcc and CUDA modules, e.g.,
 +
<pre code>module load gcc/5.4.0 cuda</pre>
 +
</li>
 +
 
 +
<li>
 +
Compile your code
 +
</li>
 +
 
 +
<li>
 +
Log out of the GPU node
 +
</li>
 +
 
 +
<li>
 +
Submit your job to the <strong>gpu</strong> queue using a submit script
 +
</li>
 +
</ol>
 +
 
 +
See [[RunningCUDASamplesOnKong]] and [[KongQueuesTable]]
 +
 
 +
==<strong><font color="#9966cc">Conserving disk space</font></strong>==
 +
===What are the best ways of conserving local space on the HPC clusters ?===
 +
<ol>
 +
<li>Remove unnecessary files</li>
 +
<li>If you have a research directory located at /afs/cad/research/..,
 +
    move (and better yet compress) files into a sub-directory of that
 +
    directory. This not only frees up space on on the cluster, but keeps the
 +
    user's results available even after their cluster account is removed.
 +
</li>
 +
<li>Compress files with <em>bzip2</em> or <em>gzip</em>. bzip2 is usually more effective</li>
 +
</ol>
 +
 
 +
==<strong><font color="#9966cc">"Error message: vsu_ClientInit: Could not get afs tokens, running unauthenticated" from qsub</font></strong>==
 +
===I get this message when running <em>qsub</em>. Is it something I should be concerned about ?===
 +
This message indicates that you do not have your AFS token. The message can be safely
 +
ignored when running qsub.
 +
 
 +
==<strong><font color="#9966cc">Error message from a node</font></strong>==
 +
===I get this message repeatedly in a terminal window when logged into Kong. Is it something I should be concerned about ?===
 +
This message indicates that there is a problem with a certain node. The message can be safely ignored, unless your job was
 +
using that node. The ARCS staff will take the offending node out of service as soon as possible.
 +
<pre code>
 +
Message from syslogd@nodeXXX at timestamp
 +
kernel: Code: 48 8d 45 d0 4c 89 4d f8 c7 45 b0 10 00 00 00 48 89 45 c0 e8 38 ff ff ff c9 c3 90 90 90 90 90 90 44 8d 46 3f 85 f6 55 44
 +
0f 49 c6 <31> d2 48 89 e5 41 c1 f8 06 45 85 c0 7e 24 48 83 3f 00 48 89 f8
 +
</pre>
 +
 
 +
==<strong><font color="#9966cc">Getting access to HPC clusters</font></strong>==
 +
===Who is eligible ?===
 +
All NJIT researchers; courses using HPC. See [[UserAccess]].
 +
 
 +
==<strong><font color="#9966cc">Getting AFS token using qlogin</font></strong>==
 +
===How do I get my AFS token when using <em>qlogin</em> ?===
 +
After qlogin :
 +
<pre code>
 +
kinit && aklog && tokens
 +
</pre>
 +
 
 +
<em>kinit</em> gets your Kerberos ticket when you supply the correct password
 +
 
 +
<em>aklog</em> gets your AFS token from your Kerberos ticket
 +
 
 +
<em>tokens</em> displays your AFS token status
 +
 
 +
==<strong><font color="#9966cc"> Kerberos ticket and AFS token status message</font></strong>==
 +
===<strong><font color="#9966cc">What does the "Your Kerberos ticket and AFS token status" message mean</font></strong> ?===
 +
 
 +
A message such as the following is informational, telling the user :
 
<ul>
 
<ul>
<li>Power, including UPS</li>
+
<li>Until what date thier Kerberos tickect ca be renewed</li>
<li>data center-grade networking</li>
+
<li>When their AFS token will expire, unless refreshed</li>
<li>HVAC</li>
+
<li>Rack/pod</li>
+
<li>Enterprise-level backups</li>
+
<li>Physical and network security</li>
+
<li>Self-service on-demand access
+
<ul>
+
<li>Researchers can install computational resources and bring them on-line without
+
coordinating with CST or waiting for DNS assignment</li>
+
</ul>
+
 
</ul>
 
</ul>
  
===Racks/Pods===
+
<pre code>
The basic storage unit is a pod.  
+
=== === === Your Kerberos ticket and AFS token status === === ===
 +
Kerberos :  Renew until 02/25/17 07:36:21, Flags: FRI
 +
AFS      :  User's (AFS ID 22964) tokens for afs@cad.njit.edu [Expires Jan 31 17:36]
 +
</pre>
 +
 
 +
==<strong><font color="#9966cc">Getting cluster resource usage history</font></strong>==
 +
===How can I view the historical usage of resources on the HPC clusters ?===
 +
Use [[Ganglia]]
 +
 
 +
==<strong><font color="#9966cc">Getting files into your local directory on kong or stheno</font></strong>==
 +
===How do I get files from outside kong or stheno into my local directory on those machines ?===
 +
The easiest way to to this is to copy the files from AFS.  
 
<ul>
 
<ul>
<li>Pod dimensions: 39"H x 24"W x 43"D</li>
+
<li>Since kong and stheno are AFS clients, a user logged into kong or stheno can copy files from any location in
<li>2 pods per 78"H rack</li>
+
AFS that the user has read access into their local kong or stheno directory.</li>
<li>Each pod has its own power, power distribution unit (PDU) and networking</li>
+
 
<li>Lockable</li>
+
<li>Conversely, a user can copy files from their local kong or stheno directory to any
<ul>
+
location in AFS to which the user has write access - e.g., a research directory or AFS home directory.</li>
<li>Provides physical security for sensitive data (e.g., medical, HIPAA)</li>
+
<li>Such security is not available in the GITC 5302 data center</li>
+
</ul>
+
<li>Dedicated to a single researcher, or shared by a group of researchers</li>
+
<li>Room security and locking racks suitable for research involving sensitive data</li>
+
<li>Accessible from front and back</li>
+
 
</ul>
 
</ul>
  
Research groups are allocated one or more pods for their exclusive use.
+
==<strong><font color="#9966cc">Archiving data</font></strong>==
 +
===How do I archive data from local storage on a cluster or AFS ?===
 +
<ol>
 +
<li>Use rclone : [[https://rclone.org/ rclone]]</li>
 +
<li>rsync to a local disk - contact <em>arcs@njit.edu</em> for assistance
 +
</ol>
  
Wherever possible, researcher equipment that located in pods should be <strong>rack-mountable</strong>.
+
==<strong><font color="#9966cc">Getting Job Status</font></strong>==
 +
===How can I get the status of jobs I have submitted to SGE ?===
 +
Use "qtsat" on a head node in various formats. See [[SonOfGridEngine]].
  
===Physical Access===
+
===How can I see an overall summary of queue activity ?===
 +
Use "qsummary" on a head node. For usage : "qsummary -h". See [[SonOfGridEngine]].
 +
 
 +
==<strong><font color="#9966cc">Getting local files onto or asccessible from an HPC cluster</font></strong>==
 +
===How do I get files that are local to my computer onto or accessible from the kong or stheno clusters ?===
 +
Programs running on compute nodes can access files that are :
 +
<ol>
 +
<li>in the user's local cluster home directory</li>
 +
<li>in an AFS directory the user has access to</li>
 +
</ol>
 +
1. Getting files from your local computer to your local cluster home directory
 +
<ol type="A">
 +
<li>make your local computer an AFS client; contact <em>arcs@njit.edu</em> for help with this</li>
 +
<li>copy the relevant files from your local computer into AFS space that you have write access to,
 +
  usually a research directory</li>
 +
        <li>log in to the cluster headnode; make sure you have your AFS token, via <em>tokens</em>; use "kinit
 +
&& aklog" if you don't have a token</li>
 +
<li>use <em>tar</em> or <em>cp</em> to get the files in AFS into your local cluster home
 +
  directory. Contact <em>arcs@njit.edu</em> if you need help</li>
 +
</ol>
 +
2. Accessing your files in AFS directly from compute nodes
 
<ul>
 
<ul>
<li>GITC 4320 is locked and alarmed</li>
+
<li>Use ksub [[UsingKsub]]</li>
<li>Faculty and staff members of research groups are given card access to GITC
+
4320</li>
+
<li>Student members of research groups are <strong>not</strong> given card access</li>
+
<li>Students in research groups allowed entry into GITC 4320 by faculty/staff
+
must be accompanied at <strong>all times</strong> by that faculty/staff person</li>
+
<li>ARCS staff are generally <strong>not</strong> available to accompany students
+
when faculty/staff are not available. For <strong>emergency</strong> cases, contact
+
arcs@njit.edu to see what, if any, arrangements can be made</li>
+
 
</ul>
 
</ul>
  
 +
==<strong><font color="#9966cc">Languages</font></strong>==
 +
===What languages are available ?===
 +
In addition to the languages listed by "module av", the languages that are
 +
part of the standard operating system installation are available - e.g.,
 +
/usr/bin/perl, /usr/bin/python. See [[SoftwareModulesAvailable]].
 +
 +
==<strong><font color="#9966cc">Performance, local machine vs. cluster</font></strong>==
 +
===An application runs faster on my own computer than it does on the HPC cluster ...?===
 +
This can happen for a variety of reasons, including :
 +
<ul>
 +
<li>your local computer has faster CPU, more RAM than cluster nodes</li>
 +
<li>your job on the cluster node shares the nodes's reosurces with other jobs</li>
 +
<li>the disk I/O on your local computer is faster than on the cluster</li>
 +
</ul>
 +
It should be noted that even in such cases users can get significantly better theoughput using
 +
a cluster by :
 +
<ul>
 +
<li>running many serial jobs simultaneously</li>
 +
<li>running jobs in parallel</li>
 +
</ul>
 +
 +
==<strong><font color="#9966cc">Program runs on head node but not on compute nodes</font></strong>==
 +
===My program runs OK when I run it on the head node, but produces different results, or doesn't run at all, on compute nodes. Why ?===
 +
There remain differences - scheduled to be remedied - between the head node and compute nodes. Code
 +
compiled on the head node may not run the same way on compute nodes. Fix : use <em>qlogin</em> to log in
 +
to an arbitrary compute node, and compile your code on that node.
 +
 +
==<strong><font color="#9966cc">Resources for big data analysis</font></strong>==
 +
===What resources exist for big data analysis ?===
 +
An Hadoop cluster for research and teaching came on-line 09 Dec 2015. Documentation on
 +
its use is being developed.
 +
 +
==<strong><font color="#9966cc">Resources writeups</font></strong>==
 +
===What writeups are available that describe NJIT's high performance computing (HPC) and big data (BD) resources ?===
 +
IST support for researchers [[ISTResearcherSupport]]
 +
 +
Short Overview [[ShortOverview]]
 +
 +
Very Short Overview [[VeryShortOverview]]
 +
 +
Detailed hardware specs [https://web.njit.edu/topics/hpc/specs Cluster specs]
 +
 +
==<strong><font color="#9966cc">Running Matlab</font></strong>==
 +
===How do I specify matlab jobs to use a specific queue ?===
 +
A queue can be specified within a matlab input file, e.g. :
 +
<pre code>
 +
ClusterInfo.setQueue ('medium')
 +
</pre>
 +
 +
===Is there support for Matlab Distributed Computing Srever (MDSC) on the HPC clusters ?===
 +
Yes. See [[GettingStartedWithSerialAndParallelMATLABOnKongAndStheno]].
 +
 +
==<strong><font color="#9966cc">Software</font></strong>==
 +
===What software is available on the HPC clusters ?===
 +
The simplest way to see what software is available is to enter :
 +
<pre code>
 +
module available
 +
</pre>
 +
This will produce a list of most - but not all - available software.
 +
 +
You can refine your search to list specific software. E.g., to list
 +
all of the GCC compilers avaialable, :
 +
<pre code>
 +
module available gcc
 +
</pre>
 +
 +
Recent list of modules [[SoftwareModulesAvailable]]
 +
 +
The "module available" command can be used on the kong and stheno
 +
head nodes, and on all public-access AFS Linux clients. The shortest
 +
form of this command is "module av".
 +
 +
===What is the "module" Command ?===
 +
This command is used to set the environment variables for the software you
 +
want to use. see [https://en.wikipedia.org/wiki/Environment_Modules_(software) Environt modules],
 +
and "man module".
 +
 +
===How do I use the "module load <software>" command ?===
 +
Generally, this command is placed in your submit script, as in :
 +
<pre code>
 +
module load matlab
 +
</pre>
 +
 +
===How do I request software that is not currently available ?===
 +
This is done by sending mail to <em>arcs@njit.edu</em>. The software must meet these
 +
criteria :
 +
<ul>
 +
<li>requested by faculty or staff</li>
 +
<li>free</li>
 +
<li>compatible with the current Linux version on the HPC clusters</li>
 +
<li>used for research and/or courses</li>
 +
</ul>
 +
In general, such software will be installed in AFS, and will be accessible to
 +
all AFS Linux clients.
 +
 +
If the software is not free, a funding source will need to be identified.
 +
 +
===The Stheno GCC version 4.1.2. Can a newer version be installed ?===
 +
 +
The problem is that the Stheno operating system, Scientific Linux SL release
 +
5.5, is too old.
 +
 +
Although several newer versions of gcc are installed in AFS :
 +
<pre code>
 +
 +
  "module avail gcc" gives :
 +
 
 +
  ------------------------------- /afs/cad.njit.edu/ucs/modulefiles
 +
  gcc/4.8.1 gcc/4.8.2 gcc/4.9.2 gcc/5.2.0 gcc/5.3.0 gcc/5.4.0 gcc/6.1.0
 +
</pre>
 +
 +
none of them will load on Stheno, due to its operating system.
 +
 +
See [[Roadmap]] for scheduled Stheno operating system upgrade.
 +
 +
The operating system on Kong.njit.edu, Scientific Linux release 6.2,
 +
does support the gcc installations in AFS.
 +
 +
Users can get a login on Kong upon request to <em>arcs@njit.edu</em> from a faculty member.
 +
 +
==<strong><font color="#9966cc">Submit script behaves strangely</font></strong>==
 +
===The statements in my submit script look correct, but it appears that the script is not being read correctly. Why is that ?===
 +
This behavior can happen if your submit script contains control characters that confuse
 +
the SGE job scheduler. The most common cause of the problem is the presence of Windows
 +
DOS line feed characters in the submit script. One way to remove these characters is via
 +
the Linux <em>dos2unix</em> command. See "man dos2unix".
 +
 +
==<strong><font color="#9966cc">Submitting jobs</font></strong>==
 +
===How are jobs submitted to the HPC cluster compute nodes ?===
 +
Jobs must be submitted using a "submit script". See [[SonOfGridEngine]].
 +
 +
===What are the valid queue names ?===
 +
Valid queue names are : "short", "medium", and "long", and others. See [[KongQueuesTable]] and [[SthenoQueuesTable]].
 +
 +
===Can I run jobs on a head node ?===
 +
Jobs run on an HPC cluster head node that use non-negligible resources will be
 +
automatically terminated, with email to that effect sent to the owner of the job.
 +
 +
===How can I run an interactive job on an HPC cluster compute node ?===
 +
Use "qlogin" on a head node. See [[KongQueues]].
 +
 +
==<strong><font color="#9966cc">Testing if binaries will run</font></strong>==
 +
===I have binaries, but no source code. How can I tell if the binaries should run on an HPC cluster ?===
 +
You can use the <em>library.check</em> (/usr/ucs/bin/library.check) utility, to check for missing
 +
libraries an/or GLIBC_* versions. For usage : "library.check -h".
 +
 +
==<strong><font color="#9966cc">Using GPUs</font></strong>==
 +
===What graphical processing units (GPUs) are available in the HPC clusters ?===
 +
See [https://web.njit.edu/topics/hpc/specs Cluster specs]
 +
 +
===How do I use those GPUs ?===
 +
See [[RunningCUDASamplesOnKong]], [[MatlabGPUOnStheno]]
 +
 +
==<strong><font color="#9966cc">Using GPUs in parallel on Kong</font></strong>==
 +
===What is the maximum number of GPUs that can be used in parallel on Kong ?===
 +
As of April 2016, there are 2 GPU nodes, each with 2 GPUs, so the maximum number
 +
of GPUs that can be used in parallel is <strong>4</strong>.
 +
 +
==<strong><font color="#9966cc">Using local scratch for large writes and reads</font></strong>==
 +
===How do I use the large scratch space local to each compute node to significantlty improve throughput ?===
 +
In your submit script, put something like the following :
 +
<pre code>
 +
######################################
 +
# copy files needed to scratch
 +
 +
mkdir -p /scratch/ucid/work
 +
cp <all needed files> /scratch/ucid/work
 +
cd /scratch/ucid/work
 +
#######################################
 +
# Run your program
 +
 +
/full/path/to/cmd and arguments
 +
########################################
 +
# Copy results to local home directory
  
===Network Access===
+
cp <all results and needed files home> ~/results_directory
All machines are accessible from the NJIT network (includes VPN).
+
#########################################
 +
# Delete scratch directory
 +
 +
rm -rf /scratch/ucid
 +
</pre>
  
<strong>No</strong> machines are accessible from outside the NJIT network. This means,
+
==<strong><font color="#9966cc">Using multithreading</font></strong>==
for example, that machines in GITC 4320 <strong>cannot</strong> act as a web server
+
===My code uses multithreading. How do I tell the submit script to do multithreading ?===
that is accessible from the Internet.
+
To use multithreading you need to use the threaded parallel environment. In the submit script :
 +
<pre code>
 +
#$ -pe threaded NUMBER_OF_CORES
 +
</pre>
  
If it is desirable to have a web server open to the Internet access data produced by
+
NUMBER_OF_CORES should not exceed 8 for the short, medium or long queues.
a machine in GITC 4320, that data can be stored in AFS, where it is accessible from
+
web server running on a virtual machine in the GITC 5302 data center.
+
  
===Agreement to Conditions===
+
For the <strong>smp</strong> queue, up to 32 cores can be used. In the submit script :
Researchers wishing to locate their machines in the GITC 4320 data center are required
+
<pre code>
to accept the conditions stated on this page, via email to arcs@njit.edu.
+
#$ -pe threaded 32
 +
#$ -q smp
 +
</pre>

Revision as of 14:19, 4 January 2021

Contents

Accessing AFS Space

How can I access my AFS space from HPC cluster compute nodes ?

Jobs submitted by the SGE scheduler to compute nodes have to get the user's AFS token in order to access space in AFS that that user has access to. The current method for doing this is via "ksub". See UsingKsub.

Just before running ksub, be sure you have your Kerberos ticket and AFS token by :

kinit && aklog

An improved method for doing what ksub does, "auks", is in the process of being implemented.

Compilers

What compilers are available ?

In addition to the compilers listed by "module av", the GNU compilers that are part of the standard operating system installation are available.

See SoftwareModulesAvailable

Compiling CUDA code

How do I compile CUDA programs ?

  1. Get an interactive login a GPU node, e,g., node151 or node152
    qlogin node151
  2. Load gcc and CUDA modules, e.g.,
    module load gcc/5.4.0 cuda
  3. Compile your code
  4. Log out of the GPU node
  5. Submit your job to the gpu queue using a submit script

See RunningCUDASamplesOnKong and KongQueuesTable

Conserving disk space

What are the best ways of conserving local space on the HPC clusters ?

  1. Remove unnecessary files
  2. If you have a research directory located at /afs/cad/research/.., move (and better yet compress) files into a sub-directory of that directory. This not only frees up space on on the cluster, but keeps the user's results available even after their cluster account is removed.
  3. Compress files with bzip2 or gzip. bzip2 is usually more effective

"Error message: vsu_ClientInit: Could not get afs tokens, running unauthenticated" from qsub

I get this message when running qsub. Is it something I should be concerned about ?

This message indicates that you do not have your AFS token. The message can be safely ignored when running qsub.

Error message from a node

I get this message repeatedly in a terminal window when logged into Kong. Is it something I should be concerned about ?

This message indicates that there is a problem with a certain node. The message can be safely ignored, unless your job was using that node. The ARCS staff will take the offending node out of service as soon as possible.

Message from syslogd@nodeXXX at timestamp
 kernel: Code: 48 8d 45 d0 4c 89 4d f8 c7 45 b0 10 00 00 00 48 89 45 c0 e8 38 ff ff ff c9 c3 90 90 90 90 90 90 44 8d 46 3f 85 f6 55 44
 0f 49 c6 <31> d2 48 89 e5 41 c1 f8 06 45 85 c0 7e 24 48 83 3f 00 48 89 f8 

Getting access to HPC clusters

Who is eligible ?

All NJIT researchers; courses using HPC. See UserAccess.

Getting AFS token using qlogin

How do I get my AFS token when using qlogin ?

After qlogin :

	kinit && aklog && tokens

kinit gets your Kerberos ticket when you supply the correct password

aklog gets your AFS token from your Kerberos ticket

tokens displays your AFS token status

Kerberos ticket and AFS token status message

What does the "Your Kerberos ticket and AFS token status" message mean ?

A message such as the following is informational, telling the user :

  • Until what date thier Kerberos tickect ca be renewed
  • When their AFS token will expire, unless refreshed
=== === === Your Kerberos ticket and AFS token status === === ===
Kerberos :  Renew until 02/25/17 07:36:21, Flags: FRI
AFS      :  User's (AFS ID 22964) tokens for afs@cad.njit.edu [Expires Jan 31 17:36]

Getting cluster resource usage history

How can I view the historical usage of resources on the HPC clusters ?

Use Ganglia

Getting files into your local directory on kong or stheno

How do I get files from outside kong or stheno into my local directory on those machines ?

The easiest way to to this is to copy the files from AFS.

  • Since kong and stheno are AFS clients, a user logged into kong or stheno can copy files from any location in AFS that the user has read access into their local kong or stheno directory.
  • Conversely, a user can copy files from their local kong or stheno directory to any location in AFS to which the user has write access - e.g., a research directory or AFS home directory.

Archiving data

How do I archive data from local storage on a cluster or AFS ?

  1. Use rclone : [rclone]
  2. rsync to a local disk - contact arcs@njit.edu for assistance

Getting Job Status

How can I get the status of jobs I have submitted to SGE ?

Use "qtsat" on a head node in various formats. See SonOfGridEngine.

How can I see an overall summary of queue activity ?

Use "qsummary" on a head node. For usage : "qsummary -h". See SonOfGridEngine.

Getting local files onto or asccessible from an HPC cluster

How do I get files that are local to my computer onto or accessible from the kong or stheno clusters ?

Programs running on compute nodes can access files that are :

  1. in the user's local cluster home directory
  2. in an AFS directory the user has access to

1. Getting files from your local computer to your local cluster home directory

  1. make your local computer an AFS client; contact arcs@njit.edu for help with this
  2. copy the relevant files from your local computer into AFS space that you have write access to, usually a research directory
  3. log in to the cluster headnode; make sure you have your AFS token, via tokens; use "kinit && aklog" if you don't have a token
  4. use tar or cp to get the files in AFS into your local cluster home directory. Contact arcs@njit.edu if you need help

2. Accessing your files in AFS directly from compute nodes

Languages

What languages are available ?

In addition to the languages listed by "module av", the languages that are part of the standard operating system installation are available - e.g., /usr/bin/perl, /usr/bin/python. See SoftwareModulesAvailable.

Performance, local machine vs. cluster

An application runs faster on my own computer than it does on the HPC cluster ...?

This can happen for a variety of reasons, including :

  • your local computer has faster CPU, more RAM than cluster nodes
  • your job on the cluster node shares the nodes's reosurces with other jobs
  • the disk I/O on your local computer is faster than on the cluster

It should be noted that even in such cases users can get significantly better theoughput using a cluster by :

  • running many serial jobs simultaneously
  • running jobs in parallel

Program runs on head node but not on compute nodes

My program runs OK when I run it on the head node, but produces different results, or doesn't run at all, on compute nodes. Why ?

There remain differences - scheduled to be remedied - between the head node and compute nodes. Code compiled on the head node may not run the same way on compute nodes. Fix : use qlogin to log in to an arbitrary compute node, and compile your code on that node.

Resources for big data analysis

What resources exist for big data analysis ?

An Hadoop cluster for research and teaching came on-line 09 Dec 2015. Documentation on its use is being developed.

Resources writeups

What writeups are available that describe NJIT's high performance computing (HPC) and big data (BD) resources ?

IST support for researchers ISTResearcherSupport

Short Overview ShortOverview

Very Short Overview VeryShortOverview

Detailed hardware specs Cluster specs

Running Matlab

How do I specify matlab jobs to use a specific queue ?

A queue can be specified within a matlab input file, e.g. :

ClusterInfo.setQueue ('medium')

Is there support for Matlab Distributed Computing Srever (MDSC) on the HPC clusters ?

Yes. See GettingStartedWithSerialAndParallelMATLABOnKongAndStheno.

Software

What software is available on the HPC clusters ?

The simplest way to see what software is available is to enter :

	module available

This will produce a list of most - but not all - available software.

You can refine your search to list specific software. E.g., to list all of the GCC compilers avaialable, :

	module available gcc

Recent list of modules SoftwareModulesAvailable

The "module available" command can be used on the kong and stheno head nodes, and on all public-access AFS Linux clients. The shortest form of this command is "module av".

What is the "module" Command ?

This command is used to set the environment variables for the software you want to use. see Environt modules, and "man module".

How do I use the "module load <software>" command ?

Generally, this command is placed in your submit script, as in :

module load matlab

How do I request software that is not currently available ?

This is done by sending mail to arcs@njit.edu. The software must meet these criteria :

  • requested by faculty or staff
  • free
  • compatible with the current Linux version on the HPC clusters
  • used for research and/or courses

In general, such software will be installed in AFS, and will be accessible to all AFS Linux clients.

If the software is not free, a funding source will need to be identified.

The Stheno GCC version 4.1.2. Can a newer version be installed ?

The problem is that the Stheno operating system, Scientific Linux SL release 5.5, is too old.

Although several newer versions of gcc are installed in AFS :


   "module avail gcc" gives :
   
   ------------------------------- /afs/cad.njit.edu/ucs/modulefiles
   gcc/4.8.1 gcc/4.8.2 gcc/4.9.2 gcc/5.2.0 gcc/5.3.0 gcc/5.4.0 gcc/6.1.0
 

none of them will load on Stheno, due to its operating system.

See Roadmap for scheduled Stheno operating system upgrade.

The operating system on Kong.njit.edu, Scientific Linux release 6.2, does support the gcc installations in AFS.

Users can get a login on Kong upon request to arcs@njit.edu from a faculty member.

Submit script behaves strangely

The statements in my submit script look correct, but it appears that the script is not being read correctly. Why is that ?

This behavior can happen if your submit script contains control characters that confuse the SGE job scheduler. The most common cause of the problem is the presence of Windows DOS line feed characters in the submit script. One way to remove these characters is via the Linux dos2unix command. See "man dos2unix".

Submitting jobs

How are jobs submitted to the HPC cluster compute nodes ?

Jobs must be submitted using a "submit script". See SonOfGridEngine.

What are the valid queue names ?

Valid queue names are : "short", "medium", and "long", and others. See KongQueuesTable and SthenoQueuesTable.

Can I run jobs on a head node ?

Jobs run on an HPC cluster head node that use non-negligible resources will be automatically terminated, with email to that effect sent to the owner of the job.

How can I run an interactive job on an HPC cluster compute node ?

Use "qlogin" on a head node. See KongQueues.

Testing if binaries will run

I have binaries, but no source code. How can I tell if the binaries should run on an HPC cluster ?

You can use the library.check (/usr/ucs/bin/library.check) utility, to check for missing libraries an/or GLIBC_* versions. For usage : "library.check -h".

Using GPUs

What graphical processing units (GPUs) are available in the HPC clusters ?

See Cluster specs

How do I use those GPUs ?

See RunningCUDASamplesOnKong, MatlabGPUOnStheno

Using GPUs in parallel on Kong

What is the maximum number of GPUs that can be used in parallel on Kong ?

As of April 2016, there are 2 GPU nodes, each with 2 GPUs, so the maximum number of GPUs that can be used in parallel is 4.

Using local scratch for large writes and reads

How do I use the large scratch space local to each compute node to significantlty improve throughput ?

In your submit script, put something like the following :

 ######################################
 # copy files needed to scratch
 
 mkdir -p /scratch/ucid/work
 cp <all needed files> /scratch/ucid/work
 cd /scratch/ucid/work
 #######################################
 # Run your program 
 
 /full/path/to/cmd and arguments
 ########################################
 # Copy results to local home directory

 cp <all results and needed files home> ~/results_directory
 #########################################
 # Delete scratch directory
 
 rm -rf /scratch/ucid

Using multithreading

My code uses multithreading. How do I tell the submit script to do multithreading ?

To use multithreading you need to use the threaded parallel environment. In the submit script :

#$ -pe threaded NUMBER_OF_CORES

NUMBER_OF_CORES should not exceed 8 for the short, medium or long queues.

For the smp queue, up to 32 cores can be used. In the submit script :

#$ -pe threaded 32
#$ -q smp