Accessing AFS Space
How can I access my AFS space from HPC cluster compute nodes ?
Jobs submitted by the SGE scheduler to compute nodes have to get the user's AFS token in order to access space in AFS that that user has access to. The current method for doing this is via "ksub". See UsingKsub.
Just before running ksub, be sure you have your Kerberos ticket and AFS token by :
kinit && aklog
An improved method for doing what ksub does, "auks", is in the process of being implemented.
What compilers are available ?
In addition to the compilers listed by "module av", the GNU compilers that are part of the standard operating system installation are available.
Compiling CUDA code
How do I compile CUDA programs ?
Get an interactive login a GPU node, e,g., node151 or node152
Load gcc and CUDA modules, e.g.,
module load gcc/5.4.0 cuda
- Compile your code
- Log out of the GPU node
- Submit your job to the gpu queue using a submit script
Conserving disk space
What are the best ways of conserving local space on the HPC clusters ?
- Remove unnecessary files
- If you have a research directory located at /afs/cad/research/.., move (and better yet compress) files into a sub-directory of that directory. This not only frees up space on on the cluster, but keeps the user's results available even after their cluster account is removed.
- Compress files with bzip2 or gzip. bzip2 is usually more effective
"Error message: vsu_ClientInit: Could not get afs tokens, running unauthenticated" from qsub
I get this message when running qsub. Is it something I should be concerned about ?
This message indicates that you do not have your AFS token. The message can be safely ignored when running qsub.
Error message from a node
I get this message repeatedly in a terminal window when logged into Kong. Is it something I should be concerned about ?
This message indicates that there is a problem with a certain node. The message can be safely ignored, unless your job was using that node. The ARCS staff will take the offending node out of service as soon as possible.
Message from syslogd@nodeXXX at timestamp kernel: Code: 48 8d 45 d0 4c 89 4d f8 c7 45 b0 10 00 00 00 48 89 45 c0 e8 38 ff ff ff c9 c3 90 90 90 90 90 90 44 8d 46 3f 85 f6 55 44 0f 49 c6 <31> d2 48 89 e5 41 c1 f8 06 45 85 c0 7e 24 48 83 3f 00 48 89 f8
Getting access to HPC clusters
Who is eligible ?
All NJIT researchers; courses using HPC. See UserAccess.
Getting AFS token using qlogin
How do I get my AFS token when using qlogin ?
After qlogin :
kinit && aklog && tokens
kinit gets your Kerberos ticket when you supply the correct password
aklog gets your AFS token from your Kerberos ticket
tokens displays your AFS token status
Kerberos ticket and AFS token status message
What does the "Your Kerberos ticket and AFS token status" message mean ?
A message such as the following is informational, telling the user :
- Until what date thier Kerberos tickect ca be renewed
- When their AFS token will expire, unless refreshed
=== === === Your Kerberos ticket and AFS token status === === === Kerberos : Renew until 02/25/17 07:36:21, Flags: FRI AFS : User's (AFS ID 22964) tokens for email@example.com [Expires Jan 31 17:36]
Getting cluster resource usage history
How can I view the historical usage of resources on the HPC clusters ?
Getting files into your local directory on kong or stheno
How do I get files from outside kong or stheno into my local directory on those machines ?
The easiest way to to this is to copy the files from AFS.
- Since kong and stheno are AFS clients, a user logged into kong or stheno can copy files from any location in AFS that the user has read access into their local kong or stheno directory.
- Conversely, a user can copy files from their local kong or stheno directory to any location in AFS to which the user has write access - e.g., a research directory or AFS home directory.
Getting Job Status
How can I get the status of jobs I have submitted to SGE ?
Use "qtsat" on a head node in various formats. See SonOfGridEngine.
How can I see an overall summary of queue activity ?
Use "qsummary" on a head node. For usage : "qsummary -h". See SonOfGridEngine.
Getting local files onto or asccessible from an HPC cluster
How do I get files that are local to my computer onto or accessible from the kong or stheno clusters ?
Programs running on compute nodes can access files that are :
- in the user's local cluster home directory
- in an AFS directory the user has access to
1. Getting files from your local computer to your local cluster home directory
- make your local computer an AFS client; contact firstname.lastname@example.org for help with this
- copy the relevant files from your local computer into AFS space that you have write access to, usually a research directory
- log in to the cluster headnode; make sure you have your AFS token, via tokens; use "kinit && aklog" if you don't have a token
- use tar or cp to get the files in AFS into your local cluster home directory. Contact email@example.com if you need help
2. Accessing your files in AFS directly from compute nodes
- Use ksub Using ksub
What languages are available ?
In addition to the languages listed by "module av", the languages that are part of the standard operating system installation are available - e.g., /usr/bin/perl, /usr/bin/python. See SoftwareModulesAvailable.
Performance, local machine vs. cluster
An application runs faster on my own computer than it does on the HPC cluster ...?
This can happen for a variety of reasons, including :
- your local computer has faster CPU, more RAM than cluster nodes
- your job on the cluster node shares the nodes's reosurces with other jobs
- the disk I/O on your local computer is faster than on the cluster
It should be noted that even in such cases users can get significantly better theoughput using a cluster by :
- running many serial jobs simultaneously
- running jobs in parallel
Program runs on head node but not on compute nodes
My program runs OK when I run it on the head node, but produces different results, or doesn't run at all, on compute nodes. Why ?
There remain differences - scheduled to be remedied - between the head node and compute nodes. Code compiled on the head node may not run the same way on compute nodes. Fix : use qlogin to log in to an arbitrary compute node, and compile your code on that node.
Resources for big data analysis
What resources exist for big data analysis ?
An Hadoop cluster for research and teaching came on-line 09 Dec 2015. Documentation on its use is being developed.
What writeups are available that describe NJIT's high performance computing (HPC) and big data (BD) resources ?
IST support for researchers ISTResearcherSupport
Short Overview ShortOverview
Very Short Overview VeryShortOverview
Detailed hardware specs Cluster specs
How do I specify matlab jobs to use a specific queue ?
A queue can be specified within a matlab input file, e.g. :
Is there support for Matlab Distributed Computing Srever (MDSC) on the HPC clusters ?
What software is available on the HPC clusters ?
The simplest way to see what software is available is to enter :
This will produce a list of most - but not all - available software.
You can refine your search to list specific software. E.g., to list all of the GCC compilers avaialable, :
module available gcc
Recent list of modules SoftwareModulesAvailable
The "module available" command can be used on the kong and stheno head nodes, and on all public-access AFS Linux clients. The shortest form of this command is "module av".
What is the "module" Command ?
This command is used to set the environment variables for the software you want to use. see Environt modules, and "man module".
How do I use the "module load <software>" command ?
Generally, this command is placed in your submit script, as in :
module load matlab
How do I request software that is not currently available ?
This is done by sending mail to firstname.lastname@example.org. The software must meet these criteria :
- requested by faculty or staff
- compatible with the current Linux version on the HPC clusters
- used for research and/or courses
In general, such software will be installed in AFS, and will be accessible to all AFS Linux clients.
If the software is not free, a funding source will need to be identified.
The Stheno GCC version 4.1.2. Can a newer version be installed ?
The problem is that the Stheno operating system, Scientific Linux SL release 5.5, is too old.
Although several newer versions of gcc are installed in AFS :
"module avail gcc" gives : ------------------------------- /afs/cad.njit.edu/ucs/modulefiles gcc/4.8.1 gcc/4.8.2 gcc/4.9.2 gcc/5.2.0 gcc/5.3.0 gcc/5.4.0 gcc/6.1.0
none of them will load on Stheno, due to its operating system.
See Roadmap for scheduled Stheno operating system upgrade.
The operating system on Kong.njit.edu, Scientific Linux release 6.2, does support the gcc installations in AFS.
Users can get a login on Kong upon request to email@example.com from a faculty member.
Submit script behaves strangely
The statements in my submit script look correct, but it appears that the script is not being read correctly. Why is that ?
This behavior can happen if your submit script contains control characters that confuse the SGE job scheduler. The most common cause of the problem is the presence of Windows DOS line feed characters in the submit script. One way to remove these characters is via the Linux dos2unix command. See "man dos2unix".
How are jobs submitted to the HPC cluster compute nodes ?
Jobs must be submitted using a "submit script". See SonOfGridEngine.
What are the valid queue names ?
Can I run jobs on a head node ?
Jobs run on an HPC cluster head node that use non-negligible resources will be automatically terminated, with email to that effect sent to the owner of the job.
How can I run an interactive job on an HPC cluster compute node ?
Use "qlogin" on a head node. See KongQueues.
Testing if binaries will run
I have binaries, but no source code. How can I tell if the binaries should run on an HPC cluster ?
You can use the library.check (/usr/ucs/bin/library.check) utility, to check for missing libraries an/or GLIBC_* versions. For usage : "library.check -h".
What graphical processing units (GPUs) are available in the HPC clusters ?
See Cluster specs
How do I use those GPUs ?
Using GPUs in parallel on Kong
What is the maximum number of GPUs that can be used in parallel on Kong ?
As of April 2016, there are 2 GPU nodes, each with 2 GPUs, so the maximum number of GPUs that can be used in parallel is 4.
Using local scratch for large writes and reads
How do I use the large scratch space local to each compute node to significantlty improve throughput ?
In your submit script, put something like the following :
###################################### # copy files needed to scratch mkdir -p /scratch/ucid/work cp <all needed files> /scratch/ucid/work cd /scratch/ucid/work ####################################### # Run your program /full/path/to/cmd and arguments ######################################## # Copy results to local home directory cp <all results and needed files home> ~/results_directory ######################################### # Delete scratch directory rm -rf /scratch/ucid
My code uses multithreading. How do I tell the submit script to do multithreading ?
To use multithreading you need to use the threaded parallel environment. In the submit script :
#$ -pe threaded NUMBER_OF_CORES
NUMBER_OF_CORES should not exceed 8 for the short, medium or long queues.
For the smp queue, up to 32 cores can be used. In the submit script :
#$ -pe threaded 32 #$ -q smp