FAQ

From NJIT-ARCS HPC Wiki
Jump to: navigation, search

Contents

Accessing AFS Space

How can I access my AFS space from HPC cluster compute nodes ?

Jobs submitted by the SGE scheduler to compute nodes have to get the user's AFS token in order to access space in AFS that that user has access to. The current method for doing this is via "ksub". See UsingKsub.

Just before running ksub, be sure you have your Kerberos ticket and AFS token by :

kinit && aklog

An improved method for doing what ksub does, "auks", is in the process of being implemented.

Compilers

What compilers are available ?

In addition to the compilers listed by "module av", the GNU compilers that are part of the standard operating system installation are available.

See SoftwareModulesAvailable

Compiling CUDA code

How do I compile CUDA programs ?

  1. Get an interactive login a GPU node, e,g., node151 or node152
    qlogin node151
  2. Load gcc and CUDA modules, e.g.,
    module load gcc/5.4.0 cuda
  3. Compile your code
  4. Log out of the GPU node
  5. Submit your job to the gpu queue using a submit script

See RunningCUDASamplesOnKong and KongQueuesTable

Conserving disk space

What are the best ways of conserving local space on the HPC clusters ?

  1. Remove unnecessary files
  2. If you have a research directory located at /afs/cad/research/.., move (and better yet compress) files into a sub-directory of that directory. This not only frees up space on on the cluster, but keeps the user's results available even after their cluster account is removed.
  3. Compress files with bzip2 or gzip. bzip2 is usually more effective

"Error message: vsu_ClientInit: Could not get afs tokens, running unauthenticated" from qsub

I get this message when running qsub. Is it something I should be concerned about ?

This message indicates that you do not have your AFS token. The message can be safely ignored when running qsub.

Error message from a node

I get this message repeatedly in a terminal window when logged into Kong. Is it something I should be concerned about ?

This message indicates that there is a problem with a certain node. The message can be safely ignored, unless your job was using that node. The ARCS staff will take the offending node out of service as soon as possible.

Message from syslogd@nodeXXX at timestamp
 kernel: Code: 48 8d 45 d0 4c 89 4d f8 c7 45 b0 10 00 00 00 48 89 45 c0 e8 38 ff ff ff c9 c3 90 90 90 90 90 90 44 8d 46 3f 85 f6 55 44
 0f 49 c6 <31> d2 48 89 e5 41 c1 f8 06 45 85 c0 7e 24 48 83 3f 00 48 89 f8 

Getting access to HPC clusters

Who is eligible ?

All NJIT researchers; courses using HPC. See UserAccess.

Getting AFS token using qlogin

How do I get my AFS token when using qlogin ?

After qlogin :

	kinit && aklog && tokens

kinit gets your Kerberos ticket when you supply the correct password

aklog gets your AFS token from your Kerberos ticket

tokens displays your AFS token status

Kerberos ticket and AFS token status message

What does the "Your Kerberos ticket and AFS token status" message mean ?

A message such as the following is informational, telling the user :

  • Until what date thier Kerberos tickect ca be renewed
  • When their AFS token will expire, unless refreshed
=== === === Your Kerberos ticket and AFS token status === === ===
Kerberos :  Renew until 02/25/17 07:36:21, Flags: FRI
AFS      :  User's (AFS ID 22964) tokens for afs@cad.njit.edu [Expires Jan 31 17:36]

Getting cluster resource usage history

How can I view the historical usage of resources on the HPC clusters ?

Use Ganglia

Getting files into your local directory on kong or stheno

How do I get files from outside kong or stheno into my local directory on those machines ?

The easiest way to to this is to copy the files from AFS.

  • Since kong and stheno are AFS clients, a user logged into kong or stheno can copy files from any location in AFS that the user has read access into their local kong or stheno directory.
  • Conversely, a user can copy files from their local kong or stheno directory to any location in AFS to which the user has write access - e.g., a research directory or AFS home directory.

Archiving data

How do I archive data from local storage on a cluster or AFS ?

  1. Use rclone : [rclone]
  2. rsync to a local disk - contact hpc@njit.edu for assistance

Getting Job Status

How can I get the status of jobs I have submitted to SGE ?

Use "qtsat" on a head node in various formats. See SonOfGridEngine.

How can I see an overall summary of queue activity ?

Use "qsummary" on a head node. For usage : "qsummary -h". See SonOfGridEngine.

Getting local files onto or asccessible from an HPC cluster

How do I get files that are local to my computer onto or accessible from the kong or stheno clusters ?

Programs running on compute nodes can access files that are :

  1. in the user's local cluster home directory
  2. in an AFS directory the user has access to

1. Getting files from your local computer to your local cluster home directory

  1. make your local computer an AFS client; contact arsc_help@njit.edu for help with this
  2. copy the relevant files from your local computer into AFS space that you have write access to, usually a research directory
  3. log in to the cluster headnode; make sure you have your AFS token, via tokens; use "kinit && aklog" if you don't have a token
  4. use tar or cp to get the files in AFS into your local cluster home directory. Contact arsc_help@njit.edu if you need help

2. Accessing your files in AFS directly from compute nodes

Languages

What languages are available ?

In addition to the languages listed by "module av", the languages that are part of the standard operating system installation are available - e.g., /usr/bin/perl, /usr/bin/python. See SoftwareModulesAvailable.

Performance, local machine vs. cluster

An application runs faster on my own computer than it does on the HPC cluster ...?

This can happen for a variety of reasons, including :

  • your local computer has faster CPU, more RAM than cluster nodes
  • your job on the cluster node shares the nodes's reosurces with other jobs
  • the disk I/O on your local computer is faster than on the cluster

It should be noted that even in such cases users can get significantly better theoughput using a cluster by :

  • running many serial jobs simultaneously
  • running jobs in parallel

Program runs on head node but not on compute nodes

My program runs OK when I run it on the head node, but produces different results, or doesn't run at all, on compute nodes. Why ?

There remain differences - scheduled to be remedied - between the head node and compute nodes. Code compiled on the head node may not run the same way on compute nodes. Fix : use qlogin to log in to an arbitrary compute node, and compile your code on that node.

Resources for big data analysis

What resources exist for big data analysis ?

An Hadoop cluster for research and teaching came on-line 09 Dec 2015. Documentation on its use is being developed.

Resources writeups

What writeups are available that describe NJIT's high performance computing (HPC) and big data (BD) resources ?

IST support for researchers ISTResearcherSupport

Short Overview ShortOverview

Very Short Overview VeryShortOverview

Detailed hardware specs Cluster specs

Running Matlab

How do I specify matlab jobs to use a specific queue ?

A queue can be specified within a matlab input file, e.g. :

ClusterInfo.setQueue ('medium')

Is there support for Matlab Distributed Computing Srever (MDSC) on the HPC clusters ?

Yes. See GettingStartedWithSerialAndParallelMATLABOnKongAndStheno.

Software

What software is available on the HPC clusters ?

The simplest way to see what software is available is to enter :

	module available

This will produce a list of most - but not all - available software.

You can refine your search to list specific software. E.g., to list all of the GCC compilers avaialable, :

	module available gcc

Recent list of modules SoftwareModulesAvailable

The "module available" command can be used on the kong and stheno head nodes, and on all public-access AFS Linux clients. The shortest form of this command is "module av".

What is the "module" Command ?

This command is used to set the environment variables for the software you want to use. see Environt modules, and "man module".

How do I use the "module load <software>" command ?

Generally, this command is placed in your submit script, as in :

module load matlab

How do I request software that is not currently available ?

This is done by sending mail to arsc_help@njit.edu. The software must meet these criteria :

  • requested by faculty or staff
  • free
  • compatible with the current Linux version on the HPC clusters
  • used for research and/or courses

In general, such software will be installed in AFS, and will be accessible to all AFS Linux clients.

If the software is not free, a funding source will need to be identified.

Running Jupyter Notebook

  • copy /opt/site/examples/jupyter_notebook/jupyter_submit.sbatch.sh to your directory
  • Modify jupyter_submit.sbatch.sh as needed - instructions are provided in jupyter_submit.sbatch.sh
  • Submit jupyter_submit.sbatch.sh

Submit script behaves strangely

The statements in my submit script look correct, but it appears that the script is not being read correctly. Why is that ?

This behavior can happen if your submit script contains control characters that confuse the SGE job scheduler. The most common cause of the problem is the presence of Windows DOS line feed characters in the submit script. One way to remove these characters is via the Linux dos2unix command. See "man dos2unix".

Submitting jobs

How are jobs submitted to the HPC cluster compute nodes ?

Jobs must be submitted using a "submit script". See SonOfGridEngine.

What are the valid queue names ?

Valid queue names are : "short", "medium", and "long", and others. See KongQueuesTable and SthenoQueuesTable.

Can I run jobs on a head node ?

Jobs run on an HPC cluster head node that use non-negligible resources will be automatically terminated, with email to that effect sent to the owner of the job.

How can I run an interactive job on an HPC cluster compute node ?

Use "qlogin" on a head node. See KongQueues.

Testing if binaries will run

I have binaries, but no source code. How can I tell if the binaries should run on an HPC cluster ?

You can use the library.check (/usr/ucs/bin/library.check) utility, to check for missing libraries an/or GLIBC_* versions. For usage : "library.check -h".

Using GPUs

What graphical processing units (GPUs) are available in the HPC clusters ?

See Cluster specs

How do I use those GPUs ?

See RunningCUDASamplesOnKong, MatlabGPUOnStheno

Using GPUs in parallel on Kong

What is the maximum number of GPUs that can be used in parallel on Kong ?

As of April 2016, there are 2 GPU nodes, each with 2 GPUs, so the maximum number of GPUs that can be used in parallel is 4.

Using local scratch for large writes and reads

How do I use the large scratch space local to each compute node to significantlty improve throughput ?

In your submit script, put something like the following :

 ######################################
 # copy files needed to scratch
 
 mkdir -p /scratch/ucid/work
 cp <all needed files> /scratch/ucid/work
 cd /scratch/ucid/work
 #######################################
 # Run your program 
 
 /full/path/to/cmd and arguments
 ########################################
 # Copy results to local home directory

 cp <all results and needed files home> ~/results_directory
 #########################################
 # Delete scratch directory
 
 rm -rf /scratch/ucid

Using multithreading

My code uses multithreading. How do I tell the submit script to do multithreading ?

To use multithreading you need to use the threaded parallel environment. In the submit script :

#$ -pe threaded NUMBER_OF_CORES

NUMBER_OF_CORES should not exceed 8 for the short, medium or long queues.

For the smp queue, up to 32 cores can be used. In the submit script :

#$ -pe threaded 32
#$ -q smp