Editing Running CUDA Samples on Kong

This tutorial demonstrates how to compile and run a GPU job using CUDA sample code.

Make a directory to hold the samples
kong-41 ~>: mkdir gpu
kong-42 ~>: cd gpu

Copy the sample files from AFS. Make sure to copy all of the files.
kong-43 gpu>: cp -r /afs/cad/linux/cuda-6.5.14/samples/ .

Change directories to matrixMul
kong-44 gpu>: cd samples/0_Simple/matrixMul

Load the cuda module
kong-45 matrixMul>: module load cuda

Build the binary.
kong-46 matrixMul>: make
"/afs/cad/linux/cuda-6.5.14"/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_11,code=sm_11 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_50,code=compute_50 -o matrixMul matrixMul.o
nvcc warning : The 'compute_11', 'compute_12', 'compute_13', 'sm_11', 'sm_12', and 'sm_13' architectures are deprecated, and may be removed in a future release.
mkdir -p ../../bin/x86_64/linux/release
cp matrixMul ../../bin/x86_64/linux/release

Create a submit script

!/bin/sh
Usage: gputest.sh
Change job name and email address as needed

-- our name ---
$ -N matrixMul
$ -S /bin/sh
Make sure that the .e and .o file arrive in the
working directory
$ -cwd
Merge the standard out and standard error to one file
$ -j y
Send mail at submission and completion of script
$ -m be
$ -M UCID@njit.edu
Request a gpu
$ -l gpu=1

/bin/echo Running on host: `hostname`. /bin/echo In directory: `pwd` /bin/echo Starting on: `date`

Load CUDA module

. /opt/modules/init/bash module load cuda

Full path to executable

/home/g/UCID/gpu/samples/0_Simple/matrixMul/matrixMul </source>

Submit the job 

kong-47 matrixMul>: qsub gpusubmit.sh 
 
Your job 390030 ("matrixMul") has been submitted

View the output 

kong-48 matrixMul>: cat matrixMul.o390030  
 
Running on host: node151. 
 
In directory: /home/g/UCID/gpu/samples/0_Simple/matrixMul 
 
Starting on: Wed Nov 5 14:46:48 EST 2014 
 
[Matrix Multiply Using CUDA] - Starting... 
 
GPU Device 0: "Tesla K20Xm" with compute capability 3.5 
 
MatrixA(320,320), MatrixB(640,320) 
 
Computing result using CUDA Kernel... 
 
done 
 
Performance= 274.18 GFlop/s, Time= 0.478 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block 
 
Checking computed result for correctness: Result = PASS 
 
Note: For peak performance, please refer to the matrixMulCUBLAS example.

Editing Running CUDA Samples on Kong

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Phake I

Clusters

Compilers

Consultation

Facilitites

FAQ

GPU

Hardware Costs

HPC URLs

IST/ARCS Services

Lessons

Matlab Parallel Server

News

Outages

Policies

Python

Researcher Resources

Researcher On-and-Off-Premise Resources

Researcher Problem Domains

Researcher Symposia

Roadmap

Running Jobs

Sharing Data

SLURM

Software

Specifications

Surveys

Tartan Initiative

User

User Contributions

Wiki Usage

Tools