Wordcount Tutorial

From HPC Wiki
Jump to: navigation, search

Upon completion of this WordCount tutorial a user will learn -

  • How to connect to the hadoop cluster
  • How to transfer files to and from the hadoop filesystem (HDFS)
  • How to compile java source code for mapreduce programs
  • How to run mapreduce programs on the hadoop cluster
  • How to view the output of the map reduce program

1) Connecting to the Hadoop Cluster

ssh to horton.njit.edu

afsconnect2-17 guest24>: ssh horton.njit.edu
guest24@horton.njit.edu's password:
[guest24@ambari03-cluster-client-0 ~]$

2) Create a directory in the AFS home directory to store the files for the tutorial.

[guest24@ambari03-cluster-client-0 ~]$ mkdir wordcount_tutorial

3) Change the current working directory to wordcount_tutorial.

[guest24@ambari03-cluster-client-0 ~]$ cd wordcount_tutorial
[guest24@ambari03-cluster-client-0 wordcount_tutorial]$

4) Copy WordCount.java to the current working directory

WordCount.java

5) Set CLASSPATH so that the java source can be compiled

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ export CLASSPATH=$(hadoop classpath):$CLASSPATH
[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ echo $CLASSPATH
/usr/hdp/2.6.1.0-129/hadoop/conf:/usr/hdp/2.6.1.0-129/hadoop/lib/*:/usr/hdp/2.6.1.0-129/hadoop/.//*:/usr/hdp/2.6.1.0-129/hadoop-hdfs/./:/usr/hdp/2.6.1.0-129/hadoop-hdfs/lib/*:/usr/hdp/2.6.1.0-129/hadoop-hdfs/.//*:/usr/hdp/2.6.1.0-129/hadoop-yarn/lib/*:/usr/hdp/2.6.1.0-129/hadoop-yarn/.//*:/usr/hdp/2.6.1.0-129/hadoop-mapreduce/lib/*:/usr/hdp/2.6.1.0-129/hadoop-mapreduce/.//*:/usr/java/default/lib/tools.jar:mysql-connector-java-5.1.17.jar:mysql-connector-java.jar:/usr/hdp/2.6.1.0-129/tez/*:/usr/hdp/2.6.1.0-129/tez/lib/*:/usr/hdp/2.6.1.0-129/tez/conf:.:/afs/cad/u/g/u/guest24/classes:/afs/cad/linux/tomcat/common/lib/servlet-api.jar:/afs/cad/linux/tomcat/common/lib/jsp-api.jar:/afs/cad/linux/oraclient10.2/jdbc/lib/classes12.zip:/afs/cad/solaris/mysql/jdbc/mysql-connector-java-5.0.4-bin.jar

6) Create a directory to store the java classes.

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ mkdir wordcount_classes

7) Compile WordCount.java.

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ javac -d wordcount_classes WordCount.java

The resultant classes can be found in ~/wordcount_tutorial/wordcount_classes/org/apache/hadoop/examples/

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ ls ~/wordcount_tutorial/wordcount_classes/org/apache/hadoop/examples/
WordCount$IntSumReducer.class  WordCount$TokenizerMapper.class  WordCount.class

8) Create a Java archive (jar) file containing the executables.

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ jar -cvf wordcount.jar wordcount_classes/
added manifest
adding: wordcount_classes/(in = 0) (out= 0)(stored 0%)
adding: wordcount_classes/org/(in = 0) (out= 0)(stored 0%)
adding: wordcount_classes/org/apache/(in = 0) (out= 0)(stored 0%)
adding: wordcount_classes/org/apache/hadoop/(in = 0) (out= 0)(stored 0%)
adding: wordcount_classes/org/apache/hadoop/examples/(in = 0) (out= 0)(stored 0%)
adding: wordcount_classes/org/apache/hadoop/examples/WordCount$TokenizerMapper.class(in = 1790) (out= 765)(deflated 57%)
adding: wordcount_classes/org/apache/hadoop/examples/WordCount$IntSumReducer.class(in = 1789) (out= 749)(deflated 58%)
adding: wordcount_classes/org/apache/hadoop/examples/WordCount.class(in = 1956) (out= 1038)(deflated 46%)
[guest24@ambari03-cluster-client-0  wordcount_tutorial]$ ls
WordCount.java  wordcount.jar  wordcount_classes

9) The WordCount program counts the number of occurrences each word in a text file appears. The text file used in this tutorial is Ulysses by James Joyce. Down the textfile.

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ wget http://www.gutenberg.org/files/4300/4300-0.zip
--2015-12-14 21:25:13--  wget http://www.gutenberg.org/files/4300/4300-0.zip
Resolving www.gutenberg.org... 152.19.134.47
Connecting to www.gutenberg.org|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 661646 (646K) [application/zip]
Saving to: `4300-0.zip'

100%[===================================================================================================================================>] 661,646     --.-K/s   in 0.1s

2015-12-14 21:25:13 (4.27 MB/s) - `4300-0.zip' saved [661646/661646]

10) Unzip the textfile.

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ unzip 4300-0.zip
Archive:  4300-0.zip
  inflating: 4300-0.txt


11) Create a directory in the Hadoop Filesytem (HDFS) to store the textfile.

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ hdfs dfs -mkdir /user/guest24/wordcount

12) Copy the textfile from AFS to HDFS.

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ hdfs dfs -put 4300-0.txt /user/guest24/wordcount

13) Verify the file was copied.

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ hdfs dfs -ls /user/guest24/wordcount
Found 1 items
-rw-r--r--   3 guest24 hdfs    1573079 2015-12-14 21:31 /user/guest24/wordcount/4300-0.txt

14) Run the wordcount program

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ hadoop jar wordcount.jar org.apache.hadoop.examples.WordCount wordcount /user/guest24/wordcount-output
18/05/22 13:57:03 INFO client.RMProxy: Connecting to ResourceManager at ambari03-cluster-master-0.hpcnet/10.102.2.19:8050
18/05/22 13:57:03 INFO client.AHSProxy: Connecting to Application History server at ambari03-cluster-master-0.hpcnet/10.102.2.19:10200
18/05/22 13:57:05 INFO input.FileInputFormat: Total input paths to process : 1
18/05/22 13:57:06 INFO mapreduce.JobSubmitter: number of splits:1
18/05/22 13:57:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1510073681808_0132
18/05/22 13:57:09 INFO impl.YarnClientImpl: Submitted application application_1510073681808_0132
18/05/22 13:57:09 INFO mapreduce.Job: The url to track the job: http://ambari03-cluster-master-0.hpcnet:8088/proxy/application_1510073681808_0132/
18/05/22 13:57:09 INFO mapreduce.Job: Running job: job_1510073681808_0132
18/05/22 13:57:24 INFO mapreduce.Job: Job job_1510073681808_0132 running in uber mode : false
18/05/22 13:57:24 INFO mapreduce.Job:  map 0% reduce 0%
18/05/22 13:57:38 INFO mapreduce.Job:  map 100% reduce 0%
18/05/22 13:57:48 INFO mapreduce.Job:  map 100% reduce 100%
18/05/22 13:57:48 INFO mapreduce.Job: Job job_1510073681808_0132 completed successfully

>>>>>>>>>>>SNIP<<<<<<<<<<<<<

       Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=1580890
        File Output Format Counters
                Bytes Written=5223367


15) View the resultant output file.

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ hdfs dfs -ls /user/guest24
Found 4 items
drwx------   - guest24 hdfs          0 2015-12-14 19:25 /user/guest24/.Trash
drwx------   - guest24 hdfs          0 2015-12-14 21:35 /user/guest24/.staging
drwxr-xr-x   - guest24 hdfs          0 2015-12-14 21:31 /user/guest24/wordcount
drwxr-xr-x   - guest24 hdfs          0 2015-12-14 21:35 /user/guest24/wordcount-output
[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ hdfs dfs -ls /user/guest24/wordcount-output
Found 2 items
-rw-r--r--   3 guest24 hdfs          0 2018-05-22 13:57 /user/guest24/wordcount-output/_SUCCESS
-rw-r--r--   3 guest24 hdfs     522336 2018-05-22 13:57 /user/guest24/wordcount-output/part-r-00000
[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ hdfs dfs -cat /user/guest24/wordcount-output/part-r-00000
"Defects,"      1
"Information    1
"Plain  2
"Project        5
"Right  1
#4300]  1
$5,000) 1
%       3

>>>>>>>>>>>SNIP<<<<<<<<<<<<<

’pon    1
’s      1
’tis    4
’twas   4
’twas.  1
’twere, 1
“Come   1
“I      1
“J”     1
“Viator”        1
“YOU    1
•       1

16) Copy the resultant output file from HDFS to AFS

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ hdfs dfs -get /user/guest24/wordcount-output/part-r-00000

17) Verify the file was copied

[guest24@ambari03-cluster-client-0 wordcount_tutorial]$ ls
4300-0.txt  4300-0.zip  WordCount.java  part-r-00000  wordcount.jar  wordcount_classes