So, you have decided you are interested in big data and data science and exploring what you can do with Hadoop and Map Reduce.
But... you find most of the tutorials too hard to wade through, inconsistent, or you simply encounter problems that you just can't solve. Hadoop is evolving so fast that often the documentation is unable to keep up.
Here I will run you through the process I followed to get the latest version of Hadoop (2.5.1) running so I could use it to test my Map Reduce programs.
Part One: Java
You need to make sure you have a compatible version of Java on your machine.Jump into your terminal and type
java -versionYou preferably need an installation of Java 7.
When I run this I get:
java version "1.7.0_55" OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1~0.12.04.2) OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
Part Two: Other Software
You will need ssh and rsync installed. Chances are that they already are, but if not just run:sudo apt-get install ssh sudo apt-get install rsync
Part Three: Grab a Release
Head to the Apache Hadoop Releases page, choose a mirror and grab the tarball (.tar.gz). Make sure you do not grab the source file by mistake (src).Remember: in this walk-through I have grabbed release: 2.5.1
Part Four: Unpack & Configure
Copy the tarball to wherever you want Hadoop to reside. For me I like to put it in the directory/usr/local/hadoopand then extract the contents with
tar -xvf hadoop-2.5.1.tar.gzThen you will need to do some configuration. Open the file
vi hadoop-2.5.1/etc/hadoop/hadoop-env.shYou will need to modify the line that currently looks like this
export JAVA_HOME=${JAVA_HOME}
You need to point this to your java installation. If you are not sure where that it just run
which java
and then copy the path (minus the bin/java at the end) into the hadoop config file to replace the text ${JAVA_HOME}.
Part Five: Test
First run a quick to check that you have configured java correctly. The following command should show you the version of hadoop and its compilation information.hadoop-2.5.1/bin/hadoop version
Part Six: Run Standalone
The simplest thing you can do with hadoop is run a map reduce job as a stand alone script.The Apache Docs give a great simple example: grepping a collection of files.
Run these commands:
mkdir input cp hadoop-2.5.1/etc/hadoop/*.xml input hadoop-2.5.1/bin/hadoop jar hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar grep input output 'dfs[a-z.]+'
When hadoop completes that process you can open up the results file and have a look.
vi output/part-r-00000You should see a single line for each match of the regular expression. Trying changing the expression and seeing what you get. Now you can use this installation to test your map reduce jars against Hadoop 2.5.1.
Coming Next: Running Hadoop 2.5.1 in Pseudo Distributed Mode