April 22, 2008

Running Hadoop On The Cluster.

Since I have built the cluster I have been trying to find cool things to do with it. One cool project is Hadoop:

From Wikipedia: Apache Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers. [1] It enables applications to easily scale out to thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.

The MapReduce methodology has really caught on, Yahoo now uses it as well and actually use Hadoop specifically. Basically it can be used anywhere you have a large dataset that needs to be processed that can’t be handled by single machine.

Setting Hadoop up wasn’t a problem, Hadoop Wiki and this site had tons of good info. Basically the idea is that you have master and slave nodes. The master(s) coordinate the slaves and the slaves do all the work. You also have to create a HDFS filesystem for Hadoop to use. Once you have Hadoop setup on all the nodes, configured your hadoop-site.xml and created your HDFS filesystem you should be ready to start up your cluster. Needless to say there are many small details that I am not mentioning, the docs above do a pretty good job or outlining those so I won’t repeat them here.

There are a few things I did differently than the docs. First, I installed Hadoop in a shared partition that all my nodes can access. This should make upgrades fairly simple. My ‘conf’ directory is also symlink’ed in as well as the version that I am currently using:

[cluster@front hadoop]$ pwd
/opt/share/hadoop
[cluster@front hadoop]$ ls -la
total 28
drwxrwxr-x 7 cluster cluster 4096 Apr 22 13:39 .
drwxrwxr-x 8 cluster ucluster 4096 Apr 21 11:57 ..
drwxr-xr-x 2 cluster cluster 4096 Apr 22 13:40 conf
lrwxrwxrwx 1 cluster cluster 13 Apr 22 11:46 current -> hadoop-0.16.3
drwxrwxr-x 4 cluster cluster 4096 Apr 22 13:45 data
drwxr-xr-x 12 cluster cluster 4096 Apr 21 12:08 hadoop-0.15.3
drwxr-xr-x 12 cluster cluster 4096 Apr 22 13:16 hadoop-0.16.3
drwxrwxr-x 2 cluster cluster 4096 Apr 21 14:06 input
[cluster@front hadoop]$ echo $HADOOP_HOME
/opt/share/hadoop/current/
[cluster@front hadoop]$ ls -al current/conf
lrwxrwxrwx 1 cluster cluster 8 Apr 22 11:46 current/conf -> ../conf/

I also created ‘/opt/share/hadoop/data’ and ‘/opt/share/hadoop/input’. ‘data’ is my HDFS store. ‘input’ is the location where I store the input that is eventually copied into the HDFS store.

I also modified my hadoop-metrics.properties configuration to turn on the Ganglia monitoring. I’ll admit that I haven’t got it properly working with Ganglia yet and the documentation is fairly sparse. If you have any suggestions on how do this out side of whats on this page let me know.

Once everything is rolling you should be able to ’start-all.sh’ to startup the Hadoop daemons on all nodes. From there you can submit jobs, below is the wordcount example:

[cluster@front hadoop]$ /opt/share/hadoop/current/bin/hadoop jar /opt/share/hadoop/current/hadoop-0.16.3-examples.jar wordcount input output-wordcount6
08/04/22 14:18:27 INFO mapred.FileInputFormat: Total input paths to process : 6
08/04/22 14:18:28 INFO mapred.JobClient: Running job: job_200804221413_0001
08/04/22 14:18:29 INFO mapred.JobClient: map 0% reduce 0%
08/04/22 14:18:42 INFO mapred.JobClient: map 21% reduce 0%
08/04/22 14:18:44 INFO mapred.JobClient: map 40% reduce 0%
...

You can check your jobs and see how things are running by hitting ‘localhost:50070′ and ‘localhost:50030′ in a browser while Hadoop is running.

Lastly, I attempted to also create jobs to run hadoop in Sun Grid Engine, which I spoke about installing with Unicluster in this post. This worked like a charm, the normal q* commands and etc work like you would expect and the jobs run properly.

Hadoop is a pretty sweet utility, its no surprise that large internet search companies could use this to their advantage. We will likely see a lot more from Hadoop and the MapReduce framework in the future.

Edit: Check out, http://www.joeandmotorboat.com/2008/07/28/more-on-hadoop-metrics-in-ganglia/ for more info on ganglia and hadoop metrics.

9 Comments

  1. bob May 15, 2008 11:21 pm

    So did you get ganglia to work with hadoop by now?

  2. joe May 16, 2008 4:57 pm

    I haven’t mostly because I have been busy with other things. But I think it may have something to do with my ganglia installation. The ganglia installed with unicluster seems to have some issues: http://www.grid.org/forum/showthread.php?t=160 It seems that Univa UD may not know what the issue is either. So currently I am trying to identify the issue but I may end up installing my own ganglia and using that. If I end up getting it to work I will be sure to post it.

  3. Cloud Strife Jul 22, 2008 9:15 pm

    hi there, I’m trying to connect ganglia and hadoop, but no luck :( .
    Any progress on making it works, Joe?

  4. joe Jul 23, 2008 5:49 am

    i haven’t had any luck either but haven’t spent much time working on it. the documentation seems sparse, so that certainly doesn’t help. i will see if i can get something rolling in the next couple days and post it.

  5. Cloud Strife Jul 31, 2008 2:13 am

    Finally it works. Thank you, Joe. Your posts did help me a lot.

  6. joe Jul 31, 2008 7:03 am

    Glad they helped!

  7. LC Oct 09, 2008 12:20 pm

    pistolilla…

  8. sneha Mar 08, 2009 3:25 pm

    dear sir,

    i am doing my project in hadoop and would like ur help in this regard… i would like to which is the fornt end for hadoop …and how do we go abt doing it ..

    kindly help u!!
    thanking you
    sneha

  9. Dinesh Mar 04, 2010 10:54 pm

    Sneha am trying to do a project using hadoop !! We could help each other out !!

Leave a Comment

(required)

(will not be published) (required)