April 30, 2008

Map Reduce and MPI.

Over at GridGuru’s they have a interesting article regarding Map Reduce its applications. The Map Reduce crowd has been growing of late and is out spoken about what a great tool it is. Without a doubt it is, but something I learned a long time ago is that for each job there is a correct tool. You don’t use a sledgehammer to fix your watch and you don’t use a pair of tweezers for demolition.

I am a skeptic, which is not to say I have anything against a generalized framework for distributing data to a large number of processors. Nor does it imply that I enjoy MPI and its coherence arising from cacophonous chatter (if all goes well). I just don’t think MapReduce is particularly “simple”. The key promoters of this algorithm such as Yahoo and Google have serious-experts MapReducing their particular problem sets and thus they make it look easy.

Sadly this implies that processing data in parallel is still hard no matter how good of a programmer you are nor how sophisticated your programming language is.

sudo

April 22, 2008

Running Hadoop On The Cluster.

Since I have built the cluster I have been trying to find cool things to do with it. One cool project is Hadoop:

From Wikipedia: Apache Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers. [1] It enables applications to easily scale out to thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.

The MapReduce methodology has really caught on, Yahoo now uses it as well and actually use Hadoop specifically. Basically it can be used anywhere you have a large dataset that needs to be processed that can’t be handled by single machine.

Setting Hadoop up wasn’t a problem, Hadoop Wiki and this site had tons of good info. Basically the idea is that you have master and slave nodes. The master(s) coordinate the slaves and the slaves do all the work. You also have to create a HDFS filesystem for Hadoop to use. Once you have Hadoop setup on all the nodes, configured your hadoop-site.xml and created your HDFS filesystem you should be ready to start up your cluster. Needless to say there are many small details that I am not mentioning, the docs above do a pretty good job or outlining those so I won’t repeat them here.

There are a few things I did differently than the docs. First, I installed Hadoop in a shared partition that all my nodes can access. This should make upgrades fairly simple. My ‘conf’ directory is also symlink’ed in as well as the version that I am currently using:

[cluster@front hadoop]$ pwd
/opt/share/hadoop
[cluster@front hadoop]$ ls -la
total 28
drwxrwxr-x 7 cluster cluster 4096 Apr 22 13:39 .
drwxrwxr-x 8 cluster ucluster 4096 Apr 21 11:57 ..
drwxr-xr-x 2 cluster cluster 4096 Apr 22 13:40 conf
lrwxrwxrwx 1 cluster cluster 13 Apr 22 11:46 current -> hadoop-0.16.3
drwxrwxr-x 4 cluster cluster 4096 Apr 22 13:45 data
drwxr-xr-x 12 cluster cluster 4096 Apr 21 12:08 hadoop-0.15.3
drwxr-xr-x 12 cluster cluster 4096 Apr 22 13:16 hadoop-0.16.3
drwxrwxr-x 2 cluster cluster 4096 Apr 21 14:06 input
[cluster@front hadoop]$ echo $HADOOP_HOME
/opt/share/hadoop/current/
[cluster@front hadoop]$ ls -al current/conf
lrwxrwxrwx 1 cluster cluster 8 Apr 22 11:46 current/conf -> ../conf/

I also created ‘/opt/share/hadoop/data’ and ‘/opt/share/hadoop/input’. ‘data’ is my HDFS store. ‘input’ is the location where I store the input that is eventually copied into the HDFS store.

I also modified my hadoop-metrics.properties configuration to turn on the Ganglia monitoring. I’ll admit that I haven’t got it properly working with Ganglia yet and the documentation is fairly sparse. If you have any suggestions on how do this out side of whats on this page let me know.

Once everything is rolling you should be able to ’start-all.sh’ to startup the Hadoop daemons on all nodes. From there you can submit jobs, below is the wordcount example:

[cluster@front hadoop]$ /opt/share/hadoop/current/bin/hadoop jar /opt/share/hadoop/current/hadoop-0.16.3-examples.jar wordcount input output-wordcount6
08/04/22 14:18:27 INFO mapred.FileInputFormat: Total input paths to process : 6
08/04/22 14:18:28 INFO mapred.JobClient: Running job: job_200804221413_0001
08/04/22 14:18:29 INFO mapred.JobClient: map 0% reduce 0%
08/04/22 14:18:42 INFO mapred.JobClient: map 21% reduce 0%
08/04/22 14:18:44 INFO mapred.JobClient: map 40% reduce 0%
...

You can check your jobs and see how things are running by hitting ‘localhost:50070′ and ‘localhost:50030′ in a browser while Hadoop is running.

Lastly, I attempted to also create jobs to run hadoop in Sun Grid Engine, which I spoke about installing with Unicluster in this post. This worked like a charm, the normal q* commands and etc work like you would expect and the jobs run properly.

Hadoop is a pretty sweet utility, its no surprise that large internet search companies could use this to their advantage. We will likely see a lot more from Hadoop and the MapReduce framework in the future.

Edit: Check out, http://www.joeandmotorboat.com/2008/07/28/more-on-hadoop-metrics-in-ganglia/ for more info on ganglia and hadoop metrics.

April 20, 2008

The Cluster.

So, the last couple months I have been working on a project in clustering and high performance computing. I haven’t really dealt with much of it in my day job outside of clustering J2EE containers like Tomcat and Resin but that sort of clustering is something entirely different for the most part. What I wanted to build was something that would be able to perform parallel jobs and solve hard math, like something you might find at a university or government lab. TACC is a great example or something you might find on Top500. Obliviously something like TACC with 3,936 nodes and 123TB of memory isn’t something I can afford to buy. So I decided to piece it together with cheap hardware from ebay. I first found a lot of three IBM Netvista’s each with 512MB of RAM and a 866MHz Pentium 3 CPU. Then I picked up a Nortel Baystack 450-24T switch, I wanted something I could manage if needed and decent throughput, better than a Linksys or Netgear off the shelf at BestBuy. Lastly I needed a machine to become the front/main node that will be leader of the cluster. For this I found a Compaq EVO, 1.7GHz Pentium 4 with 512MB of RAM. After a few weeks I had all my hardware.

Next was to choose the software I wanted to run. I tried a few different setups, the first was PelicanHPC. This clustering distro is based on Debian and works pretty well. It is completely CD based and there is very little setup involved, everything just boots over the network or off the CD. While quick and easy I shied away from using it for just that reason, I wanted something that is installed and permanent.

Next in line was Rocks. Rocks is Centos based and uses Kickstart to distribute the OS to all the nodes of the cluster. It also has “rolls” that include all sorts of HPC software like Ganglia, Sun Grid Engine and Condor. Seemed like a nice all-in-one solution. I didn’t have much luck with it due to only having 512MB in my main node, it needs at least 1GB to load up the image to install. I tried it on my laptop that has 4GB in it and it worked nicely but changing out hard drives in my laptop to play with the cluster was not something I wanted to do on a daily basis.

My next trial was using Centos 4 and installing OSCAR. OSCAR has a decent install process and number of packages “built-in” like MPI and Ganglia. OSCAR also hasn’t been updated in a couple years so it requires an older distro to work, like Centos 4 and Fedora Core 5. This isn’t really a problem (I happen to like Centos 4 quite a bit) but as you might expect during the install I ran into a number of weird package dependency issues but was able to correct them. Once I got it installed on my main node I began to install it over the network to the other nodes, a process that should have been trivial I was unable to get working properly. So with that and the annoying package dependencies I moved on.

My final trial was using Centos 4 with Univa UD’s Unicluster Express 3.2. This includes Grid Engine, Globus, a dead simple install process and Ganglia (including Univa UD’s own monitoring client). Once installed on the main node I began on the rest of them their installation script made this very easy. Once done I was able to attempt to submit jobs, I first attempted using one of their example jobs which basically just runs the ‘date’ command on the node that the job is sent to:

qsub /usr/local/unicluster/sge/examples/jobs/simple.sh

It worked! Grid Engine sent the job to one of the nodes and the node completed it. Next I wanted to try something a bit more intensive, I tried their ‘worker.sh’ example and it worked without an issue as well. Next I wanted to find out how to submit parallel jobs, after a few trial and error attempts using the wrong software I installed MPICH2 in a shared partition on the cluster, set the path and started it up and tested it out:

mpdboot -n 3 -f hostfile
mpdtrace -l -n 3 hostname
mpdringtest 100

It worked without a hitch. I now needed to setup the queue to be ‘batch interactive parallel’ so I can submit parallel jobs and have then run immediately.

qconf -mattr queue qtype "BATCH INTERACTIVE" all.q

I also went ahead and wrote a script for the job I wanted to run which is the pi example included with MPICH.

#!/bin/bash
#$ -cwd
#$ -N mpitest
#$ -pe mpi 3
#$ -now y
#$ -j y
/opt/share/mpich2/bin/mpiexec -l -n 1 -host front.esper /opt/share/mpich2/bin/cpi : -n 100 -host node01.esper /opt/share/mpich2/bin/cpi : -n 100 -host node02.esper /opt/share/mpich2/bin/cpi : -n 100 -host node03.esper /opt/share/mpich2/bin/cpi

When this script is submitted to the cluster using SGE ‘qsub’ it will run 100 processes on all the nodes and 1 on the main node. Since this is a parallel job they all work together at finding the result. I was pretty excited to find that it produced a result. My next project was to get linpack running to test the performance of the cluster. After a bit of playing with the Make script and installing atlas and fortran I was able to have it compile. I kicked off a test in the same way I did in the pi script above and scored 1.919Gflops. Nothing to write home about but it works and it’s good enough for me.

0: ============================================================================
0: T/V N NB P Q Time Gflops
0: ----------------------------------------------------------------------------
0: WR00R2R4 35 4 4 1 0.02 1.919e-03

All in all I have been pretty happy with Unicluster and continue to work with it and figure out all the cool stuff you can do with Grid Engine. One issue I have ran into is the monitoring software that is included hooks into Ganglia but currently I am not seeing anything on the graphs. Luckily the guys at Univa UD run Grid.org and are fairly active in the forums. They have helped me out a couple times now.

It’s fairly easy to get a cluster running (just time consuming) but as I am now finding out it is more difficult to keep it rolling and make life easy on myself while running. Here are a few tips I can share from my experience:

  • Shared partitions are key, I use shared home directories and a directory ‘/opt/share’ mounted on each machine. I initially wasn’t doing this and found it cumbersome to do anything that needed to be on each machine. Have applications and libraries that need to be installed on each machine? Install them on a shared partition and set the PATH to the appropriate location in your bashrc. This allows you to install and set it once for all machines.
  • Use SSH keys without pass phrases, life is easier without passwords.
  • Read the docs. MPICH2, Unicluster and SGE all have pretty good docs and decent examples to get you rolling.
  • Understand how to configure SGE queues and how they work. Here’s a starter.
  • Script everything! Script your jobs, MPI startup and anything else you do on a regular basis.

Lastly, there are likely things I failed to mention in this article so if you have a question or there is something missing just comment on this post and I will do my best to answer.

April 13, 2008

Grape Drink.

YouTube Preview Image