May 30, 2008

gexec Success!

I was finally able to get a clean build of gexec with the ‘–with-ganglia’ option. Here’s what I did:

I downloaded the tarball available at http://therealms.org/oss/ganglia/gexec-0.3.8.tar.gz (thanks to Bernard on the Ganglia mailing list). Then run:

rpmbuild -tb gexec-0.3.8.tar.gz

This created a RPM and SRPM, the RPM can be deleted and I installed the SRPM. Should be located at ‘/usr/src/redhat/SRPMS/gexec-0.3.8-4.src.rpm’. I then edited the SPEC file ‘/usr/src/redhat/SPECS/gexec.spec’ removing ‘%configure’ and adding the following above the ‘make’ line but below the ‘%build’ line.

./configure –with-ganglia –host=x86_64-redhat-linux-gnu –build=x86_64-redhat-linux-gnu –target=x86_64-redhat-linux –program-prefix= –prefix=/usr –exec-prefix=/usr –bindir=/usr/bin –sbindir=/usr/sbin –sysconfdir=/etc –datadir=/usr/share –includedir=/usr/include –libdir=/usr/lib64 –libexecdir=/usr/libexec –localstatedir=/var –sharedstatedir=/usr/com –mandir=/usr/share/man –infodir=/usr/share/info

Next, extract the tarball at ‘/usr/src/redhat/SOURCES/gexec-0.3.8.tar.gz’. Edit ‘configure.ac’ to include ‘AC_PREFIX_DEFAULT(/usr)’ rather than ‘AC_PREFIX_DEFAULT(/usr/local)’. Then change GANGLIA_LIB to use ‘/usr/lib/libganglia.a’ rather than ‘@libdir@/libganglia.a’. I also edited the Makefile to use ‘/usr/lib/libganglia.a’ rather than ‘@libdir@/libganglia.a’ in a couple spots. Then move the gexec-0.3.8.tar.gz to gexec-0.3.8.tar.gz.OLD and ‘tar zcvf gexec-0.3.8′ to create a new tarball with the changes just made. At this point one can build and install the new RPM by running:

rpmbuild -ba /usr/src/redhat/SPECS/gexec.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/gexec-0.3.8-4.x86_64.rpm

I have made my SRPM available, you can download it here.

May 29, 2008

Is It You Or Me Ganglia?

So I began building a new head cluster node in a KVM, just as a test run and to refine my methodology. I decided to drop Unicluster due to an unresolved issue, this time around I decided to install everything myself. … Java, check … Hadoop, check … Pig, check … Grid Engine, check … OpenMPI, check … Ganglia, ugh …

Ganglia seems to be an interesting beast. I build the SRPMs and then installed the RPMs for the “ganglia monitor core” without a problem, it was easy and quick. I then moved on to the “gexec execution environment” this includes gexec, gexecd, authd and libe.

The first issue I ran into in building from the SRPM was the dependencies. First, I started with authd and ran into dependency issues during the build. Sadly the SPEC file did not include what the package requires. I attempted the normal RPM (found on Ganglia’ SourceForge page). Even those didn’t work properly due to a requirement of some old OpenSSL libraries unavailable in Centos5.

[root@m ganglia]# rpm -qa | grep openssl
openssl-devel-0.9.8b-8.3.el5_0.2
openssl-0.9.8b-8.3.el5_0.2
openssl-devel-0.9.8b-8.3.el5_0.2
openssl-0.9.8b-8.3.el5_0.2
[root@m ganglia]# rpm -ivh authd-0.2.1-1.i386.rpm
error: Failed dependencies:
libcrypto.so.2 is needed by authd-0.2.1-1.i386
libssl.so.2 is needed by authd-0.2.1-1.i386

So I went back to attempting to build the SRPM. Soon I found out that the above libraries have nothing to do with the build issues I was seeing. My issue was with the libe library missing. Once I built and installed that authd build and installed without a problem.

Next, I attempted to build gexec. This proved to have the same issue as authd, the SRPM did not include a requires in the SPEC making it difficult to determine what needs to be installed as a prerequisite. I then started to investigate the errors I was seeing in the build,

gexec.c:39:33: error: ganglia/gexec_funcs.h: No such file or directory

Googling for this I found a Ganglia Developers email list entry that described that

The gexec-0.3.6 available from http://www.theether.org/gexec does not
build with 3.0.* versions of Ganglia. It builds correctly only with 2.*
versions. If you want to build with Ganglia 3, edit the gexec.c to include
/usr/include/ganglia.h and not /usr/include/ganglia/gexec_funcs.h. Of
course, you have to have ganglia-devel installed for this to work. Another
thing, in addition to the above, you have to add #include to
gexec.c in order to successfully build the gexec.

That works, so I edited the gexec.c source tarball containing the gexec.c including the above changes. My attempt to build again failed on the ‘e/llist.h’ include not existing. ‘locate’ proved that it did not exist on my machine even though libe is installed. So I went back to that email list post and found this link:

http://svn.oscar.openclustergroup.org/svn/oscar-soc/soc-2006/hpcmetrics/ganglia/

Looking through the source I found http://svn.oscar.openclustergroup.org/svn/oscar-soc/soc-2006/hpcmetrics/ganglia/src/lib/llist.h and copied it in to ‘/usr/include/e/’. This worked nicely, but as you might expect it failed again. This time looking for libraries in ‘/lib’ rather than ‘/lib64′, which is to be expected since I am running x86_64. I symlinked the library into place and moved on.

Now I am at an error that I haven’t been able to figure out. My mailing list post describing the issue has not seen a reply.

gexec.c: In function ‘main’:
gexec.c:324: warning: ‘ips’ may be used uninitialized in this function
gcc -DHAVE_CONFIG_H -I. -I. -I. -I. -O2 -Wall -D_REENTRANT -g
-D_GNU_SOURCE -DDEBUG -c gexec_options.c
gcc -O2 -Wall -D_REENTRANT -g -D_GNU_SOURCE -DDEBUG -o gexec -L.
gexec.o gexec_options.o -lpthread -lgexec -le -lauth -lssl -lcrypto
/usr/lib/libganglia.a -lssl -lpthread -lcrypto
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×10c): undefined reference to `XML_ParserCreate’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×160): undefined reference to `XML_SetElementHandler’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×16b): undefined reference to `XML_SetUserData’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×178): undefined reference to `XML_GetBuffer’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×1c4): undefined reference to `XML_ParserFree’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×1f6): undefined reference to `XML_ParseBuffer’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×265): undefined reference to `XML_GetErrorCode’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×26c): undefined reference to `XML_ErrorString’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×277): undefined reference to `XML_GetCurrentLineNumber’
collect2: ld returned 1 exit status
make: *** [gexec] Error 1

After a bit of Googling, I found that these XML directives are related to expat. I installed expat-devel (as well as a number of other xml devel packages) and attempted to rebuild. Same thing, failure. Next, I decided that since it seems in relation to libganglia.a that perhaps it was not built with expat support and needed to rebuilt, so now with expat-devel installed I did this. This fails with the same error as above. After looking at the doc I noticed that the ganglia SPEC file does not include ‘–enable-gexec’ in the configure. I built the RPMs with this option and still ran into the error. I have attempted to build gexec from SRPM as well as straight source. In every case I get the above error. The error suggests (“collect2: ld returned 1 exit status”) to me that there is a library (or libraries) missing. But at this point I’m not really sure at all. If I come up with something (outside of running gexec in standalone) I will be sure to post it. If anyone else out there knows what’s up post a comment.

This all leads me to the point of this post which is … why is setting this up so difficult? Truth be told I have no clue, but I don’t think it should be. The Ganglia mailing list was helpful enough but documentation seems a little lacking should one run into any issues. One would think that if “The gexec-0.3.6 available from http://www.theether.org/gexec does not
build with 3.0.* versions of Ganglia.” this should be documented. I don’t think that I am doing anything strange and I am using Centos5, not some obscure distro.

You may be asking what all these problems with gexec have to do with ganglia (a guy on the mailing list asked me just that “What does this have to do with ganglia?”), fair enough. Ganglia is not gexec and gexec is not Ganglia. My response was that the gexec SRPMs are downloadable side by site with all the Ganglia RPMs off of SourceForge. This leads me to believe that questions to the Ganglia mailing list about gexec doesn’t seem too far off base. Additionally, for someone that is trying to install these packages for the first time or is new to Ganglia it seems that the mailing list would be the place to ask, as I imagine there are plenty of folks running gexec hosts in Ganglia. The Ganglia documentation even mentions gexec that “integrating it with ganglia is a bit clumsy” but provides no information outside of how to run it standalone mode and how to turn it off if you have configured it by default to be on. To boot the gexec site hasn’t been updated since 2004.

Next, you may think that if this is broken and the documentation sucks why don’t you fix it, it’s an opensource project. That’s valid and I will be happy to write up some documentation on how to build the RPMs for Ganglia and associated applications. For good measure I will even see if I can get it posted to the Ganglia wiki. Of course this hinges on me actually being able to build the RPMs and have everything work properly.

Lastly, here are a few lessons learned:

  • Something I learn time and time again, don’t assume anything.
  • Any time you create SRPMs make sure you add  the “BuildRequires” directive. This alone would have likely solved my issue with gexec after I modified gexec.c or at least would have pointed me in the right direction.
  • If source code modifications are required or any other oddities in building an application document them, simply something is clunky or unintuitive is not enough.
  • If you have a software product you would like other people to use provide installation procedures. Having install docs is almost as good as having a marketing team. If people find it easy to install and are happy with it they will tell others (example: Wordpress).

That’s it for my rant. Thanks. :)

May 23, 2008

More Hadoop, Grid Engine Goodness.

Over at GridEngine.info they found a link on DanT’s Sun blog that has a sweet tutorial on setting up Hadoop using SGE’s parallel environments with loose integration.

Here we are relying on master node to start othe daemons ( [rs]sh the machine and start daemons) and distribute jobs , and we donot have control on the TaskTracker threads. This way of setting a pe in Grid Engine is called loose-integration

With some more effort one could also achieve a tighter integration wherein the task of starting daemons and tasks on other slaves could be done by SGE. But this would require further understanding of Hadoop internals.

Pretty dope.

Using Pig with Hadoop.

Pig is a query language for use with Hadoop. It allows users to query hadoop data similar to a SQL database. Formally, according to their website:

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

To get rolling you need the following:

  • A Java SDK Installed
  • Ant Installed
  • Subversion
  • A working installation of Hadoop

Once you are rolling with those items we can install Pig and test it out.

First, you need to download Pig from their Subversion repository. Once done you will need to build it with Ant.

svn co http://svn.apache.org/repos/asf/incubator/pig/trunk pig-svn
cd pig-svn
ant

From there you can run the following command to drop into the interactive shell.

java -cp pig.jar:HADOOPSITEPATH org.apache.pig.Main

Or you can run a pig script that you have already created.

java -cp pig.jar:HADOOPSITEPATH somescript.pig

HADOOPSITEPATH needs to point to the directory that contains the hadoop-site.xml file.

If you run into an issue such as:

Caused by: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.dfs.ClientProtocol version mismatch. (client = 29, server = 23)

You will need to upgrade Hadoop so the versions match.

In the end you should get something that looks like this:

[cluster@front pig-svn]$ java -cp pig.jar:HADOOPSITEPATH org.apache.pig.Main
2008-05-23 10:37:42,478 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: front.esper:9000
2008-05-23 10:37:42,585 [main] WARN org.apache.hadoop.fs.FileSystem – “front.esper:9000″ is a deprecated filesystem name. Use “hdfs://front.esper:9000/” instead.
2008-05-23 10:37:43,117 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to map-reduce job tracker at: front.esper:9001
2008-05-23 10:37:43,246 [main] WARN org.apache.hadoop.fs.FileSystem – “front.esper:9000″ is a deprecated filesystem name. Use “hdfs://front.esper:9000/” instead.
grunt>

If you need more info on the above steps check out the Pig Wiki.

From here you can follow their tutorial or play around in the shell. Regarding the tutorial, I can’t seem to find the download of the archive they mention “Pig tutorial file (*.gz)”. If anyone knows where that can be found let me know and I will post it.

April 30, 2008

Map Reduce and MPI.

Over at GridGuru’s they have a interesting article regarding Map Reduce its applications. The Map Reduce crowd has been growing of late and is out spoken about what a great tool it is. Without a doubt it is, but something I learned a long time ago is that for each job there is a correct tool. You don’t use a sledgehammer to fix your watch and you don’t use a pair of tweezers for demolition.

I am a skeptic, which is not to say I have anything against a generalized framework for distributing data to a large number of processors. Nor does it imply that I enjoy MPI and its coherence arising from cacophonous chatter (if all goes well). I just don’t think MapReduce is particularly “simple”. The key promoters of this algorithm such as Yahoo and Google have serious-experts MapReducing their particular problem sets and thus they make it look easy.

Sadly this implies that processing data in parallel is still hard no matter how good of a programmer you are nor how sophisticated your programming language is.