May 31, 2008

NVIDIA GPU/CUDA Based Supercomputer.

Check out this sweet machine that the University of Antwerp built.

YouTube Preview Image

May 30, 2008

Ratatat.

I posted this video of the new Ratatat track over at musicsucks.net. It’s so flippin’ good I’ll post it here too.

YouTube Preview Image

Ganglia, gexec, authd and libe Install Procedure.

Install Ganglia

wget http://voxel.dl.sourceforge.net/sourceforge/ganglia/ganglia-3.0.7-1.src.rpm
rpm -Uhv http://apt.sw.be/redhat/el5/en/i386/rpmforge/RPMS/rpmforge-release-0.3.6-1.el5.rf.i386.rpm
yum install libpng-devel libart_lgpl-devel rrdtool-devel freetype-devel rrdtool-devel
rpmbuild –rebuild ganglia-3.0.7-1.src.rpm
rpm -ivh /usr/src/redhat/RPMS/x86_64/ganglia-gmetad-3.0.7-1.x86_64.rpm /usr/src/redhat/RPMS/x86_64/ganglia-gmond-3.0.7-1.x86_64.rpm /usr/src/redhat/RPMS/x86_64/ganglia-devel-3.0.7-1.x86_64.rpm

Install libe

wget http://www.theether.org/libe/libe-0.3.0-1.src.rpm
rpmbuild –rebuild libe-0.3.0-1.src.rpm
rpm -ivh /usr/src/redhat/RPMS/x86_64/libe-0.3.0-1.x86_64.rpm

Install authd

yum install openssl-devel
wget http://www.theether.org/authd/authd-0.2.2-1.src.rpm
rpmbuild –rebuild authd-0.2.2-1.src.rpm

You will run into an error like the following, don’t worry about it we clean it up next.

Installing authd-0.2.2-1.src.rpm
warning: user bnc does not exist – using root
warning: group dusers does not exist – using root
error: Legacy syntax is unsupported: copyright
error: line 5: Unknown tag: Copyright: GPL

Finish up authd

mv /usr/src/redhat/SPECS/authd.spec /usr/src/redhat/SPECS/authd.spec.1
sed ’s/Copyright/License/g’ /usr/src/redhat/SPECS/authd.spec.1 > /usr/src/redhat/SPECS/authd.spec
rpmbuild -ba /usr/src/redhat/SPECS/authd.spec
openssl genrsa -out auth_priv.pem
chmod 600 auth_priv.pem
openssl rsa -in auth_priv.pem -pubout -out auth_pub.pem

Copy auth_priv.pem and auth_pub.pem to ‘/etc’ on each node of the cluster

rpm -ivh /usr/src/redhat/RPMS/x86_64/authd-0.2.2-1.x86_64.rpm

Installing gexec (using my SRPM, includes the ‘–with-ganglia’ option)

echo “gexec 2875/tcp # Caltech GEXEC” >> /etc/services
yum install glibc gcc gcc-c++ authd expat-devel
rpm -ivh /usr/src/redhat/RPMS/x86_64/gexec-0.3.8-4.x86_64.rpm

gexec Success!

I was finally able to get a clean build of gexec with the ‘–with-ganglia’ option. Here’s what I did:

I downloaded the tarball available at http://therealms.org/oss/ganglia/gexec-0.3.8.tar.gz (thanks to Bernard on the Ganglia mailing list). Then run:

rpmbuild -tb gexec-0.3.8.tar.gz

This created a RPM and SRPM, the RPM can be deleted and I installed the SRPM. Should be located at ‘/usr/src/redhat/SRPMS/gexec-0.3.8-4.src.rpm’. I then edited the SPEC file ‘/usr/src/redhat/SPECS/gexec.spec’ removing ‘%configure’ and adding the following above the ‘make’ line but below the ‘%build’ line.

./configure –with-ganglia –host=x86_64-redhat-linux-gnu –build=x86_64-redhat-linux-gnu –target=x86_64-redhat-linux –program-prefix= –prefix=/usr –exec-prefix=/usr –bindir=/usr/bin –sbindir=/usr/sbin –sysconfdir=/etc –datadir=/usr/share –includedir=/usr/include –libdir=/usr/lib64 –libexecdir=/usr/libexec –localstatedir=/var –sharedstatedir=/usr/com –mandir=/usr/share/man –infodir=/usr/share/info

Next, extract the tarball at ‘/usr/src/redhat/SOURCES/gexec-0.3.8.tar.gz’. Edit ‘configure.ac’ to include ‘AC_PREFIX_DEFAULT(/usr)’ rather than ‘AC_PREFIX_DEFAULT(/usr/local)’. Then change GANGLIA_LIB to use ‘/usr/lib/libganglia.a’ rather than ‘@libdir@/libganglia.a’. I also edited the Makefile to use ‘/usr/lib/libganglia.a’ rather than ‘@libdir@/libganglia.a’ in a couple spots. Then move the gexec-0.3.8.tar.gz to gexec-0.3.8.tar.gz.OLD and ‘tar zcvf gexec-0.3.8′ to create a new tarball with the changes just made. At this point one can build and install the new RPM by running:

rpmbuild -ba /usr/src/redhat/SPECS/gexec.spec
rpm -ivh /usr/src/redhat/RPMS/x86_64/gexec-0.3.8-4.x86_64.rpm

I have made my SRPM available, you can download it here.

May 29, 2008

Is It You Or Me Ganglia?

So I began building a new head cluster node in a KVM, just as a test run and to refine my methodology. I decided to drop Unicluster due to an unresolved issue, this time around I decided to install everything myself. … Java, check … Hadoop, check … Pig, check … Grid Engine, check … OpenMPI, check … Ganglia, ugh …

Ganglia seems to be an interesting beast. I build the SRPMs and then installed the RPMs for the “ganglia monitor core” without a problem, it was easy and quick. I then moved on to the “gexec execution environment” this includes gexec, gexecd, authd and libe.

The first issue I ran into in building from the SRPM was the dependencies. First, I started with authd and ran into dependency issues during the build. Sadly the SPEC file did not include what the package requires. I attempted the normal RPM (found on Ganglia’ SourceForge page). Even those didn’t work properly due to a requirement of some old OpenSSL libraries unavailable in Centos5.

[root@m ganglia]# rpm -qa | grep openssl
openssl-devel-0.9.8b-8.3.el5_0.2
openssl-0.9.8b-8.3.el5_0.2
openssl-devel-0.9.8b-8.3.el5_0.2
openssl-0.9.8b-8.3.el5_0.2
[root@m ganglia]# rpm -ivh authd-0.2.1-1.i386.rpm
error: Failed dependencies:
libcrypto.so.2 is needed by authd-0.2.1-1.i386
libssl.so.2 is needed by authd-0.2.1-1.i386

So I went back to attempting to build the SRPM. Soon I found out that the above libraries have nothing to do with the build issues I was seeing. My issue was with the libe library missing. Once I built and installed that authd build and installed without a problem.

Next, I attempted to build gexec. This proved to have the same issue as authd, the SRPM did not include a requires in the SPEC making it difficult to determine what needs to be installed as a prerequisite. I then started to investigate the errors I was seeing in the build,

gexec.c:39:33: error: ganglia/gexec_funcs.h: No such file or directory

Googling for this I found a Ganglia Developers email list entry that described that

The gexec-0.3.6 available from http://www.theether.org/gexec does not
build with 3.0.* versions of Ganglia. It builds correctly only with 2.*
versions. If you want to build with Ganglia 3, edit the gexec.c to include
/usr/include/ganglia.h and not /usr/include/ganglia/gexec_funcs.h. Of
course, you have to have ganglia-devel installed for this to work. Another
thing, in addition to the above, you have to add #include to
gexec.c in order to successfully build the gexec.

That works, so I edited the gexec.c source tarball containing the gexec.c including the above changes. My attempt to build again failed on the ‘e/llist.h’ include not existing. ‘locate’ proved that it did not exist on my machine even though libe is installed. So I went back to that email list post and found this link:

http://svn.oscar.openclustergroup.org/svn/oscar-soc/soc-2006/hpcmetrics/ganglia/

Looking through the source I found http://svn.oscar.openclustergroup.org/svn/oscar-soc/soc-2006/hpcmetrics/ganglia/src/lib/llist.h and copied it in to ‘/usr/include/e/’. This worked nicely, but as you might expect it failed again. This time looking for libraries in ‘/lib’ rather than ‘/lib64′, which is to be expected since I am running x86_64. I symlinked the library into place and moved on.

Now I am at an error that I haven’t been able to figure out. My mailing list post describing the issue has not seen a reply.

gexec.c: In function ‘main’:
gexec.c:324: warning: ‘ips’ may be used uninitialized in this function
gcc -DHAVE_CONFIG_H -I. -I. -I. -I. -O2 -Wall -D_REENTRANT -g
-D_GNU_SOURCE -DDEBUG -c gexec_options.c
gcc -O2 -Wall -D_REENTRANT -g -D_GNU_SOURCE -DDEBUG -o gexec -L.
gexec.o gexec_options.o -lpthread -lgexec -le -lauth -lssl -lcrypto
/usr/lib/libganglia.a -lssl -lpthread -lcrypto
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×10c): undefined reference to `XML_ParserCreate’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×160): undefined reference to `XML_SetElementHandler’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×16b): undefined reference to `XML_SetUserData’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×178): undefined reference to `XML_GetBuffer’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×1c4): undefined reference to `XML_ParserFree’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×1f6): undefined reference to `XML_ParseBuffer’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×265): undefined reference to `XML_GetErrorCode’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×26c): undefined reference to `XML_ErrorString’
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster’:
(.text+0×277): undefined reference to `XML_GetCurrentLineNumber’
collect2: ld returned 1 exit status
make: *** [gexec] Error 1

After a bit of Googling, I found that these XML directives are related to expat. I installed expat-devel (as well as a number of other xml devel packages) and attempted to rebuild. Same thing, failure. Next, I decided that since it seems in relation to libganglia.a that perhaps it was not built with expat support and needed to rebuilt, so now with expat-devel installed I did this. This fails with the same error as above. After looking at the doc I noticed that the ganglia SPEC file does not include ‘–enable-gexec’ in the configure. I built the RPMs with this option and still ran into the error. I have attempted to build gexec from SRPM as well as straight source. In every case I get the above error. The error suggests (“collect2: ld returned 1 exit status”) to me that there is a library (or libraries) missing. But at this point I’m not really sure at all. If I come up with something (outside of running gexec in standalone) I will be sure to post it. If anyone else out there knows what’s up post a comment.

This all leads me to the point of this post which is … why is setting this up so difficult? Truth be told I have no clue, but I don’t think it should be. The Ganglia mailing list was helpful enough but documentation seems a little lacking should one run into any issues. One would think that if “The gexec-0.3.6 available from http://www.theether.org/gexec does not
build with 3.0.* versions of Ganglia.” this should be documented. I don’t think that I am doing anything strange and I am using Centos5, not some obscure distro.

You may be asking what all these problems with gexec have to do with ganglia (a guy on the mailing list asked me just that “What does this have to do with ganglia?”), fair enough. Ganglia is not gexec and gexec is not Ganglia. My response was that the gexec SRPMs are downloadable side by site with all the Ganglia RPMs off of SourceForge. This leads me to believe that questions to the Ganglia mailing list about gexec doesn’t seem too far off base. Additionally, for someone that is trying to install these packages for the first time or is new to Ganglia it seems that the mailing list would be the place to ask, as I imagine there are plenty of folks running gexec hosts in Ganglia. The Ganglia documentation even mentions gexec that “integrating it with ganglia is a bit clumsy” but provides no information outside of how to run it standalone mode and how to turn it off if you have configured it by default to be on. To boot the gexec site hasn’t been updated since 2004.

Next, you may think that if this is broken and the documentation sucks why don’t you fix it, it’s an opensource project. That’s valid and I will be happy to write up some documentation on how to build the RPMs for Ganglia and associated applications. For good measure I will even see if I can get it posted to the Ganglia wiki. Of course this hinges on me actually being able to build the RPMs and have everything work properly.

Lastly, here are a few lessons learned:

  • Something I learn time and time again, don’t assume anything.
  • Any time you create SRPMs make sure you add  the “BuildRequires” directive. This alone would have likely solved my issue with gexec after I modified gexec.c or at least would have pointed me in the right direction.
  • If source code modifications are required or any other oddities in building an application document them, simply something is clunky or unintuitive is not enough.
  • If you have a software product you would like other people to use provide installation procedures. Having install docs is almost as good as having a marketing team. If people find it easy to install and are happy with it they will tell others (example: Wordpress).

That’s it for my rant. Thanks. :)