<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Joe's Blog! &#187; Clustering</title>
	<atom:link href="http://www.joeandmotorboat.com/category/clustering/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.joeandmotorboat.com</link>
	<description></description>
	<lastBuildDate>Wed, 28 Jul 2010 03:00:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Beyond BigData.</title>
		<link>http://www.joeandmotorboat.com/2010/05/31/beyond-bigdata/</link>
		<comments>http://www.joeandmotorboat.com/2010/05/31/beyond-bigdata/#comments</comments>
		<pubDate>Mon, 31 May 2010 16:54:23 +0000</pubDate>
		<dc:creator>joe</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Dev]]></category>
		<category><![CDATA[Operations]]></category>

		<guid isPermaLink="false">http://www.joeandmotorboat.com/?p=979</guid>
		<description><![CDATA[BigData is a big deal. It&#8217;s changing how we look at data and analytics, but it isn&#8217;t the end. What are the enablers of BigData? First and foremost, cheap computing resources (CPU, disks, memory, bandwidth, etc) all thanks to Moore&#8217;s Law. Today even startups have the ability to afford huge amounts of computing power, the [...]]]></description>
			<content:encoded><![CDATA[<p>BigData is a big deal. It&#8217;s changing how we look at data and analytics, but it isn&#8217;t the end. What are the enablers of BigData? First and foremost, cheap computing resources (CPU, disks, memory, bandwidth, etc) all thanks to <a href="http://en.wikipedia.org/wiki/Moore's_law">Moore&#8217;s Law</a>. Today even startups have the ability to afford huge amounts of computing power, the likes previously only the big boys could afford. Additionally, this has given rise to commodity hardware and cloud computing, which only furthers the proliferation of large amounts cheap, quickly-provisioned, computing resources. Second, to apply all that power, we have open source data processing systems based on years of distributed systems research, like <a href="http://hadoop.apache.org/">Hadoop</a>, and many incarnations of <a href="http://en.wikipedia.org/wiki/Nosql">NoSQL</a>. The development of open source data processing sytems has allowed proliferation of systems that scale, which only the highly capitalized could afford, until recently. These two things alone have allowed for the democratization of BigData. A guy in a garage can process terabytes of data with little more than a credit card and elbow grease.</p>
<p>With all these tools and recently acquired computing power, where are we going? Of course we can expect datasets to continue to grow, and the computational complexity of our data processing to increase, as well as compute power to continue to rise (GPGPUs, multicore and so on). In addition, I anticipate the emergence of something I&#8217;m calling <em>NewData</em>. NewData will build on what we have currently with the BigData, but will include some trends just beginning to take off. First, the development of ubiquitous public APIs (<a href="http://stochasticresonance.wordpress.com/2009/04/01/meatcloud-manifesto/">Meatcloud Manifesto</a>). Public APIs have yet to proliferate to all online systems. As a consequence, there is still a lot of screen scraping going on. By having easily query-able and parse-able datasets available through ubiquitous APIs, consuming the internet with machines is easier making the application of BigData more powerful. <a href="http://developer.netflix.com/">Netflix</a> is a good example of this. Second and similarly enabling will be the development of standardized public datasets. Current datasets are generally hard to find and use, standardized dataset formats will enable BigData analysis to be more productive and not waste time munging. <a href="http://www.data.gov/">Data.gov</a> is a start. These two developments are yet to be fully realized in current systems but will allow for the rise of NewData. As these developments begin to roll out we will begin to see changes to how our BigData systems look. NewData systems will be less concerned with how big the data is and what it looks like, but will emphasize derivation of more information from the data. <a href="http://techcrunch.com/2010/03/16/big-data-freedom/">Bradford Cross gets this</a>, and as a result <a href="http://flightcaster.com/">FlightCaster</a> is an early example of what I mean by <em>NewData</em>.</p>
<blockquote><p>The scale of data and computations is an important issue, but the data age is less about the raw size of your data, and more about the cool stuff you can do with it.</p></blockquote>
<p>Asking the right questions of the data is important, especially if you&#8217;re trying to do cool stuff. The <a href="http://freakonomics.blogs.nytimes.com/">Freakonomics</a> guys proved this a few times over. NewData will be about creating value from data, and asking the right questions is worth as much as the answers. The key enablers of this will be using new found APIs and datasets to combine data from disperate sources in ways that BigData couldn&#8217;t. Asking questions that we wouldn&#8217;t have thought to ask of BigData. Where BigData was about a handful of datasets at most, NewData will be about dozens of datasets. The mashup is the cornerstone of NewData.</p>
<p>That being said, we will need new systems to process this data and enable us to ask these questions. NewData analysis will need inter-process communication and collaboration. Currently, systems like Hadoop process data by splitting the data up and processing chunks in parallel on hundreds to thousands of machines. Processes are isolated from the other processes. This will continue, but NewData will require more from these systems to ask deeper questions. Complex inter-process communication will be needed to ask these questions. Think of the simplicity of writing Map/Reduce jobs, the robustness of Hadoop, the workflow and dataflow of <a href="http://www.cascading.org/">Cascading</a> and <a href="http://research.microsoft.com/en-us/projects/dryadlinq/">DryadLINQ</a>, respectively, and the power of a message passing system like <a href="http://en.wikipedia.org/wiki/Message_Passing_Interface">MPI</a>. These jobs will likely include large in-memory collaborative computations across thousands of machines. Where data locality was key in BigData, both data and memory-locality (<a href="http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access">NUMA/ccNUMA</a>) will be important in NewData.</p>
<p>It is clear that BigData still has some runway before NewData takes over. However, if the trends in the democratization of compute and processing continue (beyond Hadoop and EC2), and the opening of APIs and datasets proliferate online and off, NewData and it&#8217;s new questions, mashups, and systems are inevitable. Where having readily available compute resources and the software to use it defined BigData, NewData will be defined solely by asking the right questions, the algorithms to derive answers, and the systems used to produce them.</p>
<p><em>Thanks to <a href="http://twitter.com/mlmilleratmit">Mike Miller</a>, <a href="http://twitter.com/lusciouspear">Bradford Stephens</a> and my awesome wife <a href="http://twitter.com/xprimerw">Erin</a> for the help on this article.</em></p>
<p><strong><em>Follow me on <a href="http://twitter.com/williamsjoe">twitter</a>.<br />
</em></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.joeandmotorboat.com/2010/05/31/beyond-bigdata/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Disco.</title>
		<link>http://www.joeandmotorboat.com/2008/09/08/disco/</link>
		<comments>http://www.joeandmotorboat.com/2008/09/08/disco/#comments</comments>
		<pubDate>Mon, 08 Sep 2008 13:05:19 +0000</pubDate>
		<dc:creator>joe</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Dev]]></category>
		<category><![CDATA[Erlang]]></category>
		<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://www.joeandmotorboat.com/?p=610</guid>
		<description><![CDATA[Something I happened to see over here this weekend was Disco. It is a Map/Reduce framework written in Erlang. A user/implementer doesn&#8217;t need to know a lick of Erlang to get rolling but according to their site most folks use Python to write the actual jobs. If you as me a Map/Reduce framework built using [...]]]></description>
			<content:encoded><![CDATA[<p>Something I happened to see <a href="http://debasishg.blogspot.com/2008/09/more-erlang-with-disco.html">over here</a> this weekend was <a href="http://discoproject.org/">Disco</a>. It is a Map/Reduce framework written in Erlang. A user/implementer doesn&#8217;t need to know a lick of Erlang to get rolling but according to their site most folks use Python to write the actual jobs. If you as me a Map/Reduce framework built using Erlang makes a great amount of sense due to its message passing and light weight processes.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joeandmotorboat.com/2008/09/08/disco/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>More on Hadoop Metrics In Ganglia.</title>
		<link>http://www.joeandmotorboat.com/2008/07/28/more-on-hadoop-metrics-in-ganglia/</link>
		<comments>http://www.joeandmotorboat.com/2008/07/28/more-on-hadoop-metrics-in-ganglia/#comments</comments>
		<pubDate>Thu, 01 Jan 1970 00:00:00 +0000</pubDate>
		<dc:creator>joe</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Dev]]></category>
		<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://www.joeandmotorboat.com/?p=511</guid>
		<description><![CDATA[I have gotten a few comments and etc regarding whether or not I was able to get Hadoop to talk to Ganglia. Sadly I wasn&#8217;t able to get this to work properly either but I did contact the Hadoop mailing list (this thread) and got the following information. There is actually a bug. The link [...]]]></description>
			<content:encoded><![CDATA[<p>I have gotten a few comments and etc regarding whether or not I was able to get Hadoop to talk to Ganglia. Sadly I wasn&#8217;t able to get this to work properly either but I did contact the Hadoop mailing list (<a href="http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200807.mbox/%3C488799B9.5070404@joetify.com%3E">this thread</a>) and got the following information. There is actually a <a href="https://issues.apache.org/jira/browse/HADOOP-3422">bug</a>. The link includes a patch but note that the trunk has changed and the patch currently only works on Hadoop version 0.16.0. I have not had a chance to test everything out yet but it is at least a step in the right direction for those of you who are curious. Hope this helps.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joeandmotorboat.com/2008/07/28/more-on-hadoop-metrics-in-ganglia/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>More gexec.</title>
		<link>http://www.joeandmotorboat.com/2008/06/04/more-gexec/</link>
		<comments>http://www.joeandmotorboat.com/2008/06/04/more-gexec/#comments</comments>
		<pubDate>Thu, 01 Jan 1970 00:00:00 +0000</pubDate>
		<dc:creator>joe</dc:creator>
				<category><![CDATA[Clustering]]></category>

		<guid isPermaLink="false">http://www.joeandmotorboat.com/?p=491</guid>
		<description><![CDATA[Bernard Li pushed a new version of gexec out in response to my inquiries on the mailing list, it includes the Ganglia switch. I did some further code changes and was able to generate a tarball which builds fine without any modification, can you please try it on your system and see if it works? [...]]]></description>
			<content:encoded><![CDATA[<p>Bernard Li pushed a new version of gexec out in response to <a href="http://sourceforge.net/mailarchive/message.php?msg_name=d4c731da0806021752u7c9b1feete4efe3f3473290b5%40mail.gmail.com">my inquiries on the mailing list</a>, it includes the Ganglia switch.</p>
<blockquote><p>I did some further code changes and was able to generate a tarball<br />
which builds fine without any modification, can you please try it on<br />
your system and see if it works?  All you need to is run `rpmbuild<br />
-tb` against the tarball:</p>
<p><a href="http://therealms.org/oss/ganglia/gexec-0.3.8.1375.tar.gz">http://therealms.org/oss/ganglia/gexec-0.3.8.1375.tar.gz</a>
</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.joeandmotorboat.com/2008/06/04/more-gexec/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>NVIDIA GPU/CUDA Based Supercomputer.</title>
		<link>http://www.joeandmotorboat.com/2008/05/31/nvidia-gpucuda-based-supercomputer/</link>
		<comments>http://www.joeandmotorboat.com/2008/05/31/nvidia-gpucuda-based-supercomputer/#comments</comments>
		<pubDate>Thu, 01 Jan 1970 00:00:00 +0000</pubDate>
		<dc:creator>joe</dc:creator>
				<category><![CDATA[Clustering]]></category>

		<guid isPermaLink="false">http://www.joeandmotorboat.com/?p=488</guid>
		<description><![CDATA[Check out this sweet machine that the University of Antwerp built.]]></description>
			<content:encoded><![CDATA[<p>Check out this <a href="http://www.dvhardware.net/article27538.html">sweet machine</a> that the University of Antwerp built.</p>
<p><a href="http://www.joeandmotorboat.com/2008/05/31/nvidia-gpucuda-based-supercomputer/"><em>Click here to view the embedded video.</em></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.joeandmotorboat.com/2008/05/31/nvidia-gpucuda-based-supercomputer/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Ganglia, gexec, authd and libe Install Procedure.</title>
		<link>http://www.joeandmotorboat.com/2008/05/30/ganglia-gexec-authd-and-libe-install-procedure/</link>
		<comments>http://www.joeandmotorboat.com/2008/05/30/ganglia-gexec-authd-and-libe-install-procedure/#comments</comments>
		<pubDate>Thu, 01 Jan 1970 00:00:00 +0000</pubDate>
		<dc:creator>joe</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://www.joeandmotorboat.com/?p=486</guid>
		<description><![CDATA[Install Ganglia wget http://voxel.dl.sourceforge.net/sourceforge/ganglia/ganglia-3.0.7-1.src.rpm rpm -Uhv http://apt.sw.be/redhat/el5/en/i386/rpmforge/RPMS/rpmforge-release-0.3.6-1.el5.rf.i386.rpm yum install libpng-devel libart_lgpl-devel rrdtool-devel freetype-devel rrdtool-devel rpmbuild &#8211;rebuild ganglia-3.0.7-1.src.rpm rpm -ivh /usr/src/redhat/RPMS/x86_64/ganglia-gmetad-3.0.7-1.x86_64.rpm /usr/src/redhat/RPMS/x86_64/ganglia-gmond-3.0.7-1.x86_64.rpm /usr/src/redhat/RPMS/x86_64/ganglia-devel-3.0.7-1.x86_64.rpm Install libe wget http://www.theether.org/libe/libe-0.3.0-1.src.rpm rpmbuild &#8211;rebuild libe-0.3.0-1.src.rpm rpm -ivh /usr/src/redhat/RPMS/x86_64/libe-0.3.0-1.x86_64.rpm Install authd yum install openssl-devel wget http://www.theether.org/authd/authd-0.2.2-1.src.rpm rpmbuild &#8211;rebuild authd-0.2.2-1.src.rpm You will run into an error like the following, don&#8217;t worry about it [...]]]></description>
			<content:encoded><![CDATA[<p>Install Ganglia</p>
<blockquote><p>
wget http://voxel.dl.sourceforge.net/sourceforge/ganglia/ganglia-3.0.7-1.src.rpm<br />
rpm -Uhv http://apt.sw.be/redhat/el5/en/i386/rpmforge/RPMS/rpmforge-release-0.3.6-1.el5.rf.i386.rpm<br />
yum install libpng-devel libart_lgpl-devel rrdtool-devel freetype-devel rrdtool-devel<br />
rpmbuild &#8211;rebuild ganglia-3.0.7-1.src.rpm<br />
rpm -ivh /usr/src/redhat/RPMS/x86_64/ganglia-gmetad-3.0.7-1.x86_64.rpm /usr/src/redhat/RPMS/x86_64/ganglia-gmond-3.0.7-1.x86_64.rpm /usr/src/redhat/RPMS/x86_64/ganglia-devel-3.0.7-1.x86_64.rpm</p></blockquote>
<p>Install libe</p>
<blockquote><p>wget http://www.theether.org/libe/libe-0.3.0-1.src.rpm<br />
rpmbuild &#8211;rebuild libe-0.3.0-1.src.rpm<br />
rpm -ivh /usr/src/redhat/RPMS/x86_64/libe-0.3.0-1.x86_64.rpm </p></blockquote>
<p>Install authd</p>
<blockquote><p>yum install openssl-devel<br />
wget http://www.theether.org/authd/authd-0.2.2-1.src.rpm<br />
rpmbuild &#8211;rebuild authd-0.2.2-1.src.rpm </p></blockquote>
<p>You will run into an error like the following, don&#8217;t worry about it we clean it up next.</p>
<blockquote><p>Installing authd-0.2.2-1.src.rpm<br />
warning: user bnc does not exist &#8211; using root<br />
warning: group dusers does not exist &#8211; using root<br />
error: Legacy syntax is unsupported: copyright<br />
error: line 5: Unknown tag: Copyright: GPL</p></blockquote>
<p>Finish up authd</p>
<blockquote><p>
mv /usr/src/redhat/SPECS/authd.spec /usr/src/redhat/SPECS/authd.spec.1<br />
sed &#8216;s/Copyright/License/g&#8217; /usr/src/redhat/SPECS/authd.spec.1 > /usr/src/redhat/SPECS/authd.spec<br />
rpmbuild -ba /usr/src/redhat/SPECS/authd.spec<br />
openssl genrsa -out auth_priv.pem<br />
chmod 600 auth_priv.pem<br />
openssl rsa -in auth_priv.pem -pubout -out auth_pub.pem
</p></blockquote>
<p>Copy auth_priv.pem and auth_pub.pem to &#8216;/etc&#8217; on each node of the cluster</p>
<blockquote><p>
rpm -ivh /usr/src/redhat/RPMS/x86_64/authd-0.2.2-1.x86_64.rpm
</p></blockquote>
<p>Installing gexec (<a href="http://www.joeandmotorboat.com/files/gexec-0.3.8-4.src.rpm">using my SRPM</a>, includes the &#8216;&#8211;with-ganglia&#8217; option)</p>
<blockquote><p>echo &#8220;gexec   2875/tcp    # Caltech GEXEC&#8221; >> /etc/services<br />
yum install glibc gcc gcc-c++ authd expat-devel<br />
rpm -ivh /usr/src/redhat/RPMS/x86_64/gexec-0.3.8-4.x86_64.rpm</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.joeandmotorboat.com/2008/05/30/ganglia-gexec-authd-and-libe-install-procedure/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>gexec Success!</title>
		<link>http://www.joeandmotorboat.com/2008/05/30/gexec-success/</link>
		<comments>http://www.joeandmotorboat.com/2008/05/30/gexec-success/#comments</comments>
		<pubDate>Thu, 01 Jan 1970 00:00:00 +0000</pubDate>
		<dc:creator>joe</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://www.joeandmotorboat.com/?p=485</guid>
		<description><![CDATA[I was finally able to get a clean build of gexec with the &#8216;&#8211;with-ganglia&#8217; option. Here&#8217;s what I did: I downloaded the tarball available at http://therealms.org/oss/ganglia/gexec-0.3.8.tar.gz (thanks to Bernard on the Ganglia mailing list). Then run: rpmbuild -tb gexec-0.3.8.tar.gz This created a RPM and SRPM, the RPM can be deleted and I installed the SRPM. [...]]]></description>
			<content:encoded><![CDATA[<p>I was finally able to get a clean build of gexec with the &#8216;&#8211;with-ganglia&#8217; option. Here&#8217;s what I did:</p>
<p>I downloaded the tarball available at http://therealms.org/oss/ganglia/gexec-0.3.8.tar.gz <em>(thanks to Bernard on the Ganglia mailing list)</em>. Then run:</p>
<blockquote><p>rpmbuild -tb gexec-0.3.8.tar.gz</p></blockquote>
<p>This created a RPM and SRPM, the RPM can be deleted and I installed the SRPM. Should be located at &#8216;/usr/src/redhat/SRPMS/gexec-0.3.8-4.src.rpm&#8217;. I then edited the SPEC file &#8216;/usr/src/redhat/SPECS/gexec.spec&#8217; removing &#8216;%configure&#8217; and adding the following above the &#8216;make&#8217; line but below the &#8216;%build&#8217; line.</p>
<blockquote><p>./configure &#8211;with-ganglia &#8211;host=x86_64-redhat-linux-gnu &#8211;build=x86_64-redhat-linux-gnu &#8211;target=x86_64-redhat-linux &#8211;program-prefix= &#8211;prefix=/usr &#8211;exec-prefix=/usr &#8211;bindir=/usr/bin &#8211;sbindir=/usr/sbin &#8211;sysconfdir=/etc &#8211;datadir=/usr/share &#8211;includedir=/usr/include &#8211;libdir=/usr/lib64 &#8211;libexecdir=/usr/libexec &#8211;localstatedir=/var &#8211;sharedstatedir=/usr/com &#8211;mandir=/usr/share/man &#8211;infodir=/usr/share/info</p></blockquote>
<p>Next, extract the tarball at &#8216;/usr/src/redhat/SOURCES/gexec-0.3.8.tar.gz&#8217;. Edit &#8216;configure.ac&#8217; to include &#8216;AC_PREFIX_DEFAULT(/usr)&#8217; rather than &#8216;AC_PREFIX_DEFAULT(/usr/local)&#8217;. Then change GANGLIA_LIB to use &#8216;/usr/lib/libganglia.a&#8217; rather than &#8216;@libdir@/libganglia.a&#8217;. I also edited the Makefile to use &#8216;/usr/lib/libganglia.a&#8217; rather than &#8216;@libdir@/libganglia.a&#8217; in a couple spots. Then move the gexec-0.3.8.tar.gz to gexec-0.3.8.tar.gz.OLD and &#8216;tar zcvf gexec-0.3.8&#8242; to create a new tarball with the changes just made. At this point one can build and install the new RPM by running:</p>
<blockquote><p>
rpmbuild -ba /usr/src/redhat/SPECS/gexec.spec<br />
rpm -ivh /usr/src/redhat/RPMS/x86_64/gexec-0.3.8-4.x86_64.rpm
</p></blockquote>
<p>I have made my SRPM available, you can download it <a href="http://www.joeandmotorboat.com/files/gexec-0.3.8-4.src.rpm ">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joeandmotorboat.com/2008/05/30/gexec-success/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Is It You Or Me Ganglia?</title>
		<link>http://www.joeandmotorboat.com/2008/05/29/is-it-you-or-me-ganglia/</link>
		<comments>http://www.joeandmotorboat.com/2008/05/29/is-it-you-or-me-ganglia/#comments</comments>
		<pubDate>Thu, 01 Jan 1970 00:00:00 +0000</pubDate>
		<dc:creator>joe</dc:creator>
				<category><![CDATA[Clustering]]></category>

		<guid isPermaLink="false">http://www.joeandmotorboat.com/?p=484</guid>
		<description><![CDATA[So I began building a new head cluster node in a KVM, just as a test run and to refine my methodology. I decided to drop Unicluster due to an unresolved issue, this time around I decided to install everything myself. &#8230; Java, check &#8230; Hadoop, check &#8230; Pig, check &#8230; Grid Engine, check &#8230; [...]]]></description>
			<content:encoded><![CDATA[<p>So I began building a new head cluster node in a <a href="http://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine">KVM</a>, just as a test run and to refine my methodology. I decided to drop Unicluster due to an <a href="http://www.grid.org/forum/showthread.php?t=160">unresolved issue</a>, this time around I decided to install everything myself. &#8230; Java, check &#8230; Hadoop, check &#8230; Pig, check &#8230; Grid Engine, check &#8230; OpenMPI, check &#8230; Ganglia, ugh &#8230;</p>
<p>Ganglia seems to be an interesting beast. I build the SRPMs and then installed the RPMs for the &#8220;ganglia monitor core&#8221; without a problem, it was easy and quick. I then moved on to the &#8220;gexec execution environment&#8221; this includes gexec, gexecd, authd and libe.</p>
<p>The first issue I ran into in building from the SRPM was the dependencies. First, I started with authd and ran into dependency issues during the build. Sadly the SPEC file did not include what the package requires. I attempted the normal RPM (found on Ganglia&#8217; <a href="http://sourceforge.net/project/showfiles.php?group_id=43021&amp;package_id=36388&amp;release_id=88941">SourceForge</a> page). Even those didn&#8217;t work properly due to a requirement of some old OpenSSL libraries unavailable in Centos5.</p>
<blockquote><p>[root@m ganglia]# rpm -qa | grep openssl<br />
openssl-devel-0.9.8b-8.3.el5_0.2<br />
openssl-0.9.8b-8.3.el5_0.2<br />
openssl-devel-0.9.8b-8.3.el5_0.2<br />
openssl-0.9.8b-8.3.el5_0.2<br />
[root@m ganglia]# rpm -ivh authd-0.2.1-1.i386.rpm<br />
error: Failed dependencies:<br />
libcrypto.so.2 is needed by authd-0.2.1-1.i386<br />
libssl.so.2 is needed by authd-0.2.1-1.i386</p></blockquote>
<p>So I went back to attempting to build the SRPM. <a href="http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg03846.html">Soon I found out</a> that the above libraries have nothing to do with the build issues I was seeing. My issue was with the libe library missing. Once I built and installed that authd build and installed without a problem.</p>
<p>Next, I attempted to build gexec. This proved to have the same issue as authd, the SRPM did not include a requires in the SPEC making it difficult to determine what needs to be installed as a prerequisite. I then started to investigate the errors I was seeing in the build,</p>
<blockquote><p>gexec.c:39:33: error: ganglia/gexec_funcs.h: No such file or directory</p></blockquote>
<p>Googling for this I found a <a href="http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg02443.html">Ganglia Developers email list entry</a> that described that</p>
<blockquote><p>The gexec-0.3.6 available from http://www.theether.org/gexec does not<br />
build with 3.0.* versions of Ganglia. It builds correctly only with 2.*<br />
versions. If you want to build with Ganglia 3, edit the gexec.c to include<br />
/usr/include/ganglia.h and not /usr/include/ganglia/gexec_funcs.h. Of<br />
course, you have to have ganglia-devel installed for this to work. Another<br />
thing, in addition to the above, you have to add #include  to<br />
gexec.c in order to successfully build the gexec.</p></blockquote>
<p>That works, so I edited the gexec.c source tarball containing the gexec.c including the above changes. My attempt to build again failed on the &#8216;e/llist.h&#8217; include not existing. &#8216;locate&#8217; proved that it did not exist on my machine even though libe is installed. So I went back to that email list post and found this link:</p>
<blockquote><p>http://svn.oscar.openclustergroup.org/svn/oscar-soc/soc-2006/hpcmetrics/ganglia/</p></blockquote>
<p>Looking through the source I found http://svn.oscar.openclustergroup.org/svn/oscar-soc/soc-2006/hpcmetrics/ganglia/src/lib/llist.h and copied it in to &#8216;/usr/include/e/&#8217;. This worked nicely, but as you might expect it failed again. This time looking for libraries in &#8216;/lib&#8217; rather than &#8216;/lib64&#8242;, which is to be expected since I am running x86_64. I symlinked the library into place and moved on.</p>
<p>Now I am at an error that I haven&#8217;t been able to figure out. <a href="http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg03849.html">My mailing list post</a> describing the issue has not seen a reply.</p>
<blockquote><p>gexec.c: In function ‘main’:<br />
gexec.c:324: warning: ‘ips’ may be used uninitialized in this function<br />
gcc -DHAVE_CONFIG_H -I. -I. -I. -I.    -O2 -Wall -D_REENTRANT -g<br />
-D_GNU_SOURCE -DDEBUG -c gexec_options.c<br />
gcc  -O2 -Wall -D_REENTRANT -g -D_GNU_SOURCE -DDEBUG  -o gexec -L.<br />
gexec.o gexec_options.o -lpthread -lgexec -le -lauth -lssl -lcrypto<br />
/usr/lib/libganglia.a -lssl -lpthread -lcrypto<br />
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster&#8217;:<br />
(.text+0x10c): undefined reference to `XML_ParserCreate&#8217;<br />
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster&#8217;:<br />
(.text+0&#215;160): undefined reference to `XML_SetElementHandler&#8217;<br />
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster&#8217;:<br />
(.text+0x16b): undefined reference to `XML_SetUserData&#8217;<br />
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster&#8217;:<br />
(.text+0&#215;178): undefined reference to `XML_GetBuffer&#8217;<br />
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster&#8217;:<br />
(.text+0x1c4): undefined reference to `XML_ParserFree&#8217;<br />
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster&#8217;:<br />
(.text+0x1f6): undefined reference to `XML_ParseBuffer&#8217;<br />
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster&#8217;:<br />
(.text+0&#215;265): undefined reference to `XML_GetErrorCode&#8217;<br />
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster&#8217;:<br />
(.text+0x26c): undefined reference to `XML_ErrorString&#8217;<br />
/usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster&#8217;:<br />
(.text+0&#215;277): undefined reference to `XML_GetCurrentLineNumber&#8217;<br />
collect2: ld returned 1 exit status<br />
make: *** [gexec] Error 1</p></blockquote>
<p>After a bit of Googling, I found that these XML directives are related to expat. I installed expat-devel (as well as a number of other xml devel packages) and attempted to rebuild. Same thing, failure. Next, I decided that since it seems in relation to libganglia.a that perhaps it was not built with expat support and needed to rebuilt, so now with expat-devel installed I did this. This fails with the same error as above. After looking at the <a href="http://ganglia.wiki.sourceforge.net/ganglia_readme">doc</a> I noticed that the ganglia SPEC file does not include &#8216;&#8211;enable-gexec&#8217; in the configure. I built the RPMs with this option and still ran into the error. I have attempted to build gexec from SRPM as well as straight source. In every case I get the above error. The error suggests (&#8220;collect2: ld returned 1 exit status&#8221;) to me that there is a library (or libraries) missing. But at this point I&#8217;m not really sure at all. If I come up with something (outside of running gexec in standalone) I will be sure to post it. If anyone else out there knows what&#8217;s up post a comment.</p>
<p>This all leads me to the point of this post which is &#8230; <em>why is setting this up so difficult</em>? Truth be told I have no clue, but I don&#8217;t think it should be. The Ganglia mailing list was helpful enough but documentation seems a little lacking should one run into any issues. One would think that if &#8220;The gexec-0.3.6 available from http://www.theether.org/gexec does not<br />
build with 3.0.* versions of Ganglia.&#8221; this should be documented. I don&#8217;t think that I am doing anything strange and I am using Centos5, not some obscure distro.</p>
<p>You may be asking what all these problems with gexec have to do with ganglia (a guy on the mailing list asked me just that &#8220;What does this have to do with ganglia?&#8221;), fair enough. Ganglia is not gexec and gexec is not Ganglia. My response was that the gexec SRPMs are downloadable side by site with all the Ganglia RPMs off of SourceForge. This leads me to believe that questions to the Ganglia mailing list about gexec doesn&#8217;t seem too far off base. Additionally, for someone that is trying to install these packages for the first time or is new to Ganglia it seems that the mailing list would be the place to ask, as I imagine there are plenty of folks running gexec hosts in Ganglia. The Ganglia documentation even mentions gexec that &#8220;integrating it with ganglia is a bit clumsy&#8221; but provides no information outside of how to run it standalone mode and how to turn it off if you have configured it by default to be on. To boot the gexec site hasn&#8217;t been updated since 2004.</p>
<p>Next, you may think that if this is broken and the documentation sucks why don&#8217;t you fix it, it&#8217;s an opensource project. That&#8217;s valid and I will be happy to write up some documentation on how to build the RPMs for Ganglia and associated applications. For good measure I will even see if I can get it posted to the Ganglia wiki. Of course this hinges on me actually being able to build the RPMs and have everything work properly.</p>
<p>Lastly, here are a few lessons learned:</p>
<ul>
<li>Something I learn time and time again, don&#8217;t assume anything.</li>
<li>Any time you create SRPMs make sure you add  the &#8220;BuildRequires&#8221; directive. This alone would have likely solved my issue with gexec after I modified gexec.c or at least would have pointed me in the right direction.</li>
<li>If source code modifications are required or any other oddities in building an application document them, simply something is clunky or unintuitive is not enough.</li>
<li>If you have a software product you would like other people to use provide installation procedures. Having install docs is almost as good as having a marketing team. If people find it easy to install and are happy with it they will tell others (example: WordPress).</li>
</ul>
<p>That&#8217;s it for my rant. Thanks. <img src='http://www.joeandmotorboat.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.joeandmotorboat.com/2008/05/29/is-it-you-or-me-ganglia/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>More Hadoop, Grid Engine Goodness.</title>
		<link>http://www.joeandmotorboat.com/2008/05/23/more-hadoop-grid-engine-goodness/</link>
		<comments>http://www.joeandmotorboat.com/2008/05/23/more-hadoop-grid-engine-goodness/#comments</comments>
		<pubDate>Thu, 01 Jan 1970 00:00:00 +0000</pubDate>
		<dc:creator>joe</dc:creator>
				<category><![CDATA[Clustering]]></category>

		<guid isPermaLink="false">http://www.joeandmotorboat.com/?p=482</guid>
		<description><![CDATA[Over at GridEngine.info they found a link on DanT&#8217;s Sun blog that has a sweet tutorial on setting up Hadoop using SGE&#8217;s parallel environments with loose integration. Here we are relying on master node to start othe daemons ( [rs]sh the machine and start daemons) and distribute jobs , and we donot have control on [...]]]></description>
			<content:encoded><![CDATA[<p>Over at <a href="http://gridengine.info/articles/2008/05/23/creating-hadoop-pe-under-grid-engine">GridEngine.info</a> they found a link on <a href="http://blogs.sun.com/templedf/entry/hadoop_sun_grid_engine">DanT&#8217;s Sun blog</a> that has a sweet tutorial on <a href="http://blogs.sun.com/ravee/entry/creating_hadoop_pe_under_sge">setting up Hadoop using SGE&#8217;s parallel environments</a> with loose integration.</p>
<blockquote><p>Here we are relying on master node to start othe daemons ( [rs]sh the machine and start daemons) and distribute jobs , and we donot have control on the <em>TaskTracker</em> threads. This way of setting a pe in Grid Engine is called <a title="SGE Loose integration" href="http://gridengine.sunsource.net/howto/howto.html">loose-integration</a></p>
<p>With some more effort one could also achieve a <strong>tighter integration</strong> wherein the task of starting daemons and tasks on other slaves could be done by SGE. But this would require further understanding of Hadoop internals.</p></blockquote>
<p>Pretty dope.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joeandmotorboat.com/2008/05/23/more-hadoop-grid-engine-goodness/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Using Pig with Hadoop.</title>
		<link>http://www.joeandmotorboat.com/2008/05/23/using-pig-with-hadoop/</link>
		<comments>http://www.joeandmotorboat.com/2008/05/23/using-pig-with-hadoop/#comments</comments>
		<pubDate>Thu, 01 Jan 1970 00:00:00 +0000</pubDate>
		<dc:creator>joe</dc:creator>
				<category><![CDATA[Clustering]]></category>

		<guid isPermaLink="false">http://www.joeandmotorboat.com/?p=481</guid>
		<description><![CDATA[Pig is a query language for use with Hadoop. It allows users to query hadoop data similar to a SQL database. Formally, according to their website: Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient [...]]]></description>
			<content:encoded><![CDATA[<p>Pig is a query language for use with Hadoop. It allows users to query hadoop data similar to a SQL database. Formally, according to their website:</p>
<blockquote><p>Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.</p></blockquote>
<p>To get rolling you need the following:</p>
<ul>
<li> A Java SDK Installed</li>
<li>Ant Installed</li>
<li>Subversion</li>
<li>A working installation of Hadoop</li>
</ul>
<p>Once you are rolling with those items we can install Pig and test it out.</p>
<p>First, you need to download Pig from their Subversion repository. Once done you will need to build it with Ant.</p>
<blockquote><p>svn co http://svn.apache.org/repos/asf/incubator/pig/trunk pig-svn<br />
cd pig-svn<br />
ant</p></blockquote>
<p>From there you can run the following command to drop into the interactive shell.</p>
<blockquote><p>java -cp pig.jar:HADOOPSITEPATH org.apache.pig.Main</p></blockquote>
<p>Or you can run a pig script that you have already created.</p>
<blockquote><p>java -cp pig.jar:HADOOPSITEPATH somescript.pig</p></blockquote>
<p>HADOOPSITEPATH needs to point to the directory that contains the hadoop-site.xml file.</p>
<p>If you run into an issue such as:</p>
<blockquote><p>Caused by: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.dfs.ClientProtocol version mismatch. (client = 29, server = 23)</p></blockquote>
<p>You will need to upgrade Hadoop so the versions match.</p>
<p>In the end you should get something that looks like this:</p>
<blockquote><p>[cluster@front pig-svn]$ java -cp pig.jar:HADOOPSITEPATH org.apache.pig.Main<br />
2008-05-23 10:37:42,478 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine &#8211; Connecting to hadoop file system at: front.esper:9000<br />
2008-05-23 10:37:42,585 [main] WARN  org.apache.hadoop.fs.FileSystem &#8211; &#8220;front.esper:9000&#8243; is a deprecated filesystem name. Use &#8220;hdfs://front.esper:9000/&#8221; instead.<br />
2008-05-23 10:37:43,117 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine &#8211; Connecting to map-reduce job tracker at: front.esper:9001<br />
2008-05-23 10:37:43,246 [main] WARN  org.apache.hadoop.fs.FileSystem &#8211; &#8220;front.esper:9000&#8243; is a deprecated filesystem name. Use &#8220;hdfs://front.esper:9000/&#8221; instead.<br />
grunt&gt;</p></blockquote>
<p>If you need more info on the above steps check out the <a href="http://wiki.apache.org/pig/GettingStarted">Pig Wiki</a>.</p>
<p>From here you can follow their <a href="http://wiki.apache.org/pig/PigTutorial">tutorial</a> or play around in the <a href="http://wiki.apache.org/pig/Grunt">shell</a>. Regarding the tutorial, I can&#8217;t seem to find the download of the archive they mention &#8220;Pig tutorial file (*.gz)&#8221;. If anyone knows where that can be found let me know and I will post it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.joeandmotorboat.com/2008/05/23/using-pig-with-hadoop/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
