July 27, 2010

SurgeCon 2010

If you haven’t heard about Surge, it’s a new web operations conference presented by the smart folks at OmniTI. They have amassed a good list of speakers including guys like John Allspaw and Theo Schlossnagle. I also happen to have been invited to talk about the cloud, Cloudant and all sorts of good stuff.

July 19, 2010

Adding Health Checks to Deckard from Chef.

Recently, we (at Cloudant) open sourced Deckard, a HTTP content check monitoring system based on CouchDB. One of the best bits about using Couch is that it gives you a ReST API and with Deckard it can be used to add new health checks. Doing a simple PUT adds new URLs to monitor. At Cloudant we love Chef and use it for everything. Chef has things called resources and providers. Resources are abstractions that describe the state you want a machine to be in. Providers perform the actions described by a resource. A good example is using the package resource on Centos uses yum while on Ubuntu it uses apt-get. The resource abstracts that away, letting the provider (and node) deal with the specifics on how to install the package. This makes your recipes nice and DRY, use the same code to install packages on all sorts of platforms. There are resources and providers for anything from installing packages to even one I wrote for executing Erlang code via erl_call. One resource that works well with Deckard is the HTTP request resource, using it makes it very easy to add health checks from your cookbooks. We use something like the following code to add checks to new nodes at Cloudant:

This code will add the document describing the check to the monitor_content_check database and then create a file so we can use “not_if” and Chef won’t attempt to add the check twice. Pretty cool stuff and even more reason that everything should have an API. Even cooler than this example would be to use Chef Search to do the same thing but I’ll save that for another blog post.

May 31, 2010

Availability, the Cloud and Everything

Finally posted my presentation at Erlang Factory, WTIA Cloud SIG and Seattle Scalability Meetup here on the blog.

Beyond BigData.

BigData is a big deal. It’s changing how we look at data and analytics, but it isn’t the end. What are the enablers of BigData? First and foremost, cheap computing resources (CPU, disks, memory, bandwidth, etc) all thanks to Moore’s Law. Today even startups have the ability to afford huge amounts of computing power, the likes previously only the big boys could afford. Additionally, this has given rise to commodity hardware and cloud computing, which only furthers the proliferation of large amounts cheap, quickly-provisioned, computing resources. Second, to apply all that power, we have open source data processing systems based on years of distributed systems research, like Hadoop, and many incarnations of NoSQL. The development of open source data processing sytems has allowed proliferation of systems that scale, which only the highly capitalized could afford, until recently. These two things alone have allowed for the democratization of BigData. A guy in a garage can process terabytes of data with little more than a credit card and elbow grease.

With all these tools and recently acquired computing power, where are we going? Of course we can expect datasets to continue to grow, and the computational complexity of our data processing to increase, as well as compute power to continue to rise (GPGPUs, multicore and so on). In addition, I anticipate the emergence of something I’m calling NewData. NewData will build on what we have currently with the BigData, but will include some trends just beginning to take off. First, the development of ubiquitous public APIs (Meatcloud Manifesto). Public APIs have yet to proliferate to all online systems. As a consequence, there is still a lot of screen scraping going on. By having easily query-able and parse-able datasets available through ubiquitous APIs, consuming the internet with machines is easier making the application of BigData more powerful. Netflix is a good example of this. Second and similarly enabling will be the development of standardized public datasets. Current datasets are generally hard to find and use, standardized dataset formats will enable BigData analysis to be more productive and not waste time munging. Data.gov is a start. These two developments are yet to be fully realized in current systems but will allow for the rise of NewData. As these developments begin to roll out we will begin to see changes to how our BigData systems look. NewData systems will be less concerned with how big the data is and what it looks like, but will emphasize derivation of more information from the data. Bradford Cross gets this, and as a result FlightCaster is an early example of what I mean by NewData.

The scale of data and computations is an important issue, but the data age is less about the raw size of your data, and more about the cool stuff you can do with it.

Asking the right questions of the data is important, especially if you’re trying to do cool stuff. The Freakonomics guys proved this a few times over. NewData will be about creating value from data, and asking the right questions is worth as much as the answers. The key enablers of this will be using new found APIs and datasets to combine data from disperate sources in ways that BigData couldn’t. Asking questions that we wouldn’t have thought to ask of BigData. Where BigData was about a handful of datasets at most, NewData will be about dozens of datasets. The mashup is the cornerstone of NewData.

That being said, we will need new systems to process this data and enable us to ask these questions. NewData analysis will need inter-process communication and collaboration. Currently, systems like Hadoop process data by splitting the data up and processing chunks in parallel on hundreds to thousands of machines. Processes are isolated from the other processes. This will continue, but NewData will require more from these systems to ask deeper questions. Complex inter-process communication will be needed to ask these questions. Think of the simplicity of writing Map/Reduce jobs, the robustness of Hadoop, the workflow and dataflow of Cascading and DryadLINQ, respectively, and the power of a message passing system like MPI. These jobs will likely include large in-memory collaborative computations across thousands of machines. Where data locality was key in BigData, both data and memory-locality (NUMA/ccNUMA) will be important in NewData.

It is clear that BigData still has some runway before NewData takes over. However, if the trends in the democratization of compute and processing continue (beyond Hadoop and EC2), and the opening of APIs and datasets proliferate online and off, NewData and it’s new questions, mashups, and systems are inevitable. Where having readily available compute resources and the software to use it defined BigData, NewData will be defined solely by asking the right questions, the algorithms to derive answers, and the systems used to produce them.

Thanks to Mike Miller, Bradford Stephens and my awesome wife Erin for the help on this article.

Follow me on twitter.

December 15, 2009

Biodynamic Agriculture Applied to Datacenters.

While listening to the Green HPC podcast I had the thought that biodynamic agriculture could be applied to managing datacenters. Now I might be off my rocker but I think it might be a worthwhile way to think about it, hopefully without getting too hippy-ish.

From wikipedia:

Biodynamic agriculture is a method of organic farming with homeopathic composts that treats farms as unified and individual organisms, emphasizing balancing the holistic development and interrelationship of the soil, plants, animals as a self-nourishing system without external inputs insofar as this is possible given the loss of nutrients due to the export of food.

To me this totally has an analog in datacenters, server farms (pun intended) and machine rooms. To paraphrase the above wikipedia quote:

An electrodynamic datacenter is one that is treated as a unified and individual organism. That is each datacenter is an autonomous entity and needs to be thought about as an organism where all the components (CRACs, servers, network, power, etc) are balanced and interrelated without external inputs insofar as this is possible given the loss of capacity (bandwidth, compute, storage, etc) due to export of data, compute or another resource.

Putting it like that seems pretty reasonable and would seem to lean toward making datacenters as efficient as possible. The goal being reducing external inputs (power, bandwidth and etc) while still getting the desired amount of output. Practices such as running datacenters hot, data locality optimization or shutting down part (or all) of a datacenter while not needed would be common place. This would require tight monitoring, analysis, controls and automation on inputs and outputs. This also means developing a quantitative relationship between consumption/utilization and production, ie how much input is required for X amount of output. Certainly an interesting problem to solve and system to build although I imagine some level of this has been implemented by the Googles of the world. While datacenters will likely never be self-sustaining in the end this may be a reasonable way to think about datacenter controls and management especially as we all try to go green for monetary and environmental reasons.