July 27, 2010

SurgeCon 2010

If you haven’t heard about Surge, it’s a new web operations conference presented by the smart folks at OmniTI. They have amassed a good list of speakers including guys like John Allspaw and Theo Schlossnagle. I also happen to have been invited to talk about the cloud, Cloudant and all sorts of good stuff.

July 19, 2010

Adding Health Checks to Deckard from Chef.

Recently, we (at Cloudant) open sourced Deckard, a HTTP content check monitoring system based on CouchDB. One of the best bits about using Couch is that it gives you a ReST API and with Deckard it can be used to add new health checks. Doing a simple PUT adds new URLs to monitor. At Cloudant we love Chef and use it for everything. Chef has things called resources and providers. Resources are abstractions that describe the state you want a machine to be in. Providers perform the actions described by a resource. A good example is using the package resource on Centos uses yum while on Ubuntu it uses apt-get. The resource abstracts that away, letting the provider (and node) deal with the specifics on how to install the package. This makes your recipes nice and DRY, use the same code to install packages on all sorts of platforms. There are resources and providers for anything from installing packages to even one I wrote for executing Erlang code via erl_call. One resource that works well with Deckard is the HTTP request resource, using it makes it very easy to add health checks from your cookbooks. We use something like the following code to add checks to new nodes at Cloudant:

This code will add the document describing the check to the monitor_content_check database and then create a file so we can use “not_if” and Chef won’t attempt to add the check twice. Pretty cool stuff and even more reason that everything should have an API. Even cooler than this example would be to use Chef Search to do the same thing but I’ll save that for another blog post.

June 4, 2010

Just Opensourced: Gaff and Deckard

This post was stolen from my original post on the Cloudant blog.

Today we released two open source projects that have been in use internally at Cloudant for some time now, Gaff and Deckard.

All of our infrastructure is in the cloud and as such we need a way for disperate systems to all request resources, this is where Gaff comes in. Gaff is a pubsub daemon for asynchronously talking to cloud APIs using AMQP. Currently it supports a subset of the Dynect (DNS), Slicehost and EC2 APIs and uses geemus‘ awesome fog Ruby library. The basic workflow for Gaff is to send JSON-RPC formated messages to an AMQP exchange with a routing key corresponding to the API you are talking to, you could be sending these messages from a web application or another service.  Each message gets routed to an API specific queue and is picked up by Gaff and turned into the appropriate API call, starting, stopping, modifying your servers on EC2 or elsewhere.

We have a lot of CouchDB instances to keep tabs on to do this we wrote Deckard. Deckard is a HTTP check monitoring system based on CouchDB. Yo dawg! What better than to monitor CouchDB with CouchDB (and some Ruby)? Deckard supports basic HTTP content checks, email alerts, SMS alerts (via email) for on-call rotations, basic maintenance scheduling, replication latency alerts (between two Couches) and even has EC2 Elastic IP support for failover between two EC2 instances. Best of all since it’s based on Couch you get an API for free, just PUT a doc in the HTTP checks database and you get a new HTTP check the next time Deckard runs.

Checkout these and my other projects on GitHub and follow Cloudant and myself on Twitter.

May 31, 2010

Availability, the Cloud and Everything

Finally posted my presentation at Erlang Factory, WTIA Cloud SIG and Seattle Scalability Meetup here on the blog.

Beyond BigData.

BigData is a big deal. It’s changing how we look at data and analytics, but it isn’t the end. What are the enablers of BigData? First and foremost, cheap computing resources (CPU, disks, memory, bandwidth, etc) all thanks to Moore’s Law. Today even startups have the ability to afford huge amounts of computing power, the likes previously only the big boys could afford. Additionally, this has given rise to commodity hardware and cloud computing, which only furthers the proliferation of large amounts cheap, quickly-provisioned, computing resources. Second, to apply all that power, we have open source data processing systems based on years of distributed systems research, like Hadoop, and many incarnations of NoSQL. The development of open source data processing sytems has allowed proliferation of systems that scale, which only the highly capitalized could afford, until recently. These two things alone have allowed for the democratization of BigData. A guy in a garage can process terabytes of data with little more than a credit card and elbow grease.

With all these tools and recently acquired computing power, where are we going? Of course we can expect datasets to continue to grow, and the computational complexity of our data processing to increase, as well as compute power to continue to rise (GPGPUs, multicore and so on). In addition, I anticipate the emergence of something I’m calling NewData. NewData will build on what we have currently with the BigData, but will include some trends just beginning to take off. First, the development of ubiquitous public APIs (Meatcloud Manifesto). Public APIs have yet to proliferate to all online systems. As a consequence, there is still a lot of screen scraping going on. By having easily query-able and parse-able datasets available through ubiquitous APIs, consuming the internet with machines is easier making the application of BigData more powerful. Netflix is a good example of this. Second and similarly enabling will be the development of standardized public datasets. Current datasets are generally hard to find and use, standardized dataset formats will enable BigData analysis to be more productive and not waste time munging. Data.gov is a start. These two developments are yet to be fully realized in current systems but will allow for the rise of NewData. As these developments begin to roll out we will begin to see changes to how our BigData systems look. NewData systems will be less concerned with how big the data is and what it looks like, but will emphasize derivation of more information from the data. Bradford Cross gets this, and as a result FlightCaster is an early example of what I mean by NewData.

The scale of data and computations is an important issue, but the data age is less about the raw size of your data, and more about the cool stuff you can do with it.

Asking the right questions of the data is important, especially if you’re trying to do cool stuff. The Freakonomics guys proved this a few times over. NewData will be about creating value from data, and asking the right questions is worth as much as the answers. The key enablers of this will be using new found APIs and datasets to combine data from disperate sources in ways that BigData couldn’t. Asking questions that we wouldn’t have thought to ask of BigData. Where BigData was about a handful of datasets at most, NewData will be about dozens of datasets. The mashup is the cornerstone of NewData.

That being said, we will need new systems to process this data and enable us to ask these questions. NewData analysis will need inter-process communication and collaboration. Currently, systems like Hadoop process data by splitting the data up and processing chunks in parallel on hundreds to thousands of machines. Processes are isolated from the other processes. This will continue, but NewData will require more from these systems to ask deeper questions. Complex inter-process communication will be needed to ask these questions. Think of the simplicity of writing Map/Reduce jobs, the robustness of Hadoop, the workflow and dataflow of Cascading and DryadLINQ, respectively, and the power of a message passing system like MPI. These jobs will likely include large in-memory collaborative computations across thousands of machines. Where data locality was key in BigData, both data and memory-locality (NUMA/ccNUMA) will be important in NewData.

It is clear that BigData still has some runway before NewData takes over. However, if the trends in the democratization of compute and processing continue (beyond Hadoop and EC2), and the opening of APIs and datasets proliferate online and off, NewData and it’s new questions, mashups, and systems are inevitable. Where having readily available compute resources and the software to use it defined BigData, NewData will be defined solely by asking the right questions, the algorithms to derive answers, and the systems used to produce them.

Thanks to Mike Miller, Bradford Stephens and my awesome wife Erin for the help on this article.

Follow me on twitter.