I am still alive and still maintain the blog but haven’t had time to post recently. Lots of cool stuff brewing at Cloudant. Follow me on Twitter for more immediate updates. In the mean time keep following me here and hopefully I’ll have some blog inspiration soon.
April 13, 2010
January 1, 2010
Fun with the CouchDB _changes feed and RabbitMQ.
I was recently introduced to yajl-ruby, ruby bindings to the C based yajl json parsing/encoding libraries. After discovering that it can parse HTTP streams it seemed like it would be a perfect fit for use with CouchDB. A while back I wrote some code to push update notifications to RabbitMQ and a commenter mentioned using the _changes feed instead. Combining the _changes feed and yajl-ruby’s HttpStream seemed like a good way to do it.
The _changes feed is a running list of all the documents that have changed in a database listed in order by sequence number. This is similar to update notifications but gives more information such as the document IDs and is HTTP based (with multiple feed styles) rather than stdout. Additionally you can create design document filters which can be specified as a query parameter to give you only the parts of the feed you want. All in all _changes is a pretty powerful feature.
Now for the fun stuff, the code. There are a few dependencies I used to do this, specifically focused on making it fast. As such I used EventMachine based libraries for AMQP and HTTP requests. The first bit of code takes the _changes feed for the “test” database, parses the feed, uses the document ID to request that document and publish it to the queue. One key item to note is that this code requires the latest yajl-ruby from github to run properly. Additionally, this works nicely with feed=continuous so it grabs the documents as they are changed without a need for polling.
Note that there is a variable for since, this allows you to start from a specific sequence number so you can skip over old changes.
The next bit of code works from the other side of the queue. It subscribes to the queue, parses the JSON, performs some operations on it and puts the results back into another CouchDB database called “results”.
What could it be used for? My first thought is some sort of parallel computation, boot up a few dozen EC2 nodes and start dumping data into CouchDB. Have all those nodes pop messages off the queue, process them and dump the results back into Couch. Legitimately one could chain these together to process the results again. The queue ends up being a simple job management system with the EC2 nodes popping new messages as they finish processing them. With a little bit of work, features and the right use case I think could be a pretty powerful system.
Check out the code, my other projects and follow me on twitter @williamsjoe.
[edit: made a slight improvement to changes_sub.rb on 20100107]
December 18, 2009
Best Music of 2009.
I do this every year and figured 2009 should be no different. Here are my picks for the best albums and EPs of the year in no specific order.
Best albums:
Wavves – Wavves
Telefon Tel Aviv – Immolate Yourself
Riceboy Sleeps – Riceboy Sleeps
Dirty Projectors – Bitte Orca
Black Moth Super Rainbow – Eating Us
Animal Collective – Merriweather Post Pavilion
Japandriods – Post-Nothing
People Under the Stairs – Carried Away
Fuck Buttons – Tarot Sport
Russian Circles – Geneva
Best EPs:
Extra Life/Nat Baldwin – A Split
LITE – Turns Red
Abe Vigoda – Reviver
STATS – Marooned
December 15, 2009
Biodynamic Agriculture Applied to Datacenters.
While listening to the Green HPC podcast I had the thought that biodynamic agriculture could be applied to managing datacenters. Now I might be off my rocker but I think it might be a worthwhile way to think about it, hopefully without getting too hippy-ish.
From wikipedia:
Biodynamic agriculture is a method of organic farming with homeopathic composts that treats farms as unified and individual organisms, emphasizing balancing the holistic development and interrelationship of the soil, plants, animals as a self-nourishing system without external inputs insofar as this is possible given the loss of nutrients due to the export of food.
To me this totally has an analog in datacenters, server farms (pun intended) and machine rooms. To paraphrase the above wikipedia quote:
An electrodynamic datacenter is one that is treated as a unified and individual organism. That is each datacenter is an autonomous entity and needs to be thought about as an organism where all the components (CRACs, servers, network, power, etc) are balanced and interrelated without external inputs insofar as this is possible given the loss of capacity (bandwidth, compute, storage, etc) due to export of data, compute or another resource.
Putting it like that seems pretty reasonable and would seem to lean toward making datacenters as efficient as possible. The goal being reducing external inputs (power, bandwidth and etc) while still getting the desired amount of output. Practices such as running datacenters hot, data locality optimization or shutting down part (or all) of a datacenter while not needed would be common place. This would require tight monitoring, analysis, controls and automation on inputs and outputs. This also means developing a quantitative relationship between consumption/utilization and production, ie how much input is required for X amount of output. Certainly an interesting problem to solve and system to build although I imagine some level of this has been implemented by the Googles of the world. While datacenters will likely never beĀ self-sustainingĀ in the end this may be a reasonable way to think about datacenter controls and management especially as we all try to go green for monetary and environmental reasons.
November 5, 2009
Baracus.
Just did my first official Cloudant blog post on a project I created called Baracus. It’s an httperf wrapper for benchmarking CouchDB, check it out on github.