Last night, I decided to dig into CouchDB a bit more than I have in the past and setup a simple load balanced and replicated setup using HAProxy. In the end it was a pretty easy feat and seems to work fairly well. Here’s what I had to do.
First, I setup three instances of CouchDB on the same machine using different configuration files, PIDs and loopback addresses for each. This can certainly be exchanged for three different machines. Running them on the same machine make sure you adjust the DbRootDir, BindAddress, LogFile in the configuration file and use a command like the following to start things up. This will make sure the non-default configuration and PID location are used.
./couchdb -c SOME_PATH/couchdb2.ini -p SOME_PATH/couchdb2.pid
As you may already know CouchDB has a nice web interface called futon, http://HOSTNAME:5984/_utils/ Using futon I created a database with the same name on all three. I then chose which instance would be my “master”, couchdb1 and couchdb2 and 3 will be “slaves”. I put master and slave in quotes because there isn’t this type of relationship in CouchDB as far as I can tell. All instances can replicate to each other as long as they can connect to each other, so master-slave replication is simply the type of configuration I am enforcing with HAProxy and my replication POST commands. More on these bits later. I then created created a document on my master node and using futon’s replicator replicated the changes to the other nodes. I then wanted to find a way to automate or schedule this. You can initiate replication simply by sending a POST request to couchdb so I wrote a simple curl script to do just that.
First I created the replication POST body in a file:
{“source”:”test_rep”,”target”:”http://couchdb2:5984/test_rep”}
When run against the master this will replicate the master to couchdb2. I wrote a similar file for couchdb3 as well.
Then using curl I can send this body to the master:
curl -X POST –data @couchdb1_2_rep http://couchdb1:5984/_replicate
curl -X POST –data @couchdb1_3_rep http://couchdb1:5984/_replicate
After running you should see some output that starts with {“ok”:true,”session_id …} this means things went well. You should also see some output in the logs on both instances. These commands can be put in a cron to run a specific intervals to keep the slaves updated. You can also create a script and configure DbUpdateNotificationProcess to replicate after each update. The later is probably a nicer solution but a cron and curl should get you started.
I then moved on to setting up HAProxy to load balance between the nodes. Since I wanted a master-slave relationship between the nodes I needed to set HAProxy to only send POSTs, PUTs and DELETEs to the master and GET requests to the two slaves. After checking the docs and playing with a couple different ACL configurations I didn’t find a solution. I then contacted the mailing list for some advice and conveniently a solution was sent back to me quickly. They also told me about another piece of documentation I didn’t find initially. My configuration for HAProxy is pretty basic but it shows what needs to be done.
global
maxconn 4096
nbproc 2defaults
mode http
clitimeout 150000
srvtimeout 30000
contimeout 4000
balance roundrobin
stats enable
stats uri /haproxy?statsfrontend couchdb_lb
bind localhost:8080acl master_methods method POST DELETE PUT
use_backend master_backend if master_methods
default_backend slave_backendbackend master_backend
server couchdb1 couchdb1:5984 weight 1 maxconn 512 checkbackend slave_backend
server couchdb2 couchdb2:5984 weight 1 maxconn 512 check
server couchdb3 couchdb3:5984 weight 1 maxconn 512 check
The part that enforces where the PUTs, DELETEs and POSTs go is the ACL definition and it basically says that if HAProxy receives a POST, DELETE or PUT then use the master node otherwise use a slave.
Once done I started up HAProxy and tested it out and found that it worked out nicely with GETs going to the slaves in roundrobin fashion and PUTs, DELETEs and POSTs going to the master. I then made a slight change to my curl command from earlier to have the replication POSTs go through HAProxy just to make sure.
curl -X POST –data @couchdb1_2_rep http://localhost:8080/_replicate
curl -X POST –data @couchdb1_3_rep http://localhost:8080/_replicate
If things are working properly you should find that the replication POST commands only go to the master node and the GET commands got to the two slaves.
CouchDB is pretty easy to get going and fun to work with. Hopefully this will help you get going.
Nice! I like how easy it is to get the master/slave behavior by simply configuring a proxy server.
I’m wondering about better ways to do this replication though. It looks like we have the following sequence of steps:
1. CouchDB instance A completes an update operation
2. Instance A runs some kind of user-supplied script?
3. Script sends instance A a POST request giving a command to replicate with CouchDB instance B
4. Instance A connects to instance B to get a list of documents on instance B.
5. Instance A determines which documents it has that instance B doesn’t have, and sends those documents to instance B using a batch request.
6. Go back to step 3, replacing B with C
7. …
That seems kind of inefficient. I’m wondering if we could do something more like:
1. CouchDB instance A completes an update operation
2. Instance A runs some kind of user-supplied script?
3. Script sends the same update to instance B
4. Script sends the same update to instance C
5. …
or, as a variation on that theme:
3. Script publishes the update to a messaging server. In AMQP terminology, publish the update to and exchange. In JMS terminology, publish it to a topic.
4. ‘Slave’ databases have a live connection over which they are listening for these updates.
Great Post. How has the HAProxy setup been working for you? I do worry that the replication step involves too many steps.
It works pretty well and is in production. Regarding replication I believe there will be configurable permanent replicators coming to couch soon.
Thought I’d add a bit more detail that clarifies a couple of gotchas I ran into while bringing up a second CouchDB instance on the same machine.
The first problem I had was in specifying the separate .ini file. I had assumed that -c SOME_PATH/couchdb2.ini would work just like local.ini did for the first instance–i.e., it was an override on top of default.ini. Not so. -c (lowercase C) doesn’t build on the “system default” configuration file chain. -C (uppercase C) does. You can explicitly include default.ini and your instance’s .ini file using -c or include just your instance’s .ini file using -C (or use -c if your instance’s .ini file doesn’t assume it’s building on top of default.ini).
The second has to do with directory and file ownership–if you are running couchdb under the couchdb as I was (and as the default startup script does), the directory containing your database and view needs to be read/write-able by the couchdb user. The log file also has to be writable by the couchdb user. An easy way to make sure this is the case is to chown the directories and log file to be owned by the couchdb user.
Thanks for the article!