Separate large collections in different Replicas !

MongoDB is great if you know how to use it. But our Production Database is reaching several limits.

We are heavily using SSD backed EC2 instances. And our volume is skyrocketing. Can we shard it? Yes, but it would complicate our admin and our Rails app was not ready yet. it was not conceived ahead for the correct shard key. Moreover, the growth we are experiencing is overwhelming.

We already tried that once, splitting small collections which were very read intensive, but back then, it was only a couple of GB and the entire dump/restore lasted only a couple of minutes.

Now we have two large collections to separate: foo, has approximately 800 MM docs and our system is highly dependant on it. The other is bar, which has 200 MM docs and we can afford a small downtime on it.

So, how can we do it? Let´s separate them into two different, independent ReplicaSets!

Our initial setup is a Replicaset - RS1 with 3 servers on 3 different Availability Zones containing the 2 large collections. Our intended final setup is 2 Replicasets (RS1 and RS2 ) each one with one of the large collections.

So let´s get dirty:

1 - Sync 3 new servers
2 - Stop all 3 new servers
3- Remove them from RS1, fail to do that and the application will still try to read them.
4 - Start up each new server without the --replSet option, pointing at the correct data directory.
5 - Update the local.system.replset doc on each server with the new replica set name. You have to change every server here.

Here´s a little snippet showing this part, and a reference for the credits (http://stackoverflow.com/questions/11265997/can-i-change-the-name-of-my-replica-set-while-mongod-processes-are-running):

use local
cfg = db.system.replset.findOne()
cfg._id = "RS2"

cfg.members[4].votes=1

cfg.members[5].votes=1

cfg.members[6].votes=1

cfg.members = [cfg.members[4] , cfg.members[5] , cfg.members[6]]
db.system.replset.update({}, cfg)

6 - Shut them down again.
7 - Change the /etc/mongod.conf replSet option to the new name "RS2" on the new servers.
8 - Start them all up with the original options.
9 - Check that it´s reading and writing from the new RS .
10 - Remove collection foo from RS1, remove bar from RS2.
11 - ( optional ) Rest for a month or so until you grow another 200%. Again.

Caveats:
- Make the new servers non-voters when you create them on the old RS and voters on the newer.
- Make tests on dummies before apply it on production.
- Rename the less critical one, because it requires downtime, but if you REALLY need the information updated on the second one during the downtime, there are some opensource tools in Java ( https://github.com/wordnik/wordnik-oss) or Python ( http://pypi.python.org/pypi/OplogReplay/0.1.4 ) which can do the trick with a little hack to read the oplog just on the right collection and replay it on the second replica.

UPDATE:

We did it that last week. It didn´t work quite as well, because when two of the databases went up, the third, which was a bit ahead of the others when stopped, refused to sync and went in a permanent FATAL state.
Long story short, we successfully made a full resync on it and now we have two databases instead of one and, as expected, the %lock on them is approximately half of the original value on peak hours.

The pressure went down on the whole system and we now have a lot less exceptions and a the responsiveness is very good.

NEXT:

Cassandra will save the day replacing our back end for analytics and aggregation. Our preliminary tests are very promising and we hope to make it into production next week.