We just finished up a pretty large migration of about 30 Cassandra boxes to a new datacenter.
We'll be migrating to about 60 boxes here in the next month so scalability (and being able to do so cleanly) is important. We also completed an Elasticsearch migration at the same time. The ES migration worked fine. A few small problems with it doing silly things with relocating nodes too often but all in all it was somewhat painless. At one point we were doing 200 shard reallocations in parallel and pushing about 2-4Gbit... The Cassandra migration, however, was a LOT harder. One quick thing I wanted to point out - we're hiring. So if you're a killer Java Devops guy drop me an email.... Anyway. Back to the story. Obviously we did a bunch of research before hand to make sure we had plenty of bandwidth. This was a migration from Washington DC to Germany. Using iperf, we could consistently push about 2Gb back and forth between DC and Germany. This includes TCP as we switched to using large window sizes. The big problem that we had, was that we could only bootstrap one node at a time. The ends up taking a LOT more time because you have to keep checking on a node so that you can start the next one. I imagine one could write a coordinator script but we had so many problems with CS that it wouldn't have worked if we tried. We had 2-3 main problems. 1. Sometimes streams would just stop and lock up. No explanation why. They would just lock up and not resume. We'd wait 10-15 minutes with no response.. This would require us abort and retry. Had we updated to Cassandra 2.2 before hand I think the new resume support would work. 2. Some of our keyspaces created by Thrift caused exceptions regarding "too few resources" when trying to bootstrap. Dropping these keyspaces fixed the problem. They were just test keyspaces so it didn't matter. 3. Because of #1, it's probably better to make sure you have 2x or more disk space on the remote end before you do the migration. This way you can boot the same number of nodes you had before and just decommission the old ones quickly. (er use nodetool removenode - see below) 4. We're not sure why, but our OLDER machines kept locking up during this process. This kept requiring us to do a rolling restart on all the older nodes. We suspect this is GC and we were seeing single cores to 100%. I didn't have time to attach a profiler as were all burned out at this point and just wanted to get it over with. This problem meant that #1 was exacerbated because our old boxes would either refuse to send streams or refuse to accept them. It seemed to get better when we upgraded the older boxes to use Java 8. 5. Don't use nodetool decommission if you have a large number of nodes. Instead, use nodetool removenode. It's MUCH faster and does M-N replication between nodes directly. The downside is that you go down to N-1 replicas during this process. However, it was easily 20-30x faster. This probably saved me about 5 hours of sleep! In hindsight, I'm not sure what we would have done differently. Maybe bought more boxes. Maybe upgraded to Cassandra 2.2 and probably java 8 as well. Setting up datacenter migration might have worked out better too. Kevin -- We’re hiring if you know of any awesome Java Devops or Linux Operations Engineers! Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile <https://plus.google.com/102718274791889610666/posts>