Post portem of a large Cassandra datacenter migration.

Kevin Burton Fri, 09 Oct 2015 17:18:07 -0700

We just finished up a pretty large migration of about 30 Cassandra boxes to
a new datacenter.


We'll be migrating to about 60 boxes here in the next month so scalability
(and being able to do so cleanly) is important.

We also completed an Elasticsearch migration at the same time.  The ES
migration worked fine. A few small problems with it doing silly things with
relocating nodes too often but all in all it was somewhat painless.

At one point we were doing 200 shard reallocations in parallel and pushing
about 2-4Gbit...

The Cassandra migration, however, was a LOT harder.

One quick thing I wanted to point out - we're hiring.  So if you're a
killer Java Devops guy drop me an email....

Anyway.  Back to the story.

Obviously we did a bunch of research before hand to make sure we had plenty
of bandwidth.  This was a migration from Washington DC to Germany.

Using iperf, we could consistently push about 2Gb back and forth between DC
and Germany.  This includes TCP as we switched to using large window sizes.

The big problem that we had, was that we could only bootstrap one node at a
time.  The ends up taking a LOT more time because you have to keep checking
on a node so that you can start the next one.

I imagine one could write a coordinator script but we had so many problems
with CS that it wouldn't have worked if we tried.

We had 2-3 main problems.

1.  Sometimes streams would just stop and lock up.  No explanation why.
They would just lock up and not resume.  We'd wait 10-15 minutes with no
response.. This would require us abort and retry.  Had we updated to
Cassandra 2.2 before hand I think the new resume support would work.

2.  Some of our keyspaces created by Thrift caused exceptions regarding
"too few resources" when trying to bootstrap. Dropping these keyspaces
fixed the problem.  They were just test keyspaces so it didn't matter.

3.  Because of #1, it's probably better to make sure you have 2x or more
disk space on the remote end before you do the migration.  This way you can
boot the same number of nodes you had before and just decommission the old
ones quickly. (er use nodetool removenode - see below)

4.  We're not sure why, but our OLDER machines kept locking up during this
process.  This kept requiring us to do a rolling restart on all the older
nodes.  We suspect this is GC and we were seeing single cores to 100%.  I
didn't have time to attach a profiler as were all burned out at this point
and just wanted to get it over with.  This problem meant that #1 was
exacerbated because our old boxes would either refuse to send streams or
refuse to accept them.  It seemed to get better when we upgraded the older
boxes to use Java 8.

5.  Don't use nodetool decommission if you have a large number of nodes.
Instead, use nodetool removenode.  It's MUCH faster and does M-N
replication between nodes directly.  The downside is that you go down to
N-1 replicas during this process. However, it was easily 20-30x faster.
This probably saved me about 5 hours of sleep!

In hindsight, I'm not sure what we would have done differently.  Maybe
bought more boxes.  Maybe upgraded to Cassandra 2.2 and probably java 8 as
well.

Setting up datacenter migration might have worked out better too.

Kevin

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Post portem of a large Cassandra datacenter migration.

Reply via email to