Re: Cassandra disk usage and failure recovery

Peter Schuller Tue, 04 Jan 2011 10:21:54 -0800

This will be a very selective response, not at all as exhaustive as it
should be to truly cover what you bring up. Sorry, but here goes some
random tidbits.


> On the cassandra user list, I noticed a thread on a user that literally
> wrote his cluster to death.  Correct me if I'm wrong, but based on that
> thread, it seemed like if one node runs out of space, it kills the
> cluster.  I hope that is not the case.

A single node running out of disk doesn't kill a cluster as such.
However, if you have a cluster of nodes where each is roughly the same
size (and assuming even balancing of data) and you are so close to
being full everywhere that one of them actually becomes full, you are
in dangerous territory if you're worried about uptime.

I'd say there are two main types of issues here:

(1) Monitoring and predicting disk space usage with Cassandra is a bit
more difficult than in a traditional system because of the way it
varies over time and the way writes can have impacts on disk space
that last for some time beyond the write itself. I.e., delayed
effects.

(2) The fact that operational things like moving nodes and repair
operations need disk space, means it can be more difficult to get out
of a bad situation.

> If I have a cluster of 3 computers that is filling up in disk space and
> I add a bunch of machines, how does cassandra deal with that situation?
> Does cassandra know not to keep writing to the initial three machines
> because the disk is near full and write to the new machines only?  At
> some point my machines are going to be full of data and I don't want to
> write any more data to those machines until I add more storage to those
> machines.  Is this possible to do?

Cassandra selects replica placement based on its replication strategy
and ring layout (node's and their tokens). There is no "fallback" to
putting data elsewhere because nodes are full (doing so would likely
introduce tremendous complexity I think).

If a node goes catastrophically out of disk, it will probably stop
working and effectively be offline in the cluster. Not necessarily
though; it could be that e.g. compactions are not completing but
writes still work as do reads, and over time reads become less
performant due to lack of compaction. Alternatively if disks are
completely full and memtables cannot be flushed, I believe you'd
expect it to essentially go down.

I would say that running out of disk space is not something which is
very gracefully handled right now, although there is some code to
mitigate it. For example during compaction there is some logic to
check for disk space availability and try to compact smaller files
first to possibly make room for larger compactions, but that is not
addressing the overall issue.

>  I read somewhere that it is a bad
> idea to use more than 50% of the drive utilization because of the
> compaction requirements.

For a single column family, compaction (when major, i.e., all data is
involved) compactions can currently double disk space if data is not
overwritten or removed. In addition things like nodetool repair and
nodetool cleanup need disk space too.

Here's where I'm not really up on all the details. It would be nice to
arrive at a figure which is the absolute worst-case possible disk
space expansion that is possible.

> I was planning on using ssd for the sstables
> and standard hard drives for the compaction and writelogs.  Is this
> possible to do?

It doesn't make sense for compaction because compaction involves
replacing sstables with new ones. You could write new sstables to
separate storage and then copy them into the sstable storage location,
but that doesn't really do anything but increase the total amount of
I/O that you're doing.

>I didn't see any options for specifying where the write
> logs are written and compaction is done.

There is a commitlog_directory option. Alternatively the directory
might be a symlink.

Compaction is not in a separate location (see above).

>Also, is it possible to add
> more drives to a running machine and have cassandra utilize a mounted
> directory with free space?

Sort of, but not really the way you want it. Short version is,
"pretend the answer is no". The real answer is somewhere between yes
and no (if someone feels like doing a write-up....).

I would suggest volume growth and file system resize if that is
something you plan on doing. Assuming you have a setup where such
operations are sufficiently trusted. Probably RAID + LVM. But beware
of LVM and implications on correctness (e.g., write barriers). Err..
basically, no, I don't recommend planning to fiddle with that. The
potential for problems is probably too high. KISS.

> Cassandra should know to stop writing to the node once the directory
> with the sstables is near full, but it doesn't seem like it does
> anything like that.

It's kind of non-trivial to gracefully handle it in a way that both
satisfies the "I don't want to be stuck" requirement while also
satisfying the "I don't want to be artificially down because of
conservative estimates made by Cassandra, when I know my workload will
survive with this diskspace" requirement.

But disk space management is definitely an area of improvement.

> My second set of questions relate to how Cassandra deals with recovering
> from node failure.  How long does it take for Cassandra to fully recover
> from a node failure?  Does it slow down the other nodes while it is
> recovering?

Anti-compaction and streaming is done to move data from nodes that
have it (that are in the replica set). This implies CPU and I/O and
networking load on the source node, so it does have an impact. See
http://wiki.apache.org/cassandra/Streaming among others.

(Here's where I'm not sure, someone please confirm/deny) In 0.6, I
believe this required diskspace on the originating node. In 0.7, I
*think* the need for disk space on the source node is removed.

>  When a node goes down that hasn't had a chance to replicate
> the commit logs, then I'm assuming that data is lost forever.  That
> seems like a pretty bad thing.  How long does it take  for commit logs
> to replicate?

Commit logs don't replicate. Commit logs are only used for internal
node consistency.

Data is replicated as part of (1) RPC relay directly as a result of
traffic (normal case), (2) read-repair, and (3) anti-entropy (nodetool
repair). See 'consistency' on
http://wiki.apache.org/cassandra/Operations

You will have to select an appropriate consistency level on reads and
writes, and configure your nodes appropriately (for example batched
vs. periodic commit mode) depending on your requirements.

> If a node goes down, after we fix it and get it back up, do we need to
> do anything other than the nodetool add/remove commands?

If it only went temporarily down, just start it up again as long as it
hasn't been down longer than GCGraceSeconds (google it).

If it's actually being replaced - see the operations wiki page linked above.

> How long does
> it take to get back to normal once it's back online.  Is there any way
> to see the status of recovery?

If it's a node replacement then whatever time it takes to bootstrap
that node into the ring, which means streaming data from other nodes.
So it directly depends on the amount of data. That said, streaming is
pretty efficient and should hopefully saturate either your network or
your disks.

> 1) Ability to specify which directory compactions occur in. (It's very,
> very painful to lose 50% of my ssd storage because of where compaction
> is occuring)

Doesn't quite make sense (see above), but I sympathize with the need
to know disk space reqs.

> 2) Ability to add additional directories for storage of sstables.  (I
> should be able to add a hard drive and have cassandra start using it for
> storage)

See above - sorta supported but pretend it's not.

In the future (see
https://issues.apache.org/jira/browse/CASSANDRA-1608) it may possibly
be that it again becomes useful to point Cassandra at multiple sstable
directories. If the limited size of sstables become a reality. That
should also help in general with disk space requirements for large
CF:s.

> 4) Ability to have specify a primary storage directory for main requests
> and secondary storage directory for replicas.  Primary storage could be
> on a fast raid or ssd.  Incoming Requests could prefer primary storage
> that uses ssds.  This will enable us to get the speed of ssd, but the
> cheap replication capabilities of standard disk.

That assumes that traffic goes only to the primary replica unless it
goes down, which is only the case under specific circumstances.

Also, it doesn't help you if, when a node goes down, all traffic goes
to the remaining node(s) and kill them completely because they are
having to serve the same traffic off of slower storage.

> 5) Utilize all replicas for servicing requests.  Not sure if this is
> already done, but if I have 5 replicas, are all 5 being used to service
> read requests?

In general yes, but it's possible to affect that.

You also have to be mindful of how consistency  level of reads affect
the number of reads that have to be done, in addition to read repair
(if enabled).

-- 
/ Peter Schuller

Re: Cassandra disk usage and failure recovery

Reply via email to