> This is what I thought. I was wishing there might be another way to reclaim
> the space.

Be sure you really need this first :) Normally you just let it happen in the bg.

> The problem is that the more data you have the more time it will take to
> Cassandra to response.

Relative to what though? There are definitely important side-effects
of having very large data sets, and part of that involves compactions,
but in a normal steady state type of system you should never be in the
position to "wait" for a major compaction to run. Compactions are
something that is intended to run every now and then in the
background. It will result in variations in disk space within certain
bounds, which is expected.

Certainly the situation can be improved and the current disk space
utilization situation is not perfect, but the above suggests to me
that you're trying to do something that is not really intended to be
done.

> Reclaim space of deleted rows in the biggest SSTable requires Major
> compaction. This compaction can be triggered by adding x2 data (or x4 data
> in the default configuration) to the system or by executing it manually
> using JMX.

You can indeed choose to trigger major compactions by e.g. cron jobs.
But just be aware that if you're operating under conditions where you
are close to disk space running out, you have other concerns too -
such as periodic repair operations also needing disk space.

Also; suppose you're overwriting lots of data (or replacing by
deleting and adding other data). It is not necessarily true that you
need 4x the space relative to what you otherwise do just because of
the compaction threshold.

Keep in mind that compactions already need extra space anyway. If
you're *not* overwriting or adding data, a compaction of a single CF
is expected to need up to twice the amount of space that it occupies.
If you're doing more overwrites and deletions though, as you point out
you will have more "dead" data at any given point in time. But on the
other hand, the peak disk space usage during compactions is lower. So
the actual peak disk space usage (which is what matters since you must
have this much disk space) is actually helped by the
deletions/overwrites too.

Further, suppose you trigger major compactions more often. That means
each compaction will have a higher relative spike of disk usage
because less data has had time to be overwritten or removed.

So in a sense, it's like the disk space demands is being moved between
the category of "dead data retained for longer than necessary" and
"peak disk usage during compaction".

Also keep in mind that the *low* peak of disk space usage is not
subject to any fragmentation concerns. Depending on the size of your
data compared to e.g. column names, that disk space usage might be
significantly lower than what you would get with an in-place updating
database. There are lots of trade-offs :)

You say you have to "wait" for deletions though which sounds like
you're doing something unusual. Are you doing stuff like deleting lots
of data in bulk from one CF, only to then write data to *another* CF?
Such that you're actually having to wait for disk space to be freed to
make room for data somewhere else?

> In case of a system that deletes data regularly, which needs to serve
> customers all day and the time it takes should be in ms, this is a problem.

Not in general. I am afraid there may be some misunderstanding here.
Unless disk space is a problem for you (i.e., you're running out of
space), there is no need to wait for compactions. And certainly
whether you can serve traffic 24/7 at low-ms latencies is an important
consideration, and does become complex when disk I/O is involved, but
it is not about disk *space*. If you have important performance
requirements, make sure you can service the read load at all given
your data set size. If you're runnning out of disk, I presume your
data is big. See
http://wiki.apache.org/cassandra/LargeDataSetConsiderations

Perhaps if you can describe your situation in more detail?

> It appears to me that in order to use Cassandra you must have a process that
> will trigger major compaction on the nodes once in X amount of time.

For some cases this will be beneficial, but not always. It's been
further improved for 0.7 too w.r.t. tomb stone handling in non-major
compactions (I don't have the JIRA ticket number handy). It's
certainly not a hard requirement and would only ever be relevant if
you're operating nodes that are significantly full.

> One case where you would do that is when you don't (or hardly) delete data.

Or just in most cases where you don't push disk space concerns.

> Another one is when your upper limit of time it should take to response is
> very high so major compaction will not hurt you.

To be really clear: Compaction is a background operation. It is never
the case that reads or writes somehow "wait" for compaction to
complete.

-- 
/ Peter Schuller

Reply via email to