from:"Peter Schuller"

Re: Does replicate_on_write=true imply that CL.QUORUM for reads is unnecessary?

2013-05-31 Thread Peter Schuller

This is incorrect. IMO that page is misleading.

replicate on write should normally always be turned on, or the change
will only be recorded on one node. Replicate on write is asynchronous
with respect to the request and doesn't affect consistency level at
all.


On Wed, May 29, 2013 at 7:32 PM, Andrew Bialecki
andrew.biale...@gmail.com wrote:
 To answer my own question, directly from the docs:
 http://www.datastax.com/docs/1.0/configuration/storage_configuration#replicate-on-write.
 It appears the answer to this is: Yes, CL.QUORUM isn't necessary for
 reads. Essentially, replicate_on_write sets the CL to ALL regardless of
 what you actually set it to (and for good reason).


 On Wed, May 29, 2013 at 9:47 AM, Andrew Bialecki andrew.biale...@gmail.com
 wrote:

 Quick question about counter columns. In looking at the replicate_on_write
 setting, assuming you go with the default of true, my understanding is it
 writes the increment to all replicas on any increment.

 If that's the case, doesn't that mean there's no point in using CL.QUORUM
 for reads because all replicas have the same values?

 Similarly, what effect does the read_repair_chance have on counter columns
 since they should need to read repair on write.

 In anticipation a possible answer, that both CL.QUORUM for reads and
 read_repair_chance only end up mattering for counter deletions, it's safe to
 only use CL.ONE and disable the read repair if we're never deleting
 counters. (And, of course, if we did start deleting counters, we'd need to
 revert those client and column family changes.)





-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Read IO

2013-02-20 Thread Peter Schuller

 Is this correct ?

Yes, at least under optimal conditions and assuming a reasonably sized
row. Things like read-ahead (at the kernel level) will play into it;
and if your read (even if assumed to be small) straddles two pages you
might or might not take another read depending on your kernel settings
(typically trading pollution of page cache vs. number of I/O:s).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Simulating a failed node

2012-10-28 Thread Peter Schuller

 Operation [158320] retried 10 times - error inserting key 0158320 
 ((UnavailableException))

This means that at the point where the thrift request to write data
was handled, the co-ordinator node (the one your client is connected
to) believed that, among the replicas responsible for the key, too
many were down to satisfy the consistency level. Most likely causes
would be that you're in fact not using RF  2 (e.g., is the RF really
 1 for the keyspace you're inserting into), or you're in fact not
using ONE.

 I'm sure my naive setup is flawed in some way, but what I was hoping for was 
 when the node went down it would fail to write to the downed node and instead 
 write to one of the other nodes in the clusters. So question is why are 
 writes failing even after a retry? It might be the stress client doesn't pool 
 connections (I took

Write always go to all responsible replicas that are up, and when
enough return (according to consistency level), the insert succeeds.

If replicas fail to respond you may get a TimeoutException.

UnavailableException means it didn't even try because it didn't have
enough replicas to even try to write to.

(Note though: Reads are a bit of a different story and if you want to
test behavior when nodes go down I suggest including that. See
CASSANDRA-2540 and CASSANDRA-3927.)

--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Java 7 support?

2012-10-24 Thread Peter Schuller

FWIW, we're using openjdk7 on most of our clusters. For those where we
are still on openjdk6, it's not because of an issue - just haven't
gotten to rolling out the upgrade yet.

We haven't had any issues that I recall with upgrading the JDK.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: nodetool cleanup

2012-10-22 Thread Peter Schuller

On Oct 22, 2012 11:54 AM, B. Todd Burruss bto...@gmail.com wrote:

 does nodetool cleanup perform a major compaction in the process of
 removing unwanted data?

No.

Re: Why data tripled in size after repair?

2012-10-01 Thread Peter Schuller

 It looks like what I need. Couple questions.
 Does it work with RandomPartinioner only? I use ByteOrderedPartitioner.

I believe it should work with BOP based on cursory re-examination of
the patch. I could be wrong.

 I don't see it as part of any release. Am I supposed to build my own
 version of cassandra?

It's in the 1.1 branch; I don't remember if it went into a release
yet. If not, it'll be in the next 1.1.x release.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Why data tripled in size after repair?

2012-09-26 Thread Peter Schuller

 What is strange every time I run repair data takes almost 3 times more
 - 270G, then I run compaction and get 100G back.

https://issues.apache.org/jira/browse/CASSANDRA-2699 outlines the
maion issues with repair. In short - in your case the limited
granularity of merkle trees is causing too much data to be streamed
(effectively duplicate data).
https://issues.apache.org/jira/browse/CASSANDRA-3912 may be a bandaid
for you in that it allows granularity to be much finer, and the
process to be more incremental.

A 'nodetool compact' decreases disk space temporarily as you have
noticed, but it may also have a long-term negative effect on steady
state disk space usage depending on your workload. If you've got a
workload that's not limited to insertions only (i.e., you have
overwrites/deletes), a major compaction will tend to push steady state
disk space usage up - because you're creating a single sstable bigger
than what would normally happen, and it takes more total disk space
before it will be part of a compaction again.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: JVM 7, Cass 1.1.1 and G1 garbage collector

2012-09-24 Thread Peter Schuller

 It is not a total waste, but practically your time is better spent in other
 places. The problem is just about everything is a moving target, schema,
 request rate, hardware. Generally tuning nudges a couple variables in one
 direction or the other and you see some decent returns. But each nudge takes
 a restart and a warm up period, and with how Cassandra distributes requests
 you likely have to flip several nodes or all of them before you can see the
 change! By the time you do that its probably a different day or week.
 Essentially finding our if one setting is better then the other is like a 3
 day test in production.

 Before c* I used to deal with this in tomcat. Once in a while we would get a
 dev that read some article about tuning, something about a new jvm, or
 collector. With bright eyed enthusiasm they would want to try tuning our
 current cluster. They spend a couple days and measure something and say it
 was good lower memory usage. Meanwhile someone else would come to me and
 say higher 95th response time. More short pauses, fewer long pauses, great
 taste, less filing.

That's why blind blackbox testing isn't the way to go. Understanding
what the application does, what the GC does, and the goals you have in
mind is more fruitful. For example, are you trying to improve p99?
Maybe you want to improve p999 at the cost of worse p99? What about
failure modes (non-happy cases)? Perhaps you don't care about
few-hundred-ms pauses but want to avoid full gc:s? There's lots of
different goals one might have, and workloads.

Testing is key, but only in combination with some directed choice of
what to tweak. Especially since it's hard to test for for the
non-happy cases (e.g., node takes a burst of traffic and starts
promoting everything into old-gen prior to processing a request,
resulting in a death spiral).

 G1 is the perfect example of a time suck. Claims low pause latency for big
 heaps, and delivers something regarded by the Cassandra community (and hbase
 as well) that works worse then CMS. If you spent 3 hours switching tuning
 knobs and analysing, that is 3 hours of your life you will never get back.

This is similar to saying that someone told you to switch to CMS (or,
use some particular flag, etc), you tried it, and it didn't have the
result you expected.

G1 and CMS have different trade-offs. Nether one will consistently
result in better latencies across the board. It's all about the
details.

 Better to let SUN and other people worry about tuning (at least from where I
 sit)

They're not tuning. They are providing very general purpose default
behavior, including things that make *no* sense at all with Cassandra.
For example, the default behavior with CMS is to try to make the
marking phase run as late as possible so that it finishes just prior
to heap exhaustion, in order to optimize for throughput; except
that's not a good idea for many cases because is exacerbates
fragmentation problems in old-gen by pushing usage very high
repeatedly, and it increases the chance of full gc because marking
started too late (even if you don't hit promotion failures due to
fragmentation). Sudden changes in workloads (e.g., compaction kicks
in) also makes it harder for CMS's mark triggering heuristics to work
well.

As such, default options for Cassandra are use certain settings that
diverge from that of the default behavior of the JVM, because
Cassandra-in-general is much more specific a use-case than the
completely general target audience of the JVM. Similarly, a particular
cluster (with certain workloads/goals/etc) is a yet more specific
use-case than Cassandra-in-general and may be better served by
settings that differ from that of default Cassandra.

But, I certainly agree with this (which I think roughly matches what
you're saying): Don't randomly pick options someone claims is good in
a blog post and expect it to just make things better. If it were that
easy, it would be the default behavior for obvious reasons. The reason
it's not, is likely that it depends on the situation. Further, even if
you do play the lottery and win - if you don't know *why*, how are you
able to extrapolate the behavior of the system with slightly changed
workloads? It's very hard to blackbox-test GC settings, which is
probably why GC tuning can be perceived as a useless game of
whack-a-mole.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Invalid Counter Shard errors?

2012-09-20 Thread Peter Schuller

 I don't understand what the three in parentheses values are exactly. I guess
 the last number is the count and the middle one is the number of increments,
 is that true ? What is the first string (identical in all the errors) ?

It's (UUID, clock, increment). Very  briefly, counter columns in
Cassandra are made up of multiple shards. In the write path, a
particular counter increment is executed by one leader which is one
of the replicas of the counter. The leader will increment it's own
value, read it's own full value (this is why Replicate On Write has
to do reads in the write path for counters) and replicas to other
nodes.

UUID roughly corresponds to a node in the cluster (UUID:s are
sometimes refreshed, so it's not a strict correlation). Clockid is
supposed to be monotonically increasing for a given UUID.

 How can the last number (assuming it's the count) be negative knowing that I
 only sum positive numbers ?

I don't see a negative number in you paste?

 An other point is that the highest value seems to be *always* the good one
 (assuming this time that the middle number is the number of increments).

DISCLAIMER: This is me responding off the cuff without digging into it further.

Depends on the source of the problem. If the problem, as theorized in
the ticket, is caused by non-clean shutdown of nodes the expected
result *should* be that we effectively loose counter increments.
Given a particular leader among the replicas, suppose you increment
counter C by N1, followed by un-clean shutdown with the value never
having been written to the commit log. On the next increment of C by
N2, a counter shard would be generated which has the value being
base+N2 instead of base+N1 (assuming the memtable wasn't flushed and
no other writes to the same counter column happened).

When this gets replicated to other nodes, they would see a value based
on N1 and a value based on N2, both with the same clock. It would
choose the higher one. In either case as far as I can tell (off the
top of my head), *some* counter increment is lost. The only way I can
see (again off the top of my head) the resulting value being correct
is if the later increment (N2 in this case) is somehow including N1 as
well (e.g., because it was generated by first reading the current
counter value).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Invalid Counter Shard errors?

2012-09-20 Thread Peter Schuller

The significance I think is: If it is indeed the case that the higher
value is always *in fact* correct, I think that's inconsistent with
the hypothesis that unclean shutdown is the sole cause of these
problems - as long as the client is truly submitting non-idempotent
counter increments without a read-before-write.

As a side note: If hou're using these counters for stuff like
determining amounts of money to be payed by somebody, consider the
non-idempotense of counter increments. Any write that increments a
counter, that fails by e.g. Timeout *MAY OR MAY NOT* have been applied
and cannot be safely retried. Cassandra counters are generally not
useful if *strict* correctness is desired, for this reason.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: JVM 7, Cass 1.1.1 and G1 garbage collector

2012-09-15 Thread Peter Schuller

 Generally tuning the garbage collector is a waste of time.

Sorry, that's BS. It can be absolutely critical, when done right, and
only useless when done wrong. There's a spectrum in between.

 Just follow
 someone else's recommendation and use that.

No, don't.

Most recommendations out there are completely useless in the general
case because someone did some very specific benchmark under very
specific circumstances and then recommends some particular combination
of options. In order to understand whether a particular recommendation
applies to you, you need to know enough about your use-case that I
suspect you're better of just reading up on the available options and
figuring things out. Of course, randomly trying various different
settings to see which seems to work well may be realistic - but you
loose predictability (in the face of changing patterns of traffic for
example) if you don't know why it's behaving like it is.

If you care about GC related behavior you want to understand how the
application behaves, how the garbage collector behaves, what your
requirements are, and select settings based on those requirements and
how the application and GC behavior combine to produce emergent
behavior. The best GC options may vary *wildly* depending on the
nature of your cluster and your goals. There are also non-GC settings
(in the specific case of Cassandra) that affect the interaction with
the garbage collector, like whether you're using row/key caching, or
things like phi conviction threshold and/or timeouts. It's very hard
for anyone to give generalized recommendations. If it weren't,
Cassandra would ship with The One True set of settings that are always
the best and there would be no discussion.

It's very unfortunate that the state of GC in the freely available
JVM:s is at this point given that there exists known and working
algorithms (and at least one practical implementation) that avoids it,
mostly. But, it's the situation we're in. The only way around it that
I know of if you're on Hotspot, is to have the application behave in
such a way that it avoids the causes of un-predictable behavior w.r.t.
GC by being careful about it's memory allocation and *retention*
profile. For the specific case of avoiding *ever* seeing a full gc, it
gets even more complex.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Changing bloom filter false positive ratio

2012-09-14 Thread Peter Schuller

 I have a hunch that the SSTable selection based on the Min and Max keys in
 ColumnFamilyStore.markReferenced() means that a higher false positive has
 less of an impact.

 it's just a hunch, i've not tested it.

For leveled compaction, yes. For non-leveled, I can't see how it would
since each sstable will effectively cover almost the entire range
(since you're effectively spraying random tokens at it, unless clients
are writing data in md5 order).

(Maybe it's different for ordered partitioning though.)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: JVM 7, Cass 1.1.1 and G1 garbage collector

2012-09-12 Thread Peter Schuller

 Relatedly, I'd love to learn how to reliably reproduce full GC pauses
 on C* 1.1+.

Our full gc:s are typically not very frequent. Few days or even weeks
in between, depending on cluster. But it happens on several clusters;
I'm guessing most (but I haven't done a systematic analysis). The only
question is how often. But given the lack of handling of such failure
modes, the effect on clients is huge. Recommend data reads by default
to mitigate this and a slew of other sources of problems (and for
counter increments, we're rolling out least-active-request routing).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: JVM 7, Cass 1.1.1 and G1 garbage collector

2012-09-12 Thread Peter Schuller

 I was able to run IBM Java 7 with Cassandra (could not do it with 1.6
 because of snappy). It has a new Garbage collection policy (called balanced)
 that is good for very large heap size (over 8 GB), documented here that is
 so promising with Cassandra. I have not tried it but I like to see how it is
 in action.

FWIW, J9's balanced collector is very similar to G1 in it's design.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: JVM 7, Cass 1.1.1 and G1 garbage collector

2012-09-12 Thread Peter Schuller

 Our full gc:s are typically not very frequent. Few days or even weeks
 in between, depending on cluster.

*PER NODE* that is. On a cluster of hundreds of nodes, that's pretty
often (and all it takes is a single node).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: JVM 7, Cass 1.1.1 and G1 garbage collector

2012-09-10 Thread Peter Schuller

 I am currently profiling a Cassandra 1.1.1 set up using G1 and JVM 7.

 It is my feeble attempt to reduce Full GC pauses.

 Has anyone had any experience with this ? Anyone tried it ?

Have tried; for some workloads it's looking promising. This is without
key cache and row cache and with a pretty large young gen.

The main think you'll want to look for is whether your post-mixed mode
collection heap usage remains stable or keeps growing. The main issue
with G1 that causes fallbacks to full GC is regions becoming
effectively uncollectable due to high remembered set scanning costs
(driven by inter-region pointers). If you can avoid that, one might
hope to avoid full gc:s all-together.

The jury is still out on my side; but like I said, I've seen promising
indications.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Cassandra 1.1.1 on Java 7

2012-09-08 Thread Peter Schuller

 Has anyone tried running 1.1.1 on Java 7?

Have been running jdk 1.7 on several clusters on 1.1 for a while now.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Invalid Counter Shard errors?

2012-09-06 Thread Peter Schuller

This problem is not new to 1.1.
On Sep 6, 2012 5:51 AM, Radim Kolar h...@filez.com wrote:

 i would migrate to 1.0 because 1.1 is highly unstable.

Re: force gc?

2012-09-02 Thread Peter Schuller

 I think I described the problem wrong :) I don't want to do Java's memory
 GC. I want to do cassandra's GC - that is I want to really remove deleted
 rows from a column family and get my disc space back.

I think that was clear from your post. I don't see a problem with your
process. Setting gc grace to 0 and forcing compaction should indeed
return you to the smallest possible on-disk size.

Did you really not see a *decrease*, or are you just comparing the
final size with that of PostgreSQL? Keep in mind that in many cases
(especially if not using compression) the Cassandra on-disk format is
not as compact as PostgreSQL. For example column names are duplicated
in each row, and the row key is duplicated twice (once in index, once
in data).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: force gc?

2012-09-02 Thread Peter Schuller

 I think that was clear from your post. I don't see a problem with your
 process. Setting gc grace to 0 and forcing compaction should indeed
 return you to the smallest possible on-disk size.

(But may be unsafe as documented; can cause deleted data to pop back up, etc.)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: force gc?

2012-09-02 Thread Peter Schuller

 Maybe there is some tool to analyze it? It would be great if I could somehow
 export each row of a column family into a separate file - so I could see
 their count and sizes. Is there any such tool? Or maybe you have some better
 thoughts...

Use something like pycassa to non-obnoxiously iterate over all rows:

 for row_id, row in your_column_family.get_range():


https://github.com/pycassa/pycassa

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Memory Usage of a connection

2012-08-30 Thread Peter Schuller

 Could these 500 connections/second cause (on average) 2600Mb memory usage
 per 2 second ~ 1300Mb/second.
 or For 1 connection around 2-3Mb.

In terms of garbage generated it's much less about number of
connections as it is about what you're doing with them. Are you for
example requesting large amounts of data? Large or many columns (or
both), etc. Essentially all working data that your request touches
is allocated on the heap and contributes to allocation rate and ParNew
frequency.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: JMX(RMI) dynamic port allocation problem still exists?

2012-08-29 Thread Peter Schuller

I can recommend Jolokia highly for providing an HTTP/JSON interface to
JMX (it can be trivially run in agent mode by just altering JVM args):

http://www.jolokia.org/

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Node forgets about most of its column families

2012-08-28 Thread Peter Schuller

I can confirm having seen this (no time to debug). One method of
recovery is to jump the node back into the ring with auto_bootstrap
set to false and an appropriate token set, after deleting system
tables. That assumes you're willing to have the node take a few bad
reads until you're able to disablegossip and make other nodes not send
requests to it. disabling thrift would also be advised, or even
firewalling it prior to restart.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Why so slow?

2012-08-19 Thread Peter Schuller

You're almost certainly using a client that doesn't set TCP_NODELAY on
the thrift TCP socket. The nagle algorithm is enabled, leading to 200
ms latency for each, and thus 5 requests/second.

http://en.wikipedia.org/wiki/Nagle's_algorithm


-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: nodetool repair uses insane amount of disk space

2012-08-17 Thread Peter Schuller

 How come a node would consume 5x its normal data size during the repair
 process?

https://issues.apache.org/jira/browse/CASSANDRA-2699

It's likely a variation based on how out of synch you happen to be,
and whether you have a neighbor that's also been repaired and bloated
up already.

 My setup is kind of strange in that it's only about 80-100GB of data on a 35
 node cluster, with 2 data centers and 3 racks, however the rack assignments
 are unbalanced.  One data center has 8 nodes, and the other data center is
 split into 2 racks with one rack of 9 nodes, and the other with 18 nodes.
 However, within each rack, the tokens are distributed equally. It's a long
 sad story about how we ended up this way, but it basically boils down to
 having to utilize existing resources to resolve a production issue.

https://issues.apache.org/jira/browse/CASSANDRA-3810

In terms of DCs, different DC:s are effectively independent of each
other in terms of replica placement. So there is no need or desire for
two DC:s to be symmetrical.

The racks are important though if you are trying to take advantage of
racks being somewhat independent failure domains (for reasons outlined
in 3810 above).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: 0.8 -- 1.1 Upgrade: Any Issues?

2012-07-19 Thread Peter Schuller

 We currently have a 0.8 production cluster that I would like to upgrade to
 1.1. Are there any know compatibility or upgrade issues between 0.8 and 1.1?
 Can a rolling upgrade be done or is it all-or-nothing?

If you have lots of keys: https://issues.apache.org/jira/browse/CASSANDRA-3820

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Invalid Counter Shard errors?

2012-06-02 Thread Peter Schuller

 We're running a three node cluster of cassandra 1.1 servers, originally
 1.0.7 and immediately after the upgrade the error logs of all three servers
 began filling up with the following message:

The message you are receiving is new, but the problem it identifies is
not. The checking for this condition, and the logging, was added so
that certain kinds of counter corruption would be self-healed
eventually instead of remaining forever incorrect. Likely nothing is
wrong that wasn't before; you're just seeing it being logged now.

And I can confirm having seen this on 1.1, so the root cause remains
unknown as far as I can tell (had previously hoped the root cause were
thread-unsafe shard merging, or one of the other counter related
issues fixed during the 0.8 run).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: TTL 3 hours + GC grace 0

2012-03-11 Thread Peter Schuller

 I am using TTL 3 hours and GC grace 0 for a CF. I have a normal CF that has
 records with TTL 3 hours and I dont send any delete request. I just wonder
 if using GC grace 0 will cause any problem except extra Memory/IO/network
 load. I know that gc grace is for not transferring deleted records after a
 down node comes back. So  I assumed that transferring expired records will
 not cause any problem.

 Do you have any idea? Thank you!

If you do not perform any deletes at all, a GC grace of 0 should be
fine. But if you don't, the GC grace should not really be relevant
either. So I suggest leaving GC grace high in case you do start doing
deletes.

Columns with TTL:s will disappear regardless of GC grace.

If you do decide to run with short GC grace, be aware of the
consequencues: 
http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: OOM opening bloom filter

2012-03-11 Thread Peter Schuller

 How did this this bloom filter get too big?

Bloom filters grow with the amount of row keys you have. It is natural
that they grow bigger over time. The question is whether there is
something wrong with this node (for example, lots of sstables and
disk space used due to compaction not running, etc) or whether your
cluster is simply increasing it's use of row keys over time. You'd
want graphs to be able to see the trends. If you don't, I'd start by
comparing this node with other nodes in the cluster and figure out
whether there is a very significant difference or not.

In any case, a bigger heap will allow you to start up again. But you
should definitely make sure you know what's going on (natural growth
of data vs. some problem) if you want to avoid problems in the future.

If it is legitimate use of memory, you *may*, depending on your
workload, want to adjust target bloom filter false positive rates:

   https://issues.apache.org/jira/browse/CASSANDRA-3497

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: how to increase compaction rate?

2012-03-11 Thread Peter Schuller

 multithreaded_compaction: false

Set to true.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: LeveledCompaction and/or SnappyCompressor causing memory pressure during repair

2012-03-10 Thread Peter Schuller

 However, when I run a repair my CMS usage graph no longer shows sudden drops
 but rather gradual slopes and only manages to clear around 300MB each GC.
 This seems to occur on 2 other nodes in my cluster around the same time, I
 assume this is because they're the replicas (we use 3 replicas). Parnew
 collections look about the same on my graphs with or without repair running
 so no trouble there so far as I can tell.

I don't know why leveled/snappy would affect it, but disregarding
that, I would have been suggesting that you are seeing additional heap
usage because of long-running repairs retaining sstables and delaying
their unload/removal (index sampling/bloom filters filling your heap).
If it really only happens for leveled/snappy however, I don't know
what that might be caused by.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Frequency of Flushing in 1.0

2012-02-26 Thread Peter Schuller

 if a node goes down, it will take longer for commitlog replay.

 commit log replay time is insignificant. most time during node startup is
 wasted on index sampling. Index sampling here runs for about 15 minutes.

Depends entirely on your situation. If you have few keys and lots of
writes, index sampling will be insignificant.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Analysis of performance benchmarking - unexpected results

2012-02-16 Thread Peter Schuller

 1. Changing consistency level configurations from Write.ALL + Read.ONE
 to Write.ALL + Read.ALL increases write latency (expected) and
 decrease read latency (unexpected).

When you tested at CL.ONE, was read repair turned on?

The two ways I can think of right now, by which read latency might improve, are:

* You're benchmarking at saturation (rather than at reasonable
capacity), and the decrease overall throughput (thus load) causes
better latencies.
* No (or low) read repair could lead to different selection of read
endpoints by the co-ordinator, such as if a node never gets the chance
to be snitched close due to lack of traffic.

In addition, I should mention that it *would* be highly expected to
see better latencies at ALL if
https://issues.apache.org/jira/browse/CASSANDRA-2540 were done (same
with QUORUM).

 2. Changing from a single-region Cassandra cluster to a multi-region
 Cassandra cluster on EC2 significantly increases write latency (approx
 2x, again expected) but slightly decreases read latency (approx -10%,
 again unexpected).

Are all benchmarking clients in a single region? I could imagine some
kind of snitch effect here, possible. But not if your benchmarking
clients are connecting to random co-ordinators (assuming the
inter-region latency is in fact worse ;)).

Something is highly likely bogus IMO. If you have significantly higher
latencies across regions, there's just no way CL.ALL would get you
lower latencies than CL.ONE. Even if the clients selected hosts
randomly across the entire cluster, all requests that happen to go to
a region-local co-ordinator should see better latencies. That's
ASSUMING that you've set up the multi-region cluster as multi-DC in
Cassandra.

If not, Cassandra would not have knowledge about topology and relative
latency other than what is driven by traffic, and I could imagine this
happening if read repair were turned completely off.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: keycache persisted to disk ?

2012-02-13 Thread Peter Schuller

 I actually has the opposite 'problem'. I have a pair of servers that have
 been static since mid last week, but have seen performance vary
 significantly (x10) for exactly the same query. I hypothesised it was
 various caches so I shut down Cassandra, flushed the O/S buffer cache and
 then bought it back up. The performance wasn't significantly different to
 the pre-flush performance

I don't get this thread at all :)

Why would restarting with clean caches be expected to *improve*
performance? And why is key cache loading involved other than to delay
start-up and hopefully pre-populating caches for better (not worse)
performance?

If you want to figure out why queries seem to be slow relative to
normal, you'll need to monitor the behavior of the nodes. Look at disk
I/O statistics primarily (everyone reading this running Cassandra who
aren't intimately familiar with iostat -x -k 1 should go and read up
on it right away; make sure you understand the utilization and avg
queue size columns), CPU usage, weather compaction is happening, etc.

One easy way to see sudden bursts of poor behavior is to be heavily
reliant on cache, and then have sudden decreases in performance due to
compaction evicting data from page cache while also generating more
I/O.

But that's total speculation. It is also the case that you cannot
expect consistent performance on EC2 and that might be it.

But my #1 advise: Log into the node while it is being slow, and
observe. Figure out what the bottleneck is. iostat, top, nodetool
tpstats, nodetool netstats, nodetool compactionstats.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: keycache persisted to disk ?

2012-02-13 Thread Peter Schuller

 Yep - I've been looking at these - I don't see anything in iostat/dstat etc
 that point strongly to a problem. There is quite a bit of I/O load, but it
 looks roughly uniform on slow and fast instances of the queries. The last
 compaction ran 4 days ago - which was before I started seeing variable
 performance

[snip]

 I now why it is slow - it's clearly I/O bound. I am trying to hunt down why
 it is sometimes much faster even though I have (tried) to replicate  the
 same conditions

What does clearly I/O bound mean, and what is quite a bit of I/O
load? In general, if you have queries that come in at some rate that
is determined by outside sources (rather than by the time the last
query took to execute), you will typically either get more queries
than your cluster can take, or fewer. If fewer, there is a
non-trivially sized grey area where overall I/O throughput needed is
lower than that available, but the closer you are to capacity the more
often requests have to wait for other I/O to complete, for purely
statistical reasons.

If you're running close to maximum capacity, it would be expected that
the variation in query latency is high.

That said, if you're seeing consistently bad latencies for a while
where you sometimes see consistently good latencies, that sounds
different but would hopefully be observable somehow.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: keycache persisted to disk ?

2012-02-13 Thread Peter Schuller

 I'm making an assumption . . .  I don't yet know enough about cassandra to
 prove they are in the cache. I have my keycache set to 2 million, and am
 only querying ~900,000 keys. so after the first time I'm assuming they are
 in the cache.

Note that the key cache only caches the index positions in the data
file, and not the actual data. The key cache will only ever eliminate
the I/O that would have been required to lookup the index entry; it
doesn't help to eliminate seeking to get the data (but as usual, it
may still be in the operating system page cache).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: keycache persisted to disk ?

2012-02-13 Thread Peter Schuller

For one thing, what does ReadStage's pending look like if you
repeatedly run nodetool tpstats on these nodes? If you're simply
bottlenecking on I/O on reads, that is the most easy and direct way to
observe this empirically. If you're saturated, you'll see active close
to maximum at all times, and pending racking up consistently. If
you're just close, you'll likely see spikes sometimes.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: keycache persisted to disk ?

2012-02-13 Thread Peter Schuller

What is your total data size (nodetool info/nodetool ring) per node,
your heap size, and the amount of memory on the system?


-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: keycache persisted to disk ?

2012-02-13 Thread Peter Schuller

 the servers spending 50% of the time in io-wait

Note that I/O wait is not necessarily a good indicator, depending on
situation. In particular if you have multiple drives, I/O wait can
mostly be ignored. Similarly if you have non-trivial CPU usage in
addition to disk I/O, it is also not a good indicator. I/O wait is
essentially giving you the amount of time CPU:s spend doing nothing
because the only processes that would otherwise be runnable are
waiting on disk I/O. But even a single process waiting on disk I/O -
lots of I/O wait even if you have 24 drives.

The per-disk % utilization is generally a much better indicator
(assuming no hardware raid device, and assuming no SSD), along with
the average queue size.

 In general, if you have queries that come in at some rate that
 is determined by outside sources (rather than by the time the last
 query took to execute),

 That's an interesting approach - is that likely to give close to optimal
 performance ?

I just mean that it all depends on the situation. If you have, for
example, some N number of clients that are doing work as fast as they
can, bottlenecking only on Cassandra, you're essentially saturating
the Cassandra cluster no matter what (until the client/network becomes
a bottleneck). Under such conditions (saturation) you generally never
should expect good latencies.

For most non-batch job production use-cases, you tend to have incoming
requests driven by something external such as user behavior or
automated systems not related to the Cassandra cluster. In this cases,
you tend to have a certain amount of incoming requests at any given
time that you must serve within a reasonable time frame, and that's
where the question comes in of how much I/O you're doing in relation
to maximum. For good latencies, you always want to be significantly
below maximum - particularly when platter based disk I/O is involved.

 That may well explain it - I'll have to think about what that means for our
 use case as load will be extremely bursty

To be clear though, even your typical un-bursty load is still bursty
once you look at it at sufficient resolution, unless you have
something specifically ensuring that it is entirely smooth. A
completely random distribution over time for example would look very
even on almost any graph you can imagine unless you have sub-second
resolution, but would still exhibit un-evenness and have an affect on
latency.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: keycache persisted to disk ?

2012-02-13 Thread Peter Schuller

 Yep, the readstage is backlogging consistently - but the thing I am trying
 to explain s why it is good sometimes in an environment that is pretty well
 controlled - other than being on ec2

So pending is constantly  0? What are the clients? Is it batch jobs
or something similar where there is a feedback mechanism implicit in
that the higher latencies of the cluster are slowing down the clients,
thus reaching an equilibrium? Or are you just teetering on the edge,
dropping requests constantly?

Under typical live-traffic conditions, you never want to be running
with read stage pending backing up constantly. If on the other hand
these are batch jobs where throughput is the concern, it's not
relevant.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: keycache persisted to disk ?

2012-02-13 Thread Peter Schuller

 2 Node cluster, 7.9GB of ram (ec2 m1.large)
 RF=2
 11GB per node
 Quorum reads
 122 million keys
 heap size is 1867M (default from the AMI I am running)
 I'm reading about 900k keys

Ok, so basically a very significant portion of the data fits in page
cache, but not all.

 As I was just going through cfstats - I noticed something I don't understand

                 Key cache capacity: 906897
                 Key cache size: 906897

 I set the key cache to 2million, it's somehow got to a rather odd number

You're on 1.0 +? Nowadays there is code to actively make caches
smaller if Cassandra detects that you seem to be running low on heap.
Watch cassandra.log for messages to that effect (don't remember the
exact message right now).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Multiple data center nodetool ring output display 0% owns

2012-02-12 Thread Peter Schuller

 There was a thread on this a couple days ago -- short answer, the 'owns %' 
 column is effectively incorrect when you're using multiple DCs.  If you had 
 all 3 servers in 1 DC, since server YYY has token 1 and server XXX has token 
 0, then server XXX would truly 'own' 0% (actually, 1/(2^128) :) ), and 
 depending on your replication factor might have no data (if replication were 
 1).

It's also incorrect for rack awareness if your topology is such that
the rack awareness changes ownership (see
https://issues.apache.org/jira/browse/CASSANDRA-3810).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Cassandra 1.0.6 multi data center question

2012-02-08 Thread Peter Schuller

It is expected that the schema is replicated everywhere, but *data*
won't be in the DC with 0 replicas.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Cassandra 1.0.6 multi data center question

2012-02-08 Thread Peter Schuller

 Thanks for the reply. But I can see the data setting inserted in DC1 in DC2.
 So that means data also getting replicated to DC2.

data setting inserted?

You should not be seeing data in the DC with 0 replicas. What does
data setting inserted mean?

Do you have sstables on disk with data for the keyspace?

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Cassandra 1.0.6 multi data center question

2012-02-08 Thread Peter Schuller

Again the *schema* gets propagated and the keyspace will exist
everywhere. You should just have exactly zero amount of data for the
keyspace in the DC w/o replicas.


-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: read-repair?

2012-02-02 Thread Peter Schuller

 sorry to be dense, but which is it?  do i get the old version or the new
 version?  or is it indeterminate?

Indeterminate, depending on which nodes happen to be participating in
the read. Eventually you should get the new version, unless the node
that took the new version permanently crashed with data loss prior to
the data making it elsewhere.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Can you query Cassandra while it's doing major compaction

2012-02-02 Thread Peter Schuller

 If every node in the cluster is running major compaction, would it be able to
 answer any read request?  And is it wise to write anything to a cluster
 while it's doing major compaction?

Compaction is something that is supposed to be continuously running in
the background. As noted, it will have a performance impact in that it
(1) generates I/O, (2) leads to cache eviction, and (if you're CPU
bound rather than disk bound) (3) adds CPU load.

But there is no intention that clients should have to care about
compaction; it's to be viewed as a background operation continuously
happening. A good rule of thumb is that an individual node should be
able to handle your traffic when doing compaction; you don't want to
be in the position where you're just barely dealing with the traffic,
and a node doing compaction not being able to handle it.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: read-repair?

2012-02-01 Thread Peter Schuller

 i have RF=3, my row/column lives on 3 nodes right?  if (for some reason, eg
 a timed-out write at quorum) node 1 has a 'new' version of the row/column
 (eg clock = 10), but node 2 and 3 have 'old' versions (clock = 5), when i
 try to read my row/column at quorum, what do i get back?

You either get back the new version or the old version, depending on
whether node 1 was participated in the read. In your scenario, the
prevoius write at quorum failed (since it only made it to one node),
so this is not a violation of the contract.

Once node 2 and/or 3 return their response, read repair (if it is
active) will cause re-read and re-conciliation followed by a row
mutation being send to the nodes to correct the column.

 do i get the clock 5 version because that is what the quorum agrees on, and

No; a quorum of node is waited for, and the newest column wins. This
accomplish the reads-see-write invariant.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: how to delete data with level compaction

2012-01-28 Thread Peter Schuller

 I'm using level compaction and I have about 200GB compressed in my
 largest CFs. The disks are getting full. This is time-series data so I
 want to drop data that is a couple of months old. It's pretty easy for
 me to iterate through the relevant keys and delete the rows. But will
 that do anything?
 I currently have the majority of sstables at generation 4. Deleting rows
 will initially just create a ton of tombstones. For them to actually
 free up significant space they need to get promoted to gen 4 and cause a
 compaction there, right? nodetool compact doesn't do anything with level
 compaction, it seems. Am I doomed?
 (Ok, I'll whip out my CC and order more disk ;-)

There is a delay before you will see the effects of deletions in terms
of disk space, yes. However, this is not normally a problem because
you effectively reach a steady state of disk usage. It only becomes a
problem if you're almost *entirely* full and are trying to delete data
in a panic.

How far away are you from entirely full? Are you just worried about
the future or are you about to run out of disk space right now?

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: how to delete data with level compaction

2012-01-28 Thread Peter Schuller

 I'm at 80%, so not quite panic yet ;-)

 I'm wondering, in the steady state, how much of the space used will
 contain deleted data.

That depends entirely on your workload, including:

* How big the data that you are deleting is in relation to the size of
tombstones
* How long the average piece of data lives before being deleted
* How much other data never being deleted you have in relation to the
data that gets deleted
* How it varies over time and where you are in your cycle

I can't offer any mathematics that will give you the actual answered
for leveled compaction (and I don't have a good feel for it as I
haven't used leveled compaction in production systems yet). It's
strongly recommend graphing disk space for your particular
workload/application and see how it behaves over time. And *have
alerts* on disk space running out. Have a good amount of margin. Less
so with leveled compaction than size tiered compaction, but still
important.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: ideal cluster size

2012-01-21 Thread Peter Schuller

 Thanks for the responses! We'll definitely go for powerful servers to
 reduce the total count. Beyond a dozen servers there really doesn't seem
 to be much point in trying to increase count anymore for

Just be aware that if big servers imply *lots* of data (especially
in relation to memory size), it's not necessarily the best trade-off.
Consider the time it takes to do repairs, streaming, node start-up,
etc.

If it's only about CPU resources then bigger nodes probably make more
sense if the h/w is cost effective.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Garbage collection freezes cassandra node

2012-01-20 Thread Peter Schuller

 Thanks for this very helpful info. It is indeed a production site which I 
 cannot easily upgrade. I will try the various gc knobs and post any positive 
 results.

*IF* your data size, or at least hot set, is small enough that you're
not extremely reliant on the current size of page cache, and in terms
of short-term relief, I recommend:

* Significantly increasing the heap size. Like double it or more.
* Decrease the occupancy trigger such that it kicks in around the time
it already does (in terms of amount of heap usage used).
* Increase the young generation size (to lessen promotion into old-gen).

Experiment on a single node, making sure you're not causing too much
disk I/O by stealing memory otherwise used by page cache. Once you
have something that works you might try slowly going back down.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: delay in data deleting in cassadra

2012-01-20 Thread Peter Schuller

  The problem occurs when this thread is invoked for the second time.
 In that step , it returns some of data that i already deleted in the third
 step of the previous cycle.

In order to get a guarantee about a subsequent read seeing a write,
you must read and write at QUORUM (or LOCAL_QUORUM if it's only within
a DC).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Garbage collection freezes cassandra node

2012-01-19 Thread Peter Schuller

 On node 172.16.107.46, I see the following:

 21:53:27.192+0100: 1335393.834: [GC 1335393.834: [ParNew (promotion failed): 
 319468K-324959K(345024K), 0.1304456 secs]1335393.964: [CMS: 
 6000844K-3298251K(8005248K), 10.8526193 secs] 6310427K-3298251K(8350272K), 
 [CMS Perm : 26355K-26346K(44268K)], 10.9832679 secs] [Times: user=11.15 
 sys=0.03, real=10.98 secs]
 21:53:38,174 GC for ConcurrentMarkSweep: 10856 ms for 1 collections, 
 3389079904 used; max is 8550678528

 I have not yet tested the XX:+DisableExplicitGC switch.

 Is the right thing to do to decrease the CMSInitiatingOccupancyFraction 
 setting?

* Increasing the total heap size can definitely help; the only kink is
that if you need to increase the heap size unacceptably much, it is
not helpful.
* Decreasing the occupancy trigger can help yes, but you will get very
much diminishing returns as your trigger fraction approaches the
actual live size of data on the heap.
* I just re-checked your original message - you're on Cassandra 0.7? I
*strongly* suggest upgrading to 1.x. In general that holds true, but
also specifically relating to this are significant improvements in
memory allocation behavior that significantly reduces the probability
and/or frequency of promotion failures and full gcs.
* Increasing the size of the young generation can help by causing less
promotion to old-gen (see the cassandra.in.sh script or equivalent of
for Windows).
* Increasing the amount of parallel threads used by CMS can help CMS
complete it's marking phase quicker, but at the cost of a greater
impact on the mutator (cassandra).

I think the most important thing is - upgrade to 1.x before you run
these benchmarks. Particularly detailed tuning of GC issues is pretty
useless on 0.7 given the significant changes in 1.0. Don't even bother
spending time on this until you're on 1.0, unless this is about a
production cluster that you cannot upgrade for some reason.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: ideal cluster size

2012-01-19 Thread Peter Schuller

 We're embarking on a project where we estimate we will need on the order
 of 100 cassandra nodes. The data set is perfectly partitionable, meaning
 we have no queries that need to have access to all the data at once. We
 expect to run with RF=2 or =3. Is there some notion of ideal cluster
 size? Or perhaps asked differently, would it be easier to run one large
 cluster or would it be easier to run a bunch of, say, 16 node clusters?
 Everything we've done to date has fit into 4-5 node clusters.

Certain things certainly becomes harder with many nodes just due to
the shear amount; increased need to automate administrative tasks,
etc. But mostly, this would apply equally to e.g. 10 clusters of 10
nodes, as it does to one cluster of 100 nodes.

I'd prefer running one cluster unless there is a specific reason to do
otherwise, just because it means you have one thing to keep track of
both mentally and in terms of e.g. monitoring/alerting instead of
having another level of grouping applied to your hosts.

I can't think of significant benefits to small clusters that still
hold true when you have many of them, as opposed to a correspondingly
big single cluster.

It is probably more useful to try to select hardware such that you
have a greater number of smaller nodes, than it is to focus on node
count (although once you start reaching the few hundreds level
you're entering territory of less actual real-life production
testing).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Cannot start cassandra node anymore

2012-01-06 Thread Peter Schuller

(too much to quote)

Looks to me like this should be fixable by removing hints from the
node in question (I don't remember whether this is a bug that's been
identified and fixed or not).

I may be wrong because I'm just basing this on a quick look at the
stack trace but it seems to me there are hints on the node, for other
nodes, that contain writes to a deleted column family.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Consistency Level

2011-12-29 Thread Peter Schuller

 Thanks for the response Peter! I checked everything and it look good to me.

 I am stuck with this for almost 2 days now. Has anyone had this issue?

While it is certainly possible that you're running into a bug, it
seems unlikely to me since it is the kind of bug that would affect
almost anyone if it is failing with Unavailable due to unrelated (not
in replica sets) nodes being down.

Can you please post back with (1) the ring layout ('nodetool ring'),
and (2) the exact row key that you're testing with?

You might also want to run with DEBUG level (modify
log4j-server.properties at the top) and the strategy (assuming you are
using NetworkTopologyStrategy) will log selected endpoints, and
confirm that it's indeed picking endpoints that you think it should
based on getendpoints.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Consistency Level

2011-12-28 Thread Peter Schuller

 exception May not be enough replicas present to handle consistency level

Check for mistakes in using getendpoints. Cassandra says Unavailable
when there is not enough replicas *IN THE REPLICA SET FOR THE ROW KEY*
to satisfy the consistency level.

 I tried to read data using cassandra-cli but I am getting null.

This is just cassandra-cli quirkyness IIRC; I think you get null on
exceptions.

 With consistency level ONE, I would assume that with just one node up and
 running (of course the one that has the data) I should get my data back. But
 this is not happening.

1 node among the ones in the replica set of your row has to be up.

 Will the read repair happen automatically even if I read and write using the
 consistency level ONE?

Yes, assuming it's turned on.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: better anti OOM

2011-12-27 Thread Peter Schuller

 I will investigate situation more closely using gc via jconsole, but isn't
 bloom filter for new sstable entirely in memory? On disk there are only 2
 files Index and Data.
 -rw-r--r--  1 root  wheel   1388969984 Dec 27 09:25
 sipdb-tmp-hc-4634-Index.db
 -rw-r--r--  1 root  wheel  10965221376 Dec 27 09:25
 sipdb-tmp-hc-4634-Data.db

 Bloom filter can be that big. I have experience that if i trigger major
 compaction on 180 GB CF ( Compacted row mean size: 130) it will OOM node
 after 10 seconds, so i am sure that compactions eats memory pretty well.

Yes you're right, you'll definitely spike in memory usage whatever
amount corresponds to index sampling/BF for the thing being compacted.
This can be mitigated by never running full compactions (i.e., not
running 'nodetool compact'), but won't be gone all together.

Also, if your version doesn't yet have
https://issues.apache.org/jira/browse/CASSANDRA-2466 applied, another
side-effect is that the sudden large allocations for bloom filters can
cause promotion failures even if there is free memory.

 yes, it prints messages like heap is almost full and after some time it
 usually OOM during large compaction.

Ok, in that case it seems even more clear that you simply need a
larger heap. How large is the bloom filters in total? I.e., sizes of
the *-Filter.db files. In general, don't expect to be able to run at
close to heap capacity; there *will* be spikes.

In this particular case, leveled compaction in 1.0 should mitigate the
effect quite significantly since it only compacts up to 10% of the
data set at a time so memory usage should be considerably more even
(as will disk space usage be). That would allow you to run a bit
closer to heap capacity than regular compaction.

Also, consider tweaking compaction throughput settings to control the
rate of allocation generated during a compaction, even if you don't
need it for disk I/O purposes.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: index sampling

2011-12-27 Thread Peter Schuller

 on node with 300m rows (small node), it will be 585937 index sample entries
 with 512 sampling. lets say 100 bytes per entry this will be 585 MB, bloom
 filters are 884 MB. With default sampling 128, sampled entries will use
 majority of node memory. Index sampling should be reworked like bloom
 filters to avoid allocating one large array per sstable. hadoop mapfile is
 using sampling 128 by default too and it reads entire mapfile index into
 memory.

The index summary does have an ArrayList which will be backed by an
array which could become large; however larger than that array (which
is going to be 1 object reference per sample, or 1-2 taking into
account internal growth of the array list) will be the overhead of the
objects in the array (regular Java objects). This is also why it is
non-trivial to report on the data size.

 it should be clearly documented in
 http://wiki.apache.org/cassandra/LargeDataSetConsiderations - that bloom
 filters + index sampling will be responsible for most memory used by node.
 Caching itself has minimal use on large data set used for OLAP.

I added some information at the end.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: will compaction delete empty rows after all columns expired?

2011-12-27 Thread Peter Schuller

 Compaction should delete empty rows once gc_grace_seconds is passed, right?

Yes.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: will compaction delete empty rows after all columns expired?

2011-12-27 Thread Peter Schuller

 Compaction should delete empty rows once gc_grace_seconds is passed, right?

 Yes.

But just to be extra clear: Data will not actually be removed once the
row in question participates in compaction. Compactions will not be
actively triggered by Cassandra for tombstone processing reasons.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: improving cassandra-vs-mongodb-vs-couchdb-vs-redis

2011-12-27 Thread Peter Schuller

 Also when comparing these technologies very subtle differences in design
 have profound in effects in operation and performance. Thus someone trying
 to paper over 6 technologies and compare them with a few bullet points is
 really doing the world an injustice.

+1. Same goes for 99% of all benchmarks ever published on the internet...

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Restart for change of endpoint_snitch ?

2011-12-27 Thread Peter Schuller

 If I change endpoint_snitch from SimpleSnitch to PropertyFileSnitch,
 does it require restart of cassandra on that node ?

Yes.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: reported bloom filter FP ratio

2011-12-26 Thread Peter Schuller

 but reported ratio is  Bloom Filter False Ratio: 0.00495 which is higher
 than my computed ratio 0.000145. If you were true than reported ratio should
 be lower then mine computed from CF reads because there are more reads to
 sstables then to CF.

The ratio is the ratio of false positives to true positives *per
sstable*. It's not the amount of false positives in each sstable *per
cf read*. Thus, there is no expectation of higher vs. lower, and the
magnitude of the discrepancy is easily explained by the fact that you
only have 10 false positives. That's not a statistically significant
sample set.

 from investigation of bloom filter FP ratio it seems that default bloom
 filter FP ratio (soon user configurable) should be higher. Hbase defaults to
 1% cassandra defaults to 0.000744. bloom filters are using quite a bit
 memory now.

I don't understand how you reached that conclusion. There is a direct
trade-off between memory use and false positive hit rate, yes. That
does not mean that hbase's 1% is magically the correct choice.

I definitely think it should be tweakable (and IIRC there's work
happening on a JIRA to make this an option now), but a 1% false
positive hit rate will be completely unacceptable in some
circumstances. In others, perfectly acceptable due to the decrease in
memory use and few reads.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: reported bloom filter FP ratio

2011-12-26 Thread Peter Schuller

 I don't understand how you reached that conclusion.

 On my nodes most memory is consumed by bloom filters. Also 1.0 creates

The point is that just because that's the problem you have, doesn't
mean the default is wrong, since it quite clearly depends on use-case.
If your relative amounts of rows is low compared to the cost of
sustaining a read-heavy workload, the trade-off is different.

 Cassandra does not measure memory used by index sampling yet, i suspect that
 it will be memory hungry too and can be safely lowered by default i see very
 little difference by changing index sampling from 64 to 512.

Bloom filters and index sampling are the two major contributors to
memory use that scale with the number of rows (and thus typically with
data size). This is known. Index sampling can indeed be significant.

The default is 128 though, not 64. Here again it's a matter of
trade-offs; 512 may have worked for you, but it doesn't mean it's an
appropriate default (I am not arguing for 128 either, I am just saying
that it's more complex than observing that in your particular case you
didn't see a problem with 512). Part of the trade-off is additional
CPU usage implied in streaming and deserializing a larger amount of
data per average sstable index read; part of the trade-off is also
effects on I/O; a sparser index sampling could result in a higher
amount of seeks per index lookup.

 Basic problem with cassandra daily administration which i am currently
 solving is that memory consumption grows with your dataset size. I dont
 really like this design - you put more data in and cluster can OOM. This
 makes cassandra not optimal solution for use in data archiving. It will get
 better after tunable bloom filters will be committed.

That is a good reason for both to be configurable IMO.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: better anti OOM

2011-12-26 Thread Peter Schuller

 If node is low on memory 0.95+ heap used it can do:

 1. stop repair
 2. stop largest compaction
 3. reduce number of compaction slots
 4. switch compaction to single threaded

 flushing largest memtable/ cache reduce is not enough

Note that the emergency flushing is just a stop-gap. You should run
with appropriately sized heaps under normal conditions; the emergency
flushing stuff is intended to mitigate the effects of having a too
small heap size; it is not expected to avoid completely the
detrimental effects.

Also note that things like compaction does not normally contribute
significantly to the live size on your heap, but it typically does
contribute to allocation rate which can cause promotion failures or
concurrent mode failures if your heap size is too small and/or
concurrent mark/sweep settings not aggressive enough. Aborting
compaction wouldn't really help anything other than short-term
avoiding a fallback to full GC.

I suggest you describe exactly what the problem is you have and why
you think stopping compaction/repair is the appropriate solution.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: better anti OOM

2011-12-26 Thread Peter Schuller

 I suggest you describe exactly what the problem is you have and why you
 think stopping compaction/repair is the appropriate solution.

 compacting 41.7 GB CF with about 200 millions rows adds - 600 MB to heap,
 node logs messages like:

I don't know what you are basing that on. It seems unlikely to me that
the working set of a compaction is 600 MB. However, it may very well
be that the allocation rate is such that it contributes to an
additional 600 MB average heap usage after a CMS phase has completed.

 After node boot
 Heap Memory (MB) : 1157.98 / 1985.00

 disabled gossip + thrift, only compaction running
 Heap Memory (MB) : 1981.00 / 1985.00

Using nodetool info to monitor heap usage is not really useful
unless done continuously over time and observing the free heap after
CMS phases have completed. Regardless, the heap is always expected to
grow in usage to the occupancy trigger which kick-starts CMS. That
said, 1981/1985 does indicate a non-desirable state for Cassandra, but
it does not mean that compaction is using 600 mb as such (in terms
of live set). You might say that it implies = 600 mb extra heap
required at your current heap size and GC settings.

If you want to understand what's happening I suggest attaching with
visualvm/jconsole and looking at the GC behavior, and run with
-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps. When attached with visualvm/jconsole you can
hit perform gc and see how far it drops, to judge what the actual
live set is.

Also, you say it's pretty dead. What exactly does that mean? Does it
OOM? I suspect you're just seeing fallbacks to full GC and long pauses
because you're allocating and promoting to old-gen fast enough that
CMS is just not keeping up; rather than it having to do with memory
use per say.

In your case, I suspect you simply need to run with a bigger heap or
reconfigure CMS to use additional threads for concurrent marking
(-XX:ParallelCMSThreads=XXX - try XXX = number of CPU cores for
example in this case). Alternatively, a larger young gen to avoid so
much getting promoted during compaction.

But really, in short: The easiest fix is probably to increase the heap
size. I know this e-mail doesn't begin to explain details but it's
such a long story.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: reported bloom filter FP ratio

2011-12-25 Thread Peter Schuller

                Read Count: 68844
[snip]
 why reported bloom filter FP ratio is not counted like this
 10/68844.0
 0.00014525594096798558

Because the read count is total amount of reads to the CF, while the
bloom filter is per sstable. The number of individual reads to
sstables will be higher than the number of reads to the CF (unless you
happen to have exactly one sstable or no rows ever span sstables).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Routine nodetool repair

2011-12-22 Thread Peter Schuller

 One other thing to consider is are you creating a few very large rows ? You
 can check the min, max and average row size using nodetool cfstats.

Normall I agree, but assuming the two-node cluster has RF 2 it would
actually not matter ;)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Cassandra stress test and max vs. average read/write latency.

2011-12-22 Thread Peter Schuller

 Thanks for your input.  Can you tell me more about what we should be
 looking for in the gc log?   We've already got the gc logging turned
 on and, and we've already done the plotting to show that in most
 cases the outliers are happening periodically (with a period of
 10s of seconds to a few minutes, depnding on load and tuning)

Are you measuring writes or reads? If writes,
https://issues.apache.org/jira/browse/CASSANDRA-1991 is still relevant
I think (sorry no progress from my end on that one). Also, I/O
scheduling issues can easily cause problems with the commit log
latency (on fsync()). Try switching to periodic commit log mode and
see if it helps, just to eliminate that (if you're not already in
periodic; if so, try upping the interval).

For reads, I am generally unaware of much aside from GC and legitimate
jitter (scheduling/disk I/O etc) that would generate outliers. At
least that I can think of off hand...

And w.r.t. the GC log - yeah, correlating in time is one thing.
Another thing is to confirm what kind of GC pauses you're seeing.
Generally you want to be seeing lots of ParNew:s of shorter duration,
and those are tweakable by changing the young generation size. The
other thing is to make sure CMS is not failing (promotion
failure/concurrent mode failure) and falling back to a stop-the-world
serial compacting GC of the entire heap.

You might also use -:XX+PrintApplicationPauseTime (I think, I am
probably not spelling it entirely correctly) to get a more obvious and
greppable report for each pause, regardless of type/cause.

 I've tried to correlate the times of the outliers with messages either
 in the system log or the gc log.   There seemms to be some (but not
 complete) correlation between the outliers and system log messages about
 memtable flushing.   I can not find anything in the gc log that
 seems to be an obvious problem, or that matches up with the time
 times of the outliers.

And these are still the very extreme (500+ ms and such) outliers that
you're seeing w/o GC correlation? Off the top of my head, that seems
very unexpected (assuming a non-saturated system) and would definitely
invite investigation IMO.

If you're willing to start iterating with the source code I'd start
bisecting down the call stack and see where it's happening .

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Routine nodetool repair

2011-12-21 Thread Peter Schuller

 Could the lack of routine repair be why nodetool ring reports: node(1) Load
 - 78.24 MB and node(2) Load - 67.21 MB? The load span between the two
 nodes has been increasing ever so slowly...

No.

Generally there will be a variation in load depending on what state
compaction happens to be in on the given node (I am assuming you're
not using leveled compaction). That is in addition to any imbalance
that might result from your population of data in the cluster.

Running repair can affect the live size, but *lack* of repair won't
cause a live size divergence.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Garbage collection freezes cassandra node

2011-12-19 Thread Peter Schuller

  During the garbage collections, Cassandra freezes for about ten seconds. I 
 observe the following log entries:



 “GC for ConcurrentMarkSweep: 11597 ms for 1 collections, 1887933144 used; max 
 is 8550678528”

Ok, first off: Are you certain that it is actually pausing, or are you
assuming that due to the log entry above? Because the log entry in no
way indicates a 10 second pause; it only indicates that CMS took 10
seconds - which is entirely expected, and most of CMS is concurrent
and implies only short pauses. A full pause can happen, but that log
entry is expected and is not in and of itself indicative of a
stop-the-world 10 second pause.

It is fully expected using the CMS collector that you'll have a
sawtooth pattern as young gen is being collected, and then a sudden
drop as CMS does its job concurrently without pausing the application
for a long period of time.

I will second the recommendation to run with -XX:+DisableExplicitGC
(or -XX:+ExplicitGCInvokesConcurrent) to eliminate that as a source.

I would also run with -XX:+PrintGC -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps and report back the
results (i.e., the GC log around the time of the pause).

Your graph is looking very unusual for CMS. It's possible that
everything is as it otherwise should and CMS is kicking in too late,
but I am kind of skeptical towards that even the extremely smooth look
of your graph.

--
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Garbage collection freezes cassandra node

2011-12-19 Thread Peter Schuller

I should add: If you are indeed actually pausing due to promotion
failed or concurrent mode failure (which you will see in the GC log
if you enable it with the options I suggested), the first thing I
would try to mitigate is:

* Decrease the occupancy trigger (search for occupancy) of CMS to a
lower percentage, making the concurrent mark phase start earlier.
* Increase heap size significantly (probably not necessary based on
your graph, but for good measure).

If it then goes away, report back and we can perhaps figure out
details. There are other things that can be done.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Cassandra stress test and max vs. average read/write latency.

2011-12-19 Thread Peter Schuller

 I'm trying to understand if this is expected or not, and if there is

Without careful tuning, outliers around a couple of hundred ms are
definitely expected in general (not *necessarily*, depending on
workload) as a result of garbage collection pauses. The impact will be
worsened a bit if you are running under high CPU load (or even maxing
it out with stress) because post-pause, if you are close to max CPU
usage you will take considerably longer to catch up.

Personally, I would just log each response time and feed it to gnuplot
or something. It should be pretty obvious whether or not the latencies
are due to periodic pauses.

If you are concerned with eliminating or reducing outliers, I would:

(1) Make sure that when you're benchmarking, that you're putting
Cassandra under a reasonable amount of load. Latency benchmarks are
usually useless if you're benchmarking against a saturated system. At
least, start by achieving your latency goals at 25% or less CPU usage,
and then go from there if you want to up it.

(2) One can affect GC pauses, but it's non-trivial to eliminate the
problem completely. For example, the length of frequent young-gen
pauses can typically be decreased by decreasing the size of the young
generation, leading to more frequent shorter GC pauses. But that
instead causes more promotion into the old generation, which will
result in more frequent very long pauses (relative to normal; they
would still be infrequent relative to young gen pauses) - IF your
workload is such that you are suffering from fragmentation and
eventually seeing Cassandra fall back to full compacting GC:s
(stop-the-world) for the old generation.

I would start by adjusting young gen so that your frequent pauses are
at an acceptable level, and then see whether or not you can sustain
that in terms of old-gen.

Start with this in any case: Run Cassandra with -XX:+PrintGC
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: CPU bound workload

2011-12-18 Thread Peter Schuller

 I would first eliminate or confirm any GC hypothesis by running all
 nodes with -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
 -XX:+PrintGCDateStamps.

 Is full GC not being logged through GCInspector with the defaults ?

The GC inspect tries its best, but it's polling. Unfortunately the JDK
provides no way to properly monitor for GC events within the Java
application. The GC inspector can miss a GC.

Also, the GC inspector only tells you time + type of GC; a GC log will
provide all sorts of details.


-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Crazy compactionstats

2011-12-14 Thread Peter Schuller

 Exception in thread main java.io.IOException: Repair command #1: some
 repair session(s) failed (see log for details).

For why repair failed you unfortunately need to log at logs as it suggests.

 I still see pending tasks in nodetool compactionstats, and their number goes
 into hundreds which I haven't seen before.
 What's going on?

The compactions pending is not very useful. It just says how many
tasks are pending that MIGHT compact. Typically it will either be 0,
or it will be steadily increasing while compactions are happening
until suddenly snapping back to 0 again once compactions catch up.

Whether or not non-zero is a problem depends on the Cassandra version,
how many concurrent compactors you are running, and your column
families/data sizes/flushing speeds etc. (Sorry, kind of a long story)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: cassandra in production environment

2011-12-11 Thread Peter Schuller

  We are currently testing cassandra in RHEL 6.1 64 bit environment
 running on ESXi 5.0 and are experiencing issues with data file
 corruptions. If you are using linux for production environment can you
 please share which OS/version you are using?

It would probably be a good idea if you could be a bit more specific
about the nature of the corruption and the observations made, and the
version of Cassandra you are using.

As for production envs; lots of people are bound to use various
environments in production; I suppose the only interesting bit would
be if someone uses RHEL 6.1 specifically? I mean I can say that I've
run Cassandra on Debian Squeeze in production, but that doesn't really
help you ;)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Consistence for node shutdown and startup

2011-12-11 Thread Peter Schuller

  How about the conflict data when the two node on line separately. How it
 synchronized by two nodes when they both on line finally?

Briefly, it's based on timestamp conflict resolution. This may be a
good resource:

   http://www.datastax.com/docs/1.0/dml/about_writes#about-transactions

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: read/write counts

2011-12-11 Thread Peter Schuller

 1) Are KS level counts and CF level counts for whole cluster or just for an
 individual node?

Individual node.

Also note that the CF level counts will refer to local reads/writes
submitted to the node, while the statistics you get from StorageProxy
(in JMX) are for requests routed. In general, you will see a
magnification by a factor of RF on the local statistics (in aggregate)
relative to the StorageProxy stats.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Meaning of values in tpstats

2011-12-11 Thread Peter Schuller

 With the slicing, I'm not sure off the top of my head. I'm sure
 someone else can chime in. For e.g. a multi-get, they end up as
 independent tasks.

 So if I multiget 10 keys, they are fetched in //, consolidated by the
 coodinator and then sent back ?

Took me a while to figure out that // == parallel :)

I'm pretty sure (but not entirely, I'd have to check the code) that
the request is forwarded as one request to the necessary node(s); what
I was saying rather was that the individual gets get queued up as
individual tasks to be executed internally in the different stages.
That does lead to parallelism locally on the node (subject to the
concurrent reader setting.


  Agreed, I followed someone suggestion some time ago to reduce my batch
 sizes and it has helped tremendoulsy. I'm now doing multigetslices in
 batchers of 512 instead of 5000 and I find I no longer have Pendings up so
 high. The most I see now is a couple hundred.

In general, the best balance will depend on the situation. For example
the benefit of batching increases as the latency to the cluster (and
within it) increases, and the negative effects increase as you have
higher demands of low latency on other traffic to the cluster.


-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: ParNew and caching

2011-12-10 Thread Peter Schuller

 After re-reading my post, what I meant to say is that I switched from
 Serializing cache provider to ConcurrentLinkedHash cache provider and then
 saw better performance, but still far worse than no caching at all:

 - no caching at all : 25-30ms
 - with Serializing provider : 1300+ms
 - with Concurrent provider : 500ms

 100% cache hit rate.  ParNew is the only stat that I see out of line, so
 seems like still a lot of copying

In general, if you want to get to the bottom of this stuff and you
think GC is involved, always run with -XX:+PrintGC -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps so that the GC activity
can be observed.

1300+ should not be GC unless you are having fallbacks to full GC:s
(would be possible to see with gc logging) and it should definitely be
possible to avoid full gc:s being extremely common (but eliminating
them entirely may not be possible).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: CPU bound workload

2011-12-10 Thread Peter Schuller

 I've got a batch process running every so often that issues a bunch of
 counter increments. I have noticed that when this process runs without being
 throttled it will raise the CPU to 80-90% utilization on the nodes handling
 the requests. This in turns timeouts and general lag on queries running on
 the cluster.

This much is entirely expected. If you are not bottlenecking anywhere
else and saturing the cluster, you will be bound by it, and it will
affect the latency of other traffic, no matter how fast or slow
Cassandra is.

You do say nodes handling the requests. Two things to always keep in
mind is to (1) spread the requests evenly across all members of the
cluster, and (2) if you are doing a lot of work per row key, spread it
around and be concurrent so that you're not hitting a single row at a
time, which will be under the responsibility of a single set of RF
nodes (you want to put load on the entire cluster evently if you want
to maximize throughput).

 Is there anything that can be done to increase the throughput, I've been
 looking on the wiki and the mailing list and didn't find any optimization
 suggestions (apart from spreading the load on more nodes).

 Cluster is 5 node, BOP, RF=3, AMD opteron 4174 CPU (6 x 2.3 Ghz cores),
 Gigabit ethernet, RAID-0 SATA2 disks

For starters, what *is* the throughput? How many counter mutations are
you submitting per second?

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: CPU bound workload

2011-12-10 Thread Peter Schuller

 Counter increment is a special case in cassandra because the incur a local
 read before write. Normal column writes so not do this. So counter writes
 are intensive. If possible batch up the increments for less rpc calls and
 less reads.

Note though that the CPU usage impact of this should be limited, in
comparison to the impact when your reads end up going down to disk.
I.e., the most important performance characteristic for people to
consider is that counter writes may need to read from disk, contrary
to all other writes in Cassandra which will only imply sequential I/O
asynchronously.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Cassandra behavior too fragile?

2011-12-07 Thread Peter Schuller

 Thing is, why is it so easy for the repair process to break? OK, I admit I'm
 not sure why nodes are reported as dead once in a while, but it's
 absolutely certain that they simply don't fall off the edge, are knocked out
 for 10 min or anything like that. Why is there no built-in tolerance/retry
 mechanism so that a node that may seem silent for a minute can be contacted
 later, or, better yet, a different node with a relevant replica is
 contacted?

 As was evident from some presentations at Cassandra-NYC yesterday, failed
 compactions and repairs are a major problem for a number of users. The
 cluster can quickly become unusable. I think it would be a good idea to
 build more robustness into these procedures,

I am trying to argue for removing the failure-detector-kills-repair in
https://issues.apache.org/jira/browse/CASSANDRA-3569, but I don't know
whether that will happen since there is opposition.

However, that only fixes the particular issue you are having right
now. There are significant problems with repair, and the answer to why
there is no retry is probably because it takes non-trivial amounts of
work to make the current repair process be fault-tolerant in the face
of TCP connections dying.

Personally, my pet ticket to fix repair once and for all is
https://issues.apache.org/jira/browse/CASSANDRA-2699 which should, at
least as I envisioned it, fix a lot of problems, including making it
much much much more robust to transient failures (it would just
automatically be robust without specific code necessary to deal with
it, because repair work would happen piecemeal and incrementally in a
repeating fashion anyway). Nodes could basically be going up and down
in any wild haywire mode and things would just automatically continue
to work in the background. Repair would become irrelevant to cluster
maintenance, and you wouldn't really have to think about whether or
not someone is repairing. You would also not have to think about
repair vs. gc grace time because it would all just sit there and work
without intervention.

It's a pretty big ticket though and not something I'm gonna be working
on in my spare time, so I don't know whether or when I would actually
work on that ticket (depends on priorities). I have the ideas but I
can't promise to fix it :)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Cassandra not suitable?

2011-12-06 Thread Peter Schuller

 I'm quite desperate about Cassandra's performance in our production
 cluster. We have 8 real-HW nodes, 32core CPU, 32GB memory, 4 disks in
 raid10, cassandra 0.8.8, RF=3 and Hadoop.
 We four keyspaces, one is the large one, it has 2 CFs, one is kind of
 index, the other holds data. There are about 7milinon rows, mean row
 size is 7kB. We run several mapreduce tasks, most of them just reads
 from cassandra and writes to hdfs, but one fetch rows from cassnadra,
 compute something and write it back, for each row we compute three new
 json values, about 1kB each (they get overwritten next round).

 We got lots and lots of Timeout exceptions, LiveSSTablesCount over
 100. Reapir doesn't finish even in 24hours, reading from the other
 keyspaces timeouts as well.  We set compaction_throughput_mb_per_sec:
 0 but it didn't help.

Exactly why you're seeing timeouts would depend on quite a few
factors. In general however, my observation is that you have ~ 100 gig
per node on nodes with 32 gigs of memory, *AND* you say you're running
map reduce jobs.

In general, I would expect that any performance problems you have are
probably due to cache misses and simply bottlenecking on disk I/O.
What to do about it depends very much on the situation and it's
difficult to give a concrete suggestion without more context.

Some things that might mitigate effects include using row cache for
the hot data set (if you have a very small hot data set that should
work well since the row cache is unaffected by e.g. mapreduce jobs),
selecting a different compaction strategy (leveled can be better,
depending), running map reduce on a separate DC that takes writes but
is separated from the live cluster that takes reads (unless you're
only doing batch request).

But those are just some random things thrown in the air; do not take
that as concrete suggestions for your particular case.

The key is understanding the access pattern, what the bottlenecks are,
in combination with how the database works - and figure out what the
most cost-effective solution is.

Note that if you're bottlenecking on disk I/O, it's not surprising at
all that repairing ~ 100 gigs of data takes more than 24 hours.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Repair failure under 0.8.6

2011-12-04 Thread Peter Schuller

 I capped heap and the error is still there. So I keep seeing node dead
 messages even when I know the nodes were OK. Where and how do I tweak
 timeouts?

You can increase phi_convict_threshold in the configuration. However,
I would rather want to find out why they are being marked as down to
begin with. In a healthy situation, especially if you are not putting
extreme load on the cluster, there is very little reason for hosts to
be marked as down unless there's some bug somewhere.

Is this cluster under constant traffic? Are you seeing slow requests
from the point of view of the client (indicating that some requests
are routed to nodes that are temporarily inaccessible)?

With respect to GC, I would recommend running with -XX:+PrintGC and
-XX:PrintGCDetails and -XX:+PrintGCTimeStamps and
-XX:+PrintGCDateStamps and then look at the system log. A fallback to
full GC should be findable by grepping for Full.

Also, is this a problem with one specific host, or is it happening to
all hosts every now and then? And I mean either the host being flagged
as down, or the host that is flagging others as down.

As for uncapped heap: Generally a larger heap is not going to make it
more likely to fall back to full GC; usually the opposite is true.
However, a larger heap can make some of the non-full GC pauses longer,
depending. In either case, r unning with the above GC options will
give you specific information on GC pauses and should allow you to
rule that out (or not).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Repair failure under 0.8.6

2011-12-04 Thread Peter Schuller

 I will try to increase phi_convict -- I will just need to restart the
 cluster after
 the edit, right?

You will need to restart the nodes for which you want the phi convict
threshold to be different. You might want to do on e.g. half of the
cluster to do A/B testing.

 I do recall that I see nodes temporarily marked as down, only to pop up
 later.

I recommend grepping through the logs on all the clusters (e.g., cat
/var/log/cassandra/cassandra.log | grep UP | wc -l). That should tell
you quickly whether they all seem to be seeing roughly as many node
flaps, or whether some particular node or set of nodes is/are
over-represented.

Next, look at the actual nodes flapping (remove wc -l) and see if all
nodes are flapping or if it is a single node, or a subset of the nodes
(e.g., sharing a switch perhaps).

 In the current situation, there is no load on the cluster at all, outside
 the
 maintenance like the repair.

Ok. So what i'm getting at then is that there may be real legitimate
connectivity problems that you aren't noticing in any other way since
you don't have active traffic to the cluster.


-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Repair failure under 0.8.6

2011-12-04 Thread Peter Schuller

 As a side effect of the failed repair (so it seems) the disk usage on the
 affected node prevents compaction from working. It still works on
 the remaining nodes (we have 3 total).
 Is there a way to scrub the extraneous data?

This is one of the reasons why killing an in-process repair is a bad thing :(

If you do not have enough disk space for any kind of compaction to
work, then no, unfortunately there is no easy way to get rid of the
data.

You can go to extra trouble such as moving the entire node to some
other machine (e.g. firewalled from the cluster) with more disk and
run compaction there and then move it back, but that is kind of
painful to do. Another option is to decommission the node and replace
it. However, be aware that (1) that leaves the ring with less capacity
for a while, and (2) when you decommission, the data you stream from
that node to others would be artificially inflated due to the repair
so there is some risk of infecting the other nodes with a large data
set.

I should mention that if you have no traffic running against the
cluster, one way is to just remove all the data and then run repair
afterwards. But that implies that you're trusting that (1) no reads
are going to the cluster (else you might serve reads based on missing
data) and (2) that you are comfortable with loss of the data on the
node. (2) might be okay if you're e.g. writing at QUORUM at all times
and have RF = 3 (basically, this is as if the node would have been
lost due to hardware breakage).

A faster way to reconstruct the node would be to delete the data from
your keyspaces (except the system keyspace), start the node (now
missing data), and run 'nodetool rebuild' after
https://issues.apache.org/jira/browse/CASSANDRA-3483 is done. The
patch attached to that ticket should work for 0.8.6 I suspect (but no
guarantees). This also assumes you have no reads running against the
cluster.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Repair failure under 0.8.6

2011-12-03 Thread Peter Schuller

 quite understand how Cassandra declared a node dead (in the below). Was is a
 timeout? How do I fix that?

I was about to respond to say that repair doesn't fail just due to
failure detection, but this appears to have been broken by
CASSANDRA-2433 :(

Unless there is a subtle bug the exception you're seeing should be
indicative that it really was considered Down by the node. You might
grep the log for references ot the node in question (UP or DOWN) to
confirm. The question is why though. I would check if the node has
maybe automatically restarted, or went into full GC, etc.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Repair failure under 0.8.6

2011-12-03 Thread Peter Schuller

Filed https://issues.apache.org/jira/browse/CASSANDRA-3569 to fix it
so that streams don't die due to conviction.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Decommission a node causing high IO usage on 2 other nodes

2011-11-29 Thread Peter Schuller

 What may be causing high IO wait on 2.new and 3.new?

Do you have RF=3?

The problem with 'nodetool ring' in terms of interpreting those '0%'
is that it does not take RF and replication strategy into account. If
you have RF=3, whatever data has its primary range assigned to node N
will also be on N+1 and N+2. Additionally, whatever data has it's
primary range assigned to N-1 and N-2 will also have a copy on N.

So despite all the old nodes showing up as 0% in nodetool ring, they
are in fact responsible for data. And you would expect to see nodes
N+2 and N-2 be the two old nodes affected by decommissioning node N.
(Unless I'm tripping myself up somewhere now...)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Pending ReadStage is exploding on only one node

2011-11-24 Thread Peter Schuller

 I'm measuring a high load value on a few nodes during the update process
 (which is normal), but one node keeps the high load after the process for a
 long time.

I would say that either the reading that you to is overloading that
one node and other traffic is getting piled up as a result, or you're
stomping on page cache by reading a lot from that one node (e.g. using
CL.ONE) and you're then seeing readstage backed up until the page
cache or row cache is warm again.

In general, unless you're running at close to full CPU capacity it
sounds like you're completely disk bound, and that'll show up as a
huge amount of pending ReadStage. iostat -x -k 1 should confirm it.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Local quorum reads

2011-11-18 Thread Peter Schuller

 Keyspace: NetworkTopologyStrategy, 1DC, RF=2

 1 node goes down and we cannot read from the ring.

 My expectation is that LOCAL_QUORUM dictates that it will return a record 
 once a majority (N/2 +1) of replicas reports back.

Yes, the RF and the number of hosts that are up within the replica set
of the row in question is what matters.

 2/2 + 1 = 2

 I originally thought N=3 for 3 nodes but someone has corrected me on this 
 that it's the replicas (2) but when I use the cli, setConsistencyLevel AS 
 LOCAL_QUORUM nothing comes back. Set it back to ONE I can see the data

 Is there somethin I'm missing?

I don't know, but I am guessing this is just a poor UI issue. The CLI
is probably getting an Unavailable exception and just giving you
nothing instead of reporting a problem, but that is speculation on my
part.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Local quorum reads

2011-11-18 Thread Peter Schuller

 My expectation is that LOCAL_QUORUM dictates that it will return a record 
 once a majority (N/2 +1) of replicas reports back.

 Yes, the RF and the number of hosts that are up within the replica set
 of the row in question is what matters.

And note that this is for fundamental reasons. A node that is not part
of the replica set (and is not expected to have any data for the row)
cannot usefully participate in the read.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Local quorum reads

2011-11-18 Thread Peter Schuller

 No it's not just the cli tool, our app has the same issue coming back with 
 read issues.

You are supposed to not be able to read it. But you should be getting
a proper error, not an empty result.


-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Random access

2011-11-14 Thread Peter Schuller

 you can access cassandra data at any node and keys can be accessed at
 random.

Including individual columns in a row.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Off-heap caching through ByteBuffer.allocateDirect when JNA not available ?

2011-11-13 Thread Peter Schuller

 I would like to know it also - actually is should be similar, plus there are
 no dependencies to sun.misc packages.

I don't remember the discussion, but I assume the reason is that
allocateDirect() is not freeable except by waiting for soft ref
counting. This is enforced by the API in order to enable safe use of
allocated memory without it being possible to use to break out of the
JVM sandbox.

JNA or misc.unsafe allows explicit freeing (at the cost of application
bugs maybe segfaulting the JVM or causing other side-effects; i.e.,
breaking out of the managed runtime sandbox).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Mass deletion -- slowing down

2011-11-13 Thread Peter Schuller

Deletions in Cassandra imply the use of tombstones (see
http://wiki.apache.org/cassandra/DistributedDeletes) and under some
circumstances reads can turn O(n) with respect to the amount of
columns deleted, depending. It sounds like this is what you're seeing.

For example, suppose you're inserting a range of columns into a row,
deleting it, and inserting another non-overlapping subsequent range.
Repeat that a bunch of times. In terms of what's stored in Cassandra
for the row you now have:

  tomb
  tomb
  tomb
  tomb
  
   actual data

If you then do something like a slice on that row with the end-points
being such that they include all the tombstones, Cassandra essentially
has to read through and process all those tombstones (for the
PostgreSQL aware: this is similar to the effect you can get if
implementing e.g. a FIFO queue, where MIN(pos) turns O(n) with respect
to the number of deleted entries until the last vacuum - improved in
modern versions)).


-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

1 2 3 4 5 6 >

1 - 100 of 552 matches

Mail list logo