Re: Counter value becomes incorrect after several dozen reads writes
On Mon, Jun 24, 2013 at 6:42 PM, Josh Dzielak j...@keen.io wrote: There is only 1 thread running this sequence, and consistency levels are set to ALL. The behavior is fairly repeatable - the unexpectation mutation will happen at least 10% of the time I run this program, but at different points. When it does not go awry, I can run this loop many thousands of times and keep the counter exact. But if it starts happening to a specific counter, the counter will never recover and will continue to maintain it's incorrect value even after successful subsequent writes. Sounds like a corrupt counter shard. Hard to understand how it can happen at ALL. If I were you I would file a JIRA including your repro path... =Rob
Re: Counter value becomes incorrect after several dozen reads writes
If you can reproduce the invalid behavior 10+% of the time with steps to repro that take 5-10s/iteration, that sounds extremely interesting for getting to the bottom of the invalid shard issue (if that's what the root cause ends up being). Would be very interested in the set up to see if the behavior can be duplicated. Andrew On Tue, Jun 25, 2013 at 2:18 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Jun 24, 2013 at 6:42 PM, Josh Dzielak j...@keen.io wrote: There is only 1 thread running this sequence, and consistency levels are set to ALL. The behavior is fairly repeatable - the unexpectation mutation will happen at least 10% of the time I run this program, but at different points. When it does not go awry, I can run this loop many thousands of times and keep the counter exact. But if it starts happening to a specific counter, the counter will never recover and will continue to maintain it's incorrect value even after successful subsequent writes. Sounds like a corrupt counter shard. Hard to understand how it can happen at ALL. If I were you I would file a JIRA including your repro path... =Rob
Counter value becomes incorrect after several dozen reads writes
I have a loop that reads a counter, increments it by some integer, then goes off and does about 500ms of other work. After about 10 iterations of this loop, the counter value *sometimes* appears to be corrupted. Looking at the logs, a sequence that just happened is: Read counter - 15000 Increase counter by - 353 Read counter - 15353 Increase counter by - 1067 Read counter - 286079 (the new counter value is *very* different than what the increase should have produced, but usually, suspiciously, around 280k) Increase counter by - 875 Read counter - 286079 (the counter stops changing at a certain point) There is only 1 thread running this sequence, and consistency levels are set to ALL. The behavior is fairly repeatable - the unexpectation mutation will happen at least 10% of the time I run this program, but at different points. When it does not go awry, I can run this loop many thousands of times and keep the counter exact. But if it starts happening to a specific counter, the counter will never recover and will continue to maintain it's incorrect value even after successful subsequent writes. I'm using the latest Astyanax driver on Cassandra 1.2.3 in a 3-node test cluster. It's also happened in development. Has anyone seem something like this? It feels almost too strange to be an actual bug but I'm stumped and have been looking at it too long :) Thanks, Josh -- Josh Dzielak VP Engineering • Keen IO Twitter • @dzello (https://twitter.com/dzello) Mobile • 773-540-5264
Re: Counter value becomes incorrect after several dozen reads writes
Hi Josh, are you looking at the read counter produced by cfstats? If so it is not for a CF, but the entire KS and not tied to a specific operation, but rather per the entire lifetime of JVM. Just in case, some supporting info: http://stackoverflow.com/questions/9431590/cassandra-cfstats-and-meaning-of-read-write-latency /Arthur From: Josh Dzielak Sent: Monday, June 24, 2013 9:42 PM To: user@cassandra.apache.org Subject: Counter value becomes incorrect after several dozen reads writes I have a loop that reads a counter, increments it by some integer, then goes off and does about 500ms of other work. After about 10 iterations of this loop, the counter value *sometimes* appears to be corrupted. Looking at the logs, a sequence that just happened is: Read counter - 15000 Increase counter by - 353 Read counter - 15353 Increase counter by - 1067 Read counter - 286079 (the new counter value is *very* different than what the increase should have produced, but usually, suspiciously, around 280k) Increase counter by - 875 Read counter - 286079 (the counter stops changing at a certain point) There is only 1 thread running this sequence, and consistency levels are set to ALL. The behavior is fairly repeatable - the unexpectation mutation will happen at least 10% of the time I run this program, but at different points. When it does not go awry, I can run this loop many thousands of times and keep the counter exact. But if it starts happening to a specific counter, the counter will never recover and will continue to maintain it's incorrect value even after successful subsequent writes. I'm using the latest Astyanax driver on Cassandra 1.2.3 in a 3-node test cluster. It's also happened in development. Has anyone seem something like this? It feels almost too strange to be an actual bug but I'm stumped and have been looking at it too long :) Thanks, Josh -- Josh Dzielak VP Engineering • Keen IO Twitter • @dzello Mobile • 773-540-5264
Re: Counter value becomes incorrect after several dozen reads writes
Hi Arthur, This is actually for a column in a counter column family, i.e. CounterColumnType. Will check out that thread though, thanks. Best, Josh -- Josh Dzielak VP Engineering • Keen IO Twitter • @dzello (https://twitter.com/dzello) Mobile • 773-540-5264 On Monday, June 24, 2013 at 8:01 PM, Arthur Zubarev wrote: Hi Josh, are you looking at the read counter produced by cfstats? If so it is not for a CF, but the entire KS and not tied to a specific operation, but rather per the entire lifetime of JVM. Just in case, some supporting info: http://stackoverflow.com/questions/9431590/cassandra-cfstats-and-meaning-of-read-write-latency /Arthur From: Josh Dzielak (mailto:j...@keen.io) Sent: Monday, June 24, 2013 9:42 PM To: user@cassandra.apache.org (mailto:user@cassandra.apache.org) Subject: Counter value becomes incorrect after several dozen reads writes I have a loop that reads a counter, increments it by some integer, then goes off and does about 500ms of other work. After about 10 iterations of this loop, the counter value *sometimes* appears to be corrupted. Looking at the logs, a sequence that just happened is: Read counter - 15000 Increase counter by - 353 Read counter - 15353 Increase counter by - 1067 Read counter - 286079 (the new counter value is *very* different than what the increase should have produced, but usually, suspiciously, around 280k) Increase counter by - 875 Read counter - 286079 (the counter stops changing at a certain point) There is only 1 thread running this sequence, and consistency levels are set to ALL. The behavior is fairly repeatable - the unexpectation mutation will happen at least 10% of the time I run this program, but at different points. When it does not go awry, I can run this loop many thousands of times and keep the counter exact. But if it starts happening to a specific counter, the counter will never recover and will continue to maintain it's incorrect value even after successful subsequent writes. I'm using the latest Astyanax driver on Cassandra 1.2.3 in a 3-node test cluster. It's also happened in development. Has anyone seem something like this? It feels almost too strange to be an actual bug but I'm stumped and have been looking at it too long :) Thanks, Josh -- Josh Dzielak VP Engineering • Keen IO Twitter • @dzello (https://twitter.com/dzello) Mobile • 773-540-5264