Re: Counter value becomes incorrect after several dozen reads writes

2013-06-25 Thread Robert Coli
On Mon, Jun 24, 2013 at 6:42 PM, Josh Dzielak j...@keen.io wrote:
 There is only 1 thread running this sequence, and consistency levels are set
 to ALL. The behavior is fairly repeatable - the unexpectation mutation will
 happen at least 10% of the time I run this program, but at different points.
 When it does not go awry, I can run this loop many thousands of times and
 keep the counter exact. But if it starts happening to a specific counter,
 the counter will never recover and will continue to maintain it's
 incorrect value even after successful subsequent writes.

Sounds like a corrupt counter shard. Hard to understand how it can
happen at ALL. If I were you I would file a JIRA including your repro
path...

=Rob


Re: Counter value becomes incorrect after several dozen reads writes

2013-06-25 Thread Andrew Bialecki
If you can reproduce the invalid behavior 10+% of the time with steps to
repro that take 5-10s/iteration, that sounds extremely interesting for
getting to the bottom of the invalid shard issue (if that's what the root
cause ends up being). Would be very interested in the set up to see if the
behavior can be duplicated.

Andrew


On Tue, Jun 25, 2013 at 2:18 PM, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Jun 24, 2013 at 6:42 PM, Josh Dzielak j...@keen.io wrote:
  There is only 1 thread running this sequence, and consistency levels are
 set
  to ALL. The behavior is fairly repeatable - the unexpectation mutation
 will
  happen at least 10% of the time I run this program, but at different
 points.
  When it does not go awry, I can run this loop many thousands of times and
  keep the counter exact. But if it starts happening to a specific counter,
  the counter will never recover and will continue to maintain it's
  incorrect value even after successful subsequent writes.

 Sounds like a corrupt counter shard. Hard to understand how it can
 happen at ALL. If I were you I would file a JIRA including your repro
 path...

 =Rob



Counter value becomes incorrect after several dozen reads writes

2013-06-24 Thread Josh Dzielak
I have a loop that reads a counter, increments it by some integer, then goes 
off and does about 500ms of other work. After about 10 iterations of this loop, 
the counter value *sometimes* appears to be corrupted.

Looking at the logs, a sequence that just happened is:

Read counter - 15000
Increase counter by - 353
Read counter - 15353
Increase counter by - 1067
Read counter - 286079 (the new counter value is *very* different than what the 
increase should have produced, but usually, suspiciously, around 280k)
Increase counter by - 875
Read counter - 286079  (the counter stops changing at a certain point)


There is only 1 thread running this sequence, and consistency levels are set to 
ALL. The behavior is fairly repeatable - the unexpectation mutation will happen 
at least 10% of the time I run this program, but at different points. When it 
does not go awry, I can run this loop many thousands of times and keep the 
counter exact. But if it starts happening to a specific counter, the counter 
will never recover and will continue to maintain it's incorrect value even 
after successful subsequent writes.

I'm using the latest Astyanax driver on Cassandra 1.2.3 in a 3-node test 
cluster. It's also happened in development. Has anyone seem something like 
this? It feels almost too strange to be an actual bug but I'm stumped and have 
been looking at it too long :)

Thanks,
Josh

--
Josh Dzielak 
VP Engineering • Keen IO
Twitter • @dzello (https://twitter.com/dzello)
Mobile • 773-540-5264



Re: Counter value becomes incorrect after several dozen reads writes

2013-06-24 Thread Arthur Zubarev
Hi Josh,

are you looking at the read counter produced by cfstats?

If so it is not for a CF, but the entire KS and not tied to a specific 
operation, but rather per the entire lifetime of JVM.

Just in case, some supporting info: 
http://stackoverflow.com/questions/9431590/cassandra-cfstats-and-meaning-of-read-write-latency

/Arthur

From: Josh Dzielak 
Sent: Monday, June 24, 2013 9:42 PM
To: user@cassandra.apache.org 
Subject: Counter value becomes incorrect after several dozen reads  writes

I have a loop that reads a counter, increments it by some integer, then goes 
off and does about 500ms of other work. After about 10 iterations of this loop, 
the counter value *sometimes* appears to be corrupted.

Looking at the logs, a sequence that just happened is:

Read counter - 15000
Increase counter by - 353
Read counter - 15353
Increase counter by - 1067
Read counter - 286079 (the new counter value is *very* different than what the 
increase should have produced, but usually, suspiciously, around 280k)
Increase counter by - 875
Read counter - 286079  (the counter stops changing at a certain point)

There is only 1 thread running this sequence, and consistency levels are set to 
ALL. The behavior is fairly repeatable - the unexpectation mutation will happen 
at least 10% of the time I run this program, but at different points. When it 
does not go awry, I can run this loop many thousands of times and keep the 
counter exact. But if it starts happening to a specific counter, the counter 
will never recover and will continue to maintain it's incorrect value even 
after successful subsequent writes.

I'm using the latest Astyanax driver on Cassandra 1.2.3 in a 3-node test 
cluster. It's also happened in development. Has anyone seem something like 
this? It feels almost too strange to be an actual bug but I'm stumped and have 
been looking at it too long :)

Thanks,
Josh

--
Josh Dzielak
VP Engineering • Keen IO
Twitter • @dzello
Mobile • 773-540-5264


Re: Counter value becomes incorrect after several dozen reads writes

2013-06-24 Thread Josh Dzielak
Hi Arthur,  

This is actually for a column in a counter column family, i.e. 
CounterColumnType. Will check out that thread though, thanks.

Best,
Josh

--
Josh Dzielak 
VP Engineering • Keen IO
Twitter • @dzello (https://twitter.com/dzello)
Mobile • 773-540-5264


On Monday, June 24, 2013 at 8:01 PM, Arthur Zubarev wrote:

 Hi Josh,
   
 are you looking at the read counter produced by cfstats?
   
 If so it is not for a CF, but the entire KS and not tied to a specific 
 operation, but rather per the entire lifetime of JVM.
   
 Just in case, some supporting info: 
 http://stackoverflow.com/questions/9431590/cassandra-cfstats-and-meaning-of-read-write-latency
   
 /Arthur
   
 From: Josh Dzielak (mailto:j...@keen.io)  
 Sent: Monday, June 24, 2013 9:42 PM
 To: user@cassandra.apache.org (mailto:user@cassandra.apache.org)  
 Subject: Counter value becomes incorrect after several dozen reads  writes
  
  
   
  
 I have a loop that reads a counter, increments it by some integer, then goes 
 off and does about 500ms of other work. After about 10 iterations of this 
 loop, the counter value *sometimes* appears to be corrupted.
   
 Looking at the logs, a sequence that just happened is:
   
 Read counter - 15000
 Increase counter by - 353
 Read counter - 15353
 Increase counter by - 1067
 Read counter - 286079 (the new counter value is *very* different than what 
 the increase should have produced, but usually, suspiciously, around 280k)
 Increase counter by - 875
 Read counter - 286079  (the counter stops changing at a certain point)
  
   
 There is only 1 thread running this sequence, and consistency levels are set 
 to ALL. The behavior is fairly repeatable - the unexpectation mutation will 
 happen at least 10% of the time I run this program, but at different points. 
 When it does not go awry, I can run this loop many thousands of times and 
 keep the counter exact. But if it starts happening to a specific counter, the 
 counter will never recover and will continue to maintain it's incorrect 
 value even after successful subsequent writes.
   
 I'm using the latest Astyanax driver on Cassandra 1.2.3 in a 3-node test 
 cluster. It's also happened in development. Has anyone seem something like 
 this? It feels almost too strange to be an actual bug but I'm stumped and 
 have been looking at it too long :)
   
 Thanks,
 Josh
   
 --
 Josh Dzielak 
 VP Engineering • Keen IO
 Twitter • @dzello (https://twitter.com/dzello)
 Mobile • 773-540-5264