[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2013-05-29 Thread Phil Pirozhkov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669539#comment-13669539
 ] 

Phil Pirozhkov commented on CASSANDRA-4417:
---

Cassandra 1.2.5, single node dev local installation.
Schema:
{code}
CREATE TABLE reporting (
zoom int,
time timestamp,
total counter,
PRIMARY KEY (zoom, time)
)
WITH CLUSTERING ORDER BY (time ASC);

update reporting set total = total + 1 where zoom = 0 and time = 1234142142141;
update reporting set total = total + 1 where zoom = 1 and time = 1234142142141;
update reporting set total = total + 1 where zoom = 2 and time = 1234142142141;
{code}

Query: {code}select * from reporting where zoom=0;{code} may produce different 
results, either rpc timeout either 'total' is null.
Nodetool repair does nothing and hangs time to time.
Chance to reproduce 50%.
Tried to change to batch commitlog mode, same result (but 10 times less 
performant).

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy
 Attachments: cassandra-mck.log.bz2, err.txt


 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2013-01-14 Thread Janne Jalkanen (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552492#comment-13552492
 ] 

Janne Jalkanen commented on CASSANDRA-4417:
---

Turns out that no amount of repair (I ran both repair -pr and full repair) 
allows the counter values to converge.  One node had consistently wrong counts 
that would not be repaired no matter what. In the end I took out the node, 
removed all data and brought it back into the cluster and let it reinitialize 
itself.  Now the values are converged.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy
 Attachments: cassandra-mck.log.bz2, err.txt


 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2013-01-09 Thread Janne Jalkanen (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13549412#comment-13549412
 ] 

Janne Jalkanen commented on CASSANDRA-4417:
---

I'm seeing this while running repair -pr. Three-cluster node, RF 3. Straight 
upgrade from 1.0.12 to 1.1.8; no topology changes.  I see two invalid shard 
IDs, counts differ by more than one - sometimes even by 3000 or more.  Seems 
random to my eyes.

Our counters are in a composite column family, no TTLs in use.

I did disablegossip, disablethrift, drain, upgrade, restart on every node in a 
rolling fashion.  Then I did upgradesstables and repair -pr on every node when 
the entire cluster had been upgraded.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy
 Attachments: cassandra-mck.log.bz2, err.txt


 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-11-20 Thread Ed Solovey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13501249#comment-13501249
 ] 

Ed Solovey commented on CASSANDRA-4417:
---

We are on 1.1.6 and are seeing this on a three node cluster with replication 
factor of 2.  Is there a workaround for this?  Corrupted counters are a 
showstopper for us and we'll have to move off Cassandra if we can't resolve 
this.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy
 Attachments: cassandra-mck.log.bz2, err.txt


 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-11-18 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500026#comment-13500026
 ] 

Michael Kjellman commented on CASSANDRA-4417:
-

unfortunately, hitting this as well. We increment by different values as well. 
RF=3 on 1.1.6. Happened to me after i did a nodetool drain and restarted a 
node. When it came back up started seeing it being logged.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy
 Attachments: cassandra-mck.log.bz2, err.txt


 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-30 Thread Ivan Sobolev (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486872#comment-13486872
 ] 

Ivan Sobolev commented on CASSANDRA-4417:
-

{quote}
[~slebresne]
Quick question: do you always increment by the same value by any chance? {quote}

Attached a log has not only +1 increments(though, think not any log would help 
you there :) )

We run 1.1.5, no upgradesstables, most probably unclean shutdown too.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy
 Attachments: err.txt


 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-29 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486232#comment-13486232
 ] 

Jonathan Ellis commented on CASSANDRA-4417:
---

On a bootstrap sounds more like CASSANDRA-4071.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-26 Thread Eric Lubow (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485333#comment-13485333
 ] 

Eric Lubow commented on CASSANDRA-4417:
---

We are getting this on DSE 2.2 (C* 1.1.5) on a new node during bootstrap.  We 
upgraded the cluster from C* 1.0.10 about 10 days ago and upgradesstables was 
run on every node and we repaired the entire cluster.  We ran We've been 
getting this error sporadically on various nodes at various points but it's not 
consistent.  I've double and triple checked every node looking for sstable 
files named *-hd-* and I don't see any (assuming that's enough to tell that 
the sstable has been upgraded.  If this error is an effect of requiring one to 
run upgradesstables, then how would it happen during a bootstrap? All nodes 
involved in this cluster are 1.1.5.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-23 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13482370#comment-13482370
 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
-

bq. No, sorry just happened to pick that example.

That's ok, thanks nonetheless.

I'm really starting to think that CASSANDRA-4071 is likely the main cause for 
this and is very easy to reproduce in that case. The commit log we've discussed 
earlier can also trigger that error, but it's probably much harder to trigger. 
The good news is that as explained on the ticket, CASSANDRA-4071 won't corrupt 
counters unless you are at RF=1 (in which case that's not a good news). The bad 
one is I'm not really sure how to fix it. 

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-23 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13482552#comment-13482552
 ] 

Chris Herron commented on CASSANDRA-4417:
-

bq. I'm really starting to think that CASSANDRA-4071 is likely the main cause 
for this and is very easy to reproduce in that case. The commit log we've 
discussed earlier can also trigger that error, but it's probably much harder to 
trigger.

In our case:
* We haven't made any topology changes
* Our test drops and recreates the affected CFs. No nodes die during the test 
(w.r.t. unclean shutdown and commit log)
* After previous load test runs under different configuration (see below), no 
nodes die, and we use nodetool drain before restarting with updated configs.

Note my earlier comment above I said:

bq. In investigating CASSANDRA-4687 we disabled key cache, repeated the 
load+upgradesstables test and these invalid counter shard warnings did not 
appear.

Given that we don't have a topology change, can you think of a scenario where a 
commitlog issue is still contributing?



 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-22 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13481526#comment-13481526
 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
-

Quick question: do you always increment by the same value by any chance? I'm 
asking because the last log you've pasted indicates the conflicting information 
found correspond to 1 increment only, and in the first case the value is 1, on 
the second the value is 2. If you always increment by say 1, that would tell us 
which one is wrong (I'm not yet sure which conclusion I would draw from that 
but more info can't hurt :)).

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-22 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13481539#comment-13481539
 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
-

bq. Could this be related to: CASSANDRA-4071?

I missed that earlier on, but yes, Bartłomiej is correct, CASSANDRA-4071 will 
totally trigger 'invalid counter shard detected' messages. As described in the 
ticket, if you don't use RF=1, this shouldn't actually create data loss, but it 
would still trigger the log until things get compacted away.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-22 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13481729#comment-13481729
 ] 

Chris Herron commented on CASSANDRA-4417:
-

bq. Quick question: do you always increment by the same value by any chance?

No, sorry just happened to pick that example. We have many other log entries 
where both values are higher and don't differ by 1.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-19 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13480069#comment-13480069
 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
-

Ok. The fact that you only reproduce when using upgradesstables is definitively 
interesting. I'll check if I can see something causing that in upgradesstables. 
I'll keep you posted.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-19 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13480121#comment-13480121
 ] 

Chris Herron commented on CASSANDRA-4417:
-

Another observation since: in previous runs with key cache disabled we were not 
seeing any errors. However I've since found some invalid counter shard errors 
that are occurring during normal compaction. 

{code}
ERROR [CompactionExecutor:6] 2012-10-19 15:43:50,920 
org.apache.cassandra.db.context.CounterContext invalid counter shard detected; 
(15b843e0-ff7c-11e0--07f4b18563ff, 1, 1) and 
(15b843e0-ff7c-11e0--07f4b18563ff, 1, 2) differ only
 in count; will pick highest to self-heal; this indicates a bug or corruption 
generated a bad counter shard
{code}

So to be clear, this particular scenario is:
* C* 1.1.6 with key cache disabled. 
* Load test ran earlier against this same setup; but no upgradesstables during 
that run; no errors under load during that test run.
* Later, some nightly jobs ran that read from Super CF counters, write to other 
CFs.
* Compaction activity occurs later after load test and nightly jobs complete. 
Invalid counter shard errors are seen for some CFs. Gleaning from the log 
output order, the affected CF's:
** *Did* have upgradesstables run upon them in previous configurations (1.1.6, 
key cache on)
** Have not been written to at all for the purpose of the load test I've been 
mentioning.
** Have been read from for these nightly jobs.









 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-18 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479561#comment-13479561
 ] 

Chris Herron commented on CASSANDRA-4417:
-

bq. Also, during that test, is there anything involving streaming going on (a 
repair, a node bootstrapping/moving/decommissioning)?

There are definitely no repairs or node bootstrapping/moving/decommissioning 
happening during the test.
Re-ran the test and the JMX stats for StreamStage indicated zero tasks on all 
nodes after the test completed.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-17 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477670#comment-13477670
 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
-

bq. After reading Sylvain's comments above, tried running the same test with 
commitlog_sync: batch - we get a similar volume of the same errors

Just to clarify, using batch commit log should only avoid the initial problem 
to reproduce (assuming the analysis of the problem is correct of course). 
However, contrarily to what the error message pretends, the existing invalid 
counter shards don't heal themselves as soon a the message is logged. In 
fact, the message is logged each time we merge counter columns that have 
conflicting shards and when that merge is triggered by a compaction, it will 
indeed heal the shard. But we also merge each time we read for instance. In 
other words, even if batch commit log fixes the problem, one will need to 
compact everything/wait for everything to be compacted to have all logged 
messages disappear. Unless you've been able to reproduce on a brand new cluster 
where the commit log was set to batch from the beginning (in which case, if you 
have an easy way to reproduce, that would be interesting to know).


 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-17 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477943#comment-13477943
 ] 

Chris Herron commented on CASSANDRA-4417:
-

bq. Unless you've been able to reproduce on a brand new cluster where the 
commit log was set to batch from the beginning (in which case, if you have an 
easy way to reproduce, that would be interesting to know)

In our test the affected Super CF is completely deleted and recreated - so in 
that sense the commit log was set to batch from the beginning. Is that 
equivalent?

This does reproduce for every test run. Unfortunately our test is non-trivial 
to share. It involves heavy writes and moderate reads to counters, while 
simultaneously running upgradesstables on all nodes upon multiple CF's 
(including the affected one). Interestingly, the symptom does appear even 
before compaction reaches the Super CF that's active during the test.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-17 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478255#comment-13478255
 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
-

bq. Is that equivalent?

Probably, how was it deleted/recreated. Did you drop and recreate?

bq. This does reproduce for every test run.

Interesting, you may on something. Does it make sense for you to perform the 
same test without the upgradesstables part (i.e. only the writes and reads). If 
so, does that change something? Also, during that test, is there anything 
involving streaming going on (a repair, a node 
bootstrapping/moving/decommissioning)? Trying to narrow down what's involved in 
you test.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-17 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478505#comment-13478505
 ] 

Chris Herron commented on CASSANDRA-4417:
-

bq. Probably, how was it deleted/recreated. Did you drop and recreate?

Yes, dropped (the schema migration flavor) and recreated a CF of the same name.

bq. Perform the same test without the upgradesstables part (i.e. only the 
writes and reads). If so, does that change something?

Have already tested that scenario. Running this load test without the 
concurrent upgradesstables compaction activity, the problem does not exhibit.

bq. during that test, is there anything involving streaming going on (a repair, 
a node bootstrapping/moving/decommissioning)?

Not that I know of. I can test again and monitor for streaming activity to see.

By the way, as we've been testing in preparation for a 1.1.x upgrade, we were 
seeing symptoms of CASSANDRA-4571, CASSANDRA-4687 as well as this issue on C* 
1.1.6. In investigating CASSANDRA-4687 we disabled key cache, repeated the 
load+upgradesstables test and these invalid counter shard warnings did not 
appear.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-10-16 Thread Chris Herron (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13477556#comment-13477556
 ] 

Chris Herron commented on CASSANDRA-4417:
-

We are seeing large volumes of this error on all nodes when running a load test 
while also running upgradesstables on multiple CF's on each node.

After reading Sylvain's comments above, tried running the same test with 
commitlog_sync: batch - we get a similar volume of the same errors.

(Running a build from branch cassandra-1.1 at commit 
4d2e5e73b127dc0b335176ddc1dec1f0244e7f6d, with Java 6u35 on Amazon Linux 2.6.35)

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457504#comment-13457504
 ] 

Bartłomiej Romański commented on CASSANDRA-4417:


Is it possible to predict how dangerous this bug could be? We are already 
experiencing very serious problems with CASSANDRA-4639. Our counter values 
suddenly became a few times higher than expected. As you can imagine this is a 
disaster from the business point of view. We are already seriously thinking 
about going back to SQL databases :/ I wonder how (if) this bug (and possible 
other counter related bugs) can affect us. We rely heavily on counters.

Can this bug possibly lead to incorrect counter values? Temporarily or 
permanently - will running repair fix it? 

How incorrect counter values could be? Loosing a couple increments immediately 
preceding a node failure is probably acceptable in most cases. Is it possible 
to loose more increments? Or end up in completely incorrect counter values as 
in CASSANDRA-4639?

What would exactly happen after hitting this bug. Running repair should fix it? 
The self-healing mechanism would actually make counter consistent again? Or 
we will get this error messages over and over?

Sorry for writing a comment full of questions, but I've got very limited 
knowledge of cassandra internals. I'll be very thankful if someone could refer 
to the questions above.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457511#comment-13457511
 ] 

Bartłomiej Romański commented on CASSANDRA-4417:


In the previous comment I wanted to point directly to CASSANDRA-4436 - I've 
mixed up numbers.

One more thing: could hinted-handoff be possible somehow related to this issue? 
We've got a problem with it (CASSANDRA-4673) which was discovered in (more or 
less) in the same time that our counters problems. Is there a possibility that 
sending hinted-handoff a few times ends up with incrementing counters a few 
time?


 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457528#comment-13457528
 ] 

Bartłomiej Romański commented on CASSANDRA-4417:


And the last comment. Could this be related to: CASSANDRA-4071? If I understand 
the description correctly any topology changes (adding a node, moving a node) 
when the counter is spread across more than one sstable can result in the 
invalid counter shard detected error message during reads. Am I right?

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-12 Thread Peter Schuller (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453798#comment-13453798
 ] 

Peter Schuller commented on CASSANDRA-4417:
---

@Sylvain I know it wouldn't be correlated with the *same* node; I was referring 
to uncontrolled shutdowns in general in the cluster.

@Omid: Presumably the premise was that the mutation goes through the commit log 
on the leader prior to replication. I'm not sure if this is the case, but if it 
is, then it should work.

@jbellis FWIW, our counter use-cases are such that going commit log synch is 
probably not feasable due to very high write throughput. Doesn't mean other 
people's use-cases are the same, and of course I *fully* support the idea of 
being correct by default (as opposed to performant by default).

@Sylvain again: I agree about refreshing nodeid on every unclean restart being 
potentially dangerous. Counters are already huge due to the size of counter 
shards, and refreshing nodeids in any situation which might result in en-masse 
refreshment can definitely be dangerous both from a CPU usage perspective as 
well as a disk space one.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-11 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453058#comment-13453058
 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
-

That's a very good point. Counters do rely on the fact that nodes do not lose 
the increments they are leader for (or that they don't reuse the same nodeId 
if they do), but unless the commit log uses batch mode, this can happen. And 
that will lead to exactly the exception seen here, so I'd say there's a very 
good chance this is the problem.

I'll note that if that is indeed a problem, it's very possible that the error 
logged happens only much later (after the unclean shutdown) and on some other 
node than the one having died. So not being able to correlate the error to an 
unclean shutdown doesn't really indicate that it's not related.

The consequence of this happening is that the increments that have been lost 
with un-synced commit log are lost. Meaning that with the default 
configuration, one could lose up to 10 seconds of the increments (for which the 
dying node is leader). However, I think it is also possible to have results 
from read to miss slightly more than that, though that last part should fix 
itself if the counter is incremented again.

As for the error message logged, it's possible that lots of them are logged 
even though only a small number of counters are affected since it's print 
during column reconciliation and thus could be logged many time for the same 
counter.

A simple workaround is to use batch commit log, but that has a potentially 
important performance impact.

Another solution I've though of would be to try to detect unclean shutdown (by 
marking something during clean shutdown and checking for that at startup) and 
if we detect one, to renew the nodeId. The problem with that is that this 
potentially mean renewing the nodeId pretty often. And each time we do that, 
the internal representation of counter grow and I'm really afraid it will be a 
problem in that case. And while we have some mechanism to shrink back counter 
by merging sub-counts when the nodeId is renewed too often, that mechanism 
assumes that the node owning the nodeId has the more up-to-date value for this 
sub-count, which is exactly the problem here. So overall I don't have any good 
idea to fix this. Other ideas?


 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-11 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453105#comment-13453105
 ] 

Jonathan Ellis commented on CASSANDRA-4417:
---

Maybe it's time to make commitlog mode (off/periodic/batch) per-CF instead of 
mix of global and per-KS.  Then we could automatically force counter CF to 
batch.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-11 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453117#comment-13453117
 ] 

Brandon Williams commented on CASSANDRA-4417:
-

Under multiple-CF concurrency, wouldn't you effectively end up with batch mode?

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-11 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453123#comment-13453123
 ] 

Jonathan Ellis commented on CASSANDRA-4417:
---

You would do more fsyncs, but only the CFs in actual batch mode would have to 
block for them.  Periodic mode just queues the CL op and moves on to memtable 
append immediately.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-11 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453132#comment-13453132
 ] 

Sylvain Lebresne commented on CASSANDRA-4417:
-

That's an option. Though not an exactly short term one (I suspect mixing 
periodic and batch cf on the same commit log might require a bit of care; 
unless you were thinking of having multiple commit logs, but I'm not sure that 
would be a good thing).

But hey, I don't have a much better solution so far, so looking at that option 
is definitively worth it (since it's generally useful anyway).

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-11 Thread Omid Aladini (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453537#comment-13453537
 ] 

Omid Aladini commented on CASSANDRA-4417:
-

{quote}
A simple workaround is to use batch commit log, but that has a potentially 
important performance impact.
{quote}

I'm a bit confused why batch commit would solve the problem. If cassandra 
crashes before the batch is fsynced, the counter mutations which it was the 
leader for will still be lost although they might have been applied on other 
replicas. The difference would be that the mutations won't be acknowledged to 
the client, and since counters aren't idempotent, the client won't know weather 
to retry or not. Am I missing something?

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-10 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13451823#comment-13451823
 ] 

Fabien Rousseau commented on CASSANDRA-4417:


Could it be that counter increment is written in commitlog and sent to other 
replicas, BUT : commitlog is not yet flushed on disk AND cassandra is 
stopped/killed.

Ex :
 - node A receives an increment of 1 for c1
 - it stores (A, 1, 1)
 - it sends the increment to the replicas (A, 1, 1)
 - node is killed (without commitog being flushed to disk)
 - on restart, node A receives an increment of 3 for c1
 - it stores (A, 1, 3) (because it has no way of knowing the clock 1 was 
already attributed)


 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-10 Thread Charles Brophy (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13452016#comment-13452016
 ] 

Charles Brophy commented on CASSANDRA-4417:
---

Yes, we operate under a heavy write load and we do have frequent compaction as 
a result, but under normal conditions I never see this exception - or at least 
it doesn't happen often enough for me to catch. Following a repair, however, 
it's a guarantee for us. Could it be as simple as:

* the two servers are participants in the same key-range replicant and the 
sstables contain the same key/row/column references
* the process of streaming repair is sending a set of key/row/column references 
to the requestor in the same sstable as the out-of-sync data it's already aware 
of via repair
* Compaction finds the duplicate references in the recently received sstables - 
they're basically the other replicant's copies of that data

It seems that the act of sending the sstables from one server to the other when 
both are replicants of the same key range would be expected to result in 
duplicate references. I'm probably way off.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-10 Thread Peter Schuller (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13452689#comment-13452689
 ] 

Peter Schuller commented on CASSANDRA-4417:
---

@Fabien That sounds plausible to me upon first read at least. I cannot confirm 
or deny whether it's possible that I've only seen this under circumstances 
where a non-clean shutdown has taken place. In particular, I cannot (at the top 
of my head anyway) think of a way in which the situation you describe is ever 
*prevented* from happening. So regardless of whether or not this is *the* 
explanation for this problem, it seems to me to at least be *an* explanation 
for it.

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-07 Thread Charles Brophy (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13450766#comment-13450766
 ] 

Charles Brophy commented on CASSANDRA-4417:
---

We have a six node cluster with even key range balance, random partitioner, and 
with relication factor=2. I get these errors immediately following running 
nodetool repair but ONLY if a streaming repair happens as a result. We are 
serving live updates to our counters from our clickstream. My guess is that the 
sstable being streamed between the servers winds up becoming out of date for 
the duration of the streaming process and ends up containing these duplicates 
that are vetted during the subsequent compaction. In any case, for us it is 
100% reproducible via: nodetool repair - streaming repair - subsequent 
compaction. Let me know if you need more details. Hope this helps!

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-09-07 Thread Peter Schuller (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13451240#comment-13451240
 ] 

Peter Schuller commented on CASSANDRA-4417:
---

I am not 100% certain, but I am fairly certain, that we've seen this on nodes 
that haven't done any streaming whatsoever.

With respect to duplicates: It's certainly not *supposed* to happen. A given 
counter shard, from a given node id, and with a given clock, should only ever 
be produced exactly once by exactly one node. Obviously the bug isn't supposed 
to happen to begin with, so that doesn't mean the bug isn't related to 
streaming.

Hmmm.

Do you have a lot of writes normally? Is it possible that the correlation with 
streaming is because of the fact that it initiates significant amounts of 
compaction?

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-07-24 Thread Michael Theroux (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13421432#comment-13421432
 ] 

Michael Theroux commented on CASSANDRA-4417:


I just hit this myself on 1.1.2 on two nodes of a six node cluster.  The 
cluster has been stable for a couple of weeks.

If it makes any difference, we recently enabled row-caching.

...
INFO [AntiEntropyStage:1] 2012-07-24 11:05:55,537 AntiEntropyService.java (line 
206) [repair #b9355020-d57e-11e1--7c4549350fdf] Received merkle tree for 
caches from /10.29.214.111
ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,532 CounterContext.java 
(line 381) invalid counter shard detected; 
(6be74ab0-6cc6-11e1--242d50cf1fd7, 1, -1) and 
(6be74ab0-6cc6-11e1--242d50cf1fd7, 1, 1) differ only in count; will pick 
highest to self-heal; this indicates a bug or corruption generated a bad 
counter shard
ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,533 CounterContext.java 
(line 381) invalid counter shard detected; 
(6be74ab0-6cc6-11e1--242d50cf1fd7, 1, 1) and 
(6be74ab0-6cc6-11e1--242d50cf1fd7, 1, -1) differ only in count; will pick 
highest to self-heal; this indicates a bug or corruption generated a bad 
counter shard
ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,534 CounterContext.java 
(line 381) invalid counter shard detected; 
(6be74ab0-6cc6-11e1--242d50cf1fd7, 1, -1) and 
(6be74ab0-6cc6-11e1--242d50cf1fd7, 1, 1) differ only in count; will pick 
highest to self-heal; this indicates a bug or corruption generated a bad 
counter shard
ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,534 CounterContext.java 
(line 381) invalid counter shard detected; 
(6be74ab0-6cc6-11e1--242d50cf1fd7, 1, 1) and 
(6be74ab0-6cc6-11e1--242d50cf1fd7, 1, -1) differ only in count; will pick 
highest to self-heal; this indicates a bug or corruption generated a bad 
counter shard
ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,534 CounterContext.java 
(line 381) invalid counter shard detected; 
(6be74ab0-6cc6-11e1--242d50cf1fd7, 1, -1) and 
(6be74ab0-6cc6-11e1--242d50cf1fd7, 1, 1) differ only in count; will pick 
highest to self-heal; this indicates a bug or corruption generated a bad 
counter shard
ERROR [CompactionExecutor:183] 2012-07-24 11:05:58,535 CounterContext.java 
(line 381) invalid counter shard detected; 
(6be74ab0-6cc6-11e1--242d50cf1fd7, 1, 1) and 
(6be74ab0-6cc6-11e1--242d50cf1fd7, 1, -1) differ only in count; will pick 
highest to self-heal; this indicates a bug or corruption generated a bad 
counter shard
 INFO [AntiEntropyStage:1] 2012-07-24 11:06:05,541 AntiEntropyService.java 
(line 206) [repair #b9355020-d57e-11e1--7c4549350fdf] Received merkle tree 
for caches from /10.144.15.6
...

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-07-24 Thread Michael Theroux (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13421440#comment-13421440
 ] 

Michael Theroux commented on CASSANDRA-4417:


Ignore that comment about row-caching.  I see these errors in the log dating 
back the the 11th of July (long before we enabled rowcaching)

 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-4417) invalid counter shard detected

2012-07-06 Thread Peter Schuller (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407747#comment-13407747
 ] 

Peter Schuller commented on CASSANDRA-4417:
---

You are triggering a self-healing mechanism introduced in CASSANDRA-3641. I can 
confirm we're seeing it on 1.1 still too, even on sstables that haven't been 
upgraded from older versions. I don't think anyone knows exactly why it's 
happening.

The condition being checked for is never supposed to happen. Prior to the 
CASSANDRA-3641 the result would be counters with values that never converge 
even with read-repair; post-CASSANDRA-3641 the values converge. But depending 
on the root cause, it's unclear whether there is a danger of incorrect counter 
values.


 invalid counter shard detected 
 ---

 Key: CASSANDRA-4417
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4417
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.1.1
 Environment: Amazon Linux
Reporter: Senthilvel Rangaswamy

 Seeing errors like these:
 2012-07-06_07:00:27.22662 ERROR 07:00:27,226 invalid counter shard detected; 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 13) and 
 (17bfd850-ac52-11e1--6ecd0b5b61e7, 1, 1) differ only in count; will pick 
 highest to self-heal; this indicates a bug or corruption generated a bad 
 counter shard
 What does it mean ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira