[jira] [Updated] (CASSANDRA-4436) Counters in columns don't preserve correct values after cluster restart

2012-07-26 Thread Sylvain Lebresne (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-4436:


Attachment: 4436-1.1-2.txt
4436-1.0-2.txt

You're completely right, I'm still thinking too much in terms of 
SizeTieredCompaction.

Updated patches to use a Set.

 Counters in columns don't preserve correct values after cluster restart
 ---

 Key: CASSANDRA-4436
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4436
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.10
Reporter: Peter Velas
Assignee: Sylvain Lebresne
 Fix For: 1.1.3

 Attachments: 4436-1.0-2.txt, 4436-1.0-2.txt, 4436-1.0.txt, 
 4436-1.1-2.txt, 4436-1.1-2.txt, 4436-1.1.txt, increments.cql.gz


 Similar to #3821. but affecting normal columns. 
 Set up a 2-node cluster with rf=2.
 1. Create a counter column family and increment a 100 keys in loop 5000 
 times. 
 2. Then make a rolling restart to cluster. 
 3. Again increment another 5000 times.
 4. Make a rolling restart to cluster.
 5. Again increment another 5000 times.
 6. Make a rolling restart to cluster.
 After step 6 we were able to reproduce bug with bad counter values. 
 Expected values were 15 000. Values returned from cluster are higher then 
 15000 + some random number.
 Rolling restarts are done with nodetool drain. Always waiting until second 
 node discover its down then kill java process. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-4436) Counters in columns don't preserve correct values after cluster restart

2012-07-24 Thread Sylvain Lebresne (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-4436:


Attachment: 4436-1.1-2.txt
4436-1.0-2.txt

bq. Looks like skipCompacted in Directories.SSTableLister can be removed (since 
we scrubDataDirectories on startup and no new compacted components will be 
created).

True, though there is the (arguably remote) possibility that people call 
loadNewSSTables() (or the offline scrub from CASSANDRA-4441) on sstables having 
some -Compacted components. So I would prefer leaving it in 1.1 and removing it 
during the merge to trunk, just to be sure minor upgrade are as little 
disrupting as can be.

bq. Using a List means we can add an ancestor multiple times. Suggest using a 
Set instead.

But we won't have the same ancestor multiple times. Otherwise that would be a 
bug (and at least for counters, a particularly bad one). But for sanity I've 
added an assertion to check this doesn't happen (I've a list however, I figured 
that since the list will be small, the difference between List.contains() and 
Set.contains() will be negligeable, and it's checked in an assertion and only 
once a the sstable creation. On the other Lists have a smaller memory 
footprint. Though I admit in either case we're talked minor differences).

bq. would prefer Ancestor to LiveAncestor, since we only check liveness at 
creation time, so Live is misleading when iterating over them later.

Renamed.

bq. the deleting code feels more at home in CFS constructor than 
addInitialSSTables.

Moved.

bq. tracker parameter is unused now in SSTR.open

Removed. I realized that setTrackedBy was already always call through the 
DataTracker.addNewSSTablesSize, so I also removed the call duplication.


 Counters in columns don't preserve correct values after cluster restart
 ---

 Key: CASSANDRA-4436
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4436
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.10
Reporter: Peter Velas
Assignee: Sylvain Lebresne
 Fix For: 1.1.3

 Attachments: 4436-1.0-2.txt, 4436-1.0.txt, 4436-1.1-2.txt, 
 4436-1.1.txt, increments.cql.gz


 Similar to #3821. but affecting normal columns. 
 Set up a 2-node cluster with rf=2.
 1. Create a counter column family and increment a 100 keys in loop 5000 
 times. 
 2. Then make a rolling restart to cluster. 
 3. Again increment another 5000 times.
 4. Make a rolling restart to cluster.
 5. Again increment another 5000 times.
 6. Make a rolling restart to cluster.
 After step 6 we were able to reproduce bug with bad counter values. 
 Expected values were 15 000. Values returned from cluster are higher then 
 15000 + some random number.
 Rolling restarts are done with nodetool drain. Always waiting until second 
 node discover its down then kill java process. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-4436) Counters in columns don't preserve correct values after cluster restart

2012-07-18 Thread Sylvain Lebresne (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-4436:


Attachment: 4436-1.1.txt
4436-1.0.txt

Thanks a lot Peter for helping out reproducing this issue.

The problem is that when a node stops (or is drained for that matter, we don't 
wait for all compaction to end during drain as this could mean waiting for a 
very long time, at least with SizeTieredCompaction) just when a compaction is 
finishing, it is possible for some of the compacted file to not have -Compacted 
components even if the compacted file is not temporary anymore. In other words, 
it is possible that when the node is restart, it will load both the compacted 
files and some of the file used to compact it. While this is harmless (though 
inefficient) for normal column family, this means overcounting for counters.

I'll note that even though I can't reproduce the counter bug on 1.1 with the 
test case above, it is just luck as 1.1 is affected as well.

What we need to guarantee is that we will never use both a compacted file and 
one of it's ancestor. One way to ensure that is to keep in the metadata of the 
compacted file, the list of it's ancestors (we only need to keep the 
generation). Then when a node start, it can gather all the ancestors of all the 
sstable in the data dir, and delete all those sstable that are in this ancestor 
set. Since we don't want to keep ever going list of ancestors however, a newly 
compacted sstable only need to keep the list of it's still live ancestor (which 
99% of the time means keeping only the generation of the file that were 
compacted to obtain it). I note that if we do that, we don't need to generate 
-Compacted components.

Attaching patch to implement this. Attaching a patch for 1.0 and 1.1 (which 
aren't very different). I wrote the 1.0 version because it's on this version 
that I knew how to reproduce the counter bug reliably, and I've checked that 
this patch does fix the issue. However, this patch doesn't only affect counter 
code and is not trivial per se, so I don't know how I feel about risking to 
breaking things on 1.0 for non-counter user at this point. I think it might me 
wiser to put this in 1.1.3 only and say that counter users should either apply 
the attached patch at their own risk or upgrade to 1.1.3.


 Counters in columns don't preserve correct values after cluster restart
 ---

 Key: CASSANDRA-4436
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4436
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.10
Reporter: Peter Velas
Assignee: Sylvain Lebresne
 Fix For: 1.1.3

 Attachments: 4436-1.0.txt, 4436-1.1.txt, increments.cql.gz


 Similar to #3821. but affecting normal columns. 
 Set up a 2-node cluster with rf=2.
 1. Create a counter column family and increment a 100 keys in loop 5000 
 times. 
 2. Then make a rolling restart to cluster. 
 3. Again increment another 5000 times.
 4. Make a rolling restart to cluster.
 5. Again increment another 5000 times.
 6. Make a rolling restart to cluster.
 After step 6 we were able to reproduce bug with bad counter values. 
 Expected values were 15 000. Values returned from cluster are higher then 
 15000 + some random number.
 Rolling restarts are done with nodetool drain. Always waiting until second 
 node discover its down then kill java process. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-4436) Counters in columns don't preserve correct values after cluster restart

2012-07-17 Thread Peter Velas (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Velas updated CASSANDRA-4436:
---

Attachment: increments.cql.gz

 Counters in columns don't preserve correct values after cluster restart
 ---

 Key: CASSANDRA-4436
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4436
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.10
Reporter: Peter Velas
 Attachments: increments.cql.gz


 Similar to #3821. but affecting normal columns. 
 Set up a 2-node cluster with rf=2.
 1. Create a counter column family and increment a 100 keys in loop 5000 
 times. 
 2. Then make a rolling restart to cluster. 
 3. Again increment another 5000 times.
 4. Make a rolling restart to cluster.
 5. Again increment another 5000 times.
 6. Make a rolling restart to cluster.
 After step 6 we were able to reproduce bug with bad counter values. 
 Expected values were 15 000. Values returned from cluster are higher then 
 15000 + some random number.
 Rolling restarts are done with nodetool drain. Always waiting until second 
 node discover its down then kill java process. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-4436) Counters in columns don't preserve correct values after cluster restart

2012-07-17 Thread Peter Velas (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Velas updated CASSANDRA-4436:
---

Attachment: increments.cql.gz

Increment for batch loading through cassandra-cli.

 Counters in columns don't preserve correct values after cluster restart
 ---

 Key: CASSANDRA-4436
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4436
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.10
Reporter: Peter Velas
 Attachments: increments.cql.gz


 Similar to #3821. but affecting normal columns. 
 Set up a 2-node cluster with rf=2.
 1. Create a counter column family and increment a 100 keys in loop 5000 
 times. 
 2. Then make a rolling restart to cluster. 
 3. Again increment another 5000 times.
 4. Make a rolling restart to cluster.
 5. Again increment another 5000 times.
 6. Make a rolling restart to cluster.
 After step 6 we were able to reproduce bug with bad counter values. 
 Expected values were 15 000. Values returned from cluster are higher then 
 15000 + some random number.
 Rolling restarts are done with nodetool drain. Always waiting until second 
 node discover its down then kill java process. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-4436) Counters in columns don't preserve correct values after cluster restart

2012-07-17 Thread Peter Velas (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-4436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Velas updated CASSANDRA-4436:
---

Attachment: (was: increments.cql.gz)

 Counters in columns don't preserve correct values after cluster restart
 ---

 Key: CASSANDRA-4436
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4436
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0.10
Reporter: Peter Velas
 Attachments: increments.cql.gz


 Similar to #3821. but affecting normal columns. 
 Set up a 2-node cluster with rf=2.
 1. Create a counter column family and increment a 100 keys in loop 5000 
 times. 
 2. Then make a rolling restart to cluster. 
 3. Again increment another 5000 times.
 4. Make a rolling restart to cluster.
 5. Again increment another 5000 times.
 6. Make a rolling restart to cluster.
 After step 6 we were able to reproduce bug with bad counter values. 
 Expected values were 15 000. Values returned from cluster are higher then 
 15000 + some random number.
 Rolling restarts are done with nodetool drain. Always waiting until second 
 node discover its down then kill java process. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira