[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644421#comment-14644421
 ] 

Benedict commented on CASSANDRA-7066:
-

[~stefania]: jake has correctly pointed out we should document these changes in 
NEWS.txt. Would you mind composing some suitable blurb explaining briefly how 
the new system works from an operator's POV?

> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8894) Our default buffer size for (uncompressed) buffered reads should be smaller, and based on the expected record size

2015-07-28 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-8894:

Fix Version/s: (was: 3.x)
   3.0 alpha 1

> Our default buffer size for (uncompressed) buffered reads should be smaller, 
> and based on the expected record size
> --
>
> Key: CASSANDRA-8894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8894
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>  Labels: benedict-to-commit
> Fix For: 3.0 alpha 1
>
> Attachments: 8894_25pct.yaml, 8894_5pct.yaml, 8894_tiny.yaml
>
>
> A large contributor to slower buffered reads than mmapped is likely that we 
> read a full 64Kb at once, when average record sizes may be as low as 140 
> bytes on our stress tests. The TLB has only 128 entries on a modern core, and 
> each read will touch 32 of these, meaning we are unlikely to almost ever be 
> hitting the TLB, and will be incurring at least 30 unnecessary misses each 
> time (as well as the other costs of larger than necessary accesses). When 
> working with an SSD there is little to no benefit reading more than 4Kb at 
> once, and in either case reading more data than we need is wasteful. So, I 
> propose selecting a buffer size that is the next larger power of 2 than our 
> average record size (with a minimum of 4Kb), so that we expect to read in one 
> operation. I also propose that we create a pool of these buffers up-front, 
> and that we ensure they are all exactly aligned to a virtual page, so that 
> the source and target operations each touch exactly one virtual page per 4Kb 
> of expected record size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8894) Our default buffer size for (uncompressed) buffered reads should be smaller, and based on the expected record size

2015-07-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1466#comment-1466
 ] 

Benedict commented on CASSANDRA-8894:
-

Thanks. Committed.

I'll hold off closing this ticket until we can get those performance tests. 
[~enigmacurry], any idea if/when cstar will support the readahead options?



> Our default buffer size for (uncompressed) buffered reads should be smaller, 
> and based on the expected record size
> --
>
> Key: CASSANDRA-8894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8894
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>  Labels: benedict-to-commit
> Fix For: 3.0 alpha 1
>
> Attachments: 8894_25pct.yaml, 8894_5pct.yaml, 8894_tiny.yaml
>
>
> A large contributor to slower buffered reads than mmapped is likely that we 
> read a full 64Kb at once, when average record sizes may be as low as 140 
> bytes on our stress tests. The TLB has only 128 entries on a modern core, and 
> each read will touch 32 of these, meaning we are unlikely to almost ever be 
> hitting the TLB, and will be incurring at least 30 unnecessary misses each 
> time (as well as the other costs of larger than necessary accesses). When 
> working with an SSD there is little to no benefit reading more than 4Kb at 
> once, and in either case reading more data than we need is wasteful. So, I 
> propose selecting a buffer size that is the next larger power of 2 than our 
> average record size (with a minimum of 4Kb), so that we expect to read in one 
> operation. I also propose that we create a pool of these buffers up-front, 
> and that we ensure they are all exactly aligned to a virtual page, so that 
> the source and target operations each touch exactly one virtual page per 4Kb 
> of expected record size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9908) Potential race caused by async cleanup of transaction log files

2015-07-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644500#comment-14644500
 ] 

Benedict commented on CASSANDRA-9908:
-

Hmm. So our secondary index files don't use their cfId in their directory name?

> Potential race caused by async cleanup of transaction log files
> ---
>
> Key: CASSANDRA-9908
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9908
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sam Tunnicliffe
>Assignee: Stefania
> Fix For: 3.0 beta 1
>
>
> There seems to be a potential race in the cleanup of transaction log files, 
> introduced in CASSANDRA-7066
> It's pretty hard to trigger on trunk, but it's possible to hit it via 
> {{o.a.c.db.SecondaryIndexTest#testCreateIndex}} 
> That test creates an index, then removes it to check that the removal is 
> correctly recorded, then adds the index again to assert that it gets rebuilt 
> from the existing data. 
> The removal causes the SSTables of the index CFS to be dropped, which is a 
> transactional operation and so writes a transaction log. When the drop is 
> completed and the last reference to an SSTable is released, the cleanup of 
> the transaction log is scheduled on the periodic tasks executor. The issue is 
> that re-creating the index re-creates the index CFS. When this happens, it's 
> possible for the cleanup of the txn log to have not yet happened. If so, the 
> initialization of the CFS attempts to read the log to identify any orphaned 
> temporary files. The cleanup can happen between the finding the log file and 
> reading it's contents, which results in a {{NoSuchFileException}}
> {noformat}
> [junit] java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] java.lang.RuntimeException: java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] at 
> org.apache.cassandra.io.util.FileUtils.readLines(FileUtils.java:620)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionFile.getTrackedFiles(TransactionLogs.java:190)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionData.getTemporaryFiles(TransactionLogs.java:338)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs.getTemporaryFiles(TransactionLogs.java:739)
> [junit] at 
> org.apache.cassandra.db.lifecycle.LifecycleTransaction.getTemporaryFiles(LifecycleTransaction.java:541)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.getFilter(Directories.java:652)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:641)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.list(Directories.java:606)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:351)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:313)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:511)
> [junit] at 
> org.apache.cassandra.index.internal.CassandraIndexer.addIndexedColumn(CassandraIndexer.java:115)
> [junit] at 
> org.apache.cassandra.index.SecondaryIndexManager.addIndexedColumn(SecondaryIndexManager.java:265)
> [junit] at 
> org.apache.cassandra.db.SecondaryIndexTest.testIndexCreate(SecondaryIndexTest.java:467)
> [junit] Caused by: java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> [junit] at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> [junit] at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> [junit] at 
> sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
> [junit] at java.nio.file.Files.newByteChannel(Files.java:361)
> [junit] at java.nio.file.Files.newByteChannel(Files.java:407)
> [junit] at 
> java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
> [junit] at java.nio.file.Files.newInputStream(Files.java:152)
> [junit] at java.nio.file.Files.newBufferedReader(Files.java:2784)
> [junit] at java.nio.file.Files.readAllLines(Fil

[jira] [Commented] (CASSANDRA-9908) Potential race caused by async cleanup of transaction log files

2015-07-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644539#comment-14644539
 ] 

Benedict commented on CASSANDRA-9908:
-

Doesn't that sort of obviate much of the supposed benefit of unique cfIds?

Anyway, I guess in this instance we can just make sure that all txn logs have 
been fully cleared before creating a secondary index. The rub will be if the 
cleanup fails, though (deletion on windows, for instance), or there are still 
extant readers. In either case the cleanup can't safely complete.

> Potential race caused by async cleanup of transaction log files
> ---
>
> Key: CASSANDRA-9908
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9908
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sam Tunnicliffe
>Assignee: Stefania
> Fix For: 3.0 beta 1
>
>
> There seems to be a potential race in the cleanup of transaction log files, 
> introduced in CASSANDRA-7066
> It's pretty hard to trigger on trunk, but it's possible to hit it via 
> {{o.a.c.db.SecondaryIndexTest#testCreateIndex}} 
> That test creates an index, then removes it to check that the removal is 
> correctly recorded, then adds the index again to assert that it gets rebuilt 
> from the existing data. 
> The removal causes the SSTables of the index CFS to be dropped, which is a 
> transactional operation and so writes a transaction log. When the drop is 
> completed and the last reference to an SSTable is released, the cleanup of 
> the transaction log is scheduled on the periodic tasks executor. The issue is 
> that re-creating the index re-creates the index CFS. When this happens, it's 
> possible for the cleanup of the txn log to have not yet happened. If so, the 
> initialization of the CFS attempts to read the log to identify any orphaned 
> temporary files. The cleanup can happen between the finding the log file and 
> reading it's contents, which results in a {{NoSuchFileException}}
> {noformat}
> [junit] java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] java.lang.RuntimeException: java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] at 
> org.apache.cassandra.io.util.FileUtils.readLines(FileUtils.java:620)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionFile.getTrackedFiles(TransactionLogs.java:190)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionData.getTemporaryFiles(TransactionLogs.java:338)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs.getTemporaryFiles(TransactionLogs.java:739)
> [junit] at 
> org.apache.cassandra.db.lifecycle.LifecycleTransaction.getTemporaryFiles(LifecycleTransaction.java:541)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.getFilter(Directories.java:652)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:641)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.list(Directories.java:606)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:351)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:313)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:511)
> [junit] at 
> org.apache.cassandra.index.internal.CassandraIndexer.addIndexedColumn(CassandraIndexer.java:115)
> [junit] at 
> org.apache.cassandra.index.SecondaryIndexManager.addIndexedColumn(SecondaryIndexManager.java:265)
> [junit] at 
> org.apache.cassandra.db.SecondaryIndexTest.testIndexCreate(SecondaryIndexTest.java:467)
> [junit] Caused by: java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> [junit] at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> [junit] at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> [junit] at 
> sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
> [junit] at java.nio.file.Files.newByteChannel(Files.java:361)
> [junit] at java.nio.file.Files.newByteChannel(Files.java:4

[jira] [Commented] (CASSANDRA-9382) Snapshot file descriptors not getting purged (possible fd leak)

2015-07-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644592#comment-14644592
 ] 

Benedict commented on CASSANDRA-9382:
-

+1.

This bug shouldn't affect 2.1+, however there's no reason to track hotness 
anyway, even if it closes the resources safely.

> Snapshot file descriptors not getting purged (possible fd leak)
> ---
>
> Key: CASSANDRA-9382
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9382
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Mark Curtis
>Assignee: Yuki Morishita
> Fix For: 2.1.x, 2.0.x
>
> Attachments: yjp-heapdump.png
>
>
> OpsCenter has the repair service which does a lot of small range repairs. 
> Each repair would generate a snapshot as per normal. The cluster was showing 
> a steady increase in disk space over the course of a couple of days and the 
> only way to workaround the issue was to restart the node.
> Upon some further inspection it was seen that a lsof output of the cassandra 
> process was still showing file descriptors for snapshots that no longer 
> existed on the file system. For example:
> {code}
> ava5822 cassandra  DELREG 202,32 7359833 
> /media/ephemeral1/cassandra/data/somekeyspace/table1/snapshots/669a3a30-f3d3-11e4-bec6-3f6c4fb06498/somekeyspace-table1-jb-897689-Data.db
> {code}
> We also took a heapdump which basically showed the same thing, lots of 
> references to these file handles. We checked the logs for any errors 
> especially relating to repairs that might have failed but there was nothing 
> observed
> The repair service logs in OpsCenter showed also that all repairs (1000s of 
> them) had completed successfully, again showing that there was no repair 
> issue.
> I have not yet been able to reproduce the issue locally on a test box. The 
> cluster that this original issue appeared on was a production cluster with 
> the following spec:
> cassandra_versions: 2.0.14.352
> cluster_cores : 8, 
> cluster_instance_types : i2.2xlarge
> cluster_os : Amazon linux amd64 
> node count: 4
> node java version: Oracle Java 1.7.0_51



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644607#comment-14644607
 ] 

Benedict commented on CASSANDRA-7066:
-

If you lose one of these files you are SOL, there's not a lot that can be done 
to help a filesystem failure over one of these. If you want to try and recover, 
the best we can do is point you to the description of how they work and let you 
have a go at fixing it however you feel best. Preferably restoring from backup.

If operators are messing with their sstable directory contents, this change 
simply means any file present at startup will be considered a live file. That's 
all the operator should really need to know, from that POV.

If restoring from backup, these logs should definitely all be deleted. I'm not 
sure what tools we offer to hand hold backup/restore, but they should perhaps 
be made to understand this.

> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644623#comment-14644623
 ] 

Benedict commented on CASSANDRA-7066:
-

bq. Is that SOL as in "Now we can't tell which is new so we have to keep both 
and do redundant compaction" or as in "Now the note can't start up?"

Well, SOL in that we have no way of guaranteeing C* will be able to notice this 
has happened and do anything to avoid breaking things. If we see just one of 
the two files present, under non-fs failure that means we were part way through 
cleaning up, and need to finish deleting its contents. So if we lose the "new" 
file, say, then we will delete the old files on startup none-the-wiser. The 
result being a partial replacement of the sstables (perhaps with nothing at 
all). There's really nothing you can do to guarantee anything in the face of a 
filesystem corruption.

If the operator _knows this has happened_ before starting up they can delete 
all the log files, sure, and then we'll keep everything.

> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644635#comment-14644635
 ] 

Benedict commented on CASSANDRA-7066:
-

Well, if your file system is losing files you're not safe either way. The 
question is how noticeable the loss of files is.

> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644685#comment-14644685
 ] 

Benedict commented on CASSANDRA-7066:
-

Well, a bug in the implementation could screw us up either way, and I'm not 
sure one is more robust to it than any other. But I'm certainly not terribly 
opposed to changing it again. It shouldn't be a drastic change.

However if we're doing that, I'd rather we just went with a simple log file 
that represents new and old in one. i.e., we write lines like:

{{noformat}}
add:sstable-3
remove:sstable-2
commit
{{noformat}}

commit is only written very last if we are removing the old ones and adding the 
new ones. Otherwise we rollback.

This makes the changes pretty minimal, as behaviourally it's identical, it's 
just the on-disk representation that changes. It also retains the benefit of 
not double-counting your data. If we want to be _really_ secure, we can 
post-fix each line with a checksum for the entire file (up to the point), and 
if any do not match we retain every file as a last-ditch fallback. We can also 
log panics in that case, so the operator knows for sure something bad is 
happening with their filesystem. (if only the last line does not match, and it 
is not "commit", we're as safe as we can be to rollback - but perhaps in this 
case we just log less panic-stricken warnings that they can consider deleting 
the duplicate files).

> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644685#comment-14644685
 ] 

Benedict edited comment on CASSANDRA-7066 at 7/28/15 5:23 PM:
--

Well, a bug in the implementation could screw us up either way, and I'm not 
sure one is more robust to it than any other. But I'm certainly not terribly 
opposed to changing it again. It shouldn't be a drastic change.

However if we're doing that, I'd rather we just went with a simple log file 
that represents new and old in one. i.e., we write lines like:

{noformat}
add:sstable-3
remove:sstable-2
commit
{noformat}

commit is only written very last if we are removing the old ones and adding the 
new ones. Otherwise we rollback.

This makes the changes pretty minimal, as behaviourally it's identical, it's 
just the on-disk representation that changes. It also retains the benefit of 
not double-counting your data. If we want to be _really_ secure, we can 
post-fix each line with a checksum for the entire file (up to the point), and 
if any do not match we retain every file as a last-ditch fallback. We can also 
log panics in that case, so the operator knows for sure something bad is 
happening with their filesystem. (if only the last line does not match, and it 
is not "commit", we're as safe as we can be to rollback - but perhaps in this 
case we just log less panic-stricken warnings that they can consider deleting 
the duplicate files).


was (Author: benedict):
Well, a bug in the implementation could screw us up either way, and I'm not 
sure one is more robust to it than any other. But I'm certainly not terribly 
opposed to changing it again. It shouldn't be a drastic change.

However if we're doing that, I'd rather we just went with a simple log file 
that represents new and old in one. i.e., we write lines like:

{{noformat}}
add:sstable-3
remove:sstable-2
commit
{{noformat}}

commit is only written very last if we are removing the old ones and adding the 
new ones. Otherwise we rollback.

This makes the changes pretty minimal, as behaviourally it's identical, it's 
just the on-disk representation that changes. It also retains the benefit of 
not double-counting your data. If we want to be _really_ secure, we can 
post-fix each line with a checksum for the entire file (up to the point), and 
if any do not match we retain every file as a last-ditch fallback. We can also 
log panics in that case, so the operator knows for sure something bad is 
happening with their filesystem. (if only the last line does not match, and it 
is not "commit", we're as safe as we can be to rollback - but perhaps in this 
case we just log less panic-stricken warnings that they can consider deleting 
the duplicate files).

> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644685#comment-14644685
 ] 

Benedict edited comment on CASSANDRA-7066 at 7/28/15 5:34 PM:
--

Well, a bug in the implementation could screw us up either way, and I'm not 
sure one is more robust to it than any other. But I'm certainly not terribly 
opposed to changing it again. It shouldn't be a drastic change.

However if we're doing that, I'd rather we just went with a simple log file 
that represents new and old in one. i.e., we write lines like:

{noformat}
add:sstable-3
remove:sstable-2
commit
{noformat}

commit is only written very last if we are removing the old ones and adding the 
new ones. Otherwise we rollback.

This makes the changes pretty minimal, as behaviourally it's identical, it's 
just the on-disk representation that changes. It also retains the benefit of 
not double-counting your data. If we want to be _really_ secure, we can 
post-fix each line with a checksum for the entire file (up to the point), and 
if any do not match we retain every file as a last-ditch fallback. We can also 
log panics in that case, so the operator knows for sure something bad is 
happening with their filesystem. (if only the last line does not match, and it 
is not "commit", we're as safe as we can be to rollback - but perhaps in this 
case we just log less panic-stricken warnings that they can consider deleting 
the duplicate files).

(edit: this is actually pretty necessary, thinking about it, since deletion of 
the old files can be _very_ delayed if they're involved in a long running 
operation, so we run a pretty significant risk of duplicated data. perhaps much 
more than doubling...)


was (Author: benedict):
Well, a bug in the implementation could screw us up either way, and I'm not 
sure one is more robust to it than any other. But I'm certainly not terribly 
opposed to changing it again. It shouldn't be a drastic change.

However if we're doing that, I'd rather we just went with a simple log file 
that represents new and old in one. i.e., we write lines like:

{noformat}
add:sstable-3
remove:sstable-2
commit
{noformat}

commit is only written very last if we are removing the old ones and adding the 
new ones. Otherwise we rollback.

This makes the changes pretty minimal, as behaviourally it's identical, it's 
just the on-disk representation that changes. It also retains the benefit of 
not double-counting your data. If we want to be _really_ secure, we can 
post-fix each line with a checksum for the entire file (up to the point), and 
if any do not match we retain every file as a last-ditch fallback. We can also 
log panics in that case, so the operator knows for sure something bad is 
happening with their filesystem. (if only the last line does not match, and it 
is not "commit", we're as safe as we can be to rollback - but perhaps in this 
case we just log less panic-stricken warnings that they can consider deleting 
the duplicate files).

> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644718#comment-14644718
 ] 

Benedict commented on CASSANDRA-7066:
-

bq. The value of the 'old way' was the naming was intrinsically tied to the 
state of the sstable and the renames were atomic.

A single rename was atomic. A set of renames was not.

There's no reason to ever backup these files, I don't think? As said, worse 
case scenario if they're missing is you have too much data. However backups 
should only be operating over snapshots, shouldn't they? So they don't even see 
these files.


> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-28 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict reopened CASSANDRA-7066:
-

[~stefania]: I'm not sure if we entirely settled on a modification, but it 
sounds like we should revisit this a little, so I'm reopening.

It would be best if this system get finalised prior to the beta, and if 
possible the alpha, so we should probably focus on this ahead of other tickets.

I think the proposed changes should deal with any of the outlined concerns, 
however I think they can be made even more robust to backup interplay: if we 
also log the last modified time of the file we're deleting alongside it, then 
we can bail out if any of these mismatch for _any_ file, and we can log some 
panics. This should all make us very paranoid and prefer retention of data, as 
well as make us a little more robust to filesystem failures and perhaps also 
bugs of our own making. Although we will never be truly shielded from any of 
these issues.

> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645754#comment-14645754
 ] 

Benedict commented on CASSANDRA-7066:
-

Thanks Stef.

A couple of comments about the description:
* It would be helpful to elaborate "replace the compactions_in_progress system 
table" to include a mention of the elimination of temporary file markers.
* It would be helpful to mention that if restoring from backup, these files 
should either be deleted (in the case of a full backup), or confirmed not to 
reference the files they're restoring
** Perhaps we should offer a tool, like sstablelister, to offline complete any 
transaction log files, which can be specified as prep-work for backup restore
* I've noticed a few English speakers (including in this conversation here) use 
"loose" as "lose" which no doubt encouraged you to do the same. But it's 
incorrect :)

As far as the patch goes, it looks good to me on the whole (though I'll give it 
a full review once we've finalised the details). In the meantime, the 
incremental checksum at the very least seems like a good idea to detect even 
slight corruption of the log file (on the assumption that such corruption might 
indicate we've lost the commit record, and had been in the process of it, which 
seems to be one of the concerns). 

I agree about individual files over-complicating things, however we _could _ 
perhaps take the maximum last modified of all files with that descriptor, and 
then delete the files in ascending order of last modified. This would just help 
the log files be robust to backups being copied in with these files still 
present and referencing them as old, and also to some potential effects of 
problems like CASSANDRA-9908.

> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9738) Migrate key-cache to be fully off-heap

2015-07-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645807#comment-14645807
 ] 

Benedict commented on CASSANDRA-9738:
-

Looks like improved occupancy is also a factor (comparing run 2 to run 3). I'd 
prefer to eliminate the malloc/free from cache maintenance before we hook OHC 
into the key cache, but these numbers are pretty compelling and we should 
probably consider it for 3.0.

I suspect CASSANDRA-8930 will permit us to deliver a more efficient and high 
occupancy key-cache, but that shouldn't prevent us doing something in the 
meantime.

> Migrate key-cache to be fully off-heap
> --
>
> Key: CASSANDRA-9738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9738
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Robert Stupp
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Key cache still uses a concurrent map on-heap. This could go to off-heap and 
> feels doable now after CASSANDRA-8099.
> Evaluation should be done in advance based on a POC to prove that pure 
> off-heap counter cache buys a performance and/or gc-pressure improvement.
> In theory, elimination of on-heap management of the map should buy us some 
> benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9669) If sstable flushes complete out of order, on restart we can fail to replay necessary commit log records

2015-07-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645845#comment-14645845
 ] 

Benedict commented on CASSANDRA-9669:
-

So, I have a patch available for this 
[here|https://github.com/belliottsmith/cassandra/tree/9669-2.0]

I managed to make it less invasive than I had anticipated, but it still 
requires an sstable version increment. The patch:

* Introduces a commitLogLowerBound to the memtable, which tracks the commit log 
position at its creation
* Changes sstable metadata's "replayPosition" into "commitLogLowerBound" and 
"commitLogUpperBound" in the new sstable version
* Delays exposing a new sstable to the compaction strategy until all of its 
preceding flushes have completed
* On compaction, extends the new sstable's lower/upper bounds to the min/max of 
all sstables we're replacing. Given (3), we only extend over ranges that are 
known to already be covered by other sstables.
* On replay, we take any range covered by an sstable to not need replay (and 
any range prior to the earliest known safe range is also ignored)

Test Engineering: there are failures on dtests, but I cannot tell if these are 
new or existing. Mostly the look like flakey tests. The one that looks most 
worrisome to me is counter upgrade test, but could you take a look and tell me 
what you think of the test situation in general? Modifying 2.0 makes me 
uncomfortable

> If sstable flushes complete out of order, on restart we can fail to replay 
> necessary commit log records
> ---
>
> Key: CASSANDRA-9669
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9669
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Critical
>  Labels: correctness
> Fix For: 3.x, 2.1.x, 2.2.x, 3.0.x
>
>
> While {{postFlushExecutor}} ensures it never expires CL entries out-of-order, 
> on restart we simply take the maximum replay position of any sstable on disk, 
> and ignore anything prior. 
> It is quite possible for there to be two flushes triggered for a given table, 
> and for the second to finish first by virtue of containing a much smaller 
> quantity of live data (or perhaps the disk is just under less pressure). If 
> we crash before the first sstable has been written, then on restart the data 
> it would have represented will disappear, since we will not replay the CL 
> records.
> This looks to be a bug present since time immemorial, and also seems pretty 
> serious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9908) Potential race caused by async cleanup of transaction log files

2015-07-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645925#comment-14645925
 ] 

Benedict commented on CASSANDRA-9908:
-

Probably we should at least introduce a {{TransactionLogs.waitForDeletions}} 
call after any schema change. We should also make certain dropping an index 
awaits a new writeOrder and readOrder barrier prior to this.

> Potential race caused by async cleanup of transaction log files
> ---
>
> Key: CASSANDRA-9908
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9908
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sam Tunnicliffe
>Assignee: Stefania
> Fix For: 3.0 beta 1
>
> Attachments: TEST-org.apache.cassandra.db.SecondaryIndexTest.log
>
>
> There seems to be a potential race in the cleanup of transaction log files, 
> introduced in CASSANDRA-7066
> It's pretty hard to trigger on trunk, but it's possible to hit it via 
> {{o.a.c.db.SecondaryIndexTest#testCreateIndex}} 
> That test creates an index, then removes it to check that the removal is 
> correctly recorded, then adds the index again to assert that it gets rebuilt 
> from the existing data. 
> The removal causes the SSTables of the index CFS to be dropped, which is a 
> transactional operation and so writes a transaction log. When the drop is 
> completed and the last reference to an SSTable is released, the cleanup of 
> the transaction log is scheduled on the periodic tasks executor. The issue is 
> that re-creating the index re-creates the index CFS. When this happens, it's 
> possible for the cleanup of the txn log to have not yet happened. If so, the 
> initialization of the CFS attempts to read the log to identify any orphaned 
> temporary files. The cleanup can happen between the finding the log file and 
> reading it's contents, which results in a {{NoSuchFileException}}
> {noformat}
> [junit] java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] java.lang.RuntimeException: java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] at 
> org.apache.cassandra.io.util.FileUtils.readLines(FileUtils.java:620)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionFile.getTrackedFiles(TransactionLogs.java:190)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionData.getTemporaryFiles(TransactionLogs.java:338)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs.getTemporaryFiles(TransactionLogs.java:739)
> [junit] at 
> org.apache.cassandra.db.lifecycle.LifecycleTransaction.getTemporaryFiles(LifecycleTransaction.java:541)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.getFilter(Directories.java:652)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:641)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.list(Directories.java:606)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:351)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:313)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:511)
> [junit] at 
> org.apache.cassandra.index.internal.CassandraIndexer.addIndexedColumn(CassandraIndexer.java:115)
> [junit] at 
> org.apache.cassandra.index.SecondaryIndexManager.addIndexedColumn(SecondaryIndexManager.java:265)
> [junit] at 
> org.apache.cassandra.db.SecondaryIndexTest.testIndexCreate(SecondaryIndexTest.java:467)
> [junit] Caused by: java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> [junit] at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> [junit] at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> [junit] at 
> sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
> [junit] at java.nio.file.Files.newByteChannel(Files.java:361)
> [junit] at java.nio.file.Files.newByteChannel(Files.java:407)
> [junit] at 
> java.nio.file.spi.FileSystemProvider.newInputStream(FileSyste

[jira] [Comment Edited] (CASSANDRA-9738) Migrate key-cache to be fully off-heap

2015-07-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645807#comment-14645807
 ] 

Benedict edited comment on CASSANDRA-9738 at 7/29/15 2:25 PM:
--

Looks like improved occupancy is also a factor (comparing run 2 to run 3). I'd 
prefer to eliminate the malloc/free from cache maintenance before we hook OHC 
into the key cache, but these numbers are pretty compelling and we should 
probably consider it for 3.0.

I suspect CASSANDRA-8931 will permit us to deliver a more efficient and high 
occupancy key-cache, but that shouldn't prevent us doing something in the 
meantime.


was (Author: benedict):
Looks like improved occupancy is also a factor (comparing run 2 to run 3). I'd 
prefer to eliminate the malloc/free from cache maintenance before we hook OHC 
into the key cache, but these numbers are pretty compelling and we should 
probably consider it for 3.0.

I suspect CASSANDRA-8930 will permit us to deliver a more efficient and high 
occupancy key-cache, but that shouldn't prevent us doing something in the 
meantime.

> Migrate key-cache to be fully off-heap
> --
>
> Key: CASSANDRA-9738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9738
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Robert Stupp
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Key cache still uses a concurrent map on-heap. This could go to off-heap and 
> feels doable now after CASSANDRA-8099.
> Evaluation should be done in advance based on a POC to prove that pure 
> off-heap counter cache buys a performance and/or gc-pressure improvement.
> In theory, elimination of on-heap management of the map should buy us some 
> benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9738) Migrate key-cache to be fully off-heap

2015-07-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646112#comment-14646112
 ] 

Benedict commented on CASSANDRA-9738:
-

Yes. Thanks

> Migrate key-cache to be fully off-heap
> --
>
> Key: CASSANDRA-9738
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9738
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Robert Stupp
>Assignee: Robert Stupp
> Fix For: 3.x
>
>
> Key cache still uses a concurrent map on-heap. This could go to off-heap and 
> feels doable now after CASSANDRA-8099.
> Evaluation should be done in advance based on a POC to prove that pure 
> off-heap counter cache buys a performance and/or gc-pressure improvement.
> In theory, elimination of on-heap management of the map should buy us some 
> benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8931) IndexSummary (and Index) should store the token, and the minimal key to unambiguously direct a query

2015-07-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646156#comment-14646156
 ] 

Benedict commented on CASSANDRA-8931:
-

Right

> IndexSummary (and Index) should store the token, and the minimal key to 
> unambiguously direct a query
> 
>
> Key: CASSANDRA-8931
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8931
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>  Labels: performance
>
> Since these files are likely sticking around a little longer, it is probably 
> worth optimising them. A relatively simple change to Index and IndexSummary 
> could reduce the amount of space required significantly, reduce the CPU 
> burden of lookup, and hopefully bound the amount of space needed as key size 
> grows. On writing first we always store the token before the key (if it is 
> different to the key); then we simply truncate the whole record to the 
> minimum length necessary to answer an inequality search. Since the data file 
> contains the key also, we can corroborate we have the right key once we've 
> looked up. Since BFs are used to reduce unnecessary lookups, we don't save 
> much by ruling the false positives out one step earlier. 
>  An improved follow up version would be to use a trie of shortest length to 
> answer inequality lookups, as this would also ensure very long keys with 
> common prefixes would not significantly increase the size of the index or 
> summary. This would translate to a trie index for the summary keying into a 
> static trie page for the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9918) Introduce extra Ref usage when debugrefcount=true

2015-07-29 Thread Benedict (JIRA)
Benedict created CASSANDRA-9918:
---

 Summary: Introduce extra Ref usage when debugrefcount=true
 Key: CASSANDRA-9918
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9918
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 3.0.x


To help us in testing catch API misuse, especially resource handling, we can 
introduce an abstract base class implementing AutoCloseable that under normal 
circumstances is empty, but when ref count debugging is enabled instantiates a 
Ref instance that it releases on close. This will let us detect objects we are 
misusing that we consider too high-traffic for normal systems to have tracked 
by Refs. Ideally CASSANDRA-9379 will be delivered around the same time, so that 
this can be enabled on live systems without _much_ negative impact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9919) Cleanup closeable iterator usage

2015-07-29 Thread Benedict (JIRA)
Benedict created CASSANDRA-9919:
---

 Summary: Cleanup closeable iterator usage
 Key: CASSANDRA-9919
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9919
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 3.0.0 rc1


Our iterators are all sensibly AutoCloseable, but we still have a lot of places 
where we close the iterator if we reach the end in next() / hasNext(). ThIs 
behaviour will only mask problems, as any exceptions in normal processing will 
miss these pathways. Since we much more heavily depend on iterators now, we 
need to be certain they are rock solid. So I would prefer we remove our 
crutches and work through any pain earlier.

CASSANDRA-9918 can then help us catch any misuse more easily.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9865) Broken vint encoding, at least when interacting with OHCProvider

2015-07-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646328#comment-14646328
 ] 

Benedict commented on CASSANDRA-9865:
-

+1

> Broken vint encoding, at least when interacting with OHCProvider
> 
>
> Key: CASSANDRA-9865
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9865
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sylvain Lebresne
>Assignee: Ariel Weisberg
> Fix For: 3.0 alpha 1
>
> Attachments: 9865-hacky-test.txt
>
>
> I haven't investigated this very closely so I only have a slightly hacky way 
> to show the problem, but if you apply the patch attached, you'll see that the 
> vints serialized and the one deserialized are not the same ones. If you 
> remove the use of vints (as is currently on trunk, but only due to this issue 
> because we do want to use vints), everything works correctly.
> I'm honestly not sure where the problem is, but it sounds like it could be 
> either in {{NIODataInputStream}} or in the {{OHCProvider}} since it's used on 
> that test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9925) Sorted + Unique BTree.Builder usage can use a thread-local TreeBuilder directly

2015-07-29 Thread Benedict (JIRA)
Benedict created CASSANDRA-9925:
---

 Summary: Sorted + Unique BTree.Builder usage can use a 
thread-local TreeBuilder directly
 Key: CASSANDRA-9925
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9925
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 3.0.x


There are now a number of situations where we use a BTree.Builder that could be 
made to go directly to the TreeBuilder, since they only perform in-order unique 
additions to build an initial tree. This would potentially avoid a number of 
array allocations/copies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9896) Add ability to disable commitlog recycling

2015-07-29 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9896:

Fix Version/s: (was: 2.1.x)
   2.1.9

> Add ability to disable commitlog recycling
> --
>
> Key: CASSANDRA-9896
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9896
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Brandon Williams
>Assignee: Benedict
> Fix For: 2.1.9
>
>
> See CASSANDRA-9533 for background, specifically the graphs I linked.  
> Benedict suggests this is due the commitlog recycling and I agree, so the 
> simplest solution is to be able to disable that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9525) Commitlog allocation failure doesn't stop the entire node.

2015-07-29 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9525:

Fix Version/s: (was: 2.1.x)

> Commitlog allocation failure doesn't stop the entire node.
> --
>
> Key: CASSANDRA-9525
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9525
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Jeremiah Jordan
>
> If there is an error opening a commit log segment, the thread dies, but it 
> doesn't stop the whole node.
> Got the following on a node:
> {noformat}
> ERROR [COMMIT-LOG-ALLOCATOR] 2015-05-29 17:06:45,722  CommitLog.java:397 - 
> Failed managing commit log segments. Commit disk failure policy is stop; 
> terminating thread
> org.apache.cassandra.io.FSWriteError: java.io.FileNotFoundException: 
> /var/lib/cassandra/commitlog/CommitLog-4-1432937194590.log (Permission denied)
>   at 
> org.apache.cassandra.db.commitlog.CommitLogSegment.(CommitLogSegment.java:177)
>  ~[cassandra-all-2.1.5.jar:2.1.5]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogSegmentManager$4.call(CommitLogSegmentManager.java:397)
>  ~[cassandra-all-2.1.5.jar:2.1.5]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogSegmentManager$4.call(CommitLogSegmentManager.java:394)
>  ~[cassandra-all-2.1.5.jar:2.1.5]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogSegmentManager$1.runMayThrow(CommitLogSegmentManager.java:152)
>  ~[cassandra-all-2.1.5.jar:2.1.5]
>   at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> [cassandra-all-2.1.5.jar:2.1.5]
>   at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
> Caused by: java.io.FileNotFoundException: 
> /var/lib/cassandra/commitlog/CommitLog-4-1432937194590.log (Permission denied)
>   at java.io.RandomAccessFile.open0(Native Method) ~[na:1.8.0_45]
>   at java.io.RandomAccessFile.open(RandomAccessFile.java:316) 
> ~[na:1.8.0_45]
>   at java.io.RandomAccessFile.(RandomAccessFile.java:243) 
> ~[na:1.8.0_45]
>   at 
> org.apache.cassandra.db.commitlog.CommitLogSegment.(CommitLogSegment.java:155)
>  ~[cassandra-all-2.1.5.jar:2.1.5]
>   ... 5 common frames omitted
> {noformat}
> And the node stayed kind of up, didn't notice something wrong until the node 
> died OOM because some threads were dead and others weren't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9471) Columns should be backed by a BTree, not an array

2015-07-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646593#comment-14646593
 ] 

Benedict commented on CASSANDRA-9471:
-

Patch available [here|https://github.com/belliottsmith/cassandra/tree/9471]

> Columns should be backed by a BTree, not an array
> -
>
> Key: CASSANDRA-9471
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9471
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
> Fix For: 3.0 beta 1
>
>
> Follow up to 8099. 
> We have pretty terrible lookup performance as the number of columns grows 
> (linear). In at least one location, this results in quadratic performance. 
> We don't however want this structure to be either any more expensive to 
> build, nor to store. Some small modifications to BTree will permit it to 
> serve here, by permitting efficient lookup by index, and calculation _of_ 
> index for a given key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9586) ant eclipse-warnings fails in trunk

2015-07-29 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9586:

Fix Version/s: (was: 3.x)
   3.0 alpha 1

> ant eclipse-warnings fails in trunk
> ---
>
> Key: CASSANDRA-9586
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9586
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michael Shuler
>Assignee: Stefania
> Fix For: 3.0 alpha 1
>
>
> {noformat}
> eclipse-warnings:
> [mkdir] Created dir: /home/mshuler/git/cassandra/build/ecj
>  [echo] Running Eclipse Code Analysis.  Output logged to 
> /home/mshuler/git/cassandra/build/ecj/eclipse_compiler_checks.txt
>  [java] incorrect classpath: 
> /home/mshuler/git/cassandra/build/cobertura/classes
>  [java] --
>  [java] 1. ERROR in 
> /home/mshuler/git/cassandra/src/java/org/apache/cassandra/io/util/RandomAccessReader.java
>  (at line 81)
>  [java] super(new ChannelProxy(file), DEFAULT_BUFFER_SIZE, -1L, 
> BufferType.OFF_HEAP);
>  [java]   ^^
>  [java] Potential resource leak: '' may not 
> be closed
>  [java] --
>  [java] 1 problem (1 error)
> BUILD FAILED
> {noformat}
> (checked 2.2 and did not find this issue)
> git blame on line 81 shows commit 17dd4cc for CASSANDRA-8897



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9307) test-long contains tests that are not appropriate for running on every commit

2015-07-29 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9307:

Fix Version/s: (was: 3.x)
   3.0 alpha 1

> test-long contains tests that are not appropriate for running on every commit
> -
>
> Key: CASSANDRA-9307
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9307
> Project: Cassandra
>  Issue Type: Test
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
> Fix For: 3.0 alpha 1
>
>
> After discussion with Benedict it seems like some of the tests in test-long 
> are not long unit tests they are burn in tests for components that have not 
> changed since the tests were introduced. I propose creating a new test-target 
> called test-burn that isn't part of test-all, commenting test-all to note the 
> existence of test-burn, and then moving burn in tests to this target.
> I'll create a separate JIRA for making sure that test-burn is run regularly 
> on hardware appropriate for those tests just to prevent bit rot of the test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9926) Avoid AbstractRow.isEmpty() where possible

2015-07-29 Thread Benedict (JIRA)
Benedict created CASSANDRA-9926:
---

 Summary: Avoid AbstractRow.isEmpty() where possible
 Key: CASSANDRA-9926
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9926
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benedict
Assignee: Benedict
Priority: Minor


This method performs not-insignificant duplicate work, so we should both 
implement it more efficiently where possible, and avoid using it if we can



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9894) Serialize the header only once per message

2015-07-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646840#comment-14646840
 ] 

Benedict edited comment on CASSANDRA-9894 at 7/29/15 10:27 PM:
---

I've pushed an initial version 
[here|https://github.com/belliottsmith/cassandra/tree/9894]. This is based on 
my patch for CASSANDRA-9471.

I tried starting from Sylvain's patch, and then starting from scratch, and 
ultimately I didn't like where either lead. So this attacks the problem a 
little differently: it uses the column filter sent to the replica to help 
encode the response, knowing that the response columns must be a subset. With a 
normal number of columns this translates to a presence bitmap (otherwise it is 
a sequence of ints either adding or removing from the set, but these codepaths 
should rarely be taken). If the columns are identical, a single 0 byte is sent 
for all the columns.

This permits us to save work when serializing even single partitions, and also 
permits us to encode per-partition encoding stats, so that our timestamps can 
most likely be more efficiently encoded. It also touches far less code.

I am not 100% certain I haven't broken things, as dtests are a little tricky to 
read right now, but nothing jumps out at me. I still need to introduce some 
unit tests, and also want to invert the bitmap to make it more efficiently vint 
encoded. But the patch is generally ready for a first round of review, as it 
will change only minimally.


was (Author: benedict):
I've pushed an initial version 
[here|https://github.com/belliottsmith/cassandra/tree/9894]. This is based on 
my patch for CASSANDRA-9471.

I tried starting from Sylvain's patch, and then starting from scratch, and 
ultimately I didn't like where either lead. So this attacks the problem a 
little differently: it uses the column filter sent to the coordinator for a 
query to encode the response, knowing that the columns must be a subset. With a 
normal number of columns this translates to a bitmap of presence in the 
response for each column in the request (otherwise it is a sequence of vint 
encoded ints, but these codepaths should rarely be taken), and if the columns 
are identical (what we should expect), a single 0 byte is sent for all the 
columns.

This permits us to save work when serializing even single partitions, and also 
permits us to encode per-partition encoding stats, so that our timestamps can 
most likely be more efficiently encoded. It also touches far less code.

I am not 100% certain I haven't broken things, as dtests are a little tricky to 
read right now, but nothing jumps out at me. I still need to introduce some 
unit tests, and also want to invert the bitmap to make it more efficiently vint 
encoded. But the patch is generally ready for a first round of review, as it 
will change only minimally.

> Serialize the header only once per message
> --
>
> Key: CASSANDRA-9894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9894
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Sylvain Lebresne
>Assignee: Benedict
> Fix For: 3.0 beta 1
>
>
> One last improvement I'd like to do on the serialization side is that we 
> currently serialize the {{SerializationHeader}} for each partition. That 
> header contains the serialized columns in particular and for range queries, 
> serializing that for every partition is wasted (note that it's only a problem 
> for the messaging protocol as for sstable we only write the header once per 
> sstable).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9932) Make all partitions btree backed

2015-07-29 Thread Benedict (JIRA)
Benedict created CASSANDRA-9932:
---

 Summary: Make all partitions btree backed
 Key: CASSANDRA-9932
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9932
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 3.0.0 rc1


Following on from the other btree related refactors, this patch makes all 
partition (and partition-like) objects backed by the same basic structure: 
{{AbstractBTreePartition}}. With two main offshoots: 
{{ImmutableBTreePartition}} and {{AtomicBTreePartition}}

The main upshot is a 30% net code reduction, meaning better exercise of btree 
code paths and fewer new code paths to go wrong. A secondary upshort is that, 
by funnelling all our comparisons through a btree, there is a higher likelihood 
of icache occupancy and we have only one area to focus delivery of improvements 
for their enjoyment by all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9932) Make all partitions btree backed

2015-07-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646885#comment-14646885
 ] 

Benedict commented on CASSANDRA-9932:
-

An in progress patch is available 
[here|https://github.com/belliottsmith/cassandra/tree/btreepartition]. It is 
close to review-ready, but not quite.

> Make all partitions btree backed
> 
>
> Key: CASSANDRA-9932
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9932
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
> Fix For: 3.0.0 rc1
>
>
> Following on from the other btree related refactors, this patch makes all 
> partition (and partition-like) objects backed by the same basic structure: 
> {{AbstractBTreePartition}}. With two main offshoots: 
> {{ImmutableBTreePartition}} and {{AtomicBTreePartition}}
> The main upshot is a 30% net code reduction, meaning better exercise of btree 
> code paths and fewer new code paths to go wrong. A secondary upshort is that, 
> by funnelling all our comparisons through a btree, there is a higher 
> likelihood of icache occupancy and we have only one area to focus delivery of 
> improvements for their enjoyment by all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CASSANDRA-9888) BTreeBackedRow and ComplexColumnData

2015-07-30 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict resolved CASSANDRA-9888.
-
Resolution: Fixed

Thanks. I've ninja fixed.

(Although I'm having weird problems running this test locally, it's an obvious 
fix from the stack trace)

> BTreeBackedRow and ComplexColumnData
> 
>
> Key: CASSANDRA-9888
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9888
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
> Fix For: 3.0 alpha 1
>
>
> I found ArrayBackedRow a little hard to follow, especially around building, 
> so I've converted it to BTreeBackedRow, along with ComplexColumnData. Both 
> now rely on BTree.Builder, which introduces a little extra functionality to 
> permit these classes to be implemented more declaratively. Transformations of 
> these classes are also now uniform and more declarative, also depending on 
> some new functionality in BTree that permits applying a 
> transformation/filtration to an existing btree (this could be optimised at a 
> later date, but should suffice for now).
> The result is IMO both clearer and should scale more gracefully to larger 
> numbers of columns and complex cells.
> This hasn't taken all of the possible improvements of the back of this change 
> to their natural conclusion, as we are somewhat time pressed and I would 
> prefer to get the ball rolling with this first round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9908) Potential race caused by async cleanup of transaction log files

2015-07-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647351#comment-14647351
 ] 

Benedict commented on CASSANDRA-9908:
-

Take a look in {{Keyspace.dropCf}} for another example. In that case it is done 
to ensure there are no writes to the commit log for the table after it's 
dropped. In this case that's unimportant, but we _do_ want to be certain that 
when we call {{waitForDeletions}} we can be confident the deletions have all 
actually happened, and that the log has been cleaned up. This means waiting 
until nobody is using any of the sstables.

Really, though, we should be waiting for any writes to complete, then we should 
clean our memtable, perform a blocking flush (which waits for any in-progress 
flushes to complete), then waiting for any reads to complete, and _then_ drop 
our tables and wait for deletions. This should all happen after the index is no 
longer reachable by any new operations.

Waiting for deletions should be fine to put in {{dropSSTables}}, however I 
think it might be better put in {{invalidate}}, since there's no need to wait 
for them during truncation, for instance.


> Potential race caused by async cleanup of transaction log files
> ---
>
> Key: CASSANDRA-9908
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9908
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sam Tunnicliffe
>Assignee: Stefania
> Fix For: 3.0 beta 1
>
> Attachments: TEST-org.apache.cassandra.db.SecondaryIndexTest.log
>
>
> There seems to be a potential race in the cleanup of transaction log files, 
> introduced in CASSANDRA-7066
> It's pretty hard to trigger on trunk, but it's possible to hit it via 
> {{o.a.c.db.SecondaryIndexTest#testCreateIndex}} 
> That test creates an index, then removes it to check that the removal is 
> correctly recorded, then adds the index again to assert that it gets rebuilt 
> from the existing data. 
> The removal causes the SSTables of the index CFS to be dropped, which is a 
> transactional operation and so writes a transaction log. When the drop is 
> completed and the last reference to an SSTable is released, the cleanup of 
> the transaction log is scheduled on the periodic tasks executor. The issue is 
> that re-creating the index re-creates the index CFS. When this happens, it's 
> possible for the cleanup of the txn log to have not yet happened. If so, the 
> initialization of the CFS attempts to read the log to identify any orphaned 
> temporary files. The cleanup can happen between the finding the log file and 
> reading it's contents, which results in a {{NoSuchFileException}}
> {noformat}
> [junit] java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] java.lang.RuntimeException: java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] at 
> org.apache.cassandra.io.util.FileUtils.readLines(FileUtils.java:620)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionFile.getTrackedFiles(TransactionLogs.java:190)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionData.getTemporaryFiles(TransactionLogs.java:338)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs.getTemporaryFiles(TransactionLogs.java:739)
> [junit] at 
> org.apache.cassandra.db.lifecycle.LifecycleTransaction.getTemporaryFiles(LifecycleTransaction.java:541)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.getFilter(Directories.java:652)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:641)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.list(Directories.java:606)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:351)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:313)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:511)
> [junit] at 
> org.apache.cassandra.index.internal.CassandraIndexer.addIndexedColumn(CassandraIndexer.java:115)
> [junit] at 
> org.apache.cassandra.index.SecondaryIndexManager.addIndexedColumn(SecondaryIndexManager.java:265)
> [junit] at 
> org.apache.cassandra.db.SecondaryIndexTest.testIndexCreate(SecondaryIndexTest.java:467)
> [junit] Caused by: java.nio.file.NoSuchFileException: 
> build/test/cassa

[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647469#comment-14647469
 ] 

Benedict commented on CASSANDRA-7066:
-

bq. not sure what a panic error is exactly?

It's just hyperbolic language to mean a loud looking error. Like, perhaps with 
some capital letters in there and some panic-stricken language, so that 
operators might take notice.

bq. Here we are talking about old files only, right? 

Right, exactly. New files should never be a problem for backups or anything of 
that ilk, because they cannot be backed up until after we've finished our 
lifecycle transaction and written commit.



> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9894) Serialize the header only once per message

2015-07-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648324#comment-14648324
 ] 

Benedict commented on CASSANDRA-9894:
-

bq. Looks good to me conditional on a unit test for Columns.

bq. , but in the common case will there be enough columns to benefit?

Those are both comments for a different ticket, CASSANDRA-9471. This is built 
upon that simply because it made the patch a little easier. I will respond to 
these comments there.

bq. In Columns when you serialize -1 for the encoding/number of columns can you 
add a comment that using the unsigned varint is intentional?

The reason we is we cannot use any other encoding: the other two branches use 
it, so we _have_ to in order for deserialization to know what's going on 
I'll add a comment to that effect, but I felt this was commented by the code.


> Serialize the header only once per message
> --
>
> Key: CASSANDRA-9894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9894
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Sylvain Lebresne
>Assignee: Benedict
> Fix For: 3.0 beta 1
>
>
> One last improvement I'd like to do on the serialization side is that we 
> currently serialize the {{SerializationHeader}} for each partition. That 
> header contains the serialized columns in particular and for range queries, 
> serializing that for every partition is wasted (note that it's only a problem 
> for the messaging protocol as for sstable we only write the header once per 
> sstable).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9471) Columns should be backed by a BTree, not an array

2015-07-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648337#comment-14648337
 ] 

Benedict commented on CASSANDRA-9471:
-

[~aweisberg] said:
{quote}
It would be nice to see a microbenchmark and benchmark demonstrating that the 
use of BTree is a win. I am wondering how many additional instructions and 
branches are hidden inside BTree that you pay for in the common case of small 
Columns set. A BTree is more instructions for more cache locality, but in the 
common case will there be enough columns to benefit?
{quote}

If we want to go down that avenue, we should IMO simply look into specialising 
the btree iterator for the situation where we have just one leaf, and this is 
something I've considered. That's pretty much the only likely win. We don't 
really have time to do the kind of microbenchmarking you're talking about 
though, and it is of limited use anyway, since one of the main benefits 
_performancewise_ (in the common case) is improved icache\* occupancy, which 
cannot be accounted for in a microbenchmark. In the uncommon case this also 
gives us better lower bounds on behaviour, and access to a more powerfully 
expressive toolkit. For instance, it permits us to trivially deliver a more 
efficient multi-way merge (which we do very often), and that can be improved 
further still with less trivial enhancements.

Mostly, though, this patch simply guarantees us a good lower bound on 
performance, with improved code clarity. Right now I would very much expect 
performance in the common case to be a wash, especially on any of our existing 
workloads.

> Columns should be backed by a BTree, not an array
> -
>
> Key: CASSANDRA-9471
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9471
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
> Fix For: 3.0 beta 1
>
>
> Follow up to 8099. 
> We have pretty terrible lookup performance as the number of columns grows 
> (linear). In at least one location, this results in quadratic performance. 
> We don't however want this structure to be either any more expensive to 
> build, nor to store. Some small modifications to BTree will permit it to 
> serve here, by permitting efficient lookup by index, and calculation _of_ 
> index for a given key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9894) Serialize the header only once per message

2015-07-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648349#comment-14648349
 ] 

Benedict commented on CASSANDRA-9894:
-

bq. I thought the reason you did it was to have the most compact encoding for 
the positive numbers which are going to be more common.

I think those two statements add up to the same thing :)

I'll add a comment explaining the full thought process, as it's actually a 
little more subtle to that. The encoding is designed to tend towards a single 
byte in the most common cases, and most importantly that it was consistent with 
a guaranteed single byte in the most common case (equality).

> Serialize the header only once per message
> --
>
> Key: CASSANDRA-9894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9894
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Sylvain Lebresne
>Assignee: Benedict
> Fix For: 3.0 beta 1
>
>
> One last improvement I'd like to do on the serialization side is that we 
> currently serialize the {{SerializationHeader}} for each partition. That 
> header contains the serialized columns in particular and for range queries, 
> serializing that for every partition is wasted (note that it's only a problem 
> for the messaging protocol as for sstable we only write the header once per 
> sstable).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9471) Columns should be backed by a BTree, not an array

2015-07-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648337#comment-14648337
 ] 

Benedict edited comment on CASSANDRA-9471 at 7/30/15 9:39 PM:
--

[~aweisberg] said:
{quote}
It would be nice to see a microbenchmark and benchmark demonstrating that the 
use of BTree is a win. I am wondering how many additional instructions and 
branches are hidden inside BTree that you pay for in the common case of small 
Columns set. A BTree is more instructions for more cache locality, but in the 
common case will there be enough columns to benefit?
{quote}

If we want to go down that avenue, we should IMO simply look into specialising 
the btree iterator for the situation where we have just one leaf, and this is 
something I've considered. That's pretty much the only likely win. We don't 
really have time to do the kind of microbenchmarking you're talking about 
though, and it is of limited use anyway, since one of the main benefits 
_performancewise_ (in the common case) is improved icache\* occupancy, which 
cannot be accounted for in a microbenchmark. In the uncommon case this also 
gives us better lower bounds on behaviour, and access to a more powerfully 
expressive toolkit. For instance, it permits us to trivially deliver a more 
efficient multi-way merge (which we do very often), and that can be improved 
further still with less trivial enhancements. Other future improvements that 
can be deployed widely are efficient intersection for iteration of subsets, 
which again we do very often.

Mostly, though, for now this patch simply guarantees us a good lower bound on 
performance, with improved code clarity. Right now I would very much expect 
performance in the common case to be a wash, especially on any of our existing 
workloads.


was (Author: benedict):
[~aweisberg] said:
{quote}
It would be nice to see a microbenchmark and benchmark demonstrating that the 
use of BTree is a win. I am wondering how many additional instructions and 
branches are hidden inside BTree that you pay for in the common case of small 
Columns set. A BTree is more instructions for more cache locality, but in the 
common case will there be enough columns to benefit?
{quote}

If we want to go down that avenue, we should IMO simply look into specialising 
the btree iterator for the situation where we have just one leaf, and this is 
something I've considered. That's pretty much the only likely win. We don't 
really have time to do the kind of microbenchmarking you're talking about 
though, and it is of limited use anyway, since one of the main benefits 
_performancewise_ (in the common case) is improved icache\* occupancy, which 
cannot be accounted for in a microbenchmark. In the uncommon case this also 
gives us better lower bounds on behaviour, and access to a more powerfully 
expressive toolkit. For instance, it permits us to trivially deliver a more 
efficient multi-way merge (which we do very often), and that can be improved 
further still with less trivial enhancements.

Mostly, though, this patch simply guarantees us a good lower bound on 
performance, with improved code clarity. Right now I would very much expect 
performance in the common case to be a wash, especially on any of our existing 
workloads.

> Columns should be backed by a BTree, not an array
> -
>
> Key: CASSANDRA-9471
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9471
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
> Fix For: 3.0 beta 1
>
>
> Follow up to 8099. 
> We have pretty terrible lookup performance as the number of columns grows 
> (linear). In at least one location, this results in quadratic performance. 
> We don't however want this structure to be either any more expensive to 
> build, nor to store. Some small modifications to BTree will permit it to 
> serve here, by permitting efficient lookup by index, and calculation _of_ 
> index for a given key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648509#comment-14648509
 ] 

Benedict commented on CASSANDRA-7066:
-

[~rcoli]: while this patch does not remove the ancestor maintenance, it does 
make it completely irrelevant, since it is no longer used for any behaviour. 
However your concern highlights a much scarier prospect: the replay positions 
stored within the sstables could be at an arbitrary position, potentially in 
the future of the node's commit log. So they could break commit log replay 
during startup. In fact they probably have a high chance of doing so, although 
CASSANDRA-9669 may mitigate that.

We should actually remove this data from {{CompactionMetadata}}, and mention 
this in the NEWS.txt as well.

bq. Is it possible to have the refresh command and possibly node startup fail 
and/or throw an error if this step doesn't happen?

There's unfortunately no good way for the node to know that it should not 
replay these files, since the whole point of their existence is that they 
should be replayed. The changes we've just discussed will make us _somewhat_ 
robust to any problems that might arise from operators doing this, and the 
likelihood of copying files covered by such a file is fairly low anyway. 
[~stefania]: perhaps we should offer another command line tool that just 
performs the cleanup (i.e. either rolls back or finishes commit, whichever is 
necessary) that an operator can run to clear the way, and we can make this 
explicit in NEWS.txt

> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-07-31 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649193#comment-14649193
 ] 

Benedict commented on CASSANDRA-7066:
-

Thanks. Looks like great work as always. 

A few comments (I haven't done a full review):

* The last modified sort order was necessary for the same reason a full 
checksum of all of the old files is too heavy handed: restarting in the middle 
of deleting the files. We want to finish deleting them, and if we take the max 
timestamp we can say that will safely stay the same. We could keep the full 
checksum and include a count; if the count on disk is lower, and our max 
timestamp matches, we can proceed; if the count is the same we require a 
checksum match. This may or may not be overkill, but there's on harm in 
paranoia here.
* It looks like serialization of the sstables now assumes the ancestors won't 
be present, but this will break upgrade. We need to check the sstable version, 
and at least skip the ancestors if they're present
* The compaction metadata equality is probably not sensibly employed anywhere; 
we should just confirm this and remove it IMO since it makes very little sense, 
especially now that the ancestors have been removed
* I think you took my comments about panic warnings a little too literally. We 
should just use strong language to make it clear to the operator we've detected 
disk corruption of some kind, and have taken the most pessimistic course of 
action
** This warning should be less severe if only the last line is _incomplete_ 
(and all other lines are parsed correctly), as this would be expected if we 
crashed part way through serialization
** We should perhaps not worry at all, and continue if this last situation 
occurs, in the event that _all_ files logged are still present on the file 
system, and our other metadata all matches for them. In this case we can be 
very confident we just shutdown in the middle of writing a record, and can just 
clean up all of the new files. This should make us robust both to the scary 
situation, but also minimise the pessimism, hopefully giving us the best of 
both worlds.

WDYT?

> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9908) Potential race caused by async cleanup of transaction log files

2015-08-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651564#comment-14651564
 ] 

Benedict commented on CASSANDRA-9908:
-

Yep, that all looks good.

bq.  Isn't this already guaranteed by SecondaryIndexManager.removeIndexedColumn?

Yes, it was just a general statement, without having yet dived into this piece 
of code sufficiently, that we needed to ensure this.

(sorry for the delay, I thought I'd already responded to this)

> Potential race caused by async cleanup of transaction log files
> ---
>
> Key: CASSANDRA-9908
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9908
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sam Tunnicliffe
>Assignee: Stefania
>  Labels: benedict-to-commit
> Fix For: 3.0 beta 1
>
> Attachments: TEST-org.apache.cassandra.db.SecondaryIndexTest.log
>
>
> There seems to be a potential race in the cleanup of transaction log files, 
> introduced in CASSANDRA-7066
> It's pretty hard to trigger on trunk, but it's possible to hit it via 
> {{o.a.c.db.SecondaryIndexTest#testCreateIndex}} 
> That test creates an index, then removes it to check that the removal is 
> correctly recorded, then adds the index again to assert that it gets rebuilt 
> from the existing data. 
> The removal causes the SSTables of the index CFS to be dropped, which is a 
> transactional operation and so writes a transaction log. When the drop is 
> completed and the last reference to an SSTable is released, the cleanup of 
> the transaction log is scheduled on the periodic tasks executor. The issue is 
> that re-creating the index re-creates the index CFS. When this happens, it's 
> possible for the cleanup of the txn log to have not yet happened. If so, the 
> initialization of the CFS attempts to read the log to identify any orphaned 
> temporary files. The cleanup can happen between the finding the log file and 
> reading it's contents, which results in a {{NoSuchFileException}}
> {noformat}
> [junit] java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] java.lang.RuntimeException: java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] at 
> org.apache.cassandra.io.util.FileUtils.readLines(FileUtils.java:620)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionFile.getTrackedFiles(TransactionLogs.java:190)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionData.getTemporaryFiles(TransactionLogs.java:338)
> [junit] at 
> org.apache.cassandra.db.lifecycle.TransactionLogs.getTemporaryFiles(TransactionLogs.java:739)
> [junit] at 
> org.apache.cassandra.db.lifecycle.LifecycleTransaction.getTemporaryFiles(LifecycleTransaction.java:541)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.getFilter(Directories.java:652)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:641)
> [junit] at 
> org.apache.cassandra.db.Directories$SSTableLister.list(Directories.java:606)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:351)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:313)
> [junit] at 
> org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:511)
> [junit] at 
> org.apache.cassandra.index.internal.CassandraIndexer.addIndexedColumn(CassandraIndexer.java:115)
> [junit] at 
> org.apache.cassandra.index.SecondaryIndexManager.addIndexedColumn(SecondaryIndexManager.java:265)
> [junit] at 
> org.apache.cassandra.db.SecondaryIndexTest.testIndexCreate(SecondaryIndexTest.java:467)
> [junit] Caused by: java.nio.file.NoSuchFileException: 
> build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log
> [junit] at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> [junit] at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> [junit] at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> [junit] at 
> sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
> [junit] at java.nio.file.Files.newByteChannel(Files.java:361)
> [junit] at 

[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers

2015-08-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651571#comment-14651571
 ] 

Benedict commented on CASSANDRA-7066:
-

bq. Will this cause problems if the alpha goes out before this ticket is 
committed?

We should be fine, since we're changing on-wire serialization between alpha and 
beta also, so I guess we're OK with some upgrade problems between the two (we 
don't really want or expect anyone to use alpha except for dev, I would guess). 
[~jbellis]?

bq. However all classes implementing MetadataComponent implement them so I left 
them with a comment.  Also, without it it seems a scrub test was failing 
(unsure if related though).

I guess it's possible we compare metadata at the end of a scrub to confirm it's 
equivalent. It's somewhat ugly that we will treat all compaction metadata as 
equal, but probably not worth worrying about here.

Otherwise SGTM, I'll review it shortly.

> Simplify (and unify) cleanup of compaction leftovers
> 
>
> Key: CASSANDRA-7066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7066
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Stefania
>Priority: Minor
>  Labels: benedict-to-commit, compaction
> Fix For: 3.0 alpha 1
>
> Attachments: 7066.txt
>
>
> Currently we manage a list of in-progress compactions in a system table, 
> which we use to cleanup incomplete compactions when we're done. The problem 
> with this is that 1) it's a bit clunky (and leaves us in positions where we 
> can unnecessarily cleanup completed files, or conversely not cleanup files 
> that have been superceded); and 2) it's only used for a regular compaction - 
> no other compaction types are guarded in the same way, so can result in 
> duplication if we fail before deleting the replacements.
> I'd like to see each sstable store in its metadata its direct ancestors, and 
> on startup we simply delete any sstables that occur in the union of all 
> ancestor sets. This way as soon as we finish writing we're capable of 
> cleaning up any leftovers, so we never get duplication. It's also much easier 
> to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9471) Columns should be backed by a BTree, not an array

2015-08-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651724#comment-14651724
 ] 

Benedict commented on CASSANDRA-9471:
-

I've rebased this and introduced unit tests to cover the majority of 
functionality. All of which was already present but not covered.

> Columns should be backed by a BTree, not an array
> -
>
> Key: CASSANDRA-9471
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9471
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
> Fix For: 3.0 beta 1
>
>
> Follow up to 8099. 
> We have pretty terrible lookup performance as the number of columns grows 
> (linear). In at least one location, this results in quadratic performance. 
> We don't however want this structure to be either any more expensive to 
> build, nor to store. Some small modifications to BTree will permit it to 
> serve here, by permitting efficient lookup by index, and calculation _of_ 
> index for a given key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9894) Serialize the header only once per message

2015-08-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651953#comment-14651953
 ] 

Benedict commented on CASSANDRA-9894:
-

I've rebased against CASSANDRA-9471, to which I've added a new ColumnsTest. 
I've then added to this tests for the existing serialization and the new 
serialization introduced by this patch. I've also introduce some brief comments 
in {{serializeSubset()}} explaining at a high level how the encoding works, and 
why.

> Serialize the header only once per message
> --
>
> Key: CASSANDRA-9894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9894
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Sylvain Lebresne
>Assignee: Benedict
> Fix For: 3.0 beta 1
>
>
> One last improvement I'd like to do on the serialization side is that we 
> currently serialize the {{SerializationHeader}} for each partition. That 
> header contains the serialized columns in particular and for range queries, 
> serializing that for every partition is wasted (note that it's only a problem 
> for the messaging protocol as for sstable we only write the header once per 
> sstable).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9962) WaitQueueTest is flakey

2015-08-03 Thread Benedict (JIRA)
Benedict created CASSANDRA-9962:
---

 Summary: WaitQueueTest is flakey
 Key: CASSANDRA-9962
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9962
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 3.x


While the test is a little noddy, and superfluous, it shouldn't fail even 
vanishingly infrequently. [~aweisberg] has spotted it doing so, and I have also 
encountered it once, so I suspect that a change in hardware/OS may have made 
vanishingly unlikely just pretty unlikely, which is even less good. 

Right now it depends on {{Thread.start()}} completing before the new thread 
starts; this isn't guaranteed. This patch fixes that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9962) WaitQueueTest is flakey

2015-08-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653118#comment-14653118
 ] 

Benedict commented on CASSANDRA-9962:
-

Patch available [here|https://github.com/belliottsmith/cassandra/tree/9962]

> WaitQueueTest is flakey
> ---
>
> Key: CASSANDRA-9962
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9962
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Minor
> Fix For: 3.x
>
>
> While the test is a little noddy, and superfluous, it shouldn't fail even 
> vanishingly infrequently. [~aweisberg] has spotted it doing so, and I have 
> also encountered it once, so I suspect that a change in hardware/OS may have 
> made vanishingly unlikely just pretty unlikely, which is even less good. 
> Right now it depends on {{Thread.start()}} completing before the new thread 
> starts; this isn't guaranteed. This patch fixes that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9974) Improve debuggability of 3.0

2015-08-03 Thread Benedict (JIRA)
Benedict created CASSANDRA-9974:
---

 Summary: Improve debuggability of 3.0
 Key: CASSANDRA-9974
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9974
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
 Fix For: 3.0.0 rc1


While 8099 has brought a number of improvements, currently it is making 
debugging a bit of a nightmare (for me at least). This slows down development 
and test resolution, and so we should fix it sooner than later. This ticket is 
intended to aggregate tickets that will improve this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9975) Flatten RowIterator call hierarchy with a shared RowTransformer

2015-08-03 Thread Benedict (JIRA)
Benedict created CASSANDRA-9975:
---

 Summary: Flatten RowIterator call hierarchy with a shared 
RowTransformer
 Key: CASSANDRA-9975
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9975
 Project: Cassandra
  Issue Type: Sub-task
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 3.0.0 rc1


Stepping through a read response is made exceedingly difficult by the sheer 
depth of the call hierarchy, and how rapidly your context jumps around. This 
ticket intend to partially address that, by flattening one of the main causes 
of this: iterator transformations.

I have a patch that attempts to mitigate (but not entirely eliminate) this, 
through the introduction of a {{RowTransformer}} class that all transformations 
are applied through. If a transformation has already been applied, the 
{{RowTransformer}} class does not wrap a new iterator, but instead returns a 
new {{RowTransformer}} that wraps the original underlying (untransformed) 
iterator and both transformations. This can accumulate an arbitrary number of 
transformations and, quite importantly, can apply the filtration step 
{{Unfiltered -> Row}}  in the same instance as well. The intention being that a 
majority of control flow happens inside this {{RowTransformer}}, so there is 
far less context jumping to cope with.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9976) Flatten PartitionIterator call hierarchy with PartitionTransformer

2015-08-03 Thread Benedict (JIRA)
Benedict created CASSANDRA-9976:
---

 Summary: Flatten PartitionIterator call hierarchy with 
PartitionTransformer
 Key: CASSANDRA-9976
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9976
 Project: Cassandra
  Issue Type: Sub-task
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 3.0.0 rc1


Much like CASSANDRA-9975, except for partitions. This is less involved, and 
probably less essential, but still helpful. This has been split out to keep the 
change iterative and piecemiel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9932) Make all partitions btree backed

2015-08-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653186#comment-14653186
 ] 

Benedict commented on CASSANDRA-9932:
-

Patch available [here|https://github.com/belliottsmith/cassandra/tree/9932]

> Make all partitions btree backed
> 
>
> Key: CASSANDRA-9932
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9932
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
> Fix For: 3.0.0 rc1
>
>
> Following on from the other btree related refactors, this patch makes all 
> partition (and partition-like) objects backed by the same basic structure: 
> {{AbstractBTreePartition}}. With two main offshoots: 
> {{ImmutableBTreePartition}} and {{AtomicBTreePartition}}
> The main upshot is a 30% net code reduction, meaning better exercise of btree 
> code paths and fewer new code paths to go wrong. A secondary upshort is that, 
> by funnelling all our comparisons through a btree, there is a higher 
> likelihood of icache occupancy and we have only one area to focus delivery of 
> improvements for their enjoyment by all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9974) Improve debuggability

2015-08-04 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9974:

Summary: Improve debuggability  (was: Improve debuggability of 3.0)

> Improve debuggability
> -
>
> Key: CASSANDRA-9974
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9974
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
> Fix For: 3.0.0 rc1
>
>
> While 8099 has brought a number of improvements, currently it is making 
> debugging a bit of a nightmare (for me at least). This slows down development 
> and test resolution, and so we should fix it sooner than later. This ticket 
> is intended to aggregate tickets that will improve this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9926) Remove AbstractRow.isEmpty

2015-08-04 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9926:

Summary: Remove AbstractRow.isEmpty  (was: Avoid AbstractRow.isEmpty() 
where possible)

> Remove AbstractRow.isEmpty
> --
>
> Key: CASSANDRA-9926
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9926
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benedict
>Assignee: Benedict
>Priority: Minor
>
> This method performs not-insignificant duplicate work, so we should both 
> implement it more efficiently where possible, and avoid using it if we can



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9926) Remove AbstractRow.isEmpty

2015-08-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653194#comment-14653194
 ] 

Benedict commented on CASSANDRA-9926:
-

Since we only have one implementation of {{AbstractRow}}, we can safely remove 
the implementation in {{AbstractRow}} and - more importantly - introduce an 
efficient implementation to the concrete implementation

> Remove AbstractRow.isEmpty
> --
>
> Key: CASSANDRA-9926
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9926
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benedict
>Assignee: Benedict
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> This method performs not-insignificant duplicate work, so we should both 
> implement it more efficiently where possible, and avoid using it if we can



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9926) Remove AbstractRow.isEmpty

2015-08-04 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9926:

Priority: Trivial  (was: Minor)

> Remove AbstractRow.isEmpty
> --
>
> Key: CASSANDRA-9926
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9926
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benedict
>Assignee: Benedict
>Priority: Trivial
> Fix For: 3.0 beta 1
>
>
> This method performs not-insignificant duplicate work, so we should both 
> implement it more efficiently where possible, and avoid using it if we can



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)

2015-08-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653263#comment-14653263
 ] 

Benedict commented on CASSANDRA-8630:
-

Yes, spot on (at a high level - specifics may yet have more to discuss). 
[~aweisberg] may have comments / thoughts, since this is the direct mirror of 
CASSANDRA-9500

> Faster sequential IO (on compaction, streaming, etc)
> 
>
> Key: CASSANDRA-8630
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8630
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core, Tools
>Reporter: Oleg Anastasyev
>Assignee: Stefania
>  Labels: compaction, performance
> Fix For: 3.x
>
> Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, 
> flight_recorder_001_files.tar.gz
>
>
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot 
> of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as 
> their matching write* are implemented with numerous calls of byte by byte 
> read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in 
> either way gives 8x speed increase.
> A patch attached implements RandomAccessReader.read and 
> SequencialWriter.write methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and 
> ColumnNameHelper.maxComponents, which were on my profiler's hotspot method 
> list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% 
> faster  on uncompressed sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU 
> load - i.e. compaction; yellow is user - i.e. not compaction related tasks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)

2015-08-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653676#comment-14653676
 ] 

Benedict commented on CASSANDRA-8630:
-

We already have a branch in every case? This won't change the typical behavior 
at all. That branch is also highly likely to be predicted correctly on just 
about every execution.

> Faster sequential IO (on compaction, streaming, etc)
> 
>
> Key: CASSANDRA-8630
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8630
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core, Tools
>Reporter: Oleg Anastasyev
>Assignee: Stefania
>  Labels: compaction, performance
> Fix For: 3.x
>
> Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, 
> flight_recorder_001_files.tar.gz
>
>
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot 
> of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as 
> their matching write* are implemented with numerous calls of byte by byte 
> read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in 
> either way gives 8x speed increase.
> A patch attached implements RandomAccessReader.read and 
> SequencialWriter.write methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and 
> ColumnNameHelper.maxComponents, which were on my profiler's hotspot method 
> list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% 
> faster  on uncompressed sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU 
> load - i.e. compaction; yellow is user - i.e. not compaction related tasks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)

2015-08-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653678#comment-14653678
 ] 

Benedict commented on CASSANDRA-8630:
-

We already have a branch in every case? This won't change the typical behavior 
at all. That branch is also highly likely to be predicted correctly on just 
about every execution.

> Faster sequential IO (on compaction, streaming, etc)
> 
>
> Key: CASSANDRA-8630
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8630
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core, Tools
>Reporter: Oleg Anastasyev
>Assignee: Stefania
>  Labels: compaction, performance
> Fix For: 3.x
>
> Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, 
> flight_recorder_001_files.tar.gz
>
>
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot 
> of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as 
> their matching write* are implemented with numerous calls of byte by byte 
> read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in 
> either way gives 8x speed increase.
> A patch attached implements RandomAccessReader.read and 
> SequencialWriter.write methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and 
> ColumnNameHelper.maxComponents, which were on my profiler's hotspot method 
> list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% 
> faster  on uncompressed sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU 
> load - i.e. compaction; yellow is user - i.e. not compaction related tasks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)

2015-08-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653738#comment-14653738
 ] 

Benedict commented on CASSANDRA-8630:
-

My expectation was that we would extract the existing "is sufficient 
remaining?" check, and use this to drive a decision. It doesn't really matter 
what we do inside this block, because that is executed infrequently. I would 
have a {{long readBytesSlowly(int count)}} method, which we downcast the result 
to whatever we have actually read. That method would call {{readNext()}} as 
necessary, and would be prevented from being inlined. But that's just how I 
would do it, since I prefer not to incur complexity when we know the cost will 
be amortized away, nor inline that method and inflate our code size for the 
same reason. 

I don't think it matters _terribly_, though, and if you or [~stefania] have a 
strong opinion I won't stand in the way (unless it happens to trigger a strong 
and unexpected anti-opinion)

> Faster sequential IO (on compaction, streaming, etc)
> 
>
> Key: CASSANDRA-8630
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8630
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core, Tools
>Reporter: Oleg Anastasyev
>Assignee: Stefania
>  Labels: compaction, performance
> Fix For: 3.x
>
> Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, 
> flight_recorder_001_files.tar.gz
>
>
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot 
> of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as 
> their matching write* are implemented with numerous calls of byte by byte 
> read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in 
> either way gives 8x speed increase.
> A patch attached implements RandomAccessReader.read and 
> SequencialWriter.write methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and 
> ColumnNameHelper.maxComponents, which were on my profiler's hotspot method 
> list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% 
> faster  on uncompressed sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU 
> load - i.e. compaction; yellow is user - i.e. not compaction related tasks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9962) WaitQueueTest is flakey

2015-08-05 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654957#comment-14654957
 ] 

Benedict commented on CASSANDRA-9962:
-

bq.  A semaphore would be an easier way to wait for the thread(s) to be 
initialized?

Define easier :)

The main benefit of a {{Semaphore}} is that it does not busy spin; otherwise 
spinning on an atomic counter is just as easy. However stylistically I'd have 
no issue switching. The main benefit of busy spinning is it increases the 
likelihood of interesting thread interleavings, which is useful for testing 
this kind of functionality. Really, though, we should have a burn test, and I'm 
kind of surprised we do not. I thought we used to. Really I would like to merge 
[my concurrent utils|https://github.com/belliottsmith/bes-utils] with the tree, 
or link them, as we've had a few bugs and behavioural issues that could have 
been avoided with them, and they have pretty comprehensive burn tests, 
including for its variant on {{WaitQueue}}.

bq.  Maybe we should define a few constants in Util.java for how long to wait 
for this kind of condition?

I'm not sure there's a universally good large constant, but I've introduced a 
{{Util.joinThread()}} method with a 10s timeout. Which may be too large, but 
will probably suffice for now.

bq. here is also no atomicity between signaling ready and getting into the wait 
queue.

Sure there is. We signal only after we add ourselves to the queue.

> WaitQueueTest is flakey
> ---
>
> Key: CASSANDRA-9962
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9962
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Minor
> Fix For: 3.x
>
>
> While the test is a little noddy, and superfluous, it shouldn't fail even 
> vanishingly infrequently. [~aweisberg] has spotted it doing so, and I have 
> also encountered it once, so I suspect that a change in hardware/OS may have 
> made vanishingly unlikely just pretty unlikely, which is even less good. 
> Right now it depends on {{Thread.start()}} completing before the new thread 
> starts; this isn't guaranteed. This patch fixes that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)

2015-08-05 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654962#comment-14654962
 ] 

Benedict commented on CASSANDRA-8630:
-

I think we can do something much simpler, without any intermediate objects. e.g.

{code}
@Override
public int readInt() throws IOException
{
if (buffer.remaining() >= 4)
return buffer.getInt();  
else
return (int) readPrimitiveSlowly(4);
}

@DontInline
private long readPrimitiveSlowly(int bytes) throws IOException
{
long result = 0;
for (int i = 0 ; i < bytes ; i++)
result = (result << 8) | readByte();
return result;
}
{code}

> Faster sequential IO (on compaction, streaming, etc)
> 
>
> Key: CASSANDRA-8630
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8630
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core, Tools
>Reporter: Oleg Anastasyev
>Assignee: Stefania
>  Labels: compaction, performance
> Fix For: 3.x
>
> Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, 
> flight_recorder_001_files.tar.gz
>
>
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot 
> of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as 
> their matching write* are implemented with numerous calls of byte by byte 
> read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in 
> either way gives 8x speed increase.
> A patch attached implements RandomAccessReader.read and 
> SequencialWriter.write methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and 
> ColumnNameHelper.maxComponents, which were on my profiler's hotspot method 
> list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% 
> faster  on uncompressed sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU 
> load - i.e. compaction; yellow is user - i.e. not compaction related tasks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9985) Introduce our own AbstractIterator

2015-08-05 Thread Benedict (JIRA)
Benedict created CASSANDRA-9985:
---

 Summary: Introduce our own AbstractIterator
 Key: CASSANDRA-9985
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9985
 Project: Cassandra
  Issue Type: Sub-task
  Components: Core
Reporter: Benedict
Assignee: Benedict
Priority: Trivial
 Fix For: 3.0.0 rc1


The Guava AbstractIterator not only has unnecessary method call depth, it is 
difficult to debug without attaching source. Since it's absolutely trivial to 
write our own, and it's used widely within the codebase, I think we should do 
so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9533) Make batch commitlog mode easier to tune

2015-08-05 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654985#comment-14654985
 ] 

Benedict commented on CASSANDRA-9533:
-

Really trivial patch 
[here|https://github.com/belliottsmith/cassandra/commits/9533-2.1]

> Make batch commitlog mode easier to tune
> 
>
> Key: CASSANDRA-9533
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9533
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
> Fix For: 3.x
>
>
> As discussed in CASSANDRA-9504, 2.1 changed commitlog_sync_batch_window_in_ms 
> from a maximum time to wait between fsync to the minimum time, so one must be 
> very careful to keep it small enough that most writers aren't kept waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9926) Remove AbstractRow.isEmpty

2015-08-05 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9926:

Fix Version/s: (was: 3.0 beta 1)
   3.0.0 rc1

> Remove AbstractRow.isEmpty
> --
>
> Key: CASSANDRA-9926
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9926
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benedict
>Assignee: Benedict
>Priority: Trivial
> Fix For: 3.0.0 rc1
>
>
> This method performs not-insignificant duplicate work, so we should both 
> implement it more efficiently where possible, and avoid using it if we can



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9986) Remove SliceableUnfilteredRowIterator

2015-08-05 Thread Benedict (JIRA)
Benedict created CASSANDRA-9986:
---

 Summary: Remove SliceableUnfilteredRowIterator
 Key: CASSANDRA-9986
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9986
 Project: Cassandra
  Issue Type: Sub-task
  Components: Core
Reporter: Benedict
Assignee: Benedict
Priority: Minor
 Fix For: 3.0.x


After CASSANDRA-9932, there is only one instance where this class is needed, 
and that is {{SSTableIterator}}. It would be much simpler if, like in 
{{AbstractBTreePartition}}, the slices were passed into the {{SSTableIterator}} 
on construction. This would make the control flow easier to follow both: 
* logically, because:
**  memtables and sstables would have the same access pattern; and
** our SSTableIterator would not have two modes of (parallel) operation, which 
is kind of confusing (the slices are independent of the non-sliced iteration)
* in the debugger, because we would have one fewer wrapping iterator





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (CASSANDRA-9974) Improve debuggability

2015-08-05 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict reassigned CASSANDRA-9974:
---

Assignee: Benedict

> Improve debuggability
> -
>
> Key: CASSANDRA-9974
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9974
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
> Fix For: 3.0.0 rc1
>
>
> While 8099 has brought a number of improvements, currently it is making 
> debugging a bit of a nightmare (for me at least). This slows down development 
> and test resolution, and so we should fix it sooner than later. This ticket 
> is intended to aggregate tickets that will improve this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9974) Improve debuggability

2015-08-05 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655000#comment-14655000
 ] 

Benedict commented on CASSANDRA-9974:
-

Assigning to myself so that this encapsulating issue is tracked.

> Improve debuggability
> -
>
> Key: CASSANDRA-9974
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9974
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
> Fix For: 3.0.0 rc1
>
>
> While 8099 has brought a number of improvements, currently it is making 
> debugging a bit of a nightmare (for me at least). This slows down development 
> and test resolution, and so we should fix it sooner than later. This ticket 
> is intended to aggregate tickets that will improve this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9978) Split/Scrub tools no longer remove original sstable

2015-08-05 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9978:

Reviewer: Benedict

> Split/Scrub tools no longer remove original sstable
> ---
>
> Key: CASSANDRA-9978
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9978
> Project: Cassandra
>  Issue Type: Bug
>Reporter: T Jake Luciani
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> Looks like CASSANDRA-7066 broke the scrub and split tools.  The orig sstable 
> is no longer removed.
> I fixed the sstablesplit dtest so you should see the issue now.  The max 
> sstable size doesn't match expected because it's the original 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9978) Split/Scrub tools no longer remove original sstable

2015-08-05 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655003#comment-14655003
 ] 

Benedict commented on CASSANDRA-9978:
-

Yep, no problem. Also: SGTM.



> Split/Scrub tools no longer remove original sstable
> ---
>
> Key: CASSANDRA-9978
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9978
> Project: Cassandra
>  Issue Type: Bug
>Reporter: T Jake Luciani
>Assignee: Stefania
>Priority: Minor
> Fix For: 3.0 beta 1
>
>
> Looks like CASSANDRA-7066 broke the scrub and split tools.  The orig sstable 
> is no longer removed.
> I fixed the sstablesplit dtest so you should see the issue now.  The max 
> sstable size doesn't match expected because it's the original 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9986) Remove SliceableUnfilteredRowIterator

2015-08-05 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655007#comment-14655007
 ] 

Benedict commented on CASSANDRA-9986:
-

Unassigning, in case somebody else wants to take this before I have time.

> Remove SliceableUnfilteredRowIterator
> -
>
> Key: CASSANDRA-9986
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9986
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Benedict
>Priority: Minor
> Fix For: 3.0.x
>
>
> After CASSANDRA-9932, there is only one instance where this class is needed, 
> and that is {{SSTableIterator}}. It would be much simpler if, like in 
> {{AbstractBTreePartition}}, the slices were passed into the 
> {{SSTableIterator}} on construction. This would make the control flow easier 
> to follow both: 
> * logically, because:
> **  memtables and sstables would have the same access pattern; and
> ** our SSTableIterator would not have two modes of (parallel) operation, 
> which is kind of confusing (the slices are independent of the non-sliced 
> iteration)
> * in the debugger, because we would have one fewer wrapping iterator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9986) Remove SliceableUnfilteredRowIterator

2015-08-05 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9986:

Assignee: (was: Benedict)

> Remove SliceableUnfilteredRowIterator
> -
>
> Key: CASSANDRA-9986
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9986
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Benedict
>Priority: Minor
> Fix For: 3.0.x
>
>
> After CASSANDRA-9932, there is only one instance where this class is needed, 
> and that is {{SSTableIterator}}. It would be much simpler if, like in 
> {{AbstractBTreePartition}}, the slices were passed into the 
> {{SSTableIterator}} on construction. This would make the control flow easier 
> to follow both: 
> * logically, because:
> **  memtables and sstables would have the same access pattern; and
> ** our SSTableIterator would not have two modes of (parallel) operation, 
> which is kind of confusing (the slices are independent of the non-sliced 
> iteration)
> * in the debugger, because we would have one fewer wrapping iterator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9985) Introduce our own AbstractIterator

2015-08-05 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655039#comment-14655039
 ] 

Benedict commented on CASSANDRA-9985:
-

Trivial patch available 
[here|https://github.com/belliottsmith/cassandra/tree/abstract-iterator]

I realise this may (or may not) be contentious, so it is just a suggestion. 
Opinions solicited.

> Introduce our own AbstractIterator
> --
>
> Key: CASSANDRA-9985
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9985
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Trivial
> Fix For: 3.0.0 rc1
>
>
> The Guava AbstractIterator not only has unnecessary method call depth, it is 
> difficult to debug without attaching source. Since it's absolutely trivial to 
> write our own, and it's used widely within the codebase, I think we should do 
> so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9987) BTree Related Optimisations

2015-08-05 Thread Benedict (JIRA)
Benedict created CASSANDRA-9987:
---

 Summary: BTree Related Optimisations
 Key: CASSANDRA-9987
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9987
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 3.x


This ticket is to track a number of small improvements to BTree related 
functionality. Assigning to myself for tracking purposes only.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9988) Introduce leaf-only iterator

2015-08-05 Thread Benedict (JIRA)
Benedict created CASSANDRA-9988:
---

 Summary: Introduce leaf-only iterator
 Key: CASSANDRA-9988
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9988
 Project: Cassandra
  Issue Type: Sub-task
Reporter: Benedict
Priority: Minor
 Fix For: 3.x


In many cases we have small btrees, small enough to fit in a single leaf page. 
In this case it _may_ be more efficient to specialise our iterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9989) Optimise BTree.Buider

2015-08-05 Thread Benedict (JIRA)
Benedict created CASSANDRA-9989:
---

 Summary: Optimise BTree.Buider
 Key: CASSANDRA-9989
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9989
 Project: Cassandra
  Issue Type: Sub-task
Reporter: Benedict
Priority: Minor
 Fix For: 3.x


BTree.Builder could reduce its copying, and exploit toArray more efficiently, 
with some work. It's not very important right now because we don't make as much 
use of its bulk-add methods as we otherwise might, however over time this work 
will become more useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9990) Use TreeBuilder directly where possible

2015-08-05 Thread Benedict (JIRA)
Benedict created CASSANDRA-9990:
---

 Summary: Use TreeBuilder directly where possible
 Key: CASSANDRA-9990
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9990
 Project: Cassandra
  Issue Type: Sub-task
  Components: Core
Reporter: Benedict
Priority: Minor
 Fix For: 3.x


In cases where we iteratively build a btree with sorted and unique values, we 
can use a TreeBuilder directly and avoid unnecessary copies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-9991) Implement efficient btree removal

2015-08-05 Thread Benedict (JIRA)
Benedict created CASSANDRA-9991:
---

 Summary: Implement efficient btree removal
 Key: CASSANDRA-9991
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9991
 Project: Cassandra
  Issue Type: Sub-task
  Components: Core
Reporter: Benedict
 Fix For: 3.x


Currently removal is implemented as a reconstruction by filtering and iterator 
over the original btree. This could be much more efficient, editing just the 
necessary nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)

2015-08-05 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655149#comment-14655149
 ] 

Benedict commented on CASSANDRA-8630:
-

bq. For the fast path, the built-in BB methods should still be faster, right?

Right.

bq. readByte() would result in one unsafe get call per byte.

The unsafe calls here are all intrinsics, but still - even for fully inlined 
method calls and unrolled loop we're talking something like 24x the work, but 
then we have the virtual invocation costs involved (the behaviour of which for 
a sequence of 8 identical calls I'm not certain - I would hope there is some 
sharing of the method call burden through loop unrolling, but I don't count on 
it), and we are probably 100x+ using rigorous finger-in-air maths.

> Faster sequential IO (on compaction, streaming, etc)
> 
>
> Key: CASSANDRA-8630
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8630
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core, Tools
>Reporter: Oleg Anastasyev
>Assignee: Stefania
>  Labels: compaction, performance
> Fix For: 3.x
>
> Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, 
> flight_recorder_001_files.tar.gz
>
>
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot 
> of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as 
> their matching write* are implemented with numerous calls of byte by byte 
> read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in 
> either way gives 8x speed increase.
> A patch attached implements RandomAccessReader.read and 
> SequencialWriter.write methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and 
> ColumnNameHelper.maxComponents, which were on my profiler's hotspot method 
> list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% 
> faster  on uncompressed sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU 
> load - i.e. compaction; yellow is user - i.e. not compaction related tasks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)

2015-08-05 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655149#comment-14655149
 ] 

Benedict edited comment on CASSANDRA-8630 at 8/5/15 10:33 AM:
--

bq. For the fast path, the built-in BB methods should still be faster, right?

Right.

bq. readByte() would result in one unsafe get call per byte.

The unsafe calls here are all intrinsics, but still - even for fully inlined 
method calls and unrolled loop we're talking something like 24x the work, but 
then we have the virtual invocation costs involved (the behaviour of which for 
a sequence of 8 identical calls I'm not certain - I would hope there is some 
sharing of the method call burden through loop unrolling, but I don't count on 
it), and we are probably 100x+ using rigorous finger-in-air maths.

bq. Do we care about little-endian ordering as well

Good point. I think we may actually depend on this in MemoryInputStream for 
some classes that were persisted in weird byte order. However I'm tempted to 
say we should start pushing this upstream to the places that care, as there are 
few, and we basically consider them all broken. It's not the first time this 
has caused annoyance. I'd be tempted to just forbid it, and patch up the few 
places that need it.


was (Author: benedict):
bq. For the fast path, the built-in BB methods should still be faster, right?

Right.

bq. readByte() would result in one unsafe get call per byte.

The unsafe calls here are all intrinsics, but still - even for fully inlined 
method calls and unrolled loop we're talking something like 24x the work, but 
then we have the virtual invocation costs involved (the behaviour of which for 
a sequence of 8 identical calls I'm not certain - I would hope there is some 
sharing of the method call burden through loop unrolling, but I don't count on 
it), and we are probably 100x+ using rigorous finger-in-air maths.

> Faster sequential IO (on compaction, streaming, etc)
> 
>
> Key: CASSANDRA-8630
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8630
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core, Tools
>Reporter: Oleg Anastasyev
>Assignee: Stefania
>  Labels: compaction, performance
> Fix For: 3.x
>
> Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, 
> flight_recorder_001_files.tar.gz
>
>
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot 
> of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as 
> their matching write* are implemented with numerous calls of byte by byte 
> read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in 
> either way gives 8x speed increase.
> A patch attached implements RandomAccessReader.read and 
> SequencialWriter.write methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and 
> ColumnNameHelper.maxComponents, which were on my profiler's hotspot method 
> list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% 
> faster  on uncompressed sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU 
> load - i.e. compaction; yellow is user - i.e. not compaction related tasks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)

2015-08-05 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655149#comment-14655149
 ] 

Benedict edited comment on CASSANDRA-8630 at 8/5/15 10:34 AM:
--

bq. For the fast path, the built-in BB methods should still be faster, right?

Right.

bq. readByte() would result in one unsafe get call per byte.

The unsafe calls here are all intrinsics, but still - even for fully inlined 
method calls and unrolled loop we're talking something like 24x the work, but 
then we have the virtual invocation costs involved (the behaviour of which for 
a sequence of 8 identical calls I'm not certain - I would hope there is some 
sharing of the method call burden through loop unrolling, but I don't count on 
it), and we are probably 100x+ using rigorous finger-in-air maths.

bq. Do we care about little-endian ordering as well

Good point. I think we may actually depend on this in MemoryInputStream for 
some classes that were persisted in weird byte order. However I'm tempted to 
say we should start pushing this upstream to the places that care, as there are 
few, and we basically consider them all broken. It's not the first time this 
has caused annoyance. I'd be tempted to just forbid it, and patch up the few 
places that need it. (My opinion, only, and if it looks like a hassle we can 
just add support here)


was (Author: benedict):
bq. For the fast path, the built-in BB methods should still be faster, right?

Right.

bq. readByte() would result in one unsafe get call per byte.

The unsafe calls here are all intrinsics, but still - even for fully inlined 
method calls and unrolled loop we're talking something like 24x the work, but 
then we have the virtual invocation costs involved (the behaviour of which for 
a sequence of 8 identical calls I'm not certain - I would hope there is some 
sharing of the method call burden through loop unrolling, but I don't count on 
it), and we are probably 100x+ using rigorous finger-in-air maths.

bq. Do we care about little-endian ordering as well

Good point. I think we may actually depend on this in MemoryInputStream for 
some classes that were persisted in weird byte order. However I'm tempted to 
say we should start pushing this upstream to the places that care, as there are 
few, and we basically consider them all broken. It's not the first time this 
has caused annoyance. I'd be tempted to just forbid it, and patch up the few 
places that need it.

> Faster sequential IO (on compaction, streaming, etc)
> 
>
> Key: CASSANDRA-8630
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8630
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core, Tools
>Reporter: Oleg Anastasyev
>Assignee: Stefania
>  Labels: compaction, performance
> Fix For: 3.x
>
> Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, 
> flight_recorder_001_files.tar.gz
>
>
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot 
> of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as 
> their matching write* are implemented with numerous calls of byte by byte 
> read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in 
> either way gives 8x speed increase.
> A patch attached implements RandomAccessReader.read and 
> SequencialWriter.write methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and 
> ColumnNameHelper.maxComponents, which were on my profiler's hotspot method 
> list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% 
> faster  on uncompressed sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU 
> load - i.e. compaction; yellow is user - i.e. not compaction related tasks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CASSANDRA-9925) Sorted + Unique BTree.Builder usage can use a thread-local TreeBuilder directly

2015-08-05 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict resolved CASSANDRA-9925.
-
Resolution: Duplicate

> Sorted + Unique BTree.Builder usage can use a thread-local TreeBuilder 
> directly
> ---
>
> Key: CASSANDRA-9925
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9925
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Minor
> Fix For: 3.0.x
>
>
> There are now a number of situations where we use a BTree.Builder that could 
> be made to go directly to the TreeBuilder, since they only perform in-order 
> unique additions to build an initial tree. This would potentially avoid a 
> number of array allocations/copies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CASSANDRA-9600) DescriptorTypeTidy and GlobalTypeTidy do not benefit from being full fledged Ref instances

2015-08-05 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict resolved CASSANDRA-9600.
-
Resolution: Won't Fix

We've already removed DescriptorTypeTidy, and I don't think it's worth worrying 
about GlobalTypeTidy any time soon.

> DescriptorTypeTidy and GlobalTypeTidy do not benefit from being full fledged 
> Ref instances
> --
>
> Key: CASSANDRA-9600
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9600
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Minor
>
> These inner SSTableReader tidying classes do not benefit from being a full 
> fledged Ref because they are managed in such a small scope. This increases 
> our surface area to problems such as CASSANDRA-9549 (these were the affected 
> instances, ftr).
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9985) Introduce our own AbstractIterator

2015-08-05 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658273#comment-14658273
 ] 

Benedict commented on CASSANDRA-9985:
-

I wouldn't care if it weren't for the sheer number of these calls that we chain 
together now, so that 75% of your code stepping is through a linked jar. With 
multiple versions it is hard to link the source, and so the majority of your 
debug stepping is through a morass of unknown. This patch also removes one 
method call from that chain, which shrinks the height of the call tree 
significantly when they're added together.

But, as I say, I figured this little patch might split opinion so I only 
propose it for inclusion at the consensus of the community.

It;s worth noting that CASSANDRA-9975 also mitigates the problem, however it 
leaves a lot of these iterators in the call tree.

> Introduce our own AbstractIterator
> --
>
> Key: CASSANDRA-9985
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9985
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Trivial
> Fix For: 3.0.0 rc1
>
>
> The Guava AbstractIterator not only has unnecessary method call depth, it is 
> difficult to debug without attaching source. Since it's absolutely trivial to 
> write our own, and it's used widely within the codebase, I think we should do 
> so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9985) Introduce our own AbstractIterator

2015-08-05 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658310#comment-14658310
 ] 

Benedict commented on CASSANDRA-9985:
-

FTR, I've had to update the patch to closer to Guava because we do in fact use 
null returns (I had thought we did not).

To be honest, I would be totally comfortable just porting the Guava code 
wholesale (or almost wholesale; the removal of tryComputeNext is still nice).

> Introduce our own AbstractIterator
> --
>
> Key: CASSANDRA-9985
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9985
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Trivial
> Fix For: 3.0.0 rc1
>
>
> The Guava AbstractIterator not only has unnecessary method call depth, it is 
> difficult to debug without attaching source. Since it's absolutely trivial to 
> write our own, and it's used widely within the codebase, I think we should do 
> so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9985) Introduce our own AbstractIterator

2015-08-06 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659823#comment-14659823
 ] 

Benedict commented on CASSANDRA-9985:
-

bq. Can I bring up the icache boogyman? 

You can. Since I'm it's main champion, it's a powerful weapon against me. 
However in this case a little analysis shows that Guava doesn't even use its 
own {{AbstractIterator}} (either of them!) very much at all. In fact nowhere 
that we do, from what I can tell. I believe CASSANDRA-9471 uses 
{{Iterables.mergeSorted}} which does - but we can perform a more efficient 
merge since it's only 2-way, and otherwise should prefer our own Mergeiterator, 
which has a more efficient heap implementation. So I will update it to avoid 
using this function.

Which is a relief, since I am not in favour of overriding external packages if 
we can avoid it, as it plays havoc with anyone wanting to use signed jars.

The only other place we use Guava {{AbstractIterator}} is in OHC. Since this is 
the "HotKey" iterator, it is used infrequently and probably not to be troubled 
over.

Of course, we do run the risk of in future introducing the occasional usage, 
but those risks are small given how narrowly it is deployed in Guava. We have a 
lot of minor issues around duplication, and we should probably file a ticket to 
specifically track and fix it en masse. Especially regarding duplicate 
functionality between Guava and java.util.function, but no harm setting up some 
analysis for everything. It would be great if we had some automated analysis to 
tell us of duplicate code candidates, in general. I'm certain such tools exist.

I'll take a look at purloining any Guava tests.

> Introduce our own AbstractIterator
> --
>
> Key: CASSANDRA-9985
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9985
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Trivial
> Fix For: 3.0.0 rc1
>
>
> The Guava AbstractIterator not only has unnecessary method call depth, it is 
> difficult to debug without attaching source. Since it's absolutely trivial to 
> write our own, and it's used widely within the codebase, I think we should do 
> so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9985) Introduce our own AbstractIterator

2015-08-06 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659837#comment-14659837
 ] 

Benedict commented on CASSANDRA-9985:
-

So, I've pushed a version with Guava's test case included, but with two failing 
tests removed because IMO they aren't things we should worry about:

1) Reentrancy in {{hasNext()}}, i.e. if you create a {{computeNext()}} method 
that recursively calls {{hasNext()}} then it will throw ISE
2) Continuing to use an iterator that threw an exception during 
{{computeNext()}}, which will also throw ISE

My view is that neither check buys us much, and I prefer the code clarity of 
the current implementation. However support is easily added at the expense of a 
bit of ugliness.

> Introduce our own AbstractIterator
> --
>
> Key: CASSANDRA-9985
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9985
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Trivial
> Fix For: 3.0.0 rc1
>
>
> The Guava AbstractIterator not only has unnecessary method call depth, it is 
> difficult to debug without attaching source. Since it's absolutely trivial to 
> write our own, and it's used widely within the codebase, I think we should do 
> so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9894) Serialize the header only once per message

2015-08-06 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660099#comment-14660099
 ] 

Benedict commented on CASSANDRA-9894:
-

bq.  I changed it to return the assembled list and then the test didn't pass.

Thanks. I've added two commits, one of which fixes this, and another which 
switched to delta encoding, which naturally reduced both huge serialization 
paths into one, making it much cleaner.

bq. There are a couple of other methods like digest, selectOrderIterator, 
getComplex, and getSimple that don't run. I get surprised pretty regularly so I 
would test those as well.

{{selectOrderIterator}} was already being tested, and I've added coverage for 
{{getComplex}} and {{getSimple}}. 

However digest and {{getDroppedColumnDefinition}} I'm not so sure about: I kind 
of feel they only make sense in a larger picture, i.e. in a dtest. I don't mind 
doing a noddy "check it does what the code already says it does" kind of test, 
but I'm not convinced it adds value by itself.

bq. Can you use a random with a known or logged seed for the test?

This turned out to be much more annoying. Because {{ThreadLocalRandom}} does 
not support setting (or fetching) its seed, and regular Random has a more 
rubbish API - it does not support a lot of useful methods. We should file a 
ticket for exposing easily reseeded/logged Random with the full TLR API, I 
think.

> Serialize the header only once per message
> --
>
> Key: CASSANDRA-9894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9894
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Sylvain Lebresne
>Assignee: Benedict
> Fix For: 3.0 beta 1
>
>
> One last improvement I'd like to do on the serialization side is that we 
> currently serialize the {{SerializationHeader}} for each partition. That 
> header contains the serialized columns in particular and for range queries, 
> serializing that for every partition is wasted (note that it's only a problem 
> for the messaging protocol as for sstable we only write the header once per 
> sstable).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9925) Sorted + Unique BTree.Builder usage can use a thread-local TreeBuilder directly

2015-08-06 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9925:

Fix Version/s: (was: 3.0.x)

> Sorted + Unique BTree.Builder usage can use a thread-local TreeBuilder 
> directly
> ---
>
> Key: CASSANDRA-9925
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9925
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Minor
>
> There are now a number of situations where we use a BTree.Builder that could 
> be made to go directly to the TreeBuilder, since they only perform in-order 
> unique additions to build an initial tree. This would potentially avoid a 
> number of array allocations/copies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9962) WaitQueueTest is flakey

2015-08-06 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9962:

Fix Version/s: (was: 3.x)
   3.0 beta 1
   2.2.1
   2.1.9

> WaitQueueTest is flakey
> ---
>
> Key: CASSANDRA-9962
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9962
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Minor
> Fix For: 2.1.9, 2.2.1, 3.0 beta 1
>
>
> While the test is a little noddy, and superfluous, it shouldn't fail even 
> vanishingly infrequently. [~aweisberg] has spotted it doing so, and I have 
> also encountered it once, so I suspect that a change in hardware/OS may have 
> made vanishingly unlikely just pretty unlikely, which is even less good. 
> Right now it depends on {{Thread.start()}} completing before the new thread 
> starts; this isn't guaranteed. This patch fixes that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9998) LEAK DETECTED with snapshot/sequential repairs

2015-08-06 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661382#comment-14661382
 ] 

Benedict commented on CASSANDRA-9998:
-

+1

> LEAK DETECTED with snapshot/sequential repairs
> --
>
> Key: CASSANDRA-9998
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9998
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
> Fix For: 3.x, 2.1.x, 2.2.x
>
>
> http://cassci.datastax.com/job/cassandra-2.1_dtest/lastCompletedBuild/testReport/repair_test/TestRepair/simple_sequential_repair_test/
> does not happen if I add -par to the test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9985) Introduce our own AbstractIterator

2015-08-06 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661392#comment-14661392
 ] 

Benedict commented on CASSANDRA-9985:
-

Right, it seems exceedingly difficult to get into, and is a worse misuse of the 
iterator than the iterator can handle. 

My main reason to not want to address it, though, is that we will be stepping 
through this {{AbstractIterator.hasNext()}} often, and I'd like it to be easy 
and attractive to do so. I had a quick go at making it remain easy to read and, 
hopefully, step through, while still catching this edge case. I've pushed a 
version that I think meets the criteria.

> Introduce our own AbstractIterator
> --
>
> Key: CASSANDRA-9985
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9985
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Trivial
> Fix For: 3.0.0 rc1
>
>
> The Guava AbstractIterator not only has unnecessary method call depth, it is 
> difficult to debug without attaching source. Since it's absolutely trivial to 
> write our own, and it's used widely within the codebase, I think we should do 
> so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)

2015-08-07 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661560#comment-14661560
 ] 

Benedict commented on CASSANDRA-8630:
-

It would be better, from that perspective to use native order. However:

# Swapping byte order is a single cycle instruction, so is probably not a big 
deal, given how inefficient we are otherwise right now
# We would have to require that users have nodes all sharing the same 
architecture (or at least endianness), which may or may not be limiting. Or we 
could pick the common platform endianness and go little endian
# We would have to support both at least during upgrade

All told I think it's easier to stick with big endian format as is Java 
standard and standard across our codebase. But I'm not opposed to the 
alternative of providing a patch to support both, followed by an upgrade period 
and dropping support for big endian in 4.0



> Faster sequential IO (on compaction, streaming, etc)
> 
>
> Key: CASSANDRA-8630
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8630
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core, Tools
>Reporter: Oleg Anastasyev
>Assignee: Stefania
>  Labels: compaction, performance
> Fix For: 3.x
>
> Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, 
> flight_recorder_001_files.tar.gz
>
>
> When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot 
> of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int).
> This is because default implementations of readShort,readLong, etc as well as 
> their matching write* are implemented with numerous calls of byte by byte 
> read and write. 
> This makes a lot of syscalls as well.
> A quick microbench shows than just reimplementation of these methods in 
> either way gives 8x speed increase.
> A patch attached implements RandomAccessReader.read and 
> SequencialWriter.write methods in more efficient way.
> I also eliminated some extra byte copies in CompositeType.split and 
> ColumnNameHelper.maxComponents, which were on my profiler's hotspot method 
> list during tests.
> A stress tests on my laptop show that this patch makes compaction 25-30% 
> faster  on uncompressed sstables and 15% faster for compressed ones.
> A deployment to production shows much less CPU load for compaction. 
> (I attached a cpu load graph from one of our production, orange is niced CPU 
> load - i.e. compaction; yellow is user - i.e. not compaction related tasks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8921) Experiment with a probabilistic tree of membership for maxPurgeableTimestamp

2015-08-07 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661802#comment-14661802
 ] 

Benedict commented on CASSANDRA-8921:
-

Hi Daniel,

No one is working on this, no, and we'd love you to have a go. Since it's a 
really non-trivial ticket, though, I feel like I should issue a brief warning 
that it will be more difficult than your average patch to reach inclusion. It's 
some fairly fundamental algorithmic work, and effectively a research piece, so 
it will receive a lot of scrutiny. It's also not *100%* certain if it will pay 
dividends. I expect it will, but it may have costs for some workloads, and 
there are open questions: 

* do we rebuild this tree each time we cross an sstable boundary? (this may 
lead to high CPU cost in some scenarios)
* do we build one tree up-front, and then post-filter? (this may lead to a 
significant increase in memory requirement)
* do we hybridise, and chain new BFs on wholesale, and rebalance the tree once 
we have enough that the cost will be sufficiently amortized? 

This all makes it a really fun piece of work, but I would recommend you think 
through how you think the algorithm would work and post a high-level design 
here, preferably with a little blurb about your expectation of the tradeoffs. 
That way we can discuss it up-front, and hopefully ease the gradual process of 
inclusion.

> Experiment with a probabilistic tree of membership for maxPurgeableTimestamp
> 
>
> Key: CASSANDRA-8921
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8921
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>
> maxPurgeableTimestamp appears to be a significant cost for some workloads, 
> the majority of which stemming from the cost of membership tests across the 
> overlapping tables. It would be possible to construct a tree of bloom filters 
> from the existing filters, that could yield queries of the set of possible 
> membership of a given key with logarithmic performance, and it appears there 
> is a research paper (that I haven't dived into yet) that outlines something 
> like this http://www.usna.edu/Users/cs/adina/research/Bloofi%20_CloudI2013.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10002) Repeated slices on RowSearchers are incorrect

2015-08-07 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662156#comment-14662156
 ] 

Benedict commented on CASSANDRA-10002:
--

It's worth pointing out this code will be removed by CASSANDRA-9932

> Repeated slices on RowSearchers are incorrect
> -
>
> Key: CASSANDRA-10002
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10002
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Tyler Hobbs
>Assignee: Tyler Hobbs
> Fix For: 3.0 beta 1
>
>
> In {{AbstractThreadUnsafePartition}}, repeated {{slice()}} calls on a 
> {{RowSearcher}} can produce incorrect results.  This is caused by only 
> performing a binary search over a sublist (based on {{nextIdx}}), but not 
> taking {{nextIdx}} into account when using the search result index.
> I made a quick fix in [this 
> commit|https://github.com/thobbs/cassandra/commit/73725ea6825c9c0da1fa4986b01f39ae08130e10]
>  on one of my branches, but the full fix also needs to cover 
> {{ReverseRowSearcher}} and include a test to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9500) SequentialWriter should extend BufferedDataOutputStreamPlus

2015-08-08 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9500:

Reviewer: Benedict

> SequentialWriter should extend BufferedDataOutputStreamPlus
> ---
>
> Key: CASSANDRA-9500
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9500
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benedict
>Assignee: Ariel Weisberg
>Priority: Minor
>
> SequentialWriter is the last piece left not implementing 
> DataOutputStreamPlus. It should not be too tricky to extend it, and it should 
> result in improved write throughput. Further, it will permit it to exploit 
> the more efficient implementation of writeVInt being delivered in 
> CASSANDRA-9499



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9985) Introduce our own AbstractIterator

2015-08-08 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-9985:

Fix Version/s: (was: 3.0.0 rc1)
   3.0 beta 1

> Introduce our own AbstractIterator
> --
>
> Key: CASSANDRA-9985
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9985
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
>Priority: Trivial
> Fix For: 3.0 beta 1
>
>
> The Guava AbstractIterator not only has unnecessary method call depth, it is 
> difficult to debug without attaching source. Since it's absolutely trivial to 
> write our own, and it's used widely within the codebase, I think we should do 
> so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9533) Make batch commitlog mode easier to tune

2015-08-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662908#comment-14662908
 ] 

Benedict commented on CASSANDRA-9533:
-

If you're trickling one erase block every 10ms, I think most consumer grade 
SSDs will still last years. So even that's probably not worth worrying about

It's also been shown in some workloads that having a delay may help spinning 
disks, but my theoretical understanding of that is that it was unique to our 
tool, and caused the work producer (stress) to phase its work production so 
that there were fewer writes. This might be beneficial to some users as well, 
but hard to say, and probably a minority.

I've committed to 2.1, 2.2, 3.0 and trunk.


> Make batch commitlog mode easier to tune
> 
>
> Key: CASSANDRA-9533
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9533
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Jonathan Ellis
>Assignee: Benedict
> Fix For: 3.x
>
>
> As discussed in CASSANDRA-9504, 2.1 changed commitlog_sync_batch_window_in_ms 
> from a maximum time to wait between fsync to the minimum time, so one must be 
> very careful to keep it small enough that most writers aren't kept waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9932) Make all partitions btree backed

2015-08-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662973#comment-14662973
 ] 

Benedict commented on CASSANDRA-9932:
-

bq. This change to a large extent looks like a refactor and not new code so I 
am trying not to wade too deep into the existing code

This ticket is, in essence, a part of my review of CASSANDRA-8099, but filed as 
its own ticket to parallelize and make digestible. As such, I'm not sure how 
strongly it counts as "existing code" and nothing should be taken totally for 
granted here.

bq. With default methods it really seems to me that we always want @Override 
annotations

I agree, but rather than wait for a PMC adjudication on a change of practice, 
I've just opted roll back the use of a default method in that particular 
change, since it wasn't really necessary.

bq. I ran code coverage for some tests and there seems to be untested stuff in 
BTree such as transformAndFilter. reverse() which is new for this patch also 
has no coverage.

{{transformAndFilter}} is already covered by {{LongBTreeTest}}, and is not new 
to this patch. I've added {{reverse()}} coverage. 

bq.  I just want to focus on how convincing the tests are
bq. I don't see unit tests for the various partition types in isolation 
AtomicBTreePartition from PartitionTest doesn't cover stuff like waste tracking 
or pessimistic locking. Is that coverage supposed to come indirectly via 
Memtable tests? AbstractBTreePartition also has stuff that isn't tested.

Test coverage is a bit of a wasteland still here. I would rather not bottleneck 
on this, however, since this patch is explicitly designed to reduce our 
exposure to poorly tested new code. We certainly can and should file a follow 
up ticket to improve all of these areas, and I will try to address that pre-3.0 
release, once the more pressing 8099 review-generated work is done.

> Make all partitions btree backed
> 
>
> Key: CASSANDRA-9932
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9932
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Benedict
>Assignee: Benedict
> Fix For: 3.0.0 rc1
>
>
> Following on from the other btree related refactors, this patch makes all 
> partition (and partition-like) objects backed by the same basic structure: 
> {{AbstractBTreePartition}}. With two main offshoots: 
> {{ImmutableBTreePartition}} and {{AtomicBTreePartition}}
> The main upshot is a 30% net code reduction, meaning better exercise of btree 
> code paths and fewer new code paths to go wrong. A secondary upshort is that, 
> by funnelling all our comparisons through a btree, there is a higher 
> likelihood of icache occupancy and we have only one area to focus delivery of 
> improvements for their enjoyment by all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   10   >