[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644421#comment-14644421 ] Benedict commented on CASSANDRA-7066: - [~stefania]: jake has correctly pointed out we should document these changes in NEWS.txt. Would you mind composing some suitable blurb explaining briefly how the new system works from an operator's POV? > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8894) Our default buffer size for (uncompressed) buffered reads should be smaller, and based on the expected record size
[ https://issues.apache.org/jira/browse/CASSANDRA-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-8894: Fix Version/s: (was: 3.x) 3.0 alpha 1 > Our default buffer size for (uncompressed) buffered reads should be smaller, > and based on the expected record size > -- > > Key: CASSANDRA-8894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8894 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania > Labels: benedict-to-commit > Fix For: 3.0 alpha 1 > > Attachments: 8894_25pct.yaml, 8894_5pct.yaml, 8894_tiny.yaml > > > A large contributor to slower buffered reads than mmapped is likely that we > read a full 64Kb at once, when average record sizes may be as low as 140 > bytes on our stress tests. The TLB has only 128 entries on a modern core, and > each read will touch 32 of these, meaning we are unlikely to almost ever be > hitting the TLB, and will be incurring at least 30 unnecessary misses each > time (as well as the other costs of larger than necessary accesses). When > working with an SSD there is little to no benefit reading more than 4Kb at > once, and in either case reading more data than we need is wasteful. So, I > propose selecting a buffer size that is the next larger power of 2 than our > average record size (with a minimum of 4Kb), so that we expect to read in one > operation. I also propose that we create a pool of these buffers up-front, > and that we ensure they are all exactly aligned to a virtual page, so that > the source and target operations each touch exactly one virtual page per 4Kb > of expected record size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8894) Our default buffer size for (uncompressed) buffered reads should be smaller, and based on the expected record size
[ https://issues.apache.org/jira/browse/CASSANDRA-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1466#comment-1466 ] Benedict commented on CASSANDRA-8894: - Thanks. Committed. I'll hold off closing this ticket until we can get those performance tests. [~enigmacurry], any idea if/when cstar will support the readahead options? > Our default buffer size for (uncompressed) buffered reads should be smaller, > and based on the expected record size > -- > > Key: CASSANDRA-8894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8894 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania > Labels: benedict-to-commit > Fix For: 3.0 alpha 1 > > Attachments: 8894_25pct.yaml, 8894_5pct.yaml, 8894_tiny.yaml > > > A large contributor to slower buffered reads than mmapped is likely that we > read a full 64Kb at once, when average record sizes may be as low as 140 > bytes on our stress tests. The TLB has only 128 entries on a modern core, and > each read will touch 32 of these, meaning we are unlikely to almost ever be > hitting the TLB, and will be incurring at least 30 unnecessary misses each > time (as well as the other costs of larger than necessary accesses). When > working with an SSD there is little to no benefit reading more than 4Kb at > once, and in either case reading more data than we need is wasteful. So, I > propose selecting a buffer size that is the next larger power of 2 than our > average record size (with a minimum of 4Kb), so that we expect to read in one > operation. I also propose that we create a pool of these buffers up-front, > and that we ensure they are all exactly aligned to a virtual page, so that > the source and target operations each touch exactly one virtual page per 4Kb > of expected record size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9908) Potential race caused by async cleanup of transaction log files
[ https://issues.apache.org/jira/browse/CASSANDRA-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644500#comment-14644500 ] Benedict commented on CASSANDRA-9908: - Hmm. So our secondary index files don't use their cfId in their directory name? > Potential race caused by async cleanup of transaction log files > --- > > Key: CASSANDRA-9908 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9908 > Project: Cassandra > Issue Type: Bug >Reporter: Sam Tunnicliffe >Assignee: Stefania > Fix For: 3.0 beta 1 > > > There seems to be a potential race in the cleanup of transaction log files, > introduced in CASSANDRA-7066 > It's pretty hard to trigger on trunk, but it's possible to hit it via > {{o.a.c.db.SecondaryIndexTest#testCreateIndex}} > That test creates an index, then removes it to check that the removal is > correctly recorded, then adds the index again to assert that it gets rebuilt > from the existing data. > The removal causes the SSTables of the index CFS to be dropped, which is a > transactional operation and so writes a transaction log. When the drop is > completed and the last reference to an SSTable is released, the cleanup of > the transaction log is scheduled on the periodic tasks executor. The issue is > that re-creating the index re-creates the index CFS. When this happens, it's > possible for the cleanup of the txn log to have not yet happened. If so, the > initialization of the CFS attempts to read the log to identify any orphaned > temporary files. The cleanup can happen between the finding the log file and > reading it's contents, which results in a {{NoSuchFileException}} > {noformat} > [junit] java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] java.lang.RuntimeException: java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] at > org.apache.cassandra.io.util.FileUtils.readLines(FileUtils.java:620) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionFile.getTrackedFiles(TransactionLogs.java:190) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionData.getTemporaryFiles(TransactionLogs.java:338) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs.getTemporaryFiles(TransactionLogs.java:739) > [junit] at > org.apache.cassandra.db.lifecycle.LifecycleTransaction.getTemporaryFiles(LifecycleTransaction.java:541) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.getFilter(Directories.java:652) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:641) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.list(Directories.java:606) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:351) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:313) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:511) > [junit] at > org.apache.cassandra.index.internal.CassandraIndexer.addIndexedColumn(CassandraIndexer.java:115) > [junit] at > org.apache.cassandra.index.SecondaryIndexManager.addIndexedColumn(SecondaryIndexManager.java:265) > [junit] at > org.apache.cassandra.db.SecondaryIndexTest.testIndexCreate(SecondaryIndexTest.java:467) > [junit] Caused by: java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > [junit] at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > [junit] at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > [junit] at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > [junit] at java.nio.file.Files.newByteChannel(Files.java:361) > [junit] at java.nio.file.Files.newByteChannel(Files.java:407) > [junit] at > java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384) > [junit] at java.nio.file.Files.newInputStream(Files.java:152) > [junit] at java.nio.file.Files.newBufferedReader(Files.java:2784) > [junit] at java.nio.file.Files.readAllLines(Fil
[jira] [Commented] (CASSANDRA-9908) Potential race caused by async cleanup of transaction log files
[ https://issues.apache.org/jira/browse/CASSANDRA-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644539#comment-14644539 ] Benedict commented on CASSANDRA-9908: - Doesn't that sort of obviate much of the supposed benefit of unique cfIds? Anyway, I guess in this instance we can just make sure that all txn logs have been fully cleared before creating a secondary index. The rub will be if the cleanup fails, though (deletion on windows, for instance), or there are still extant readers. In either case the cleanup can't safely complete. > Potential race caused by async cleanup of transaction log files > --- > > Key: CASSANDRA-9908 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9908 > Project: Cassandra > Issue Type: Bug >Reporter: Sam Tunnicliffe >Assignee: Stefania > Fix For: 3.0 beta 1 > > > There seems to be a potential race in the cleanup of transaction log files, > introduced in CASSANDRA-7066 > It's pretty hard to trigger on trunk, but it's possible to hit it via > {{o.a.c.db.SecondaryIndexTest#testCreateIndex}} > That test creates an index, then removes it to check that the removal is > correctly recorded, then adds the index again to assert that it gets rebuilt > from the existing data. > The removal causes the SSTables of the index CFS to be dropped, which is a > transactional operation and so writes a transaction log. When the drop is > completed and the last reference to an SSTable is released, the cleanup of > the transaction log is scheduled on the periodic tasks executor. The issue is > that re-creating the index re-creates the index CFS. When this happens, it's > possible for the cleanup of the txn log to have not yet happened. If so, the > initialization of the CFS attempts to read the log to identify any orphaned > temporary files. The cleanup can happen between the finding the log file and > reading it's contents, which results in a {{NoSuchFileException}} > {noformat} > [junit] java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] java.lang.RuntimeException: java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] at > org.apache.cassandra.io.util.FileUtils.readLines(FileUtils.java:620) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionFile.getTrackedFiles(TransactionLogs.java:190) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionData.getTemporaryFiles(TransactionLogs.java:338) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs.getTemporaryFiles(TransactionLogs.java:739) > [junit] at > org.apache.cassandra.db.lifecycle.LifecycleTransaction.getTemporaryFiles(LifecycleTransaction.java:541) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.getFilter(Directories.java:652) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:641) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.list(Directories.java:606) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:351) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:313) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:511) > [junit] at > org.apache.cassandra.index.internal.CassandraIndexer.addIndexedColumn(CassandraIndexer.java:115) > [junit] at > org.apache.cassandra.index.SecondaryIndexManager.addIndexedColumn(SecondaryIndexManager.java:265) > [junit] at > org.apache.cassandra.db.SecondaryIndexTest.testIndexCreate(SecondaryIndexTest.java:467) > [junit] Caused by: java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > [junit] at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > [junit] at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > [junit] at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > [junit] at java.nio.file.Files.newByteChannel(Files.java:361) > [junit] at java.nio.file.Files.newByteChannel(Files.java:4
[jira] [Commented] (CASSANDRA-9382) Snapshot file descriptors not getting purged (possible fd leak)
[ https://issues.apache.org/jira/browse/CASSANDRA-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644592#comment-14644592 ] Benedict commented on CASSANDRA-9382: - +1. This bug shouldn't affect 2.1+, however there's no reason to track hotness anyway, even if it closes the resources safely. > Snapshot file descriptors not getting purged (possible fd leak) > --- > > Key: CASSANDRA-9382 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9382 > Project: Cassandra > Issue Type: Bug >Reporter: Mark Curtis >Assignee: Yuki Morishita > Fix For: 2.1.x, 2.0.x > > Attachments: yjp-heapdump.png > > > OpsCenter has the repair service which does a lot of small range repairs. > Each repair would generate a snapshot as per normal. The cluster was showing > a steady increase in disk space over the course of a couple of days and the > only way to workaround the issue was to restart the node. > Upon some further inspection it was seen that a lsof output of the cassandra > process was still showing file descriptors for snapshots that no longer > existed on the file system. For example: > {code} > ava5822 cassandra DELREG 202,32 7359833 > /media/ephemeral1/cassandra/data/somekeyspace/table1/snapshots/669a3a30-f3d3-11e4-bec6-3f6c4fb06498/somekeyspace-table1-jb-897689-Data.db > {code} > We also took a heapdump which basically showed the same thing, lots of > references to these file handles. We checked the logs for any errors > especially relating to repairs that might have failed but there was nothing > observed > The repair service logs in OpsCenter showed also that all repairs (1000s of > them) had completed successfully, again showing that there was no repair > issue. > I have not yet been able to reproduce the issue locally on a test box. The > cluster that this original issue appeared on was a production cluster with > the following spec: > cassandra_versions: 2.0.14.352 > cluster_cores : 8, > cluster_instance_types : i2.2xlarge > cluster_os : Amazon linux amd64 > node count: 4 > node java version: Oracle Java 1.7.0_51 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644607#comment-14644607 ] Benedict commented on CASSANDRA-7066: - If you lose one of these files you are SOL, there's not a lot that can be done to help a filesystem failure over one of these. If you want to try and recover, the best we can do is point you to the description of how they work and let you have a go at fixing it however you feel best. Preferably restoring from backup. If operators are messing with their sstable directory contents, this change simply means any file present at startup will be considered a live file. That's all the operator should really need to know, from that POV. If restoring from backup, these logs should definitely all be deleted. I'm not sure what tools we offer to hand hold backup/restore, but they should perhaps be made to understand this. > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644623#comment-14644623 ] Benedict commented on CASSANDRA-7066: - bq. Is that SOL as in "Now we can't tell which is new so we have to keep both and do redundant compaction" or as in "Now the note can't start up?" Well, SOL in that we have no way of guaranteeing C* will be able to notice this has happened and do anything to avoid breaking things. If we see just one of the two files present, under non-fs failure that means we were part way through cleaning up, and need to finish deleting its contents. So if we lose the "new" file, say, then we will delete the old files on startup none-the-wiser. The result being a partial replacement of the sstables (perhaps with nothing at all). There's really nothing you can do to guarantee anything in the face of a filesystem corruption. If the operator _knows this has happened_ before starting up they can delete all the log files, sure, and then we'll keep everything. > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644635#comment-14644635 ] Benedict commented on CASSANDRA-7066: - Well, if your file system is losing files you're not safe either way. The question is how noticeable the loss of files is. > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644685#comment-14644685 ] Benedict commented on CASSANDRA-7066: - Well, a bug in the implementation could screw us up either way, and I'm not sure one is more robust to it than any other. But I'm certainly not terribly opposed to changing it again. It shouldn't be a drastic change. However if we're doing that, I'd rather we just went with a simple log file that represents new and old in one. i.e., we write lines like: {{noformat}} add:sstable-3 remove:sstable-2 commit {{noformat}} commit is only written very last if we are removing the old ones and adding the new ones. Otherwise we rollback. This makes the changes pretty minimal, as behaviourally it's identical, it's just the on-disk representation that changes. It also retains the benefit of not double-counting your data. If we want to be _really_ secure, we can post-fix each line with a checksum for the entire file (up to the point), and if any do not match we retain every file as a last-ditch fallback. We can also log panics in that case, so the operator knows for sure something bad is happening with their filesystem. (if only the last line does not match, and it is not "commit", we're as safe as we can be to rollback - but perhaps in this case we just log less panic-stricken warnings that they can consider deleting the duplicate files). > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644685#comment-14644685 ] Benedict edited comment on CASSANDRA-7066 at 7/28/15 5:23 PM: -- Well, a bug in the implementation could screw us up either way, and I'm not sure one is more robust to it than any other. But I'm certainly not terribly opposed to changing it again. It shouldn't be a drastic change. However if we're doing that, I'd rather we just went with a simple log file that represents new and old in one. i.e., we write lines like: {noformat} add:sstable-3 remove:sstable-2 commit {noformat} commit is only written very last if we are removing the old ones and adding the new ones. Otherwise we rollback. This makes the changes pretty minimal, as behaviourally it's identical, it's just the on-disk representation that changes. It also retains the benefit of not double-counting your data. If we want to be _really_ secure, we can post-fix each line with a checksum for the entire file (up to the point), and if any do not match we retain every file as a last-ditch fallback. We can also log panics in that case, so the operator knows for sure something bad is happening with their filesystem. (if only the last line does not match, and it is not "commit", we're as safe as we can be to rollback - but perhaps in this case we just log less panic-stricken warnings that they can consider deleting the duplicate files). was (Author: benedict): Well, a bug in the implementation could screw us up either way, and I'm not sure one is more robust to it than any other. But I'm certainly not terribly opposed to changing it again. It shouldn't be a drastic change. However if we're doing that, I'd rather we just went with a simple log file that represents new and old in one. i.e., we write lines like: {{noformat}} add:sstable-3 remove:sstable-2 commit {{noformat}} commit is only written very last if we are removing the old ones and adding the new ones. Otherwise we rollback. This makes the changes pretty minimal, as behaviourally it's identical, it's just the on-disk representation that changes. It also retains the benefit of not double-counting your data. If we want to be _really_ secure, we can post-fix each line with a checksum for the entire file (up to the point), and if any do not match we retain every file as a last-ditch fallback. We can also log panics in that case, so the operator knows for sure something bad is happening with their filesystem. (if only the last line does not match, and it is not "commit", we're as safe as we can be to rollback - but perhaps in this case we just log less panic-stricken warnings that they can consider deleting the duplicate files). > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644685#comment-14644685 ] Benedict edited comment on CASSANDRA-7066 at 7/28/15 5:34 PM: -- Well, a bug in the implementation could screw us up either way, and I'm not sure one is more robust to it than any other. But I'm certainly not terribly opposed to changing it again. It shouldn't be a drastic change. However if we're doing that, I'd rather we just went with a simple log file that represents new and old in one. i.e., we write lines like: {noformat} add:sstable-3 remove:sstable-2 commit {noformat} commit is only written very last if we are removing the old ones and adding the new ones. Otherwise we rollback. This makes the changes pretty minimal, as behaviourally it's identical, it's just the on-disk representation that changes. It also retains the benefit of not double-counting your data. If we want to be _really_ secure, we can post-fix each line with a checksum for the entire file (up to the point), and if any do not match we retain every file as a last-ditch fallback. We can also log panics in that case, so the operator knows for sure something bad is happening with their filesystem. (if only the last line does not match, and it is not "commit", we're as safe as we can be to rollback - but perhaps in this case we just log less panic-stricken warnings that they can consider deleting the duplicate files). (edit: this is actually pretty necessary, thinking about it, since deletion of the old files can be _very_ delayed if they're involved in a long running operation, so we run a pretty significant risk of duplicated data. perhaps much more than doubling...) was (Author: benedict): Well, a bug in the implementation could screw us up either way, and I'm not sure one is more robust to it than any other. But I'm certainly not terribly opposed to changing it again. It shouldn't be a drastic change. However if we're doing that, I'd rather we just went with a simple log file that represents new and old in one. i.e., we write lines like: {noformat} add:sstable-3 remove:sstable-2 commit {noformat} commit is only written very last if we are removing the old ones and adding the new ones. Otherwise we rollback. This makes the changes pretty minimal, as behaviourally it's identical, it's just the on-disk representation that changes. It also retains the benefit of not double-counting your data. If we want to be _really_ secure, we can post-fix each line with a checksum for the entire file (up to the point), and if any do not match we retain every file as a last-ditch fallback. We can also log panics in that case, so the operator knows for sure something bad is happening with their filesystem. (if only the last line does not match, and it is not "commit", we're as safe as we can be to rollback - but perhaps in this case we just log less panic-stricken warnings that they can consider deleting the duplicate files). > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644718#comment-14644718 ] Benedict commented on CASSANDRA-7066: - bq. The value of the 'old way' was the naming was intrinsically tied to the state of the sstable and the renames were atomic. A single rename was atomic. A set of renames was not. There's no reason to ever backup these files, I don't think? As said, worse case scenario if they're missing is you have too much data. However backups should only be operating over snapshots, shouldn't they? So they don't even see these files. > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reopened CASSANDRA-7066: - [~stefania]: I'm not sure if we entirely settled on a modification, but it sounds like we should revisit this a little, so I'm reopening. It would be best if this system get finalised prior to the beta, and if possible the alpha, so we should probably focus on this ahead of other tickets. I think the proposed changes should deal with any of the outlined concerns, however I think they can be made even more robust to backup interplay: if we also log the last modified time of the file we're deleting alongside it, then we can bail out if any of these mismatch for _any_ file, and we can log some panics. This should all make us very paranoid and prefer retention of data, as well as make us a little more robust to filesystem failures and perhaps also bugs of our own making. Although we will never be truly shielded from any of these issues. > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645754#comment-14645754 ] Benedict commented on CASSANDRA-7066: - Thanks Stef. A couple of comments about the description: * It would be helpful to elaborate "replace the compactions_in_progress system table" to include a mention of the elimination of temporary file markers. * It would be helpful to mention that if restoring from backup, these files should either be deleted (in the case of a full backup), or confirmed not to reference the files they're restoring ** Perhaps we should offer a tool, like sstablelister, to offline complete any transaction log files, which can be specified as prep-work for backup restore * I've noticed a few English speakers (including in this conversation here) use "loose" as "lose" which no doubt encouraged you to do the same. But it's incorrect :) As far as the patch goes, it looks good to me on the whole (though I'll give it a full review once we've finalised the details). In the meantime, the incremental checksum at the very least seems like a good idea to detect even slight corruption of the log file (on the assumption that such corruption might indicate we've lost the commit record, and had been in the process of it, which seems to be one of the concerns). I agree about individual files over-complicating things, however we _could _ perhaps take the maximum last modified of all files with that descriptor, and then delete the files in ascending order of last modified. This would just help the log files be robust to backups being copied in with these files still present and referencing them as old, and also to some potential effects of problems like CASSANDRA-9908. > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9738) Migrate key-cache to be fully off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645807#comment-14645807 ] Benedict commented on CASSANDRA-9738: - Looks like improved occupancy is also a factor (comparing run 2 to run 3). I'd prefer to eliminate the malloc/free from cache maintenance before we hook OHC into the key cache, but these numbers are pretty compelling and we should probably consider it for 3.0. I suspect CASSANDRA-8930 will permit us to deliver a more efficient and high occupancy key-cache, but that shouldn't prevent us doing something in the meantime. > Migrate key-cache to be fully off-heap > -- > > Key: CASSANDRA-9738 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9738 > Project: Cassandra > Issue Type: Sub-task >Reporter: Robert Stupp >Assignee: Robert Stupp > Fix For: 3.x > > > Key cache still uses a concurrent map on-heap. This could go to off-heap and > feels doable now after CASSANDRA-8099. > Evaluation should be done in advance based on a POC to prove that pure > off-heap counter cache buys a performance and/or gc-pressure improvement. > In theory, elimination of on-heap management of the map should buy us some > benefit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9669) If sstable flushes complete out of order, on restart we can fail to replay necessary commit log records
[ https://issues.apache.org/jira/browse/CASSANDRA-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645845#comment-14645845 ] Benedict commented on CASSANDRA-9669: - So, I have a patch available for this [here|https://github.com/belliottsmith/cassandra/tree/9669-2.0] I managed to make it less invasive than I had anticipated, but it still requires an sstable version increment. The patch: * Introduces a commitLogLowerBound to the memtable, which tracks the commit log position at its creation * Changes sstable metadata's "replayPosition" into "commitLogLowerBound" and "commitLogUpperBound" in the new sstable version * Delays exposing a new sstable to the compaction strategy until all of its preceding flushes have completed * On compaction, extends the new sstable's lower/upper bounds to the min/max of all sstables we're replacing. Given (3), we only extend over ranges that are known to already be covered by other sstables. * On replay, we take any range covered by an sstable to not need replay (and any range prior to the earliest known safe range is also ignored) Test Engineering: there are failures on dtests, but I cannot tell if these are new or existing. Mostly the look like flakey tests. The one that looks most worrisome to me is counter upgrade test, but could you take a look and tell me what you think of the test situation in general? Modifying 2.0 makes me uncomfortable > If sstable flushes complete out of order, on restart we can fail to replay > necessary commit log records > --- > > Key: CASSANDRA-9669 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9669 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Critical > Labels: correctness > Fix For: 3.x, 2.1.x, 2.2.x, 3.0.x > > > While {{postFlushExecutor}} ensures it never expires CL entries out-of-order, > on restart we simply take the maximum replay position of any sstable on disk, > and ignore anything prior. > It is quite possible for there to be two flushes triggered for a given table, > and for the second to finish first by virtue of containing a much smaller > quantity of live data (or perhaps the disk is just under less pressure). If > we crash before the first sstable has been written, then on restart the data > it would have represented will disappear, since we will not replay the CL > records. > This looks to be a bug present since time immemorial, and also seems pretty > serious. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9908) Potential race caused by async cleanup of transaction log files
[ https://issues.apache.org/jira/browse/CASSANDRA-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645925#comment-14645925 ] Benedict commented on CASSANDRA-9908: - Probably we should at least introduce a {{TransactionLogs.waitForDeletions}} call after any schema change. We should also make certain dropping an index awaits a new writeOrder and readOrder barrier prior to this. > Potential race caused by async cleanup of transaction log files > --- > > Key: CASSANDRA-9908 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9908 > Project: Cassandra > Issue Type: Bug >Reporter: Sam Tunnicliffe >Assignee: Stefania > Fix For: 3.0 beta 1 > > Attachments: TEST-org.apache.cassandra.db.SecondaryIndexTest.log > > > There seems to be a potential race in the cleanup of transaction log files, > introduced in CASSANDRA-7066 > It's pretty hard to trigger on trunk, but it's possible to hit it via > {{o.a.c.db.SecondaryIndexTest#testCreateIndex}} > That test creates an index, then removes it to check that the removal is > correctly recorded, then adds the index again to assert that it gets rebuilt > from the existing data. > The removal causes the SSTables of the index CFS to be dropped, which is a > transactional operation and so writes a transaction log. When the drop is > completed and the last reference to an SSTable is released, the cleanup of > the transaction log is scheduled on the periodic tasks executor. The issue is > that re-creating the index re-creates the index CFS. When this happens, it's > possible for the cleanup of the txn log to have not yet happened. If so, the > initialization of the CFS attempts to read the log to identify any orphaned > temporary files. The cleanup can happen between the finding the log file and > reading it's contents, which results in a {{NoSuchFileException}} > {noformat} > [junit] java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] java.lang.RuntimeException: java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] at > org.apache.cassandra.io.util.FileUtils.readLines(FileUtils.java:620) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionFile.getTrackedFiles(TransactionLogs.java:190) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionData.getTemporaryFiles(TransactionLogs.java:338) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs.getTemporaryFiles(TransactionLogs.java:739) > [junit] at > org.apache.cassandra.db.lifecycle.LifecycleTransaction.getTemporaryFiles(LifecycleTransaction.java:541) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.getFilter(Directories.java:652) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:641) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.list(Directories.java:606) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:351) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:313) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:511) > [junit] at > org.apache.cassandra.index.internal.CassandraIndexer.addIndexedColumn(CassandraIndexer.java:115) > [junit] at > org.apache.cassandra.index.SecondaryIndexManager.addIndexedColumn(SecondaryIndexManager.java:265) > [junit] at > org.apache.cassandra.db.SecondaryIndexTest.testIndexCreate(SecondaryIndexTest.java:467) > [junit] Caused by: java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > [junit] at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > [junit] at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > [junit] at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > [junit] at java.nio.file.Files.newByteChannel(Files.java:361) > [junit] at java.nio.file.Files.newByteChannel(Files.java:407) > [junit] at > java.nio.file.spi.FileSystemProvider.newInputStream(FileSyste
[jira] [Comment Edited] (CASSANDRA-9738) Migrate key-cache to be fully off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645807#comment-14645807 ] Benedict edited comment on CASSANDRA-9738 at 7/29/15 2:25 PM: -- Looks like improved occupancy is also a factor (comparing run 2 to run 3). I'd prefer to eliminate the malloc/free from cache maintenance before we hook OHC into the key cache, but these numbers are pretty compelling and we should probably consider it for 3.0. I suspect CASSANDRA-8931 will permit us to deliver a more efficient and high occupancy key-cache, but that shouldn't prevent us doing something in the meantime. was (Author: benedict): Looks like improved occupancy is also a factor (comparing run 2 to run 3). I'd prefer to eliminate the malloc/free from cache maintenance before we hook OHC into the key cache, but these numbers are pretty compelling and we should probably consider it for 3.0. I suspect CASSANDRA-8930 will permit us to deliver a more efficient and high occupancy key-cache, but that shouldn't prevent us doing something in the meantime. > Migrate key-cache to be fully off-heap > -- > > Key: CASSANDRA-9738 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9738 > Project: Cassandra > Issue Type: Sub-task >Reporter: Robert Stupp >Assignee: Robert Stupp > Fix For: 3.x > > > Key cache still uses a concurrent map on-heap. This could go to off-heap and > feels doable now after CASSANDRA-8099. > Evaluation should be done in advance based on a POC to prove that pure > off-heap counter cache buys a performance and/or gc-pressure improvement. > In theory, elimination of on-heap management of the map should buy us some > benefit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9738) Migrate key-cache to be fully off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-9738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646112#comment-14646112 ] Benedict commented on CASSANDRA-9738: - Yes. Thanks > Migrate key-cache to be fully off-heap > -- > > Key: CASSANDRA-9738 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9738 > Project: Cassandra > Issue Type: Sub-task >Reporter: Robert Stupp >Assignee: Robert Stupp > Fix For: 3.x > > > Key cache still uses a concurrent map on-heap. This could go to off-heap and > feels doable now after CASSANDRA-8099. > Evaluation should be done in advance based on a POC to prove that pure > off-heap counter cache buys a performance and/or gc-pressure improvement. > In theory, elimination of on-heap management of the map should buy us some > benefit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8931) IndexSummary (and Index) should store the token, and the minimal key to unambiguously direct a query
[ https://issues.apache.org/jira/browse/CASSANDRA-8931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646156#comment-14646156 ] Benedict commented on CASSANDRA-8931: - Right > IndexSummary (and Index) should store the token, and the minimal key to > unambiguously direct a query > > > Key: CASSANDRA-8931 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8931 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania > Labels: performance > > Since these files are likely sticking around a little longer, it is probably > worth optimising them. A relatively simple change to Index and IndexSummary > could reduce the amount of space required significantly, reduce the CPU > burden of lookup, and hopefully bound the amount of space needed as key size > grows. On writing first we always store the token before the key (if it is > different to the key); then we simply truncate the whole record to the > minimum length necessary to answer an inequality search. Since the data file > contains the key also, we can corroborate we have the right key once we've > looked up. Since BFs are used to reduce unnecessary lookups, we don't save > much by ruling the false positives out one step earlier. > An improved follow up version would be to use a trie of shortest length to > answer inequality lookups, as this would also ensure very long keys with > common prefixes would not significantly increase the size of the index or > summary. This would translate to a trie index for the summary keying into a > static trie page for the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9918) Introduce extra Ref usage when debugrefcount=true
Benedict created CASSANDRA-9918: --- Summary: Introduce extra Ref usage when debugrefcount=true Key: CASSANDRA-9918 URL: https://issues.apache.org/jira/browse/CASSANDRA-9918 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Fix For: 3.0.x To help us in testing catch API misuse, especially resource handling, we can introduce an abstract base class implementing AutoCloseable that under normal circumstances is empty, but when ref count debugging is enabled instantiates a Ref instance that it releases on close. This will let us detect objects we are misusing that we consider too high-traffic for normal systems to have tracked by Refs. Ideally CASSANDRA-9379 will be delivered around the same time, so that this can be enabled on live systems without _much_ negative impact. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9919) Cleanup closeable iterator usage
Benedict created CASSANDRA-9919: --- Summary: Cleanup closeable iterator usage Key: CASSANDRA-9919 URL: https://issues.apache.org/jira/browse/CASSANDRA-9919 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Fix For: 3.0.0 rc1 Our iterators are all sensibly AutoCloseable, but we still have a lot of places where we close the iterator if we reach the end in next() / hasNext(). ThIs behaviour will only mask problems, as any exceptions in normal processing will miss these pathways. Since we much more heavily depend on iterators now, we need to be certain they are rock solid. So I would prefer we remove our crutches and work through any pain earlier. CASSANDRA-9918 can then help us catch any misuse more easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9865) Broken vint encoding, at least when interacting with OHCProvider
[ https://issues.apache.org/jira/browse/CASSANDRA-9865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646328#comment-14646328 ] Benedict commented on CASSANDRA-9865: - +1 > Broken vint encoding, at least when interacting with OHCProvider > > > Key: CASSANDRA-9865 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9865 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Ariel Weisberg > Fix For: 3.0 alpha 1 > > Attachments: 9865-hacky-test.txt > > > I haven't investigated this very closely so I only have a slightly hacky way > to show the problem, but if you apply the patch attached, you'll see that the > vints serialized and the one deserialized are not the same ones. If you > remove the use of vints (as is currently on trunk, but only due to this issue > because we do want to use vints), everything works correctly. > I'm honestly not sure where the problem is, but it sounds like it could be > either in {{NIODataInputStream}} or in the {{OHCProvider}} since it's used on > that test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9925) Sorted + Unique BTree.Builder usage can use a thread-local TreeBuilder directly
Benedict created CASSANDRA-9925: --- Summary: Sorted + Unique BTree.Builder usage can use a thread-local TreeBuilder directly Key: CASSANDRA-9925 URL: https://issues.apache.org/jira/browse/CASSANDRA-9925 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 3.0.x There are now a number of situations where we use a BTree.Builder that could be made to go directly to the TreeBuilder, since they only perform in-order unique additions to build an initial tree. This would potentially avoid a number of array allocations/copies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9896) Add ability to disable commitlog recycling
[ https://issues.apache.org/jira/browse/CASSANDRA-9896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9896: Fix Version/s: (was: 2.1.x) 2.1.9 > Add ability to disable commitlog recycling > -- > > Key: CASSANDRA-9896 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9896 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Brandon Williams >Assignee: Benedict > Fix For: 2.1.9 > > > See CASSANDRA-9533 for background, specifically the graphs I linked. > Benedict suggests this is due the commitlog recycling and I agree, so the > simplest solution is to be able to disable that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9525) Commitlog allocation failure doesn't stop the entire node.
[ https://issues.apache.org/jira/browse/CASSANDRA-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9525: Fix Version/s: (was: 2.1.x) > Commitlog allocation failure doesn't stop the entire node. > -- > > Key: CASSANDRA-9525 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9525 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Jeremiah Jordan > > If there is an error opening a commit log segment, the thread dies, but it > doesn't stop the whole node. > Got the following on a node: > {noformat} > ERROR [COMMIT-LOG-ALLOCATOR] 2015-05-29 17:06:45,722 CommitLog.java:397 - > Failed managing commit log segments. Commit disk failure policy is stop; > terminating thread > org.apache.cassandra.io.FSWriteError: java.io.FileNotFoundException: > /var/lib/cassandra/commitlog/CommitLog-4-1432937194590.log (Permission denied) > at > org.apache.cassandra.db.commitlog.CommitLogSegment.(CommitLogSegment.java:177) > ~[cassandra-all-2.1.5.jar:2.1.5] > at > org.apache.cassandra.db.commitlog.CommitLogSegmentManager$4.call(CommitLogSegmentManager.java:397) > ~[cassandra-all-2.1.5.jar:2.1.5] > at > org.apache.cassandra.db.commitlog.CommitLogSegmentManager$4.call(CommitLogSegmentManager.java:394) > ~[cassandra-all-2.1.5.jar:2.1.5] > at > org.apache.cassandra.db.commitlog.CommitLogSegmentManager$1.runMayThrow(CommitLogSegmentManager.java:152) > ~[cassandra-all-2.1.5.jar:2.1.5] > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > [cassandra-all-2.1.5.jar:2.1.5] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45] > Caused by: java.io.FileNotFoundException: > /var/lib/cassandra/commitlog/CommitLog-4-1432937194590.log (Permission denied) > at java.io.RandomAccessFile.open0(Native Method) ~[na:1.8.0_45] > at java.io.RandomAccessFile.open(RandomAccessFile.java:316) > ~[na:1.8.0_45] > at java.io.RandomAccessFile.(RandomAccessFile.java:243) > ~[na:1.8.0_45] > at > org.apache.cassandra.db.commitlog.CommitLogSegment.(CommitLogSegment.java:155) > ~[cassandra-all-2.1.5.jar:2.1.5] > ... 5 common frames omitted > {noformat} > And the node stayed kind of up, didn't notice something wrong until the node > died OOM because some threads were dead and others weren't. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9471) Columns should be backed by a BTree, not an array
[ https://issues.apache.org/jira/browse/CASSANDRA-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646593#comment-14646593 ] Benedict commented on CASSANDRA-9471: - Patch available [here|https://github.com/belliottsmith/cassandra/tree/9471] > Columns should be backed by a BTree, not an array > - > > Key: CASSANDRA-9471 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9471 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Fix For: 3.0 beta 1 > > > Follow up to 8099. > We have pretty terrible lookup performance as the number of columns grows > (linear). In at least one location, this results in quadratic performance. > We don't however want this structure to be either any more expensive to > build, nor to store. Some small modifications to BTree will permit it to > serve here, by permitting efficient lookup by index, and calculation _of_ > index for a given key. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9586) ant eclipse-warnings fails in trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-9586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9586: Fix Version/s: (was: 3.x) 3.0 alpha 1 > ant eclipse-warnings fails in trunk > --- > > Key: CASSANDRA-9586 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9586 > Project: Cassandra > Issue Type: Bug >Reporter: Michael Shuler >Assignee: Stefania > Fix For: 3.0 alpha 1 > > > {noformat} > eclipse-warnings: > [mkdir] Created dir: /home/mshuler/git/cassandra/build/ecj > [echo] Running Eclipse Code Analysis. Output logged to > /home/mshuler/git/cassandra/build/ecj/eclipse_compiler_checks.txt > [java] incorrect classpath: > /home/mshuler/git/cassandra/build/cobertura/classes > [java] -- > [java] 1. ERROR in > /home/mshuler/git/cassandra/src/java/org/apache/cassandra/io/util/RandomAccessReader.java > (at line 81) > [java] super(new ChannelProxy(file), DEFAULT_BUFFER_SIZE, -1L, > BufferType.OFF_HEAP); > [java] ^^ > [java] Potential resource leak: '' may not > be closed > [java] -- > [java] 1 problem (1 error) > BUILD FAILED > {noformat} > (checked 2.2 and did not find this issue) > git blame on line 81 shows commit 17dd4cc for CASSANDRA-8897 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9307) test-long contains tests that are not appropriate for running on every commit
[ https://issues.apache.org/jira/browse/CASSANDRA-9307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9307: Fix Version/s: (was: 3.x) 3.0 alpha 1 > test-long contains tests that are not appropriate for running on every commit > - > > Key: CASSANDRA-9307 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9307 > Project: Cassandra > Issue Type: Test >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 alpha 1 > > > After discussion with Benedict it seems like some of the tests in test-long > are not long unit tests they are burn in tests for components that have not > changed since the tests were introduced. I propose creating a new test-target > called test-burn that isn't part of test-all, commenting test-all to note the > existence of test-burn, and then moving burn in tests to this target. > I'll create a separate JIRA for making sure that test-burn is run regularly > on hardware appropriate for those tests just to prevent bit rot of the test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9926) Avoid AbstractRow.isEmpty() where possible
Benedict created CASSANDRA-9926: --- Summary: Avoid AbstractRow.isEmpty() where possible Key: CASSANDRA-9926 URL: https://issues.apache.org/jira/browse/CASSANDRA-9926 Project: Cassandra Issue Type: Improvement Reporter: Benedict Assignee: Benedict Priority: Minor This method performs not-insignificant duplicate work, so we should both implement it more efficiently where possible, and avoid using it if we can -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9894) Serialize the header only once per message
[ https://issues.apache.org/jira/browse/CASSANDRA-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646840#comment-14646840 ] Benedict edited comment on CASSANDRA-9894 at 7/29/15 10:27 PM: --- I've pushed an initial version [here|https://github.com/belliottsmith/cassandra/tree/9894]. This is based on my patch for CASSANDRA-9471. I tried starting from Sylvain's patch, and then starting from scratch, and ultimately I didn't like where either lead. So this attacks the problem a little differently: it uses the column filter sent to the replica to help encode the response, knowing that the response columns must be a subset. With a normal number of columns this translates to a presence bitmap (otherwise it is a sequence of ints either adding or removing from the set, but these codepaths should rarely be taken). If the columns are identical, a single 0 byte is sent for all the columns. This permits us to save work when serializing even single partitions, and also permits us to encode per-partition encoding stats, so that our timestamps can most likely be more efficiently encoded. It also touches far less code. I am not 100% certain I haven't broken things, as dtests are a little tricky to read right now, but nothing jumps out at me. I still need to introduce some unit tests, and also want to invert the bitmap to make it more efficiently vint encoded. But the patch is generally ready for a first round of review, as it will change only minimally. was (Author: benedict): I've pushed an initial version [here|https://github.com/belliottsmith/cassandra/tree/9894]. This is based on my patch for CASSANDRA-9471. I tried starting from Sylvain's patch, and then starting from scratch, and ultimately I didn't like where either lead. So this attacks the problem a little differently: it uses the column filter sent to the coordinator for a query to encode the response, knowing that the columns must be a subset. With a normal number of columns this translates to a bitmap of presence in the response for each column in the request (otherwise it is a sequence of vint encoded ints, but these codepaths should rarely be taken), and if the columns are identical (what we should expect), a single 0 byte is sent for all the columns. This permits us to save work when serializing even single partitions, and also permits us to encode per-partition encoding stats, so that our timestamps can most likely be more efficiently encoded. It also touches far less code. I am not 100% certain I haven't broken things, as dtests are a little tricky to read right now, but nothing jumps out at me. I still need to introduce some unit tests, and also want to invert the bitmap to make it more efficiently vint encoded. But the patch is generally ready for a first round of review, as it will change only minimally. > Serialize the header only once per message > -- > > Key: CASSANDRA-9894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9894 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Sylvain Lebresne >Assignee: Benedict > Fix For: 3.0 beta 1 > > > One last improvement I'd like to do on the serialization side is that we > currently serialize the {{SerializationHeader}} for each partition. That > header contains the serialized columns in particular and for range queries, > serializing that for every partition is wasted (note that it's only a problem > for the messaging protocol as for sstable we only write the header once per > sstable). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9932) Make all partitions btree backed
Benedict created CASSANDRA-9932: --- Summary: Make all partitions btree backed Key: CASSANDRA-9932 URL: https://issues.apache.org/jira/browse/CASSANDRA-9932 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Fix For: 3.0.0 rc1 Following on from the other btree related refactors, this patch makes all partition (and partition-like) objects backed by the same basic structure: {{AbstractBTreePartition}}. With two main offshoots: {{ImmutableBTreePartition}} and {{AtomicBTreePartition}} The main upshot is a 30% net code reduction, meaning better exercise of btree code paths and fewer new code paths to go wrong. A secondary upshort is that, by funnelling all our comparisons through a btree, there is a higher likelihood of icache occupancy and we have only one area to focus delivery of improvements for their enjoyment by all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9932) Make all partitions btree backed
[ https://issues.apache.org/jira/browse/CASSANDRA-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646885#comment-14646885 ] Benedict commented on CASSANDRA-9932: - An in progress patch is available [here|https://github.com/belliottsmith/cassandra/tree/btreepartition]. It is close to review-ready, but not quite. > Make all partitions btree backed > > > Key: CASSANDRA-9932 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9932 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Fix For: 3.0.0 rc1 > > > Following on from the other btree related refactors, this patch makes all > partition (and partition-like) objects backed by the same basic structure: > {{AbstractBTreePartition}}. With two main offshoots: > {{ImmutableBTreePartition}} and {{AtomicBTreePartition}} > The main upshot is a 30% net code reduction, meaning better exercise of btree > code paths and fewer new code paths to go wrong. A secondary upshort is that, > by funnelling all our comparisons through a btree, there is a higher > likelihood of icache occupancy and we have only one area to focus delivery of > improvements for their enjoyment by all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CASSANDRA-9888) BTreeBackedRow and ComplexColumnData
[ https://issues.apache.org/jira/browse/CASSANDRA-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-9888. - Resolution: Fixed Thanks. I've ninja fixed. (Although I'm having weird problems running this test locally, it's an obvious fix from the stack trace) > BTreeBackedRow and ComplexColumnData > > > Key: CASSANDRA-9888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9888 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Fix For: 3.0 alpha 1 > > > I found ArrayBackedRow a little hard to follow, especially around building, > so I've converted it to BTreeBackedRow, along with ComplexColumnData. Both > now rely on BTree.Builder, which introduces a little extra functionality to > permit these classes to be implemented more declaratively. Transformations of > these classes are also now uniform and more declarative, also depending on > some new functionality in BTree that permits applying a > transformation/filtration to an existing btree (this could be optimised at a > later date, but should suffice for now). > The result is IMO both clearer and should scale more gracefully to larger > numbers of columns and complex cells. > This hasn't taken all of the possible improvements of the back of this change > to their natural conclusion, as we are somewhat time pressed and I would > prefer to get the ball rolling with this first round. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9908) Potential race caused by async cleanup of transaction log files
[ https://issues.apache.org/jira/browse/CASSANDRA-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647351#comment-14647351 ] Benedict commented on CASSANDRA-9908: - Take a look in {{Keyspace.dropCf}} for another example. In that case it is done to ensure there are no writes to the commit log for the table after it's dropped. In this case that's unimportant, but we _do_ want to be certain that when we call {{waitForDeletions}} we can be confident the deletions have all actually happened, and that the log has been cleaned up. This means waiting until nobody is using any of the sstables. Really, though, we should be waiting for any writes to complete, then we should clean our memtable, perform a blocking flush (which waits for any in-progress flushes to complete), then waiting for any reads to complete, and _then_ drop our tables and wait for deletions. This should all happen after the index is no longer reachable by any new operations. Waiting for deletions should be fine to put in {{dropSSTables}}, however I think it might be better put in {{invalidate}}, since there's no need to wait for them during truncation, for instance. > Potential race caused by async cleanup of transaction log files > --- > > Key: CASSANDRA-9908 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9908 > Project: Cassandra > Issue Type: Bug >Reporter: Sam Tunnicliffe >Assignee: Stefania > Fix For: 3.0 beta 1 > > Attachments: TEST-org.apache.cassandra.db.SecondaryIndexTest.log > > > There seems to be a potential race in the cleanup of transaction log files, > introduced in CASSANDRA-7066 > It's pretty hard to trigger on trunk, but it's possible to hit it via > {{o.a.c.db.SecondaryIndexTest#testCreateIndex}} > That test creates an index, then removes it to check that the removal is > correctly recorded, then adds the index again to assert that it gets rebuilt > from the existing data. > The removal causes the SSTables of the index CFS to be dropped, which is a > transactional operation and so writes a transaction log. When the drop is > completed and the last reference to an SSTable is released, the cleanup of > the transaction log is scheduled on the periodic tasks executor. The issue is > that re-creating the index re-creates the index CFS. When this happens, it's > possible for the cleanup of the txn log to have not yet happened. If so, the > initialization of the CFS attempts to read the log to identify any orphaned > temporary files. The cleanup can happen between the finding the log file and > reading it's contents, which results in a {{NoSuchFileException}} > {noformat} > [junit] java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] java.lang.RuntimeException: java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] at > org.apache.cassandra.io.util.FileUtils.readLines(FileUtils.java:620) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionFile.getTrackedFiles(TransactionLogs.java:190) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionData.getTemporaryFiles(TransactionLogs.java:338) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs.getTemporaryFiles(TransactionLogs.java:739) > [junit] at > org.apache.cassandra.db.lifecycle.LifecycleTransaction.getTemporaryFiles(LifecycleTransaction.java:541) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.getFilter(Directories.java:652) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:641) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.list(Directories.java:606) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:351) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:313) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:511) > [junit] at > org.apache.cassandra.index.internal.CassandraIndexer.addIndexedColumn(CassandraIndexer.java:115) > [junit] at > org.apache.cassandra.index.SecondaryIndexManager.addIndexedColumn(SecondaryIndexManager.java:265) > [junit] at > org.apache.cassandra.db.SecondaryIndexTest.testIndexCreate(SecondaryIndexTest.java:467) > [junit] Caused by: java.nio.file.NoSuchFileException: > build/test/cassa
[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647469#comment-14647469 ] Benedict commented on CASSANDRA-7066: - bq. not sure what a panic error is exactly? It's just hyperbolic language to mean a loud looking error. Like, perhaps with some capital letters in there and some panic-stricken language, so that operators might take notice. bq. Here we are talking about old files only, right? Right, exactly. New files should never be a problem for backups or anything of that ilk, because they cannot be backed up until after we've finished our lifecycle transaction and written commit. > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9894) Serialize the header only once per message
[ https://issues.apache.org/jira/browse/CASSANDRA-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648324#comment-14648324 ] Benedict commented on CASSANDRA-9894: - bq. Looks good to me conditional on a unit test for Columns. bq. , but in the common case will there be enough columns to benefit? Those are both comments for a different ticket, CASSANDRA-9471. This is built upon that simply because it made the patch a little easier. I will respond to these comments there. bq. In Columns when you serialize -1 for the encoding/number of columns can you add a comment that using the unsigned varint is intentional? The reason we is we cannot use any other encoding: the other two branches use it, so we _have_ to in order for deserialization to know what's going on I'll add a comment to that effect, but I felt this was commented by the code. > Serialize the header only once per message > -- > > Key: CASSANDRA-9894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9894 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Sylvain Lebresne >Assignee: Benedict > Fix For: 3.0 beta 1 > > > One last improvement I'd like to do on the serialization side is that we > currently serialize the {{SerializationHeader}} for each partition. That > header contains the serialized columns in particular and for range queries, > serializing that for every partition is wasted (note that it's only a problem > for the messaging protocol as for sstable we only write the header once per > sstable). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9471) Columns should be backed by a BTree, not an array
[ https://issues.apache.org/jira/browse/CASSANDRA-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648337#comment-14648337 ] Benedict commented on CASSANDRA-9471: - [~aweisberg] said: {quote} It would be nice to see a microbenchmark and benchmark demonstrating that the use of BTree is a win. I am wondering how many additional instructions and branches are hidden inside BTree that you pay for in the common case of small Columns set. A BTree is more instructions for more cache locality, but in the common case will there be enough columns to benefit? {quote} If we want to go down that avenue, we should IMO simply look into specialising the btree iterator for the situation where we have just one leaf, and this is something I've considered. That's pretty much the only likely win. We don't really have time to do the kind of microbenchmarking you're talking about though, and it is of limited use anyway, since one of the main benefits _performancewise_ (in the common case) is improved icache\* occupancy, which cannot be accounted for in a microbenchmark. In the uncommon case this also gives us better lower bounds on behaviour, and access to a more powerfully expressive toolkit. For instance, it permits us to trivially deliver a more efficient multi-way merge (which we do very often), and that can be improved further still with less trivial enhancements. Mostly, though, this patch simply guarantees us a good lower bound on performance, with improved code clarity. Right now I would very much expect performance in the common case to be a wash, especially on any of our existing workloads. > Columns should be backed by a BTree, not an array > - > > Key: CASSANDRA-9471 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9471 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Fix For: 3.0 beta 1 > > > Follow up to 8099. > We have pretty terrible lookup performance as the number of columns grows > (linear). In at least one location, this results in quadratic performance. > We don't however want this structure to be either any more expensive to > build, nor to store. Some small modifications to BTree will permit it to > serve here, by permitting efficient lookup by index, and calculation _of_ > index for a given key. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9894) Serialize the header only once per message
[ https://issues.apache.org/jira/browse/CASSANDRA-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648349#comment-14648349 ] Benedict commented on CASSANDRA-9894: - bq. I thought the reason you did it was to have the most compact encoding for the positive numbers which are going to be more common. I think those two statements add up to the same thing :) I'll add a comment explaining the full thought process, as it's actually a little more subtle to that. The encoding is designed to tend towards a single byte in the most common cases, and most importantly that it was consistent with a guaranteed single byte in the most common case (equality). > Serialize the header only once per message > -- > > Key: CASSANDRA-9894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9894 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Sylvain Lebresne >Assignee: Benedict > Fix For: 3.0 beta 1 > > > One last improvement I'd like to do on the serialization side is that we > currently serialize the {{SerializationHeader}} for each partition. That > header contains the serialized columns in particular and for range queries, > serializing that for every partition is wasted (note that it's only a problem > for the messaging protocol as for sstable we only write the header once per > sstable). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-9471) Columns should be backed by a BTree, not an array
[ https://issues.apache.org/jira/browse/CASSANDRA-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648337#comment-14648337 ] Benedict edited comment on CASSANDRA-9471 at 7/30/15 9:39 PM: -- [~aweisberg] said: {quote} It would be nice to see a microbenchmark and benchmark demonstrating that the use of BTree is a win. I am wondering how many additional instructions and branches are hidden inside BTree that you pay for in the common case of small Columns set. A BTree is more instructions for more cache locality, but in the common case will there be enough columns to benefit? {quote} If we want to go down that avenue, we should IMO simply look into specialising the btree iterator for the situation where we have just one leaf, and this is something I've considered. That's pretty much the only likely win. We don't really have time to do the kind of microbenchmarking you're talking about though, and it is of limited use anyway, since one of the main benefits _performancewise_ (in the common case) is improved icache\* occupancy, which cannot be accounted for in a microbenchmark. In the uncommon case this also gives us better lower bounds on behaviour, and access to a more powerfully expressive toolkit. For instance, it permits us to trivially deliver a more efficient multi-way merge (which we do very often), and that can be improved further still with less trivial enhancements. Other future improvements that can be deployed widely are efficient intersection for iteration of subsets, which again we do very often. Mostly, though, for now this patch simply guarantees us a good lower bound on performance, with improved code clarity. Right now I would very much expect performance in the common case to be a wash, especially on any of our existing workloads. was (Author: benedict): [~aweisberg] said: {quote} It would be nice to see a microbenchmark and benchmark demonstrating that the use of BTree is a win. I am wondering how many additional instructions and branches are hidden inside BTree that you pay for in the common case of small Columns set. A BTree is more instructions for more cache locality, but in the common case will there be enough columns to benefit? {quote} If we want to go down that avenue, we should IMO simply look into specialising the btree iterator for the situation where we have just one leaf, and this is something I've considered. That's pretty much the only likely win. We don't really have time to do the kind of microbenchmarking you're talking about though, and it is of limited use anyway, since one of the main benefits _performancewise_ (in the common case) is improved icache\* occupancy, which cannot be accounted for in a microbenchmark. In the uncommon case this also gives us better lower bounds on behaviour, and access to a more powerfully expressive toolkit. For instance, it permits us to trivially deliver a more efficient multi-way merge (which we do very often), and that can be improved further still with less trivial enhancements. Mostly, though, this patch simply guarantees us a good lower bound on performance, with improved code clarity. Right now I would very much expect performance in the common case to be a wash, especially on any of our existing workloads. > Columns should be backed by a BTree, not an array > - > > Key: CASSANDRA-9471 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9471 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Fix For: 3.0 beta 1 > > > Follow up to 8099. > We have pretty terrible lookup performance as the number of columns grows > (linear). In at least one location, this results in quadratic performance. > We don't however want this structure to be either any more expensive to > build, nor to store. Some small modifications to BTree will permit it to > serve here, by permitting efficient lookup by index, and calculation _of_ > index for a given key. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648509#comment-14648509 ] Benedict commented on CASSANDRA-7066: - [~rcoli]: while this patch does not remove the ancestor maintenance, it does make it completely irrelevant, since it is no longer used for any behaviour. However your concern highlights a much scarier prospect: the replay positions stored within the sstables could be at an arbitrary position, potentially in the future of the node's commit log. So they could break commit log replay during startup. In fact they probably have a high chance of doing so, although CASSANDRA-9669 may mitigate that. We should actually remove this data from {{CompactionMetadata}}, and mention this in the NEWS.txt as well. bq. Is it possible to have the refresh command and possibly node startup fail and/or throw an error if this step doesn't happen? There's unfortunately no good way for the node to know that it should not replay these files, since the whole point of their existence is that they should be replayed. The changes we've just discussed will make us _somewhat_ robust to any problems that might arise from operators doing this, and the likelihood of copying files covered by such a file is fairly low anyway. [~stefania]: perhaps we should offer another command line tool that just performs the cleanup (i.e. either rolls back or finishes commit, whichever is necessary) that an operator can run to clear the way, and we can make this explicit in NEWS.txt > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649193#comment-14649193 ] Benedict commented on CASSANDRA-7066: - Thanks. Looks like great work as always. A few comments (I haven't done a full review): * The last modified sort order was necessary for the same reason a full checksum of all of the old files is too heavy handed: restarting in the middle of deleting the files. We want to finish deleting them, and if we take the max timestamp we can say that will safely stay the same. We could keep the full checksum and include a count; if the count on disk is lower, and our max timestamp matches, we can proceed; if the count is the same we require a checksum match. This may or may not be overkill, but there's on harm in paranoia here. * It looks like serialization of the sstables now assumes the ancestors won't be present, but this will break upgrade. We need to check the sstable version, and at least skip the ancestors if they're present * The compaction metadata equality is probably not sensibly employed anywhere; we should just confirm this and remove it IMO since it makes very little sense, especially now that the ancestors have been removed * I think you took my comments about panic warnings a little too literally. We should just use strong language to make it clear to the operator we've detected disk corruption of some kind, and have taken the most pessimistic course of action ** This warning should be less severe if only the last line is _incomplete_ (and all other lines are parsed correctly), as this would be expected if we crashed part way through serialization ** We should perhaps not worry at all, and continue if this last situation occurs, in the event that _all_ files logged are still present on the file system, and our other metadata all matches for them. In this case we can be very confident we just shutdown in the middle of writing a record, and can just clean up all of the new files. This should make us robust both to the scary situation, but also minimise the pessimism, hopefully giving us the best of both worlds. WDYT? > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9908) Potential race caused by async cleanup of transaction log files
[ https://issues.apache.org/jira/browse/CASSANDRA-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651564#comment-14651564 ] Benedict commented on CASSANDRA-9908: - Yep, that all looks good. bq. Isn't this already guaranteed by SecondaryIndexManager.removeIndexedColumn? Yes, it was just a general statement, without having yet dived into this piece of code sufficiently, that we needed to ensure this. (sorry for the delay, I thought I'd already responded to this) > Potential race caused by async cleanup of transaction log files > --- > > Key: CASSANDRA-9908 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9908 > Project: Cassandra > Issue Type: Bug >Reporter: Sam Tunnicliffe >Assignee: Stefania > Labels: benedict-to-commit > Fix For: 3.0 beta 1 > > Attachments: TEST-org.apache.cassandra.db.SecondaryIndexTest.log > > > There seems to be a potential race in the cleanup of transaction log files, > introduced in CASSANDRA-7066 > It's pretty hard to trigger on trunk, but it's possible to hit it via > {{o.a.c.db.SecondaryIndexTest#testCreateIndex}} > That test creates an index, then removes it to check that the removal is > correctly recorded, then adds the index again to assert that it gets rebuilt > from the existing data. > The removal causes the SSTables of the index CFS to be dropped, which is a > transactional operation and so writes a transaction log. When the drop is > completed and the last reference to an SSTable is released, the cleanup of > the transaction log is scheduled on the periodic tasks executor. The issue is > that re-creating the index re-creates the index CFS. When this happens, it's > possible for the cleanup of the txn log to have not yet happened. If so, the > initialization of the CFS attempts to read the log to identify any orphaned > temporary files. The cleanup can happen between the finding the log file and > reading it's contents, which results in a {{NoSuchFileException}} > {noformat} > [junit] java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] java.lang.RuntimeException: java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] at > org.apache.cassandra.io.util.FileUtils.readLines(FileUtils.java:620) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionFile.getTrackedFiles(TransactionLogs.java:190) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs$TransactionData.getTemporaryFiles(TransactionLogs.java:338) > [junit] at > org.apache.cassandra.db.lifecycle.TransactionLogs.getTemporaryFiles(TransactionLogs.java:739) > [junit] at > org.apache.cassandra.db.lifecycle.LifecycleTransaction.getTemporaryFiles(LifecycleTransaction.java:541) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.getFilter(Directories.java:652) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:641) > [junit] at > org.apache.cassandra.db.Directories$SSTableLister.list(Directories.java:606) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:351) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:313) > [junit] at > org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:511) > [junit] at > org.apache.cassandra.index.internal.CassandraIndexer.addIndexedColumn(CassandraIndexer.java:115) > [junit] at > org.apache.cassandra.index.SecondaryIndexManager.addIndexedColumn(SecondaryIndexManager.java:265) > [junit] at > org.apache.cassandra.db.SecondaryIndexTest.testIndexCreate(SecondaryIndexTest.java:467) > [junit] Caused by: java.nio.file.NoSuchFileException: > build/test/cassandra/data:1/SecondaryIndexTest1/CompositeIndexToBeAdded-d0885f60323211e5a5e8ad83a3dc3e9c/.birthdate_index/transactions/unknowncompactiontype_d4b69fc0-3232-11e5-a5e8-ad83a3dc3e9c_old.log > [junit] at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > [junit] at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > [junit] at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > [junit] at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > [junit] at java.nio.file.Files.newByteChannel(Files.java:361) > [junit] at
[jira] [Commented] (CASSANDRA-7066) Simplify (and unify) cleanup of compaction leftovers
[ https://issues.apache.org/jira/browse/CASSANDRA-7066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651571#comment-14651571 ] Benedict commented on CASSANDRA-7066: - bq. Will this cause problems if the alpha goes out before this ticket is committed? We should be fine, since we're changing on-wire serialization between alpha and beta also, so I guess we're OK with some upgrade problems between the two (we don't really want or expect anyone to use alpha except for dev, I would guess). [~jbellis]? bq. However all classes implementing MetadataComponent implement them so I left them with a comment. Also, without it it seems a scrub test was failing (unsure if related though). I guess it's possible we compare metadata at the end of a scrub to confirm it's equivalent. It's somewhat ugly that we will treat all compaction metadata as equal, but probably not worth worrying about here. Otherwise SGTM, I'll review it shortly. > Simplify (and unify) cleanup of compaction leftovers > > > Key: CASSANDRA-7066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7066 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Stefania >Priority: Minor > Labels: benedict-to-commit, compaction > Fix For: 3.0 alpha 1 > > Attachments: 7066.txt > > > Currently we manage a list of in-progress compactions in a system table, > which we use to cleanup incomplete compactions when we're done. The problem > with this is that 1) it's a bit clunky (and leaves us in positions where we > can unnecessarily cleanup completed files, or conversely not cleanup files > that have been superceded); and 2) it's only used for a regular compaction - > no other compaction types are guarded in the same way, so can result in > duplication if we fail before deleting the replacements. > I'd like to see each sstable store in its metadata its direct ancestors, and > on startup we simply delete any sstables that occur in the union of all > ancestor sets. This way as soon as we finish writing we're capable of > cleaning up any leftovers, so we never get duplication. It's also much easier > to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9471) Columns should be backed by a BTree, not an array
[ https://issues.apache.org/jira/browse/CASSANDRA-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651724#comment-14651724 ] Benedict commented on CASSANDRA-9471: - I've rebased this and introduced unit tests to cover the majority of functionality. All of which was already present but not covered. > Columns should be backed by a BTree, not an array > - > > Key: CASSANDRA-9471 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9471 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Fix For: 3.0 beta 1 > > > Follow up to 8099. > We have pretty terrible lookup performance as the number of columns grows > (linear). In at least one location, this results in quadratic performance. > We don't however want this structure to be either any more expensive to > build, nor to store. Some small modifications to BTree will permit it to > serve here, by permitting efficient lookup by index, and calculation _of_ > index for a given key. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9894) Serialize the header only once per message
[ https://issues.apache.org/jira/browse/CASSANDRA-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651953#comment-14651953 ] Benedict commented on CASSANDRA-9894: - I've rebased against CASSANDRA-9471, to which I've added a new ColumnsTest. I've then added to this tests for the existing serialization and the new serialization introduced by this patch. I've also introduce some brief comments in {{serializeSubset()}} explaining at a high level how the encoding works, and why. > Serialize the header only once per message > -- > > Key: CASSANDRA-9894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9894 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Sylvain Lebresne >Assignee: Benedict > Fix For: 3.0 beta 1 > > > One last improvement I'd like to do on the serialization side is that we > currently serialize the {{SerializationHeader}} for each partition. That > header contains the serialized columns in particular and for range queries, > serializing that for every partition is wasted (note that it's only a problem > for the messaging protocol as for sstable we only write the header once per > sstable). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9962) WaitQueueTest is flakey
Benedict created CASSANDRA-9962: --- Summary: WaitQueueTest is flakey Key: CASSANDRA-9962 URL: https://issues.apache.org/jira/browse/CASSANDRA-9962 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 3.x While the test is a little noddy, and superfluous, it shouldn't fail even vanishingly infrequently. [~aweisberg] has spotted it doing so, and I have also encountered it once, so I suspect that a change in hardware/OS may have made vanishingly unlikely just pretty unlikely, which is even less good. Right now it depends on {{Thread.start()}} completing before the new thread starts; this isn't guaranteed. This patch fixes that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9962) WaitQueueTest is flakey
[ https://issues.apache.org/jira/browse/CASSANDRA-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653118#comment-14653118 ] Benedict commented on CASSANDRA-9962: - Patch available [here|https://github.com/belliottsmith/cassandra/tree/9962] > WaitQueueTest is flakey > --- > > Key: CASSANDRA-9962 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9962 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Minor > Fix For: 3.x > > > While the test is a little noddy, and superfluous, it shouldn't fail even > vanishingly infrequently. [~aweisberg] has spotted it doing so, and I have > also encountered it once, so I suspect that a change in hardware/OS may have > made vanishingly unlikely just pretty unlikely, which is even less good. > Right now it depends on {{Thread.start()}} completing before the new thread > starts; this isn't guaranteed. This patch fixes that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9974) Improve debuggability of 3.0
Benedict created CASSANDRA-9974: --- Summary: Improve debuggability of 3.0 Key: CASSANDRA-9974 URL: https://issues.apache.org/jira/browse/CASSANDRA-9974 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Fix For: 3.0.0 rc1 While 8099 has brought a number of improvements, currently it is making debugging a bit of a nightmare (for me at least). This slows down development and test resolution, and so we should fix it sooner than later. This ticket is intended to aggregate tickets that will improve this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9975) Flatten RowIterator call hierarchy with a shared RowTransformer
Benedict created CASSANDRA-9975: --- Summary: Flatten RowIterator call hierarchy with a shared RowTransformer Key: CASSANDRA-9975 URL: https://issues.apache.org/jira/browse/CASSANDRA-9975 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Benedict Assignee: Benedict Fix For: 3.0.0 rc1 Stepping through a read response is made exceedingly difficult by the sheer depth of the call hierarchy, and how rapidly your context jumps around. This ticket intend to partially address that, by flattening one of the main causes of this: iterator transformations. I have a patch that attempts to mitigate (but not entirely eliminate) this, through the introduction of a {{RowTransformer}} class that all transformations are applied through. If a transformation has already been applied, the {{RowTransformer}} class does not wrap a new iterator, but instead returns a new {{RowTransformer}} that wraps the original underlying (untransformed) iterator and both transformations. This can accumulate an arbitrary number of transformations and, quite importantly, can apply the filtration step {{Unfiltered -> Row}} in the same instance as well. The intention being that a majority of control flow happens inside this {{RowTransformer}}, so there is far less context jumping to cope with. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9976) Flatten PartitionIterator call hierarchy with PartitionTransformer
Benedict created CASSANDRA-9976: --- Summary: Flatten PartitionIterator call hierarchy with PartitionTransformer Key: CASSANDRA-9976 URL: https://issues.apache.org/jira/browse/CASSANDRA-9976 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Benedict Assignee: Benedict Fix For: 3.0.0 rc1 Much like CASSANDRA-9975, except for partitions. This is less involved, and probably less essential, but still helpful. This has been split out to keep the change iterative and piecemiel. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9932) Make all partitions btree backed
[ https://issues.apache.org/jira/browse/CASSANDRA-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653186#comment-14653186 ] Benedict commented on CASSANDRA-9932: - Patch available [here|https://github.com/belliottsmith/cassandra/tree/9932] > Make all partitions btree backed > > > Key: CASSANDRA-9932 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9932 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Fix For: 3.0.0 rc1 > > > Following on from the other btree related refactors, this patch makes all > partition (and partition-like) objects backed by the same basic structure: > {{AbstractBTreePartition}}. With two main offshoots: > {{ImmutableBTreePartition}} and {{AtomicBTreePartition}} > The main upshot is a 30% net code reduction, meaning better exercise of btree > code paths and fewer new code paths to go wrong. A secondary upshort is that, > by funnelling all our comparisons through a btree, there is a higher > likelihood of icache occupancy and we have only one area to focus delivery of > improvements for their enjoyment by all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9974) Improve debuggability
[ https://issues.apache.org/jira/browse/CASSANDRA-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9974: Summary: Improve debuggability (was: Improve debuggability of 3.0) > Improve debuggability > - > > Key: CASSANDRA-9974 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9974 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict > Fix For: 3.0.0 rc1 > > > While 8099 has brought a number of improvements, currently it is making > debugging a bit of a nightmare (for me at least). This slows down development > and test resolution, and so we should fix it sooner than later. This ticket > is intended to aggregate tickets that will improve this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9926) Remove AbstractRow.isEmpty
[ https://issues.apache.org/jira/browse/CASSANDRA-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9926: Summary: Remove AbstractRow.isEmpty (was: Avoid AbstractRow.isEmpty() where possible) > Remove AbstractRow.isEmpty > -- > > Key: CASSANDRA-9926 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9926 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Benedict >Priority: Minor > > This method performs not-insignificant duplicate work, so we should both > implement it more efficiently where possible, and avoid using it if we can -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9926) Remove AbstractRow.isEmpty
[ https://issues.apache.org/jira/browse/CASSANDRA-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653194#comment-14653194 ] Benedict commented on CASSANDRA-9926: - Since we only have one implementation of {{AbstractRow}}, we can safely remove the implementation in {{AbstractRow}} and - more importantly - introduce an efficient implementation to the concrete implementation > Remove AbstractRow.isEmpty > -- > > Key: CASSANDRA-9926 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9926 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Benedict >Priority: Minor > Fix For: 3.0 beta 1 > > > This method performs not-insignificant duplicate work, so we should both > implement it more efficiently where possible, and avoid using it if we can -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9926) Remove AbstractRow.isEmpty
[ https://issues.apache.org/jira/browse/CASSANDRA-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9926: Priority: Trivial (was: Minor) > Remove AbstractRow.isEmpty > -- > > Key: CASSANDRA-9926 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9926 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Benedict >Priority: Trivial > Fix For: 3.0 beta 1 > > > This method performs not-insignificant duplicate work, so we should both > implement it more efficiently where possible, and avoid using it if we can -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)
[ https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653263#comment-14653263 ] Benedict commented on CASSANDRA-8630: - Yes, spot on (at a high level - specifics may yet have more to discuss). [~aweisberg] may have comments / thoughts, since this is the direct mirror of CASSANDRA-9500 > Faster sequential IO (on compaction, streaming, etc) > > > Key: CASSANDRA-8630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8630 > Project: Cassandra > Issue Type: Improvement > Components: Core, Tools >Reporter: Oleg Anastasyev >Assignee: Stefania > Labels: compaction, performance > Fix For: 3.x > > Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, > flight_recorder_001_files.tar.gz > > > When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot > of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int). > This is because default implementations of readShort,readLong, etc as well as > their matching write* are implemented with numerous calls of byte by byte > read and write. > This makes a lot of syscalls as well. > A quick microbench shows than just reimplementation of these methods in > either way gives 8x speed increase. > A patch attached implements RandomAccessReader.read and > SequencialWriter.write methods in more efficient way. > I also eliminated some extra byte copies in CompositeType.split and > ColumnNameHelper.maxComponents, which were on my profiler's hotspot method > list during tests. > A stress tests on my laptop show that this patch makes compaction 25-30% > faster on uncompressed sstables and 15% faster for compressed ones. > A deployment to production shows much less CPU load for compaction. > (I attached a cpu load graph from one of our production, orange is niced CPU > load - i.e. compaction; yellow is user - i.e. not compaction related tasks) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)
[ https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653676#comment-14653676 ] Benedict commented on CASSANDRA-8630: - We already have a branch in every case? This won't change the typical behavior at all. That branch is also highly likely to be predicted correctly on just about every execution. > Faster sequential IO (on compaction, streaming, etc) > > > Key: CASSANDRA-8630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8630 > Project: Cassandra > Issue Type: Improvement > Components: Core, Tools >Reporter: Oleg Anastasyev >Assignee: Stefania > Labels: compaction, performance > Fix For: 3.x > > Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, > flight_recorder_001_files.tar.gz > > > When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot > of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int). > This is because default implementations of readShort,readLong, etc as well as > their matching write* are implemented with numerous calls of byte by byte > read and write. > This makes a lot of syscalls as well. > A quick microbench shows than just reimplementation of these methods in > either way gives 8x speed increase. > A patch attached implements RandomAccessReader.read and > SequencialWriter.write methods in more efficient way. > I also eliminated some extra byte copies in CompositeType.split and > ColumnNameHelper.maxComponents, which were on my profiler's hotspot method > list during tests. > A stress tests on my laptop show that this patch makes compaction 25-30% > faster on uncompressed sstables and 15% faster for compressed ones. > A deployment to production shows much less CPU load for compaction. > (I attached a cpu load graph from one of our production, orange is niced CPU > load - i.e. compaction; yellow is user - i.e. not compaction related tasks) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)
[ https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653678#comment-14653678 ] Benedict commented on CASSANDRA-8630: - We already have a branch in every case? This won't change the typical behavior at all. That branch is also highly likely to be predicted correctly on just about every execution. > Faster sequential IO (on compaction, streaming, etc) > > > Key: CASSANDRA-8630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8630 > Project: Cassandra > Issue Type: Improvement > Components: Core, Tools >Reporter: Oleg Anastasyev >Assignee: Stefania > Labels: compaction, performance > Fix For: 3.x > > Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, > flight_recorder_001_files.tar.gz > > > When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot > of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int). > This is because default implementations of readShort,readLong, etc as well as > their matching write* are implemented with numerous calls of byte by byte > read and write. > This makes a lot of syscalls as well. > A quick microbench shows than just reimplementation of these methods in > either way gives 8x speed increase. > A patch attached implements RandomAccessReader.read and > SequencialWriter.write methods in more efficient way. > I also eliminated some extra byte copies in CompositeType.split and > ColumnNameHelper.maxComponents, which were on my profiler's hotspot method > list during tests. > A stress tests on my laptop show that this patch makes compaction 25-30% > faster on uncompressed sstables and 15% faster for compressed ones. > A deployment to production shows much less CPU load for compaction. > (I attached a cpu load graph from one of our production, orange is niced CPU > load - i.e. compaction; yellow is user - i.e. not compaction related tasks) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)
[ https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653738#comment-14653738 ] Benedict commented on CASSANDRA-8630: - My expectation was that we would extract the existing "is sufficient remaining?" check, and use this to drive a decision. It doesn't really matter what we do inside this block, because that is executed infrequently. I would have a {{long readBytesSlowly(int count)}} method, which we downcast the result to whatever we have actually read. That method would call {{readNext()}} as necessary, and would be prevented from being inlined. But that's just how I would do it, since I prefer not to incur complexity when we know the cost will be amortized away, nor inline that method and inflate our code size for the same reason. I don't think it matters _terribly_, though, and if you or [~stefania] have a strong opinion I won't stand in the way (unless it happens to trigger a strong and unexpected anti-opinion) > Faster sequential IO (on compaction, streaming, etc) > > > Key: CASSANDRA-8630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8630 > Project: Cassandra > Issue Type: Improvement > Components: Core, Tools >Reporter: Oleg Anastasyev >Assignee: Stefania > Labels: compaction, performance > Fix For: 3.x > > Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, > flight_recorder_001_files.tar.gz > > > When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot > of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int). > This is because default implementations of readShort,readLong, etc as well as > their matching write* are implemented with numerous calls of byte by byte > read and write. > This makes a lot of syscalls as well. > A quick microbench shows than just reimplementation of these methods in > either way gives 8x speed increase. > A patch attached implements RandomAccessReader.read and > SequencialWriter.write methods in more efficient way. > I also eliminated some extra byte copies in CompositeType.split and > ColumnNameHelper.maxComponents, which were on my profiler's hotspot method > list during tests. > A stress tests on my laptop show that this patch makes compaction 25-30% > faster on uncompressed sstables and 15% faster for compressed ones. > A deployment to production shows much less CPU load for compaction. > (I attached a cpu load graph from one of our production, orange is niced CPU > load - i.e. compaction; yellow is user - i.e. not compaction related tasks) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9962) WaitQueueTest is flakey
[ https://issues.apache.org/jira/browse/CASSANDRA-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654957#comment-14654957 ] Benedict commented on CASSANDRA-9962: - bq. A semaphore would be an easier way to wait for the thread(s) to be initialized? Define easier :) The main benefit of a {{Semaphore}} is that it does not busy spin; otherwise spinning on an atomic counter is just as easy. However stylistically I'd have no issue switching. The main benefit of busy spinning is it increases the likelihood of interesting thread interleavings, which is useful for testing this kind of functionality. Really, though, we should have a burn test, and I'm kind of surprised we do not. I thought we used to. Really I would like to merge [my concurrent utils|https://github.com/belliottsmith/bes-utils] with the tree, or link them, as we've had a few bugs and behavioural issues that could have been avoided with them, and they have pretty comprehensive burn tests, including for its variant on {{WaitQueue}}. bq. Maybe we should define a few constants in Util.java for how long to wait for this kind of condition? I'm not sure there's a universally good large constant, but I've introduced a {{Util.joinThread()}} method with a 10s timeout. Which may be too large, but will probably suffice for now. bq. here is also no atomicity between signaling ready and getting into the wait queue. Sure there is. We signal only after we add ourselves to the queue. > WaitQueueTest is flakey > --- > > Key: CASSANDRA-9962 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9962 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Minor > Fix For: 3.x > > > While the test is a little noddy, and superfluous, it shouldn't fail even > vanishingly infrequently. [~aweisberg] has spotted it doing so, and I have > also encountered it once, so I suspect that a change in hardware/OS may have > made vanishingly unlikely just pretty unlikely, which is even less good. > Right now it depends on {{Thread.start()}} completing before the new thread > starts; this isn't guaranteed. This patch fixes that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)
[ https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654962#comment-14654962 ] Benedict commented on CASSANDRA-8630: - I think we can do something much simpler, without any intermediate objects. e.g. {code} @Override public int readInt() throws IOException { if (buffer.remaining() >= 4) return buffer.getInt(); else return (int) readPrimitiveSlowly(4); } @DontInline private long readPrimitiveSlowly(int bytes) throws IOException { long result = 0; for (int i = 0 ; i < bytes ; i++) result = (result << 8) | readByte(); return result; } {code} > Faster sequential IO (on compaction, streaming, etc) > > > Key: CASSANDRA-8630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8630 > Project: Cassandra > Issue Type: Improvement > Components: Core, Tools >Reporter: Oleg Anastasyev >Assignee: Stefania > Labels: compaction, performance > Fix For: 3.x > > Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, > flight_recorder_001_files.tar.gz > > > When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot > of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int). > This is because default implementations of readShort,readLong, etc as well as > their matching write* are implemented with numerous calls of byte by byte > read and write. > This makes a lot of syscalls as well. > A quick microbench shows than just reimplementation of these methods in > either way gives 8x speed increase. > A patch attached implements RandomAccessReader.read and > SequencialWriter.write methods in more efficient way. > I also eliminated some extra byte copies in CompositeType.split and > ColumnNameHelper.maxComponents, which were on my profiler's hotspot method > list during tests. > A stress tests on my laptop show that this patch makes compaction 25-30% > faster on uncompressed sstables and 15% faster for compressed ones. > A deployment to production shows much less CPU load for compaction. > (I attached a cpu load graph from one of our production, orange is niced CPU > load - i.e. compaction; yellow is user - i.e. not compaction related tasks) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9985) Introduce our own AbstractIterator
Benedict created CASSANDRA-9985: --- Summary: Introduce our own AbstractIterator Key: CASSANDRA-9985 URL: https://issues.apache.org/jira/browse/CASSANDRA-9985 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Benedict Assignee: Benedict Priority: Trivial Fix For: 3.0.0 rc1 The Guava AbstractIterator not only has unnecessary method call depth, it is difficult to debug without attaching source. Since it's absolutely trivial to write our own, and it's used widely within the codebase, I think we should do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9533) Make batch commitlog mode easier to tune
[ https://issues.apache.org/jira/browse/CASSANDRA-9533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654985#comment-14654985 ] Benedict commented on CASSANDRA-9533: - Really trivial patch [here|https://github.com/belliottsmith/cassandra/commits/9533-2.1] > Make batch commitlog mode easier to tune > > > Key: CASSANDRA-9533 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9533 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict > Fix For: 3.x > > > As discussed in CASSANDRA-9504, 2.1 changed commitlog_sync_batch_window_in_ms > from a maximum time to wait between fsync to the minimum time, so one must be > very careful to keep it small enough that most writers aren't kept waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9926) Remove AbstractRow.isEmpty
[ https://issues.apache.org/jira/browse/CASSANDRA-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9926: Fix Version/s: (was: 3.0 beta 1) 3.0.0 rc1 > Remove AbstractRow.isEmpty > -- > > Key: CASSANDRA-9926 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9926 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Benedict >Priority: Trivial > Fix For: 3.0.0 rc1 > > > This method performs not-insignificant duplicate work, so we should both > implement it more efficiently where possible, and avoid using it if we can -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9986) Remove SliceableUnfilteredRowIterator
Benedict created CASSANDRA-9986: --- Summary: Remove SliceableUnfilteredRowIterator Key: CASSANDRA-9986 URL: https://issues.apache.org/jira/browse/CASSANDRA-9986 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Benedict Assignee: Benedict Priority: Minor Fix For: 3.0.x After CASSANDRA-9932, there is only one instance where this class is needed, and that is {{SSTableIterator}}. It would be much simpler if, like in {{AbstractBTreePartition}}, the slices were passed into the {{SSTableIterator}} on construction. This would make the control flow easier to follow both: * logically, because: ** memtables and sstables would have the same access pattern; and ** our SSTableIterator would not have two modes of (parallel) operation, which is kind of confusing (the slices are independent of the non-sliced iteration) * in the debugger, because we would have one fewer wrapping iterator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (CASSANDRA-9974) Improve debuggability
[ https://issues.apache.org/jira/browse/CASSANDRA-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-9974: --- Assignee: Benedict > Improve debuggability > - > > Key: CASSANDRA-9974 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9974 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Fix For: 3.0.0 rc1 > > > While 8099 has brought a number of improvements, currently it is making > debugging a bit of a nightmare (for me at least). This slows down development > and test resolution, and so we should fix it sooner than later. This ticket > is intended to aggregate tickets that will improve this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9974) Improve debuggability
[ https://issues.apache.org/jira/browse/CASSANDRA-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655000#comment-14655000 ] Benedict commented on CASSANDRA-9974: - Assigning to myself so that this encapsulating issue is tracked. > Improve debuggability > - > > Key: CASSANDRA-9974 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9974 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Fix For: 3.0.0 rc1 > > > While 8099 has brought a number of improvements, currently it is making > debugging a bit of a nightmare (for me at least). This slows down development > and test resolution, and so we should fix it sooner than later. This ticket > is intended to aggregate tickets that will improve this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9978) Split/Scrub tools no longer remove original sstable
[ https://issues.apache.org/jira/browse/CASSANDRA-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9978: Reviewer: Benedict > Split/Scrub tools no longer remove original sstable > --- > > Key: CASSANDRA-9978 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9978 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Stefania >Priority: Minor > Fix For: 3.0 beta 1 > > > Looks like CASSANDRA-7066 broke the scrub and split tools. The orig sstable > is no longer removed. > I fixed the sstablesplit dtest so you should see the issue now. The max > sstable size doesn't match expected because it's the original -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9978) Split/Scrub tools no longer remove original sstable
[ https://issues.apache.org/jira/browse/CASSANDRA-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655003#comment-14655003 ] Benedict commented on CASSANDRA-9978: - Yep, no problem. Also: SGTM. > Split/Scrub tools no longer remove original sstable > --- > > Key: CASSANDRA-9978 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9978 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Stefania >Priority: Minor > Fix For: 3.0 beta 1 > > > Looks like CASSANDRA-7066 broke the scrub and split tools. The orig sstable > is no longer removed. > I fixed the sstablesplit dtest so you should see the issue now. The max > sstable size doesn't match expected because it's the original -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9986) Remove SliceableUnfilteredRowIterator
[ https://issues.apache.org/jira/browse/CASSANDRA-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655007#comment-14655007 ] Benedict commented on CASSANDRA-9986: - Unassigning, in case somebody else wants to take this before I have time. > Remove SliceableUnfilteredRowIterator > - > > Key: CASSANDRA-9986 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9986 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Benedict >Priority: Minor > Fix For: 3.0.x > > > After CASSANDRA-9932, there is only one instance where this class is needed, > and that is {{SSTableIterator}}. It would be much simpler if, like in > {{AbstractBTreePartition}}, the slices were passed into the > {{SSTableIterator}} on construction. This would make the control flow easier > to follow both: > * logically, because: > ** memtables and sstables would have the same access pattern; and > ** our SSTableIterator would not have two modes of (parallel) operation, > which is kind of confusing (the slices are independent of the non-sliced > iteration) > * in the debugger, because we would have one fewer wrapping iterator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9986) Remove SliceableUnfilteredRowIterator
[ https://issues.apache.org/jira/browse/CASSANDRA-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9986: Assignee: (was: Benedict) > Remove SliceableUnfilteredRowIterator > - > > Key: CASSANDRA-9986 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9986 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Benedict >Priority: Minor > Fix For: 3.0.x > > > After CASSANDRA-9932, there is only one instance where this class is needed, > and that is {{SSTableIterator}}. It would be much simpler if, like in > {{AbstractBTreePartition}}, the slices were passed into the > {{SSTableIterator}} on construction. This would make the control flow easier > to follow both: > * logically, because: > ** memtables and sstables would have the same access pattern; and > ** our SSTableIterator would not have two modes of (parallel) operation, > which is kind of confusing (the slices are independent of the non-sliced > iteration) > * in the debugger, because we would have one fewer wrapping iterator -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9985) Introduce our own AbstractIterator
[ https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655039#comment-14655039 ] Benedict commented on CASSANDRA-9985: - Trivial patch available [here|https://github.com/belliottsmith/cassandra/tree/abstract-iterator] I realise this may (or may not) be contentious, so it is just a suggestion. Opinions solicited. > Introduce our own AbstractIterator > -- > > Key: CASSANDRA-9985 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9985 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Trivial > Fix For: 3.0.0 rc1 > > > The Guava AbstractIterator not only has unnecessary method call depth, it is > difficult to debug without attaching source. Since it's absolutely trivial to > write our own, and it's used widely within the codebase, I think we should do > so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9987) BTree Related Optimisations
Benedict created CASSANDRA-9987: --- Summary: BTree Related Optimisations Key: CASSANDRA-9987 URL: https://issues.apache.org/jira/browse/CASSANDRA-9987 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Fix For: 3.x This ticket is to track a number of small improvements to BTree related functionality. Assigning to myself for tracking purposes only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9988) Introduce leaf-only iterator
Benedict created CASSANDRA-9988: --- Summary: Introduce leaf-only iterator Key: CASSANDRA-9988 URL: https://issues.apache.org/jira/browse/CASSANDRA-9988 Project: Cassandra Issue Type: Sub-task Reporter: Benedict Priority: Minor Fix For: 3.x In many cases we have small btrees, small enough to fit in a single leaf page. In this case it _may_ be more efficient to specialise our iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9989) Optimise BTree.Buider
Benedict created CASSANDRA-9989: --- Summary: Optimise BTree.Buider Key: CASSANDRA-9989 URL: https://issues.apache.org/jira/browse/CASSANDRA-9989 Project: Cassandra Issue Type: Sub-task Reporter: Benedict Priority: Minor Fix For: 3.x BTree.Builder could reduce its copying, and exploit toArray more efficiently, with some work. It's not very important right now because we don't make as much use of its bulk-add methods as we otherwise might, however over time this work will become more useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9990) Use TreeBuilder directly where possible
Benedict created CASSANDRA-9990: --- Summary: Use TreeBuilder directly where possible Key: CASSANDRA-9990 URL: https://issues.apache.org/jira/browse/CASSANDRA-9990 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Benedict Priority: Minor Fix For: 3.x In cases where we iteratively build a btree with sorted and unique values, we can use a TreeBuilder directly and avoid unnecessary copies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9991) Implement efficient btree removal
Benedict created CASSANDRA-9991: --- Summary: Implement efficient btree removal Key: CASSANDRA-9991 URL: https://issues.apache.org/jira/browse/CASSANDRA-9991 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Benedict Fix For: 3.x Currently removal is implemented as a reconstruction by filtering and iterator over the original btree. This could be much more efficient, editing just the necessary nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)
[ https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655149#comment-14655149 ] Benedict commented on CASSANDRA-8630: - bq. For the fast path, the built-in BB methods should still be faster, right? Right. bq. readByte() would result in one unsafe get call per byte. The unsafe calls here are all intrinsics, but still - even for fully inlined method calls and unrolled loop we're talking something like 24x the work, but then we have the virtual invocation costs involved (the behaviour of which for a sequence of 8 identical calls I'm not certain - I would hope there is some sharing of the method call burden through loop unrolling, but I don't count on it), and we are probably 100x+ using rigorous finger-in-air maths. > Faster sequential IO (on compaction, streaming, etc) > > > Key: CASSANDRA-8630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8630 > Project: Cassandra > Issue Type: Improvement > Components: Core, Tools >Reporter: Oleg Anastasyev >Assignee: Stefania > Labels: compaction, performance > Fix For: 3.x > > Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, > flight_recorder_001_files.tar.gz > > > When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot > of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int). > This is because default implementations of readShort,readLong, etc as well as > their matching write* are implemented with numerous calls of byte by byte > read and write. > This makes a lot of syscalls as well. > A quick microbench shows than just reimplementation of these methods in > either way gives 8x speed increase. > A patch attached implements RandomAccessReader.read and > SequencialWriter.write methods in more efficient way. > I also eliminated some extra byte copies in CompositeType.split and > ColumnNameHelper.maxComponents, which were on my profiler's hotspot method > list during tests. > A stress tests on my laptop show that this patch makes compaction 25-30% > faster on uncompressed sstables and 15% faster for compressed ones. > A deployment to production shows much less CPU load for compaction. > (I attached a cpu load graph from one of our production, orange is niced CPU > load - i.e. compaction; yellow is user - i.e. not compaction related tasks) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)
[ https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655149#comment-14655149 ] Benedict edited comment on CASSANDRA-8630 at 8/5/15 10:33 AM: -- bq. For the fast path, the built-in BB methods should still be faster, right? Right. bq. readByte() would result in one unsafe get call per byte. The unsafe calls here are all intrinsics, but still - even for fully inlined method calls and unrolled loop we're talking something like 24x the work, but then we have the virtual invocation costs involved (the behaviour of which for a sequence of 8 identical calls I'm not certain - I would hope there is some sharing of the method call burden through loop unrolling, but I don't count on it), and we are probably 100x+ using rigorous finger-in-air maths. bq. Do we care about little-endian ordering as well Good point. I think we may actually depend on this in MemoryInputStream for some classes that were persisted in weird byte order. However I'm tempted to say we should start pushing this upstream to the places that care, as there are few, and we basically consider them all broken. It's not the first time this has caused annoyance. I'd be tempted to just forbid it, and patch up the few places that need it. was (Author: benedict): bq. For the fast path, the built-in BB methods should still be faster, right? Right. bq. readByte() would result in one unsafe get call per byte. The unsafe calls here are all intrinsics, but still - even for fully inlined method calls and unrolled loop we're talking something like 24x the work, but then we have the virtual invocation costs involved (the behaviour of which for a sequence of 8 identical calls I'm not certain - I would hope there is some sharing of the method call burden through loop unrolling, but I don't count on it), and we are probably 100x+ using rigorous finger-in-air maths. > Faster sequential IO (on compaction, streaming, etc) > > > Key: CASSANDRA-8630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8630 > Project: Cassandra > Issue Type: Improvement > Components: Core, Tools >Reporter: Oleg Anastasyev >Assignee: Stefania > Labels: compaction, performance > Fix For: 3.x > > Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, > flight_recorder_001_files.tar.gz > > > When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot > of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int). > This is because default implementations of readShort,readLong, etc as well as > their matching write* are implemented with numerous calls of byte by byte > read and write. > This makes a lot of syscalls as well. > A quick microbench shows than just reimplementation of these methods in > either way gives 8x speed increase. > A patch attached implements RandomAccessReader.read and > SequencialWriter.write methods in more efficient way. > I also eliminated some extra byte copies in CompositeType.split and > ColumnNameHelper.maxComponents, which were on my profiler's hotspot method > list during tests. > A stress tests on my laptop show that this patch makes compaction 25-30% > faster on uncompressed sstables and 15% faster for compressed ones. > A deployment to production shows much less CPU load for compaction. > (I attached a cpu load graph from one of our production, orange is niced CPU > load - i.e. compaction; yellow is user - i.e. not compaction related tasks) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)
[ https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655149#comment-14655149 ] Benedict edited comment on CASSANDRA-8630 at 8/5/15 10:34 AM: -- bq. For the fast path, the built-in BB methods should still be faster, right? Right. bq. readByte() would result in one unsafe get call per byte. The unsafe calls here are all intrinsics, but still - even for fully inlined method calls and unrolled loop we're talking something like 24x the work, but then we have the virtual invocation costs involved (the behaviour of which for a sequence of 8 identical calls I'm not certain - I would hope there is some sharing of the method call burden through loop unrolling, but I don't count on it), and we are probably 100x+ using rigorous finger-in-air maths. bq. Do we care about little-endian ordering as well Good point. I think we may actually depend on this in MemoryInputStream for some classes that were persisted in weird byte order. However I'm tempted to say we should start pushing this upstream to the places that care, as there are few, and we basically consider them all broken. It's not the first time this has caused annoyance. I'd be tempted to just forbid it, and patch up the few places that need it. (My opinion, only, and if it looks like a hassle we can just add support here) was (Author: benedict): bq. For the fast path, the built-in BB methods should still be faster, right? Right. bq. readByte() would result in one unsafe get call per byte. The unsafe calls here are all intrinsics, but still - even for fully inlined method calls and unrolled loop we're talking something like 24x the work, but then we have the virtual invocation costs involved (the behaviour of which for a sequence of 8 identical calls I'm not certain - I would hope there is some sharing of the method call burden through loop unrolling, but I don't count on it), and we are probably 100x+ using rigorous finger-in-air maths. bq. Do we care about little-endian ordering as well Good point. I think we may actually depend on this in MemoryInputStream for some classes that were persisted in weird byte order. However I'm tempted to say we should start pushing this upstream to the places that care, as there are few, and we basically consider them all broken. It's not the first time this has caused annoyance. I'd be tempted to just forbid it, and patch up the few places that need it. > Faster sequential IO (on compaction, streaming, etc) > > > Key: CASSANDRA-8630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8630 > Project: Cassandra > Issue Type: Improvement > Components: Core, Tools >Reporter: Oleg Anastasyev >Assignee: Stefania > Labels: compaction, performance > Fix For: 3.x > > Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, > flight_recorder_001_files.tar.gz > > > When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot > of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int). > This is because default implementations of readShort,readLong, etc as well as > their matching write* are implemented with numerous calls of byte by byte > read and write. > This makes a lot of syscalls as well. > A quick microbench shows than just reimplementation of these methods in > either way gives 8x speed increase. > A patch attached implements RandomAccessReader.read and > SequencialWriter.write methods in more efficient way. > I also eliminated some extra byte copies in CompositeType.split and > ColumnNameHelper.maxComponents, which were on my profiler's hotspot method > list during tests. > A stress tests on my laptop show that this patch makes compaction 25-30% > faster on uncompressed sstables and 15% faster for compressed ones. > A deployment to production shows much less CPU load for compaction. > (I attached a cpu load graph from one of our production, orange is niced CPU > load - i.e. compaction; yellow is user - i.e. not compaction related tasks) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CASSANDRA-9925) Sorted + Unique BTree.Builder usage can use a thread-local TreeBuilder directly
[ https://issues.apache.org/jira/browse/CASSANDRA-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-9925. - Resolution: Duplicate > Sorted + Unique BTree.Builder usage can use a thread-local TreeBuilder > directly > --- > > Key: CASSANDRA-9925 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9925 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Minor > Fix For: 3.0.x > > > There are now a number of situations where we use a BTree.Builder that could > be made to go directly to the TreeBuilder, since they only perform in-order > unique additions to build an initial tree. This would potentially avoid a > number of array allocations/copies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CASSANDRA-9600) DescriptorTypeTidy and GlobalTypeTidy do not benefit from being full fledged Ref instances
[ https://issues.apache.org/jira/browse/CASSANDRA-9600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-9600. - Resolution: Won't Fix We've already removed DescriptorTypeTidy, and I don't think it's worth worrying about GlobalTypeTidy any time soon. > DescriptorTypeTidy and GlobalTypeTidy do not benefit from being full fledged > Ref instances > -- > > Key: CASSANDRA-9600 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9600 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Minor > > These inner SSTableReader tidying classes do not benefit from being a full > fledged Ref because they are managed in such a small scope. This increases > our surface area to problems such as CASSANDRA-9549 (these were the affected > instances, ftr). > . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9985) Introduce our own AbstractIterator
[ https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658273#comment-14658273 ] Benedict commented on CASSANDRA-9985: - I wouldn't care if it weren't for the sheer number of these calls that we chain together now, so that 75% of your code stepping is through a linked jar. With multiple versions it is hard to link the source, and so the majority of your debug stepping is through a morass of unknown. This patch also removes one method call from that chain, which shrinks the height of the call tree significantly when they're added together. But, as I say, I figured this little patch might split opinion so I only propose it for inclusion at the consensus of the community. It;s worth noting that CASSANDRA-9975 also mitigates the problem, however it leaves a lot of these iterators in the call tree. > Introduce our own AbstractIterator > -- > > Key: CASSANDRA-9985 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9985 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Trivial > Fix For: 3.0.0 rc1 > > > The Guava AbstractIterator not only has unnecessary method call depth, it is > difficult to debug without attaching source. Since it's absolutely trivial to > write our own, and it's used widely within the codebase, I think we should do > so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9985) Introduce our own AbstractIterator
[ https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658310#comment-14658310 ] Benedict commented on CASSANDRA-9985: - FTR, I've had to update the patch to closer to Guava because we do in fact use null returns (I had thought we did not). To be honest, I would be totally comfortable just porting the Guava code wholesale (or almost wholesale; the removal of tryComputeNext is still nice). > Introduce our own AbstractIterator > -- > > Key: CASSANDRA-9985 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9985 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Trivial > Fix For: 3.0.0 rc1 > > > The Guava AbstractIterator not only has unnecessary method call depth, it is > difficult to debug without attaching source. Since it's absolutely trivial to > write our own, and it's used widely within the codebase, I think we should do > so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9985) Introduce our own AbstractIterator
[ https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659823#comment-14659823 ] Benedict commented on CASSANDRA-9985: - bq. Can I bring up the icache boogyman? You can. Since I'm it's main champion, it's a powerful weapon against me. However in this case a little analysis shows that Guava doesn't even use its own {{AbstractIterator}} (either of them!) very much at all. In fact nowhere that we do, from what I can tell. I believe CASSANDRA-9471 uses {{Iterables.mergeSorted}} which does - but we can perform a more efficient merge since it's only 2-way, and otherwise should prefer our own Mergeiterator, which has a more efficient heap implementation. So I will update it to avoid using this function. Which is a relief, since I am not in favour of overriding external packages if we can avoid it, as it plays havoc with anyone wanting to use signed jars. The only other place we use Guava {{AbstractIterator}} is in OHC. Since this is the "HotKey" iterator, it is used infrequently and probably not to be troubled over. Of course, we do run the risk of in future introducing the occasional usage, but those risks are small given how narrowly it is deployed in Guava. We have a lot of minor issues around duplication, and we should probably file a ticket to specifically track and fix it en masse. Especially regarding duplicate functionality between Guava and java.util.function, but no harm setting up some analysis for everything. It would be great if we had some automated analysis to tell us of duplicate code candidates, in general. I'm certain such tools exist. I'll take a look at purloining any Guava tests. > Introduce our own AbstractIterator > -- > > Key: CASSANDRA-9985 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9985 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Trivial > Fix For: 3.0.0 rc1 > > > The Guava AbstractIterator not only has unnecessary method call depth, it is > difficult to debug without attaching source. Since it's absolutely trivial to > write our own, and it's used widely within the codebase, I think we should do > so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9985) Introduce our own AbstractIterator
[ https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659837#comment-14659837 ] Benedict commented on CASSANDRA-9985: - So, I've pushed a version with Guava's test case included, but with two failing tests removed because IMO they aren't things we should worry about: 1) Reentrancy in {{hasNext()}}, i.e. if you create a {{computeNext()}} method that recursively calls {{hasNext()}} then it will throw ISE 2) Continuing to use an iterator that threw an exception during {{computeNext()}}, which will also throw ISE My view is that neither check buys us much, and I prefer the code clarity of the current implementation. However support is easily added at the expense of a bit of ugliness. > Introduce our own AbstractIterator > -- > > Key: CASSANDRA-9985 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9985 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Trivial > Fix For: 3.0.0 rc1 > > > The Guava AbstractIterator not only has unnecessary method call depth, it is > difficult to debug without attaching source. Since it's absolutely trivial to > write our own, and it's used widely within the codebase, I think we should do > so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9894) Serialize the header only once per message
[ https://issues.apache.org/jira/browse/CASSANDRA-9894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660099#comment-14660099 ] Benedict commented on CASSANDRA-9894: - bq. I changed it to return the assembled list and then the test didn't pass. Thanks. I've added two commits, one of which fixes this, and another which switched to delta encoding, which naturally reduced both huge serialization paths into one, making it much cleaner. bq. There are a couple of other methods like digest, selectOrderIterator, getComplex, and getSimple that don't run. I get surprised pretty regularly so I would test those as well. {{selectOrderIterator}} was already being tested, and I've added coverage for {{getComplex}} and {{getSimple}}. However digest and {{getDroppedColumnDefinition}} I'm not so sure about: I kind of feel they only make sense in a larger picture, i.e. in a dtest. I don't mind doing a noddy "check it does what the code already says it does" kind of test, but I'm not convinced it adds value by itself. bq. Can you use a random with a known or logged seed for the test? This turned out to be much more annoying. Because {{ThreadLocalRandom}} does not support setting (or fetching) its seed, and regular Random has a more rubbish API - it does not support a lot of useful methods. We should file a ticket for exposing easily reseeded/logged Random with the full TLR API, I think. > Serialize the header only once per message > -- > > Key: CASSANDRA-9894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9894 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Sylvain Lebresne >Assignee: Benedict > Fix For: 3.0 beta 1 > > > One last improvement I'd like to do on the serialization side is that we > currently serialize the {{SerializationHeader}} for each partition. That > header contains the serialized columns in particular and for range queries, > serializing that for every partition is wasted (note that it's only a problem > for the messaging protocol as for sstable we only write the header once per > sstable). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9925) Sorted + Unique BTree.Builder usage can use a thread-local TreeBuilder directly
[ https://issues.apache.org/jira/browse/CASSANDRA-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9925: Fix Version/s: (was: 3.0.x) > Sorted + Unique BTree.Builder usage can use a thread-local TreeBuilder > directly > --- > > Key: CASSANDRA-9925 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9925 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Minor > > There are now a number of situations where we use a BTree.Builder that could > be made to go directly to the TreeBuilder, since they only perform in-order > unique additions to build an initial tree. This would potentially avoid a > number of array allocations/copies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9962) WaitQueueTest is flakey
[ https://issues.apache.org/jira/browse/CASSANDRA-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9962: Fix Version/s: (was: 3.x) 3.0 beta 1 2.2.1 2.1.9 > WaitQueueTest is flakey > --- > > Key: CASSANDRA-9962 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9962 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Minor > Fix For: 2.1.9, 2.2.1, 3.0 beta 1 > > > While the test is a little noddy, and superfluous, it shouldn't fail even > vanishingly infrequently. [~aweisberg] has spotted it doing so, and I have > also encountered it once, so I suspect that a change in hardware/OS may have > made vanishingly unlikely just pretty unlikely, which is even less good. > Right now it depends on {{Thread.start()}} completing before the new thread > starts; this isn't guaranteed. This patch fixes that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9998) LEAK DETECTED with snapshot/sequential repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-9998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661382#comment-14661382 ] Benedict commented on CASSANDRA-9998: - +1 > LEAK DETECTED with snapshot/sequential repairs > -- > > Key: CASSANDRA-9998 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9998 > Project: Cassandra > Issue Type: Bug >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson > Fix For: 3.x, 2.1.x, 2.2.x > > > http://cassci.datastax.com/job/cassandra-2.1_dtest/lastCompletedBuild/testReport/repair_test/TestRepair/simple_sequential_repair_test/ > does not happen if I add -par to the test -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9985) Introduce our own AbstractIterator
[ https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661392#comment-14661392 ] Benedict commented on CASSANDRA-9985: - Right, it seems exceedingly difficult to get into, and is a worse misuse of the iterator than the iterator can handle. My main reason to not want to address it, though, is that we will be stepping through this {{AbstractIterator.hasNext()}} often, and I'd like it to be easy and attractive to do so. I had a quick go at making it remain easy to read and, hopefully, step through, while still catching this edge case. I've pushed a version that I think meets the criteria. > Introduce our own AbstractIterator > -- > > Key: CASSANDRA-9985 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9985 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Trivial > Fix For: 3.0.0 rc1 > > > The Guava AbstractIterator not only has unnecessary method call depth, it is > difficult to debug without attaching source. Since it's absolutely trivial to > write our own, and it's used widely within the codebase, I think we should do > so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8630) Faster sequential IO (on compaction, streaming, etc)
[ https://issues.apache.org/jira/browse/CASSANDRA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661560#comment-14661560 ] Benedict commented on CASSANDRA-8630: - It would be better, from that perspective to use native order. However: # Swapping byte order is a single cycle instruction, so is probably not a big deal, given how inefficient we are otherwise right now # We would have to require that users have nodes all sharing the same architecture (or at least endianness), which may or may not be limiting. Or we could pick the common platform endianness and go little endian # We would have to support both at least during upgrade All told I think it's easier to stick with big endian format as is Java standard and standard across our codebase. But I'm not opposed to the alternative of providing a patch to support both, followed by an upgrade period and dropping support for big endian in 4.0 > Faster sequential IO (on compaction, streaming, etc) > > > Key: CASSANDRA-8630 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8630 > Project: Cassandra > Issue Type: Improvement > Components: Core, Tools >Reporter: Oleg Anastasyev >Assignee: Stefania > Labels: compaction, performance > Fix For: 3.x > > Attachments: 8630-FasterSequencialReadsAndWrites.txt, cpu_load.png, > flight_recorder_001_files.tar.gz > > > When node is doing a lot of sequencial IO (streaming, compacting, etc) a lot > of CPU is lost in calls to RAF's int read() and DataOutputStream's write(int). > This is because default implementations of readShort,readLong, etc as well as > their matching write* are implemented with numerous calls of byte by byte > read and write. > This makes a lot of syscalls as well. > A quick microbench shows than just reimplementation of these methods in > either way gives 8x speed increase. > A patch attached implements RandomAccessReader.read and > SequencialWriter.write methods in more efficient way. > I also eliminated some extra byte copies in CompositeType.split and > ColumnNameHelper.maxComponents, which were on my profiler's hotspot method > list during tests. > A stress tests on my laptop show that this patch makes compaction 25-30% > faster on uncompressed sstables and 15% faster for compressed ones. > A deployment to production shows much less CPU load for compaction. > (I attached a cpu load graph from one of our production, orange is niced CPU > load - i.e. compaction; yellow is user - i.e. not compaction related tasks) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8921) Experiment with a probabilistic tree of membership for maxPurgeableTimestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-8921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661802#comment-14661802 ] Benedict commented on CASSANDRA-8921: - Hi Daniel, No one is working on this, no, and we'd love you to have a go. Since it's a really non-trivial ticket, though, I feel like I should issue a brief warning that it will be more difficult than your average patch to reach inclusion. It's some fairly fundamental algorithmic work, and effectively a research piece, so it will receive a lot of scrutiny. It's also not *100%* certain if it will pay dividends. I expect it will, but it may have costs for some workloads, and there are open questions: * do we rebuild this tree each time we cross an sstable boundary? (this may lead to high CPU cost in some scenarios) * do we build one tree up-front, and then post-filter? (this may lead to a significant increase in memory requirement) * do we hybridise, and chain new BFs on wholesale, and rebalance the tree once we have enough that the cost will be sufficiently amortized? This all makes it a really fun piece of work, but I would recommend you think through how you think the algorithm would work and post a high-level design here, preferably with a little blurb about your expectation of the tradeoffs. That way we can discuss it up-front, and hopefully ease the gradual process of inclusion. > Experiment with a probabilistic tree of membership for maxPurgeableTimestamp > > > Key: CASSANDRA-8921 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8921 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict > > maxPurgeableTimestamp appears to be a significant cost for some workloads, > the majority of which stemming from the cost of membership tests across the > overlapping tables. It would be possible to construct a tree of bloom filters > from the existing filters, that could yield queries of the set of possible > membership of a given key with logarithmic performance, and it appears there > is a research paper (that I haven't dived into yet) that outlines something > like this http://www.usna.edu/Users/cs/adina/research/Bloofi%20_CloudI2013.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10002) Repeated slices on RowSearchers are incorrect
[ https://issues.apache.org/jira/browse/CASSANDRA-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662156#comment-14662156 ] Benedict commented on CASSANDRA-10002: -- It's worth pointing out this code will be removed by CASSANDRA-9932 > Repeated slices on RowSearchers are incorrect > - > > Key: CASSANDRA-10002 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10002 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Tyler Hobbs >Assignee: Tyler Hobbs > Fix For: 3.0 beta 1 > > > In {{AbstractThreadUnsafePartition}}, repeated {{slice()}} calls on a > {{RowSearcher}} can produce incorrect results. This is caused by only > performing a binary search over a sublist (based on {{nextIdx}}), but not > taking {{nextIdx}} into account when using the search result index. > I made a quick fix in [this > commit|https://github.com/thobbs/cassandra/commit/73725ea6825c9c0da1fa4986b01f39ae08130e10] > on one of my branches, but the full fix also needs to cover > {{ReverseRowSearcher}} and include a test to reproduce the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9500) SequentialWriter should extend BufferedDataOutputStreamPlus
[ https://issues.apache.org/jira/browse/CASSANDRA-9500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9500: Reviewer: Benedict > SequentialWriter should extend BufferedDataOutputStreamPlus > --- > > Key: CASSANDRA-9500 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9500 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict >Assignee: Ariel Weisberg >Priority: Minor > > SequentialWriter is the last piece left not implementing > DataOutputStreamPlus. It should not be too tricky to extend it, and it should > result in improved write throughput. Further, it will permit it to exploit > the more efficient implementation of writeVInt being delivered in > CASSANDRA-9499 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9985) Introduce our own AbstractIterator
[ https://issues.apache.org/jira/browse/CASSANDRA-9985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-9985: Fix Version/s: (was: 3.0.0 rc1) 3.0 beta 1 > Introduce our own AbstractIterator > -- > > Key: CASSANDRA-9985 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9985 > Project: Cassandra > Issue Type: Sub-task > Components: Core >Reporter: Benedict >Assignee: Benedict >Priority: Trivial > Fix For: 3.0 beta 1 > > > The Guava AbstractIterator not only has unnecessary method call depth, it is > difficult to debug without attaching source. Since it's absolutely trivial to > write our own, and it's used widely within the codebase, I think we should do > so. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9533) Make batch commitlog mode easier to tune
[ https://issues.apache.org/jira/browse/CASSANDRA-9533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662908#comment-14662908 ] Benedict commented on CASSANDRA-9533: - If you're trickling one erase block every 10ms, I think most consumer grade SSDs will still last years. So even that's probably not worth worrying about It's also been shown in some workloads that having a delay may help spinning disks, but my theoretical understanding of that is that it was unique to our tool, and caused the work producer (stress) to phase its work production so that there were fewer writes. This might be beneficial to some users as well, but hard to say, and probably a minority. I've committed to 2.1, 2.2, 3.0 and trunk. > Make batch commitlog mode easier to tune > > > Key: CASSANDRA-9533 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9533 > Project: Cassandra > Issue Type: Improvement >Reporter: Jonathan Ellis >Assignee: Benedict > Fix For: 3.x > > > As discussed in CASSANDRA-9504, 2.1 changed commitlog_sync_batch_window_in_ms > from a maximum time to wait between fsync to the minimum time, so one must be > very careful to keep it small enough that most writers aren't kept waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9932) Make all partitions btree backed
[ https://issues.apache.org/jira/browse/CASSANDRA-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662973#comment-14662973 ] Benedict commented on CASSANDRA-9932: - bq. This change to a large extent looks like a refactor and not new code so I am trying not to wade too deep into the existing code This ticket is, in essence, a part of my review of CASSANDRA-8099, but filed as its own ticket to parallelize and make digestible. As such, I'm not sure how strongly it counts as "existing code" and nothing should be taken totally for granted here. bq. With default methods it really seems to me that we always want @Override annotations I agree, but rather than wait for a PMC adjudication on a change of practice, I've just opted roll back the use of a default method in that particular change, since it wasn't really necessary. bq. I ran code coverage for some tests and there seems to be untested stuff in BTree such as transformAndFilter. reverse() which is new for this patch also has no coverage. {{transformAndFilter}} is already covered by {{LongBTreeTest}}, and is not new to this patch. I've added {{reverse()}} coverage. bq. I just want to focus on how convincing the tests are bq. I don't see unit tests for the various partition types in isolation AtomicBTreePartition from PartitionTest doesn't cover stuff like waste tracking or pessimistic locking. Is that coverage supposed to come indirectly via Memtable tests? AbstractBTreePartition also has stuff that isn't tested. Test coverage is a bit of a wasteland still here. I would rather not bottleneck on this, however, since this patch is explicitly designed to reduce our exposure to poorly tested new code. We certainly can and should file a follow up ticket to improve all of these areas, and I will try to address that pre-3.0 release, once the more pressing 8099 review-generated work is done. > Make all partitions btree backed > > > Key: CASSANDRA-9932 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9932 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Benedict >Assignee: Benedict > Fix For: 3.0.0 rc1 > > > Following on from the other btree related refactors, this patch makes all > partition (and partition-like) objects backed by the same basic structure: > {{AbstractBTreePartition}}. With two main offshoots: > {{ImmutableBTreePartition}} and {{AtomicBTreePartition}} > The main upshot is a 30% net code reduction, meaning better exercise of btree > code paths and fewer new code paths to go wrong. A secondary upshort is that, > by funnelling all our comparisons through a btree, there is a higher > likelihood of icache occupancy and we have only one area to focus delivery of > improvements for their enjoyment by all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)