[jira] [Comment Edited] (CASSANDRA-11990) Address rows rather than partitions in SASI
[ https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593803#comment-16593803 ] Alex Petrov edited comment on CASSANDRA-11990 at 8/27/18 3:33 PM: -- [~jrwest] thank you for the patch! I haven't reviewed the patch fully just yet but I have multiple comments: * since we now store clusterings in the index, index sizes got much bigger. In case of prefix we get 1,5G for 175M of data (old version yields 660M). * I realize there might be no hard limit on the amount of clusterings but a) using {{Integer.BYTES}} seems a lot of bytes even though it's written only once per partition and b) we should be able to infer clustering amount from schema, so writing it into the index might be redundant. I've tried to naively discard the value and replace {{buffer.getInt}} with {{clusteringComparator.size()}} and it seems to be ok (however, cases with static clustering, static tables etc should be double-checked) * I might be mistaken, but we can infer whether or not to read an entire partition from the query rather than going through all clusterings like [here|https://github.com/jrwest/cassandra/compare/deec448d8c1adee8663281708dc5e0060d0fa250...5d63aae5ffba7569811fad6a818163fcf3d26c6f#diff-df913f108e1290500a9afb8cd4638189R114] * I've tried ingesting a relatively small SSTable (~300Mb) and old SASI didn't seem to have any issues while new version struggles with GC. I'd have to do a lot more testing and verification to say what might be the cause. * it's be great to have at least a rough description of the index file format. We didn't have it before, I know, but if we could make it while at it, might be very useful for the future work. * given we're fetching clustering (therefore can say which rows are "same" across partitions), and data reconciliation is done by controller that goes through the "normal" read path, do we need to do re-filter the data in cases of non-static clusterings? Could you also describe the motivation for embedding rows rather than row offsets into the index? I thought that it could improve some cases with range-iteration and eliminate additional filtering. But since we're making a seek to the data file in order to materialize the key, we can also fetch clusterings from there as well. was (Author: ifesdjeen): [~jrwest] thank you for the patch! I haven't reviewed the patch fully just yet but I have multiple comments: * since we now store clusterings in the index, index sizes got much bigger. In case of prefix we get 1,5G for 175M of data (old version yields 660M). * I realize there might be no hard limit on the amount of clusterings but a) using {{Integer.BYTES}} seems a lot of bytes even though it's written only once per partition and b) we should be able to infer clustering amount from schema, so writing it into the index might be redundant. I've tried to naively discard the value and replace {{buffer.getInt}} with {{clusteringComparator.size()}} and it seems to be ok (however, cases with static clustering, static tables etc should be double-checked) * I might be mistaken, but we can infer whether or not to read an entire partition from the query rather than going through all clusterings like [here|https://github.com/jrwest/cassandra/compare/deec448d8c1adee8663281708dc5e0060d0fa250...5d63aae5ffba7569811fad6a818163fcf3d26c6f#diff-df913f108e1290500a9afb8cd4638189R114] * I've tried ingesting a relatively small SSTable (~300Mb) and old SASI didn't seem to have any issues while new version struggles with GC. I'd have to do a lot more testing and verification to say what might be the cause. * it's be great to have at least a rough description of the index file format. We didn't have it before, I know, but if we could make it while at it, might be very useful for the future work. * given we're fetching clustering (therefore can say which rows are "same" across partitions), and data reconciliation is done by controller that goes through the "normal" read path, do we need to do re-filter the data in cases of non-static clusterings? > Address rows rather than partitions in SASI > --- > > Key: CASSANDRA-11990 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11990 > Project: Cassandra > Issue Type: Improvement > Components: CQL, sasi >Reporter: Alex Petrov >Assignee: Jordan West >Priority: Major > Fix For: 4.x > > Attachments: perf.pdf, size_comparison.png > > > Currently, the lookup in SASI index would return the key position of the > partition. After the partition lookup, the rows are iterated and the > operators are applied in order to filter out ones that do not match. > bq. TokenTree which accepts variable size keys (such
[jira] [Comment Edited] (CASSANDRA-11990) Address rows rather than partitions in SASI
[ https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412775#comment-15412775 ] Pavel Yaskevich edited comment on CASSANDRA-11990 at 8/9/16 1:12 AM: - I looked at the code again and I think we might be better off keeping only Murmur3Partitioner (tokens as longs) for now instead of changing so much just to support RP... That would also make update path much easier for us - a. added clustering positions, b. extend tokens sizes to variable size. instead of having intermediate of multi-size fixed tokens... *Code remarks:* - https://github.com/apache/cassandra/compare/trunk...ifesdjeen:11990-trunk#diff-2143dfad950ab134c0a25bd903c2875aR143 what is the point of logger.info it it's still wrapped into logger.isDebugEnabled() ? - Can you move drop-data and rebuild related changes to the separate branch so we can keep changes to token tree and bug fixes separate since they are separate tickets? - https://github.com/apache/cassandra/compare/trunk...ifesdjeen:11990-trunk#diff-afe204f22033543e3ae0185240523a9bR113 I'm not sure what is this `// ignore` about?... - SasiToken -> SASIToken since SASI is an abbreviation already. - Back to the point about not supporting PR; the way TokenTreeSerializationHelper is currently done is not optimal since it has to be explicitly carried around, which means that it's most likely not a proper abstraction to have, e.g. higher levels like QueryPlan/Operation/Expression/ColumnIndex should never care about any details about TokenTree, especially about how it's tokens are serialized, we should either use IPartitioner interface or nothing at all. - Upgrade test is missing for TokenTree. was (Author: xedin): I looked at the code again and I think we might be better off keeping only Murmur3Partitioner (tokens as longs) for now instead of changing so much just to support RP... That would also make update path much easier for us - a. added clustering positions, b. extend tokens sizes to variable size. instead of having intermediate of multi-size fixed tokens... *Code remarks:* - https://github.com/apache/cassandra/compare/trunk...ifesdjeen:11990-trunk#diff-2143dfad950ab134c0a25bd903c2875aR143 what is the point of logger.info it it's still wrapped into logger.isDebugEnabled() ? - Can you move drop-data and rebuild related changes to the separate branch so we can keep changes to token tree and bug fixes separate since they are separate tickets? https://github.com/apache/cassandra/compare/trunk...ifesdjeen:11990-trunk#diff-afe204f22033543e3ae0185240523a9bR113 I'm not sure what is this `// ignore` about?... - SasiToken -> SASIToken since SASI is an abbreviation already. - Back to the point about not supporting PR; the way TokenTreeSerializationHelper is currently done is not optimal since it has to be explicitly carried around, which means that it's most likely not a proper abstraction to have, e.g. higher levels like QueryPlan/Operation/Expression/ColumnIndex should never care about any details about TokenTree, especially about how it's tokens are serialized, we should either use IPartitioner interface or nothing at all. - Upgrade test is missing for TokenTree. > Address rows rather than partitions in SASI > --- > > Key: CASSANDRA-11990 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11990 > Project: Cassandra > Issue Type: Improvement > Components: CQL, sasi >Reporter: Alex Petrov >Assignee: Alex Petrov > Attachments: perf.pdf, size_comparison.png > > > Currently, the lookup in SASI index would return the key position of the > partition. After the partition lookup, the rows are iterated and the > operators are applied in order to filter out ones that do not match. > bq. TokenTree which accepts variable size keys (such would enable different > partitioners, collections support, primary key indexing etc.), -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11990) Address rows rather than partitions in SASI
[ https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408322#comment-15408322 ] Alex Petrov edited comment on CASSANDRA-11990 at 8/5/16 1:29 PM: - I've tested random partitioner and it (mostly) works, although as suggested I'd highly advise bringing in support for different partitioners for the next patch. Reason for leaving the random partitioner out is that the test suite will require quite some refactoring, and current patch is already quite big. I've implemented a serialization helper that abstracts all the logic that will be required to implement a fixed-size token reading. For variable-size token reading we'll have to invest some more work and implement skipping in some reasonable way. I've implemented the support for reading from old format (ab). Although to make the testing more complete, I'm going to be working on the upgrade tests for SASI. Also, dtests are missing for SASI, so I'll create a ticket for that. |[trunk|https://github.com/ifesdjeen/cassandra/tree/11990-trunk]|[testall|https://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-11182-trunk-testall/]|[dtest|https://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-11182-trunk-dtest/]| [CASSANDRA-12389] and [CASSANDRA-12390] to track the follow-up work on supporting different partitioners. was (Author: ifesdjeen): I've tested random partitioner and it (mostly) works, although as suggested I'd highly advise bringing in support for different partitioners for the next patch. Reason for leaving the random partitioner out is that the test suite will require quite some refactoring, and current patch is already quite big. I've implemented a serialization helper that abstracts all the logic that will be required to implement a fixed-size token reading. For variable-size token reading we'll have to invest some more work and implement skipping in some reasonable way. I've implemented the support for reading from old format (ab). Although to make the testing more complete, I'm going to be working on the upgrade tests for SASI. Also, dtests are missing for SASI, so I'll create a ticket for that. |[trunk|https://github.com/ifesdjeen/cassandra/tree/11990-trunk]|[testall|https://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-11182-trunk-testall/]|[dtest|https://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-11182-trunk-dtest/]| > Address rows rather than partitions in SASI > --- > > Key: CASSANDRA-11990 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11990 > Project: Cassandra > Issue Type: Improvement > Components: CQL >Reporter: Alex Petrov >Assignee: Alex Petrov > Attachments: perf.pdf, size_comparison.png > > > Currently, the lookup in SASI index would return the key position of the > partition. After the partition lookup, the rows are iterated and the > operators are applied in order to filter out ones that do not match. > bq. TokenTree which accepts variable size keys (such would enable different > partitioners, collections support, primary key indexing etc.), -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11990) Address rows rather than partitions in SASI
[ https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399120#comment-15399120 ] Alex Petrov edited comment on CASSANDRA-11990 at 8/1/16 6:24 PM: - During several discussions with [~xedin] we came up with an idea to evaluate the support for different partitioners, since it'd help with wider SASI adoption and remove current limitation of Long tokens. I've evaluated the support, and can conclude that supporting the constant-size tokens can be included into the patch without large overhead. Patch was adjusted accordingly. There are still several failing tests, although they'll be fixed shortly. Support for variable-size tokens (for partitioners such as {{ByteOrderedPartitioner}} requires much larger time investment. My personal suggestion is to encode them with the size and avoid on-disk format changes. This will result into more complex iteration process for variable-size tokens, since we'll have to skip tokens depending on the size and won't be able to use simple multiplication for offset calculation. I've made a small patch / proof of concept for variable size tokens by adding `serializedSize` method into the token tree nodes, currently (for sakes of POC and in order to save some time), it was done by reusing the `serialize` function and passing a throwaway byte buffer, and calculating offsets by iterating and reading integers with token size. It worked just fine for simple cases. I'll mention that SASI code is written very well and offset calculation methods are very well isolated. Having that said, I'd suggest to leave the "algorithmic" heavy-lifting (variable token offset calculation) for the separate ticket to reduce the scope of current ticket. Since it's not going to require the on-disk format changes, we can safely postpone this work. Another thing that's been mentioned was is to include the column offset into clustering offset long. I'll be evaluating this proposal in terms of performance today. It seems that we can avoid increasing the size of {{long[]}} array that hold offsets and this change can help to avoid post-filtering alltogether. Additional optimisation (which, once again, could be left for the follow-up patch) is to avoid the second seek within the data file for cases when we are only querying columns that are indexed. This can be a significant performance improvement, although it'd be good to discuss whether such queries are widely used. cc [~slebresne] [~iamaleksey] [~jbellis] [~beobal] was (Author: ifesdjeen): During several discussions it's been proposed to evaluate the support for different partitioners, since it'd help with wider SASI adoption and remove current limitation of Long tokens. I've evaluated the support, and can conclude that supporting the constant-size tokens can be included into the patch without large overhead. Patch was adjusted accordingly. There are still several failing tests, although they'll be fixed shortly. Support for variable-size tokens (for partitioners such as {{ByteOrderedPartitioner}} requires much larger time investment. My personal suggestion is to encode them with the size and avoid on-disk format changes. This will result into more complex iteration process for variable-size tokens, since we'll have to skip tokens depending on the size and won't be able to use simple multiplication for offset calculation. I've made a small patch / proof of concept for variable size tokens by adding `serializedSize` method into the token tree nodes, currently (for sakes of POC and in order to save some time), it was done by reusing the `serialize` function and passing a throwaway byte buffer, and calculating offsets by iterating and reading integers with token size. It worked just fine for simple cases. I'll mention that SASI code is written very well and offset calculation methods are very well isolated. Having that said, I'd suggest to leave the "algorithmic" heavy-lifting (variable token offset calculation) for the separate ticket to reduce the scope of current ticket. Since it's not going to require the on-disk format changes, we can safely postpone this work. Another thing that's been mentioned was is to include the column offset into clustering offset long. I'll be evaluating this proposal in terms of performance today. It seems that we can avoid increasing the size of {{long[]}} array that hold offsets and this change can help to avoid post-filtering alltogether. Additional optimisation (which, once again, could be left for the follow-up patch) is to avoid the second seek within the data file for cases when we are only querying columns that are indexed. This can be a significant performance improvement, although it'd be good to discuss whether such queries are widely used. cc [~slebresne] [~iamaleksey] [~jbellis] [~beobal] > Address rows
[jira] [Comment Edited] (CASSANDRA-11990) Address rows rather than partitions in SASI
[ https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393773#comment-15393773 ] Alex Petrov edited comment on CASSANDRA-11990 at 7/26/16 1:20 PM: -- Attached is the size comparison graph between trunk and this branch. Smaller is better. As with the performance test, I'm inserting 100 partitions, containing 10, 100 and 1000 clusterings, with different-sized partition keys. cc [~xedin] was (Author: ifesdjeen): Attached is the size comparison graph between trunk and this branch. Smaller is better. As with the performance test, I'm inserting 100 partitions, containing 10, 100 and 1000 clusterings, with different-sized partition keys. > Address rows rather than partitions in SASI > --- > > Key: CASSANDRA-11990 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11990 > Project: Cassandra > Issue Type: Improvement > Components: CQL >Reporter: Alex Petrov >Assignee: Alex Petrov > Attachments: perf.pdf, size_comparison.png > > > Currently, the lookup in SASI index would return the key position of the > partition. After the partition lookup, the rows are iterated and the > operators are applied in order to filter out ones that do not match. > bq. TokenTree which accepts variable size keys (such would enable different > partitioners, collections support, primary key indexing etc.), -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11990) Address rows rather than partitions in SASI
[ https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350152#comment-15350152 ] Alex Petrov edited comment on CASSANDRA-11990 at 6/26/16 3:21 PM: -- I've done a small investigation of what it'd take and talked to several people about potential scenarios. First of all, I'd indicate that it'll be a rather big change, which will require a new format for writing the SASI {{TokenTree}}. I'll list several steps that would need to be taken: * tl;dr version: we have to extend TokenTree to fit row offset along with partition key. More elaborate version: currently SASI is a highly optimised tree that aims to encode a tree of {{long token}}/{{short + int}} entries. Since the max offset size does not exceed 48 bytes, there are more optimisations involved. It takes several optimisation steps to improve read performance and storage overhead. Short description of current format can be found [here|https://gist.github.com/ifesdjeen/0436faf9a66b401ace0ad0947d256317]. Since we'll have to hold two offsets (partition and row offset, Partition offset is required to read the PK and static rows etc), on the first step, for the proof of concept, we'll reduce the number of distinction to more simple cases (single and multiple offsets). The rest of possible combinations of optimisation (with the most obvious being when both items fit into the single long, and possibly adding more distinctions if they're flexibly skippable/addressable). * currently, we can only read the decorated partition key, so we need to extend storage to address a single row as well * we need to extend TokenTree to support other partitioners (whether or not it's going to be done in scope of this ticket, we'll have to make sure we're not making it harder to extend it this way. * there might be a need to store the order-preserving hash of clustering keys for queries where row is split across multiple SSTables, although I have to gather more information on that one, as we might be able to resolve rows after reading them from sstables. * we'll need to find migration/upgrade paths from current format, which may involve re-indexing and failing queries while upgrade is in process or supporting two format versions at read time, to support reads from old format while indexes are rebuilt. cc [~xedin] [~beobal] [~jrwest] was (Author: ifesdjeen): I've done a small investigation of what it'd take and talked to several people about potential scenarios. First of all, I'd indicate that it'll be a rather big change, which will require a new format for writing the SASI {{TokenTree}}. I'll list several steps that would need to be taken: * tl;dr version: we have to extend TokenTree to fit row offset along with partition key. More elaborate version: currently SASI is a highly optimised tree that aims to encode a tree of {{long token}}/{{short + int}} entries. Since the max offset size does not exceed 48 bytes, there are more optimisations involved. It takes several optimisation steps to improve read performance and storage overhead. Short description of current format can be found [here|https://gist.github.com/ifesdjeen/0436faf9a66b401ace0ad0947d256317]. Since we'll have to hold two offsets (partition and row offset, Partition offset is required to read the PK and static rows etc), on the first step, for the proof of concept, we'll reduce the number of distinction to more simple cases (single and multiple offsets). The rest of possible combinations of optimisation (with the most obvious being when both items fit into the single long, and possibly adding more distinctions if they're flexibly skippable/addressable). * we need to extend TokenTree to support other partitioners (whether or not it's going to be done in scope of this ticket, we'll have to make sure we're not making it harder to extend it this way. * there might be a need to store the order-preserving hash of clustering keys for queries where row is split across multiple SSTables, although I have to gather more information on that one, as we might be able to resolve rows after reading them from sstables. * we'll need to find migration/upgrade paths from current format, which may involve re-indexing and failing queries while upgrade is in process or supporting two format versions at read time, to support reads from old format while indexes are rebuilt. cc [~xedin] [~beobal] [~jrwest] > Address rows rather than partitions in SASI > --- > > Key: CASSANDRA-11990 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11990 > Project: Cassandra > Issue Type: Improvement > Components: CQL >Reporter: Alex Petrov >Assignee: Alex Petrov > > Currently, the lookup in SASI index