[jira] [Comment Edited] (CASSANDRA-11990) Address rows rather than partitions in SASI

2018-08-27 Thread Alex Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593803#comment-16593803
 ] 

Alex Petrov edited comment on CASSANDRA-11990 at 8/27/18 3:33 PM:
--

[~jrwest] thank you for the patch! 

I haven't reviewed the patch fully just yet but I have multiple comments: 

  * since we now store clusterings in the index, index sizes got much bigger. 
In case of prefix we get 1,5G for 175M of data (old version yields 660M). 
  * I realize there might be no hard limit on the amount of clusterings but a) 
using {{Integer.BYTES}} seems a lot of bytes even though it's written only once 
per partition and b) we should be able to infer clustering amount from schema, 
so writing it into the index might be redundant. I've tried to naively discard 
the value and replace {{buffer.getInt}} with {{clusteringComparator.size()}} 
and it seems to be ok (however, cases with static clustering, static tables etc 
should be double-checked)
  * I might be mistaken, but we can infer whether or not to read an entire 
partition from the query rather than going through all clusterings like 
[here|https://github.com/jrwest/cassandra/compare/deec448d8c1adee8663281708dc5e0060d0fa250...5d63aae5ffba7569811fad6a818163fcf3d26c6f#diff-df913f108e1290500a9afb8cd4638189R114]
  * I've tried ingesting a relatively small SSTable (~300Mb) and old SASI 
didn't seem to have any issues while new version struggles with GC. I'd have to 
do a lot more testing and verification to say what might be the cause.
  * it's be great to have at least a rough description of the index file 
format. We didn't have it before, I know, but if we could make it while at it, 
might be very useful for the future work.
  * given we're fetching clustering (therefore can say which rows are "same" 
across partitions), and data reconciliation is done by controller that goes 
through the "normal" read path, do we need to do re-filter the data in cases of 
non-static clusterings?

Could you also describe the motivation for embedding rows rather than row 
offsets into the index? I thought that it could improve some cases with 
range-iteration and eliminate additional filtering. But since we're making a 
seek to the data file in order to materialize the key, we can also fetch 
clusterings from there as well. 


was (Author: ifesdjeen):
[~jrwest] thank you for the patch! 

I haven't reviewed the patch fully just yet but I have multiple comments: 

  * since we now store clusterings in the index, index sizes got much bigger. 
In case of prefix we get 1,5G for 175M of data (old version yields 660M). 
  * I realize there might be no hard limit on the amount of clusterings but a) 
using {{Integer.BYTES}} seems a lot of bytes even though it's written only once 
per partition and b) we should be able to infer clustering amount from schema, 
so writing it into the index might be redundant. I've tried to naively discard 
the value and replace {{buffer.getInt}} with {{clusteringComparator.size()}} 
and it seems to be ok (however, cases with static clustering, static tables etc 
should be double-checked)
  * I might be mistaken, but we can infer whether or not to read an entire 
partition from the query rather than going through all clusterings like 
[here|https://github.com/jrwest/cassandra/compare/deec448d8c1adee8663281708dc5e0060d0fa250...5d63aae5ffba7569811fad6a818163fcf3d26c6f#diff-df913f108e1290500a9afb8cd4638189R114]
  * I've tried ingesting a relatively small SSTable (~300Mb) and old SASI 
didn't seem to have any issues while new version struggles with GC. I'd have to 
do a lot more testing and verification to say what might be the cause.
  * it's be great to have at least a rough description of the index file 
format. We didn't have it before, I know, but if we could make it while at it, 
might be very useful for the future work.
  * given we're fetching clustering (therefore can say which rows are "same" 
across partitions), and data reconciliation is done by controller that goes 
through the "normal" read path, do we need to do re-filter the data in cases of 
non-static clusterings?

> Address rows rather than partitions in SASI
> ---
>
> Key: CASSANDRA-11990
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11990
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL, sasi
>Reporter: Alex Petrov
>Assignee: Jordan West
>Priority: Major
> Fix For: 4.x
>
> Attachments: perf.pdf, size_comparison.png
>
>
> Currently, the lookup in SASI index would return the key position of the 
> partition. After the partition lookup, the rows are iterated and the 
> operators are applied in order to filter out ones that do not match.
> bq. TokenTree which accepts variable size keys (such 

[jira] [Comment Edited] (CASSANDRA-11990) Address rows rather than partitions in SASI

2016-08-08 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412775#comment-15412775
 ] 

Pavel Yaskevich edited comment on CASSANDRA-11990 at 8/9/16 1:12 AM:
-

I looked at the code again and I think we might be better off keeping only 
Murmur3Partitioner (tokens as longs) for now instead of changing so much just 
to support RP...
That would also make update path much easier for us - a. added clustering 
positions, b. extend tokens sizes to variable size. instead of having 
intermediate of multi-size fixed tokens...

*Code remarks:*

- 
https://github.com/apache/cassandra/compare/trunk...ifesdjeen:11990-trunk#diff-2143dfad950ab134c0a25bd903c2875aR143
   what is the point of logger.info it it's still wrapped into 
logger.isDebugEnabled() ?
- Can you move drop-data and rebuild related changes to the separate branch so 
we can keep changes to token tree and bug fixes separate since they are 
separate tickets?
- 
https://github.com/apache/cassandra/compare/trunk...ifesdjeen:11990-trunk#diff-afe204f22033543e3ae0185240523a9bR113
 
   I'm not sure what is this `// ignore` about?...
- SasiToken -> SASIToken since SASI is an abbreviation already.
- Back to the point about not supporting PR; the way 
TokenTreeSerializationHelper is currently done is not optimal since it has to 
be explicitly carried around,
   which means that it's most likely not a proper abstraction to have, e.g. 
higher levels like QueryPlan/Operation/Expression/ColumnIndex should never care 
about any details
   about TokenTree, especially about how it's tokens are serialized, we should 
either use IPartitioner interface or nothing at all.
- Upgrade test is missing for TokenTree.


was (Author: xedin):
I looked at the code again and I think we might be better off keeping only 
Murmur3Partitioner (tokens as longs) for now instead of changing so much just 
to support RP...
That would also make update path much easier for us - a. added clustering 
positions, b. extend tokens sizes to variable size. instead of having 
intermediate of multi-size fixed tokens...

*Code remarks:*

- 
https://github.com/apache/cassandra/compare/trunk...ifesdjeen:11990-trunk#diff-2143dfad950ab134c0a25bd903c2875aR143
   what is the point of logger.info it it's still wrapped into 
logger.isDebugEnabled() ?
- Can you move drop-data and rebuild related changes to the separate branch so 
we can keep changes to token tree and bug fixes separate since they are 
separate tickets?
 
https://github.com/apache/cassandra/compare/trunk...ifesdjeen:11990-trunk#diff-afe204f22033543e3ae0185240523a9bR113
 
   I'm not sure what is this `// ignore` about?...
- SasiToken -> SASIToken since SASI is an abbreviation already.
- Back to the point about not supporting PR; the way 
TokenTreeSerializationHelper is currently done is not optimal since it has to 
be explicitly carried around,
   which means that it's most likely not a proper abstraction to have, e.g. 
higher levels like QueryPlan/Operation/Expression/ColumnIndex should never care 
about any details
   about TokenTree, especially about how it's tokens are serialized, we should 
either use IPartitioner interface or nothing at all.
- Upgrade test is missing for TokenTree.

> Address rows rather than partitions in SASI
> ---
>
> Key: CASSANDRA-11990
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11990
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL, sasi
>Reporter: Alex Petrov
>Assignee: Alex Petrov
> Attachments: perf.pdf, size_comparison.png
>
>
> Currently, the lookup in SASI index would return the key position of the 
> partition. After the partition lookup, the rows are iterated and the 
> operators are applied in order to filter out ones that do not match.
> bq. TokenTree which accepts variable size keys (such would enable different 
> partitioners, collections support, primary key indexing etc.), 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11990) Address rows rather than partitions in SASI

2016-08-05 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408322#comment-15408322
 ] 

Alex Petrov edited comment on CASSANDRA-11990 at 8/5/16 1:29 PM:
-

I've tested random partitioner and it (mostly) works, although as suggested I'd 
highly advise bringing in support for different partitioners for the next 
patch. Reason for leaving the random partitioner out is that the test suite 
will require quite some refactoring, and current patch is already quite big. 
I've implemented a serialization helper that abstracts all the logic that will 
be required to implement a fixed-size token reading. For variable-size token 
reading we'll have to invest some more work and implement skipping in some 
reasonable way.

I've implemented the support for reading from old format (ab). Although to make 
the testing more complete, I'm going to be working on the upgrade tests for 
SASI. Also, dtests are missing for SASI, so I'll create a ticket for that.

|[trunk|https://github.com/ifesdjeen/cassandra/tree/11990-trunk]|[testall|https://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-11182-trunk-testall/]|[dtest|https://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-11182-trunk-dtest/]|

[CASSANDRA-12389] and [CASSANDRA-12390] to track the follow-up work on 
supporting different partitioners.


was (Author: ifesdjeen):
I've tested random partitioner and it (mostly) works, although as suggested I'd 
highly advise bringing in support for different partitioners for the next 
patch. Reason for leaving the random partitioner out is that the test suite 
will require quite some refactoring, and current patch is already quite big. 
I've implemented a serialization helper that abstracts all the logic that will 
be required to implement a fixed-size token reading. For variable-size token 
reading we'll have to invest some more work and implement skipping in some 
reasonable way.

I've implemented the support for reading from old format (ab). Although to make 
the testing more complete, I'm going to be working on the upgrade tests for 
SASI. Also, dtests are missing for SASI, so I'll create a ticket for that.

|[trunk|https://github.com/ifesdjeen/cassandra/tree/11990-trunk]|[testall|https://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-11182-trunk-testall/]|[dtest|https://cassci.datastax.com/view/Dev/view/ifesdjeen/job/ifesdjeen-11182-trunk-dtest/]|

> Address rows rather than partitions in SASI
> ---
>
> Key: CASSANDRA-11990
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11990
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Alex Petrov
>Assignee: Alex Petrov
> Attachments: perf.pdf, size_comparison.png
>
>
> Currently, the lookup in SASI index would return the key position of the 
> partition. After the partition lookup, the rows are iterated and the 
> operators are applied in order to filter out ones that do not match.
> bq. TokenTree which accepts variable size keys (such would enable different 
> partitioners, collections support, primary key indexing etc.), 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11990) Address rows rather than partitions in SASI

2016-08-01 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399120#comment-15399120
 ] 

Alex Petrov edited comment on CASSANDRA-11990 at 8/1/16 6:24 PM:
-

During several discussions with [~xedin] we came up with an idea to evaluate 
the support for different partitioners, since it'd help with wider SASI 
adoption and remove current limitation of Long tokens. I've evaluated the 
support, and can conclude that supporting the constant-size tokens can be 
included into the patch without large overhead. Patch was adjusted accordingly. 
There are still several failing tests, although they'll be fixed shortly. 

Support for variable-size tokens (for partitioners such as 
{{ByteOrderedPartitioner}} requires much larger time investment. My personal 
suggestion is to encode them with the size and avoid on-disk format changes. 
This will result into more complex iteration process for variable-size tokens, 
since we'll have to skip tokens depending on the size and won't be able to use 
simple multiplication for offset calculation. I've made a small patch / proof 
of concept for variable size tokens by adding `serializedSize` method into the 
token tree nodes, currently (for sakes of POC and in order to save some time), 
it was done by reusing the `serialize` function and passing a throwaway byte 
buffer, and calculating offsets by iterating and reading integers with token 
size. It worked just fine for simple cases. I'll mention that SASI code is 
written very well and offset calculation methods are very well isolated. 

Having that said, I'd suggest to leave the "algorithmic" heavy-lifting 
(variable token offset calculation) for the separate ticket to reduce the scope 
of current ticket. Since it's not going to require the on-disk format changes, 
we can safely postpone this work. 


Another thing that's been mentioned was is to include the column offset into 
clustering offset long. I'll be evaluating this proposal in terms of 
performance today. It seems that we can avoid increasing the size of {{long[]}} 
array that hold offsets and this change can help to avoid post-filtering 
alltogether. Additional optimisation (which, once again, could be left for the 
follow-up patch) is to avoid the second seek within the data file for cases 
when we are only querying columns that are indexed. This can be a significant 
performance improvement, although it'd be good to discuss whether such queries 
are widely used.

cc [~slebresne] [~iamaleksey] [~jbellis] [~beobal]


was (Author: ifesdjeen):
During several discussions it's been proposed to evaluate the support for 
different partitioners, since it'd help with wider SASI adoption and remove 
current limitation of Long tokens. I've evaluated the support, and can conclude 
that supporting the constant-size tokens can be included into the patch without 
large overhead. Patch was adjusted accordingly. There are still several failing 
tests, although they'll be fixed shortly. 

Support for variable-size tokens (for partitioners such as 
{{ByteOrderedPartitioner}} requires much larger time investment. My personal 
suggestion is to encode them with the size and avoid on-disk format changes. 
This will result into more complex iteration process for variable-size tokens, 
since we'll have to skip tokens depending on the size and won't be able to use 
simple multiplication for offset calculation. I've made a small patch / proof 
of concept for variable size tokens by adding `serializedSize` method into the 
token tree nodes, currently (for sakes of POC and in order to save some time), 
it was done by reusing the `serialize` function and passing a throwaway byte 
buffer, and calculating offsets by iterating and reading integers with token 
size. It worked just fine for simple cases. I'll mention that SASI code is 
written very well and offset calculation methods are very well isolated. 

Having that said, I'd suggest to leave the "algorithmic" heavy-lifting 
(variable token offset calculation) for the separate ticket to reduce the scope 
of current ticket. Since it's not going to require the on-disk format changes, 
we can safely postpone this work. 


Another thing that's been mentioned was is to include the column offset into 
clustering offset long. I'll be evaluating this proposal in terms of 
performance today. It seems that we can avoid increasing the size of {{long[]}} 
array that hold offsets and this change can help to avoid post-filtering 
alltogether. Additional optimisation (which, once again, could be left for the 
follow-up patch) is to avoid the second seek within the data file for cases 
when we are only querying columns that are indexed. This can be a significant 
performance improvement, although it'd be good to discuss whether such queries 
are widely used.

cc [~slebresne] [~iamaleksey] [~jbellis] [~beobal]


> Address rows 

[jira] [Comment Edited] (CASSANDRA-11990) Address rows rather than partitions in SASI

2016-07-26 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393773#comment-15393773
 ] 

Alex Petrov edited comment on CASSANDRA-11990 at 7/26/16 1:20 PM:
--

Attached is the size comparison graph between trunk and this branch. 

Smaller is better. As with the performance test, I'm inserting 100 partitions, 
containing 10, 100 and 1000 clusterings, with different-sized partition keys.

cc [~xedin]


was (Author: ifesdjeen):
Attached is the size comparison graph between trunk and this branch. 

Smaller is better. As with the performance test, I'm inserting 100 partitions, 
containing 10, 100 and 1000 clusterings, with different-sized partition keys.

> Address rows rather than partitions in SASI
> ---
>
> Key: CASSANDRA-11990
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11990
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Alex Petrov
>Assignee: Alex Petrov
> Attachments: perf.pdf, size_comparison.png
>
>
> Currently, the lookup in SASI index would return the key position of the 
> partition. After the partition lookup, the rows are iterated and the 
> operators are applied in order to filter out ones that do not match.
> bq. TokenTree which accepts variable size keys (such would enable different 
> partitioners, collections support, primary key indexing etc.), 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11990) Address rows rather than partitions in SASI

2016-06-26 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350152#comment-15350152
 ] 

Alex Petrov edited comment on CASSANDRA-11990 at 6/26/16 3:21 PM:
--

I've done a small investigation of what it'd take and talked to several people 
about potential scenarios. First of all, I'd indicate that it'll be a rather 
big change, which will require a new format for writing the SASI {{TokenTree}}. 
I'll list several steps that would need to be taken: 

  * tl;dr version: we have to extend TokenTree to fit row offset along with 
partition key. More elaborate version: currently SASI is a highly optimised 
tree that aims to encode a tree of {{long token}}/{{short + int}} entries. 
Since the max offset size does not exceed 48 bytes, there are more 
optimisations involved. It takes several optimisation steps to improve read 
performance and storage overhead. Short description of current format can be 
found 
[here|https://gist.github.com/ifesdjeen/0436faf9a66b401ace0ad0947d256317]. 
Since we'll have to hold two offsets (partition and row offset, Partition 
offset is required to read the PK and static rows etc), on the first step, for 
the proof of concept, we'll reduce the number of distinction to more simple 
cases (single and multiple offsets).  The rest of possible combinations of 
optimisation (with the most obvious being when both items fit into the single 
long, and possibly adding more distinctions if they're flexibly 
skippable/addressable). 
  * currently, we can only read the decorated partition key, so we need to 
extend storage to address a single row as well
  * we need to extend TokenTree to support other partitioners (whether or not 
it's going to be done in scope of this ticket, we'll have to make sure we're 
not making it harder to extend it this way.
  * there might be a need to store the order-preserving hash of clustering keys 
for queries where row is split across multiple SSTables, although I have to 
gather more information on that one, as we might be able to resolve rows after 
reading them from sstables. 
  * we'll need to find migration/upgrade paths from current format, which may 
involve re-indexing and failing queries while upgrade is in process or 
supporting two format versions at read time, to support reads from old format 
while indexes are rebuilt. 

cc [~xedin] [~beobal] [~jrwest] 


was (Author: ifesdjeen):
I've done a small investigation of what it'd take and talked to several people 
about potential scenarios. First of all, I'd indicate that it'll be a rather 
big change, which will require a new format for writing the SASI {{TokenTree}}. 
I'll list several steps that would need to be taken: 

  * tl;dr version: we have to extend TokenTree to fit row offset along with 
partition key. More elaborate version: currently SASI is a highly optimised 
tree that aims to encode a tree of {{long token}}/{{short + int}} entries. 
Since the max offset size does not exceed 48 bytes, there are more 
optimisations involved. It takes several optimisation steps to improve read 
performance and storage overhead. Short description of current format can be 
found 
[here|https://gist.github.com/ifesdjeen/0436faf9a66b401ace0ad0947d256317]. 
Since we'll have to hold two offsets (partition and row offset, Partition 
offset is required to read the PK and static rows etc), on the first step, for 
the proof of concept, we'll reduce the number of distinction to more simple 
cases (single and multiple offsets).  The rest of possible combinations of 
optimisation (with the most obvious being when both items fit into the single 
long, and possibly adding more distinctions if they're flexibly 
skippable/addressable). 
  * we need to extend TokenTree to support other partitioners (whether or not 
it's going to be done in scope of this ticket, we'll have to make sure we're 
not making it harder to extend it this way.
  * there might be a need to store the order-preserving hash of clustering keys 
for queries where row is split across multiple SSTables, although I have to 
gather more information on that one, as we might be able to resolve rows after 
reading them from sstables. 
  * we'll need to find migration/upgrade paths from current format, which may 
involve re-indexing and failing queries while upgrade is in process or 
supporting two format versions at read time, to support reads from old format 
while indexes are rebuilt. 

cc [~xedin] [~beobal] [~jrwest] 

> Address rows rather than partitions in SASI
> ---
>
> Key: CASSANDRA-11990
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11990
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>
> Currently, the lookup in SASI index