Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Hannu Kröger
Yes, I agree. I would say it cannot skip those cells because it doesn’t check 
the max timestamp of the cells of the sstable and therefore scans them one by 
one.

Hannu
 
> On 16 May 2017, at 19:48, Stefano Ortolani  wrote:
> 
> But it should skip those records since they are sorted. My understanding 
> would be something like:
> 
> 1) read sstable 2
> 2) read the range tombstone
> 3) skip records from sstable2 and sstable1 within the range boundaries
> 4) read remaining records from sstable1
> 5) no records, return
> 
> On Tue, May 16, 2017 at 5:43 PM, Hannu Kröger  > wrote:
> This is a bit of guessing but it probably reads sstables in some sort of 
> sequence, so even if sstable 2 contains the tombstone, it still scans through 
> the sstable 1 for possible data to be read.
> 
> BR,
> Hannu
> 
>> On 16 May 2017, at 19:40, Stefano Ortolani > > wrote:
>> 
>> Little update: also the following query timeouts, which is weird since the 
>> range tombstone should have been read by then...
>> 
>> SELECT * 
>> FROM test_cql.test_cf 
>> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
>> AND timeid < the_oldest_deleted_timeid
>> ORDER BY timeid DESC;
>> 
>> 
>> 
>> On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani > > wrote:
>> Yes, that was my intention but I wanted to cross-check with the ML and the 
>> devs keeping an eye on it first.
>> 
>> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger > > wrote:
>> Well,
>> 
>> sstables contain some statistics about the cell timestamps and using that 
>> information and the tombstone timestamp it might be possible to skip some 
>> data but I’m not sure that Cassandra currently does that. Maybe it would be 
>> worth a JIRA ticket and see what the devs think about it. If optimizing this 
>> case would make sense.
>> 
>> Hannu
>> 
>>> On 16 May 2017, at 18:03, Stefano Ortolani >> > wrote:
>>> 
>>> Hi Hannu,
>>> 
>>> the piece of data in question is older. In my example the tombstone is the 
>>> newest piece of data.
>>> Since a range tombstone has information re the clustering key ranges, and 
>>> the data is clustering key sorted, I would expect a linear scan not to be 
>>> necessary.
>>> 
>>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger >> > wrote:
>>> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
>>> bigger regions of deleted data based on range tombstone. If some piece of 
>>> data in a partition is newer than the tombstone, then it cannot be skipped. 
>>> Therefore some partition level statistics of cell ages would need to be 
>>> kept in the column index for the skipping and that is probably not there.
>>> 
>>> Hannu 
>>> 
 On 16 May 2017, at 17:33, Stefano Ortolani > wrote:
 
 That is another way to see the question: are reverse iterators range 
 tombstone aware? Yes.
 That is why I am puzzled by this afore-mentioned behavior. 
 I would expect them to handle this case more gracefully.
 
 Cheers,
 Stefano
 
 On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth > wrote:
 Hannu,
 
 How can you read a partition in reverse?
 
 Sent from my iPhone
 
 > On May 16, 2017, at 9:20 AM, Hannu Kröger  > wrote:
 >
 > Well, I’m guessing that Cassandra doesn't really know if the range 
 > tombstone is useful for this or not.
 >
 > In many cases it might be that the partition contains data that is 
 > within the range of the tombstone but is newer than the tombstone and 
 > therefore it might be still be returned. Scanning through deleted data 
 > can be avoided by reading the partition in reverse (if all the deleted 
 > data is in the beginning of the partition). Eventually you will still 
 > end up reading a lot of tombstones but you will get a lot of live data 
 > first and the implicit query limit of 1 probably is reached before 
 > you get to the tombstones. Therefore you will get an immediate answer.
 >
 > Does it make sense?
 >
 > Hannu
 >
 >> On 16 May 2017, at 16:33, Stefano Ortolani > > wrote:
 >>
 >> Hi all,
 >>
 >> I am seeing inconsistencies when mixing range tombstones, wide 
 >> partitions, and reverse iterators.
 >> I still have to understand if the behaviour is to be expected hence the 
 >> message on the mailing list.
 >>
 >> The situation is conceptually simple. I am using a table defined as 
 >> follows:
 >>
 >> CREATE TABLE 

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
But it should skip those records since they are sorted. My understanding
would be something like:

1) read sstable 2
2) read the range tombstone
3) skip records from sstable2 and sstable1 within the range boundaries
4) read remaining records from sstable1
5) no records, return

On Tue, May 16, 2017 at 5:43 PM, Hannu Kröger  wrote:

> This is a bit of guessing but it probably reads sstables in some sort of
> sequence, so even if sstable 2 contains the tombstone, it still scans
> through the sstable 1 for possible data to be read.
>
> BR,
> Hannu
>
> On 16 May 2017, at 19:40, Stefano Ortolani  wrote:
>
> Little update: also the following query timeouts, which is weird since the
> range tombstone should have been read by then...
>
> SELECT *
> FROM test_cql.test_cf
> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> AND timeid < the_oldest_deleted_timeid
> ORDER BY timeid DESC;
>
>
>
> On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani 
> wrote:
>
>> Yes, that was my intention but I wanted to cross-check with the ML and
>> the devs keeping an eye on it first.
>>
>> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger  wrote:
>>
>>> Well,
>>>
>>> sstables contain some statistics about the cell timestamps and using
>>> that information and the tombstone timestamp it might be possible to skip
>>> some data but I’m not sure that Cassandra currently does that. Maybe it
>>> would be worth a JIRA ticket and see what the devs think about it. If
>>> optimizing this case would make sense.
>>>
>>> Hannu
>>>
>>> On 16 May 2017, at 18:03, Stefano Ortolani  wrote:
>>>
>>> Hi Hannu,
>>>
>>> the piece of data in question is older. In my example the tombstone is
>>> the newest piece of data.
>>> Since a range tombstone has information re the clustering key ranges,
>>> and the data is clustering key sorted, I would expect a linear scan not to
>>> be necessary.
>>>
>>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  wrote:
>>>
 Well, as mentioned, probably Cassandra doesn’t have logic and data to
 skip bigger regions of deleted data based on range tombstone. If some piece
 of data in a partition is newer than the tombstone, then it cannot be
 skipped. Therefore some partition level statistics of cell ages would need
 to be kept in the column index for the skipping and that is probably not
 there.

 Hannu

 On 16 May 2017, at 17:33, Stefano Ortolani  wrote:

 That is another way to see the question: are reverse iterators range
 tombstone aware? Yes.
 That is why I am puzzled by this afore-mentioned behavior.
 I would expect them to handle this case more gracefully.

 Cheers,
 Stefano

 On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth 
 wrote:

> Hannu,
>
> How can you read a partition in reverse?
>
> Sent from my iPhone
>
> > On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
> >
> > Well, I’m guessing that Cassandra doesn't really know if the range
> tombstone is useful for this or not.
> >
> > In many cases it might be that the partition contains data that is
> within the range of the tombstone but is newer than the tombstone and
> therefore it might be still be returned. Scanning through deleted data can
> be avoided by reading the partition in reverse (if all the deleted data is
> in the beginning of the partition). Eventually you will still end up
> reading a lot of tombstones but you will get a lot of live data first and
> the implicit query limit of 1 probably is reached before you get to 
> the
> tombstones. Therefore you will get an immediate answer.
> >
> > Does it make sense?
> >
> > Hannu
> >
> >> On 16 May 2017, at 16:33, Stefano Ortolani 
> wrote:
> >>
> >> Hi all,
> >>
> >> I am seeing inconsistencies when mixing range tombstones, wide
> partitions, and reverse iterators.
> >> I still have to understand if the behaviour is to be expected hence
> the message on the mailing list.
> >>
> >> The situation is conceptually simple. I am using a table defined as
> follows:
> >>
> >> CREATE TABLE test_cql.test_cf (
> >>  hash blob,
> >>  timeid timeuuid,
> >>  PRIMARY KEY (hash, timeid)
> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
> >>
> >> I then proceed by loading 2/3GB from 3 sstables which I know
> contain a really wide partition (> 512 MB) for `hash = x`. I then delete
> the oldest _half_ of that partition by executing the query below, and
> restart the node:
> >>
> >> DELETE
> >> FROM test_cql.test_cf
> >> WHERE hash = x AND timeid < y;
> >>
> >> If I keep 

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Hannu Kröger
This is a bit of guessing but it probably reads sstables in some sort of 
sequence, so even if sstable 2 contains the tombstone, it still scans through 
the sstable 1 for possible data to be read.

BR,
Hannu

> On 16 May 2017, at 19:40, Stefano Ortolani  wrote:
> 
> Little update: also the following query timeouts, which is weird since the 
> range tombstone should have been read by then...
> 
> SELECT * 
> FROM test_cql.test_cf 
> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
> AND timeid < the_oldest_deleted_timeid
> ORDER BY timeid DESC;
> 
> 
> 
> On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani  > wrote:
> Yes, that was my intention but I wanted to cross-check with the ML and the 
> devs keeping an eye on it first.
> 
> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger  > wrote:
> Well,
> 
> sstables contain some statistics about the cell timestamps and using that 
> information and the tombstone timestamp it might be possible to skip some 
> data but I’m not sure that Cassandra currently does that. Maybe it would be 
> worth a JIRA ticket and see what the devs think about it. If optimizing this 
> case would make sense.
> 
> Hannu
> 
>> On 16 May 2017, at 18:03, Stefano Ortolani > > wrote:
>> 
>> Hi Hannu,
>> 
>> the piece of data in question is older. In my example the tombstone is the 
>> newest piece of data.
>> Since a range tombstone has information re the clustering key ranges, and 
>> the data is clustering key sorted, I would expect a linear scan not to be 
>> necessary.
>> 
>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger > > wrote:
>> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
>> bigger regions of deleted data based on range tombstone. If some piece of 
>> data in a partition is newer than the tombstone, then it cannot be skipped. 
>> Therefore some partition level statistics of cell ages would need to be kept 
>> in the column index for the skipping and that is probably not there.
>> 
>> Hannu 
>> 
>>> On 16 May 2017, at 17:33, Stefano Ortolani >> > wrote:
>>> 
>>> That is another way to see the question: are reverse iterators range 
>>> tombstone aware? Yes.
>>> That is why I am puzzled by this afore-mentioned behavior. 
>>> I would expect them to handle this case more gracefully.
>>> 
>>> Cheers,
>>> Stefano
>>> 
>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth >> > wrote:
>>> Hannu,
>>> 
>>> How can you read a partition in reverse?
>>> 
>>> Sent from my iPhone
>>> 
>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger >> > > wrote:
>>> >
>>> > Well, I’m guessing that Cassandra doesn't really know if the range 
>>> > tombstone is useful for this or not.
>>> >
>>> > In many cases it might be that the partition contains data that is within 
>>> > the range of the tombstone but is newer than the tombstone and therefore 
>>> > it might be still be returned. Scanning through deleted data can be 
>>> > avoided by reading the partition in reverse (if all the deleted data is 
>>> > in the beginning of the partition). Eventually you will still end up 
>>> > reading a lot of tombstones but you will get a lot of live data first and 
>>> > the implicit query limit of 1 probably is reached before you get to 
>>> > the tombstones. Therefore you will get an immediate answer.
>>> >
>>> > Does it make sense?
>>> >
>>> > Hannu
>>> >
>>> >> On 16 May 2017, at 16:33, Stefano Ortolani >> >> > wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I am seeing inconsistencies when mixing range tombstones, wide 
>>> >> partitions, and reverse iterators.
>>> >> I still have to understand if the behaviour is to be expected hence the 
>>> >> message on the mailing list.
>>> >>
>>> >> The situation is conceptually simple. I am using a table defined as 
>>> >> follows:
>>> >>
>>> >> CREATE TABLE test_cql.test_cf (
>>> >>  hash blob,
>>> >>  timeid timeuuid,
>>> >>  PRIMARY KEY (hash, timeid)
>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> >>
>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the 
>>> >> oldest _half_ of that partition by executing the query below, and 
>>> >> restart the node:
>>> >>
>>> >> DELETE
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = x AND timeid < y;
>>> >>
>>> >> If I keep compactions disabled the following query timeouts (takes more 
>>> >> than 10 seconds to
>>> >> succeed):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> 

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
Little update: also the following query timeouts, which is weird since the
range tombstone should have been read by then...

SELECT *
FROM test_cql.test_cf
WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
AND timeid < the_oldest_deleted_timeid
ORDER BY timeid DESC;



On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani 
wrote:

> Yes, that was my intention but I wanted to cross-check with the ML and the
> devs keeping an eye on it first.
>
> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger  wrote:
>
>> Well,
>>
>> sstables contain some statistics about the cell timestamps and using that
>> information and the tombstone timestamp it might be possible to skip some
>> data but I’m not sure that Cassandra currently does that. Maybe it would be
>> worth a JIRA ticket and see what the devs think about it. If optimizing
>> this case would make sense.
>>
>> Hannu
>>
>> On 16 May 2017, at 18:03, Stefano Ortolani  wrote:
>>
>> Hi Hannu,
>>
>> the piece of data in question is older. In my example the tombstone is
>> the newest piece of data.
>> Since a range tombstone has information re the clustering key ranges, and
>> the data is clustering key sorted, I would expect a linear scan not to be
>> necessary.
>>
>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  wrote:
>>
>>> Well, as mentioned, probably Cassandra doesn’t have logic and data to
>>> skip bigger regions of deleted data based on range tombstone. If some piece
>>> of data in a partition is newer than the tombstone, then it cannot be
>>> skipped. Therefore some partition level statistics of cell ages would need
>>> to be kept in the column index for the skipping and that is probably not
>>> there.
>>>
>>> Hannu
>>>
>>> On 16 May 2017, at 17:33, Stefano Ortolani  wrote:
>>>
>>> That is another way to see the question: are reverse iterators range
>>> tombstone aware? Yes.
>>> That is why I am puzzled by this afore-mentioned behavior.
>>> I would expect them to handle this case more gracefully.
>>>
>>> Cheers,
>>> Stefano
>>>
>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth  wrote:
>>>
 Hannu,

 How can you read a partition in reverse?

 Sent from my iPhone

 > On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
 >
 > Well, I’m guessing that Cassandra doesn't really know if the range
 tombstone is useful for this or not.
 >
 > In many cases it might be that the partition contains data that is
 within the range of the tombstone but is newer than the tombstone and
 therefore it might be still be returned. Scanning through deleted data can
 be avoided by reading the partition in reverse (if all the deleted data is
 in the beginning of the partition). Eventually you will still end up
 reading a lot of tombstones but you will get a lot of live data first and
 the implicit query limit of 1 probably is reached before you get to the
 tombstones. Therefore you will get an immediate answer.
 >
 > Does it make sense?
 >
 > Hannu
 >
 >> On 16 May 2017, at 16:33, Stefano Ortolani 
 wrote:
 >>
 >> Hi all,
 >>
 >> I am seeing inconsistencies when mixing range tombstones, wide
 partitions, and reverse iterators.
 >> I still have to understand if the behaviour is to be expected hence
 the message on the mailing list.
 >>
 >> The situation is conceptually simple. I am using a table defined as
 follows:
 >>
 >> CREATE TABLE test_cql.test_cf (
 >>  hash blob,
 >>  timeid timeuuid,
 >>  PRIMARY KEY (hash, timeid)
 >> ) WITH CLUSTERING ORDER BY (timeid ASC)
 >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
 >>
 >> I then proceed by loading 2/3GB from 3 sstables which I know contain
 a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
 _half_ of that partition by executing the query below, and restart the 
 node:
 >>
 >> DELETE
 >> FROM test_cql.test_cf
 >> WHERE hash = x AND timeid < y;
 >>
 >> If I keep compactions disabled the following query timeouts (takes
 more than 10 seconds to
 >> succeed):
 >>
 >> SELECT *
 >> FROM test_cql.test_cf
 >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
 >> ORDER BY timeid ASC;
 >>
 >> While the following returns immediately (obviously because no
 deleted data is ever read):
 >>
 >> SELECT *
 >> FROM test_cql.test_cf
 >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
 >> ORDER BY timeid DESC;
 >>
 >> If I force a compaction the problem is gone, but I presume just
 because the data is rearranged.
 >>
 >> It seems to me that reading by ASC does not make use of the range
 tombstone until C* reads the
 >> last sstables (which actually 

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
Yes, that was my intention but I wanted to cross-check with the ML and the
devs keeping an eye on it first.

On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger  wrote:

> Well,
>
> sstables contain some statistics about the cell timestamps and using that
> information and the tombstone timestamp it might be possible to skip some
> data but I’m not sure that Cassandra currently does that. Maybe it would be
> worth a JIRA ticket and see what the devs think about it. If optimizing
> this case would make sense.
>
> Hannu
>
> On 16 May 2017, at 18:03, Stefano Ortolani  wrote:
>
> Hi Hannu,
>
> the piece of data in question is older. In my example the tombstone is the
> newest piece of data.
> Since a range tombstone has information re the clustering key ranges, and
> the data is clustering key sorted, I would expect a linear scan not to be
> necessary.
>
> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  wrote:
>
>> Well, as mentioned, probably Cassandra doesn’t have logic and data to
>> skip bigger regions of deleted data based on range tombstone. If some piece
>> of data in a partition is newer than the tombstone, then it cannot be
>> skipped. Therefore some partition level statistics of cell ages would need
>> to be kept in the column index for the skipping and that is probably not
>> there.
>>
>> Hannu
>>
>> On 16 May 2017, at 17:33, Stefano Ortolani  wrote:
>>
>> That is another way to see the question: are reverse iterators range
>> tombstone aware? Yes.
>> That is why I am puzzled by this afore-mentioned behavior.
>> I would expect them to handle this case more gracefully.
>>
>> Cheers,
>> Stefano
>>
>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth  wrote:
>>
>>> Hannu,
>>>
>>> How can you read a partition in reverse?
>>>
>>> Sent from my iPhone
>>>
>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
>>> >
>>> > Well, I’m guessing that Cassandra doesn't really know if the range
>>> tombstone is useful for this or not.
>>> >
>>> > In many cases it might be that the partition contains data that is
>>> within the range of the tombstone but is newer than the tombstone and
>>> therefore it might be still be returned. Scanning through deleted data can
>>> be avoided by reading the partition in reverse (if all the deleted data is
>>> in the beginning of the partition). Eventually you will still end up
>>> reading a lot of tombstones but you will get a lot of live data first and
>>> the implicit query limit of 1 probably is reached before you get to the
>>> tombstones. Therefore you will get an immediate answer.
>>> >
>>> > Does it make sense?
>>> >
>>> > Hannu
>>> >
>>> >> On 16 May 2017, at 16:33, Stefano Ortolani 
>>> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I am seeing inconsistencies when mixing range tombstones, wide
>>> partitions, and reverse iterators.
>>> >> I still have to understand if the behaviour is to be expected hence
>>> the message on the mailing list.
>>> >>
>>> >> The situation is conceptually simple. I am using a table defined as
>>> follows:
>>> >>
>>> >> CREATE TABLE test_cql.test_cf (
>>> >>  hash blob,
>>> >>  timeid timeuuid,
>>> >>  PRIMARY KEY (hash, timeid)
>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> >>
>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain
>>> a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
>>> _half_ of that partition by executing the query below, and restart the node:
>>> >>
>>> >> DELETE
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = x AND timeid < y;
>>> >>
>>> >> If I keep compactions disabled the following query timeouts (takes
>>> more than 10 seconds to
>>> >> succeed):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid ASC;
>>> >>
>>> >> While the following returns immediately (obviously because no deleted
>>> data is ever read):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid DESC;
>>> >>
>>> >> If I force a compaction the problem is gone, but I presume just
>>> because the data is rearranged.
>>> >>
>>> >> It seems to me that reading by ASC does not make use of the range
>>> tombstone until C* reads the
>>> >> last sstables (which actually contains the range tombstone and is
>>> flushed at node restart), and it wastes time reading all rows that are
>>> actually not live anymore.
>>> >>
>>> >> Is this expected? Should the range tombstone actually help in these
>>> cases?
>>> >>
>>> >> Thanks a lot!
>>> >> Stefano
>>> >
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> > For additional commands, e-mail: 

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Hannu Kröger
Well,

sstables contain some statistics about the cell timestamps and using that 
information and the tombstone timestamp it might be possible to skip some data 
but I’m not sure that Cassandra currently does that. Maybe it would be worth a 
JIRA ticket and see what the devs think about it. If optimizing this case would 
make sense.

Hannu

> On 16 May 2017, at 18:03, Stefano Ortolani  wrote:
> 
> Hi Hannu,
> 
> the piece of data in question is older. In my example the tombstone is the 
> newest piece of data.
> Since a range tombstone has information re the clustering key ranges, and the 
> data is clustering key sorted, I would expect a linear scan not to be 
> necessary.
> 
> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  > wrote:
> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
> bigger regions of deleted data based on range tombstone. If some piece of 
> data in a partition is newer than the tombstone, then it cannot be skipped. 
> Therefore some partition level statistics of cell ages would need to be kept 
> in the column index for the skipping and that is probably not there.
> 
> Hannu 
> 
>> On 16 May 2017, at 17:33, Stefano Ortolani > > wrote:
>> 
>> That is another way to see the question: are reverse iterators range 
>> tombstone aware? Yes.
>> That is why I am puzzled by this afore-mentioned behavior. 
>> I would expect them to handle this case more gracefully.
>> 
>> Cheers,
>> Stefano
>> 
>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth > > wrote:
>> Hannu,
>> 
>> How can you read a partition in reverse?
>> 
>> Sent from my iPhone
>> 
>> > On May 16, 2017, at 9:20 AM, Hannu Kröger > > > wrote:
>> >
>> > Well, I’m guessing that Cassandra doesn't really know if the range 
>> > tombstone is useful for this or not.
>> >
>> > In many cases it might be that the partition contains data that is within 
>> > the range of the tombstone but is newer than the tombstone and therefore 
>> > it might be still be returned. Scanning through deleted data can be 
>> > avoided by reading the partition in reverse (if all the deleted data is in 
>> > the beginning of the partition). Eventually you will still end up reading 
>> > a lot of tombstones but you will get a lot of live data first and the 
>> > implicit query limit of 1 probably is reached before you get to the 
>> > tombstones. Therefore you will get an immediate answer.
>> >
>> > Does it make sense?
>> >
>> > Hannu
>> >
>> >> On 16 May 2017, at 16:33, Stefano Ortolani > >> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I am seeing inconsistencies when mixing range tombstones, wide 
>> >> partitions, and reverse iterators.
>> >> I still have to understand if the behaviour is to be expected hence the 
>> >> message on the mailing list.
>> >>
>> >> The situation is conceptually simple. I am using a table defined as 
>> >> follows:
>> >>
>> >> CREATE TABLE test_cql.test_cf (
>> >>  hash blob,
>> >>  timeid timeuuid,
>> >>  PRIMARY KEY (hash, timeid)
>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>> >>
>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest 
>> >> _half_ of that partition by executing the query below, and restart the 
>> >> node:
>> >>
>> >> DELETE
>> >> FROM test_cql.test_cf
>> >> WHERE hash = x AND timeid < y;
>> >>
>> >> If I keep compactions disabled the following query timeouts (takes more 
>> >> than 10 seconds to
>> >> succeed):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid ASC;
>> >>
>> >> While the following returns immediately (obviously because no deleted 
>> >> data is ever read):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid DESC;
>> >>
>> >> If I force a compaction the problem is gone, but I presume just because 
>> >> the data is rearranged.
>> >>
>> >> It seems to me that reading by ASC does not make use of the range 
>> >> tombstone until C* reads the
>> >> last sstables (which actually contains the range tombstone and is flushed 
>> >> at node restart), and it wastes time reading all rows that are actually 
>> >> not live anymore.
>> >>
>> >> Is this expected? Should the range tombstone actually help in these cases?
>> >>
>> >> Thanks a lot!
>> >> Stefano
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
>> > 
>> > For additional commands, e-mail: 

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Nitan Kainth
Thank you Stefano
> On May 16, 2017, at 10:56 AM, Stefano Ortolani  wrote:
> 
> No, because C* has reverse iterators.
> 
> On Tue, May 16, 2017 at 4:47 PM, Nitan Kainth  > wrote:
> If the data is stored in ASC order and query asks for DESC, then wouldn’t it 
> read whole partition in first and then pick data from reverse order?
> 
> 
>> On May 16, 2017, at 10:03 AM, Stefano Ortolani > > wrote:
>> 
>> Hi Hannu,
>> 
>> the piece of data in question is older. In my example the tombstone is the 
>> newest piece of data.
>> Since a range tombstone has information re the clustering key ranges, and 
>> the data is clustering key sorted, I would expect a linear scan not to be 
>> necessary.
>> 
>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger > > wrote:
>> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
>> bigger regions of deleted data based on range tombstone. If some piece of 
>> data in a partition is newer than the tombstone, then it cannot be skipped. 
>> Therefore some partition level statistics of cell ages would need to be kept 
>> in the column index for the skipping and that is probably not there.
>> 
>> Hannu 
>> 
>>> On 16 May 2017, at 17:33, Stefano Ortolani >> > wrote:
>>> 
>>> That is another way to see the question: are reverse iterators range 
>>> tombstone aware? Yes.
>>> That is why I am puzzled by this afore-mentioned behavior. 
>>> I would expect them to handle this case more gracefully.
>>> 
>>> Cheers,
>>> Stefano
>>> 
>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth >> > wrote:
>>> Hannu,
>>> 
>>> How can you read a partition in reverse?
>>> 
>>> Sent from my iPhone
>>> 
>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger >> > > wrote:
>>> >
>>> > Well, I’m guessing that Cassandra doesn't really know if the range 
>>> > tombstone is useful for this or not.
>>> >
>>> > In many cases it might be that the partition contains data that is within 
>>> > the range of the tombstone but is newer than the tombstone and therefore 
>>> > it might be still be returned. Scanning through deleted data can be 
>>> > avoided by reading the partition in reverse (if all the deleted data is 
>>> > in the beginning of the partition). Eventually you will still end up 
>>> > reading a lot of tombstones but you will get a lot of live data first and 
>>> > the implicit query limit of 1 probably is reached before you get to 
>>> > the tombstones. Therefore you will get an immediate answer.
>>> >
>>> > Does it make sense?
>>> >
>>> > Hannu
>>> >
>>> >> On 16 May 2017, at 16:33, Stefano Ortolani >> >> > wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I am seeing inconsistencies when mixing range tombstones, wide 
>>> >> partitions, and reverse iterators.
>>> >> I still have to understand if the behaviour is to be expected hence the 
>>> >> message on the mailing list.
>>> >>
>>> >> The situation is conceptually simple. I am using a table defined as 
>>> >> follows:
>>> >>
>>> >> CREATE TABLE test_cql.test_cf (
>>> >>  hash blob,
>>> >>  timeid timeuuid,
>>> >>  PRIMARY KEY (hash, timeid)
>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> >>
>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the 
>>> >> oldest _half_ of that partition by executing the query below, and 
>>> >> restart the node:
>>> >>
>>> >> DELETE
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = x AND timeid < y;
>>> >>
>>> >> If I keep compactions disabled the following query timeouts (takes more 
>>> >> than 10 seconds to
>>> >> succeed):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid ASC;
>>> >>
>>> >> While the following returns immediately (obviously because no deleted 
>>> >> data is ever read):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid DESC;
>>> >>
>>> >> If I force a compaction the problem is gone, but I presume just because 
>>> >> the data is rearranged.
>>> >>
>>> >> It seems to me that reading by ASC does not make use of the range 
>>> >> tombstone until C* reads the
>>> >> last sstables (which actually contains the range tombstone and is 
>>> >> flushed at node restart), and it wastes time reading all rows that are 
>>> >> actually not live anymore.
>>> >>
>>> >> Is this expected? Should the range tombstone actually help in these 
>>> >> cases?
>>> >>
>>> >> Thanks a lot!
>>> >> Stefano
>>> >
>>> >
>>> > 

Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
No, because C* has reverse iterators.

On Tue, May 16, 2017 at 4:47 PM, Nitan Kainth  wrote:

> If the data is stored in ASC order and query asks for DESC, then wouldn’t
> it read whole partition in first and then pick data from reverse order?
>
>
> On May 16, 2017, at 10:03 AM, Stefano Ortolani  wrote:
>
> Hi Hannu,
>
> the piece of data in question is older. In my example the tombstone is the
> newest piece of data.
> Since a range tombstone has information re the clustering key ranges, and
> the data is clustering key sorted, I would expect a linear scan not to be
> necessary.
>
> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  wrote:
>
>> Well, as mentioned, probably Cassandra doesn’t have logic and data to
>> skip bigger regions of deleted data based on range tombstone. If some piece
>> of data in a partition is newer than the tombstone, then it cannot be
>> skipped. Therefore some partition level statistics of cell ages would need
>> to be kept in the column index for the skipping and that is probably not
>> there.
>>
>> Hannu
>>
>> On 16 May 2017, at 17:33, Stefano Ortolani  wrote:
>>
>> That is another way to see the question: are reverse iterators range
>> tombstone aware? Yes.
>> That is why I am puzzled by this afore-mentioned behavior.
>> I would expect them to handle this case more gracefully.
>>
>> Cheers,
>> Stefano
>>
>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth  wrote:
>>
>>> Hannu,
>>>
>>> How can you read a partition in reverse?
>>>
>>> Sent from my iPhone
>>>
>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
>>> >
>>> > Well, I’m guessing that Cassandra doesn't really know if the range
>>> tombstone is useful for this or not.
>>> >
>>> > In many cases it might be that the partition contains data that is
>>> within the range of the tombstone but is newer than the tombstone and
>>> therefore it might be still be returned. Scanning through deleted data can
>>> be avoided by reading the partition in reverse (if all the deleted data is
>>> in the beginning of the partition). Eventually you will still end up
>>> reading a lot of tombstones but you will get a lot of live data first and
>>> the implicit query limit of 1 probably is reached before you get to the
>>> tombstones. Therefore you will get an immediate answer.
>>> >
>>> > Does it make sense?
>>> >
>>> > Hannu
>>> >
>>> >> On 16 May 2017, at 16:33, Stefano Ortolani 
>>> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I am seeing inconsistencies when mixing range tombstones, wide
>>> partitions, and reverse iterators.
>>> >> I still have to understand if the behaviour is to be expected hence
>>> the message on the mailing list.
>>> >>
>>> >> The situation is conceptually simple. I am using a table defined as
>>> follows:
>>> >>
>>> >> CREATE TABLE test_cql.test_cf (
>>> >>  hash blob,
>>> >>  timeid timeuuid,
>>> >>  PRIMARY KEY (hash, timeid)
>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> >>
>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain
>>> a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
>>> _half_ of that partition by executing the query below, and restart the node:
>>> >>
>>> >> DELETE
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = x AND timeid < y;
>>> >>
>>> >> If I keep compactions disabled the following query timeouts (takes
>>> more than 10 seconds to
>>> >> succeed):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid ASC;
>>> >>
>>> >> While the following returns immediately (obviously because no deleted
>>> data is ever read):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid DESC;
>>> >>
>>> >> If I force a compaction the problem is gone, but I presume just
>>> because the data is rearranged.
>>> >>
>>> >> It seems to me that reading by ASC does not make use of the range
>>> tombstone until C* reads the
>>> >> last sstables (which actually contains the range tombstone and is
>>> flushed at node restart), and it wastes time reading all rows that are
>>> actually not live anymore.
>>> >>
>>> >> Is this expected? Should the range tombstone actually help in these
>>> cases?
>>> >>
>>> >> Thanks a lot!
>>> >> Stefano
>>> >
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>>> >
>>>
>>
>>
>>
>
>


Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Nitan Kainth
If the data is stored in ASC order and query asks for DESC, then wouldn’t it 
read whole partition in first and then pick data from reverse order?


> On May 16, 2017, at 10:03 AM, Stefano Ortolani  wrote:
> 
> Hi Hannu,
> 
> the piece of data in question is older. In my example the tombstone is the 
> newest piece of data.
> Since a range tombstone has information re the clustering key ranges, and the 
> data is clustering key sorted, I would expect a linear scan not to be 
> necessary.
> 
> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  > wrote:
> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
> bigger regions of deleted data based on range tombstone. If some piece of 
> data in a partition is newer than the tombstone, then it cannot be skipped. 
> Therefore some partition level statistics of cell ages would need to be kept 
> in the column index for the skipping and that is probably not there.
> 
> Hannu 
> 
>> On 16 May 2017, at 17:33, Stefano Ortolani > > wrote:
>> 
>> That is another way to see the question: are reverse iterators range 
>> tombstone aware? Yes.
>> That is why I am puzzled by this afore-mentioned behavior. 
>> I would expect them to handle this case more gracefully.
>> 
>> Cheers,
>> Stefano
>> 
>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth > > wrote:
>> Hannu,
>> 
>> How can you read a partition in reverse?
>> 
>> Sent from my iPhone
>> 
>> > On May 16, 2017, at 9:20 AM, Hannu Kröger > > > wrote:
>> >
>> > Well, I’m guessing that Cassandra doesn't really know if the range 
>> > tombstone is useful for this or not.
>> >
>> > In many cases it might be that the partition contains data that is within 
>> > the range of the tombstone but is newer than the tombstone and therefore 
>> > it might be still be returned. Scanning through deleted data can be 
>> > avoided by reading the partition in reverse (if all the deleted data is in 
>> > the beginning of the partition). Eventually you will still end up reading 
>> > a lot of tombstones but you will get a lot of live data first and the 
>> > implicit query limit of 1 probably is reached before you get to the 
>> > tombstones. Therefore you will get an immediate answer.
>> >
>> > Does it make sense?
>> >
>> > Hannu
>> >
>> >> On 16 May 2017, at 16:33, Stefano Ortolani > >> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I am seeing inconsistencies when mixing range tombstones, wide 
>> >> partitions, and reverse iterators.
>> >> I still have to understand if the behaviour is to be expected hence the 
>> >> message on the mailing list.
>> >>
>> >> The situation is conceptually simple. I am using a table defined as 
>> >> follows:
>> >>
>> >> CREATE TABLE test_cql.test_cf (
>> >>  hash blob,
>> >>  timeid timeuuid,
>> >>  PRIMARY KEY (hash, timeid)
>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>> >>
>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest 
>> >> _half_ of that partition by executing the query below, and restart the 
>> >> node:
>> >>
>> >> DELETE
>> >> FROM test_cql.test_cf
>> >> WHERE hash = x AND timeid < y;
>> >>
>> >> If I keep compactions disabled the following query timeouts (takes more 
>> >> than 10 seconds to
>> >> succeed):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid ASC;
>> >>
>> >> While the following returns immediately (obviously because no deleted 
>> >> data is ever read):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid DESC;
>> >>
>> >> If I force a compaction the problem is gone, but I presume just because 
>> >> the data is rearranged.
>> >>
>> >> It seems to me that reading by ASC does not make use of the range 
>> >> tombstone until C* reads the
>> >> last sstables (which actually contains the range tombstone and is flushed 
>> >> at node restart), and it wastes time reading all rows that are actually 
>> >> not live anymore.
>> >>
>> >> Is this expected? Should the range tombstone actually help in these cases?
>> >>
>> >> Thanks a lot!
>> >> Stefano
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
>> > 
>> > For additional commands, e-mail: user-h...@cassandra.apache.org 
>> > 
>> >
>> 
> 
> 



Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
Hi Hannu,

the piece of data in question is older. In my example the tombstone is the
newest piece of data.
Since a range tombstone has information re the clustering key ranges, and
the data is clustering key sorted, I would expect a linear scan not to be
necessary.

On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger  wrote:

> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip
> bigger regions of deleted data based on range tombstone. If some piece of
> data in a partition is newer than the tombstone, then it cannot be skipped.
> Therefore some partition level statistics of cell ages would need to be
> kept in the column index for the skipping and that is probably not there.
>
> Hannu
>
> On 16 May 2017, at 17:33, Stefano Ortolani  wrote:
>
> That is another way to see the question: are reverse iterators range
> tombstone aware? Yes.
> That is why I am puzzled by this afore-mentioned behavior.
> I would expect them to handle this case more gracefully.
>
> Cheers,
> Stefano
>
> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth  wrote:
>
>> Hannu,
>>
>> How can you read a partition in reverse?
>>
>> Sent from my iPhone
>>
>> > On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
>> >
>> > Well, I’m guessing that Cassandra doesn't really know if the range
>> tombstone is useful for this or not.
>> >
>> > In many cases it might be that the partition contains data that is
>> within the range of the tombstone but is newer than the tombstone and
>> therefore it might be still be returned. Scanning through deleted data can
>> be avoided by reading the partition in reverse (if all the deleted data is
>> in the beginning of the partition). Eventually you will still end up
>> reading a lot of tombstones but you will get a lot of live data first and
>> the implicit query limit of 1 probably is reached before you get to the
>> tombstones. Therefore you will get an immediate answer.
>> >
>> > Does it make sense?
>> >
>> > Hannu
>> >
>> >> On 16 May 2017, at 16:33, Stefano Ortolani  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I am seeing inconsistencies when mixing range tombstones, wide
>> partitions, and reverse iterators.
>> >> I still have to understand if the behaviour is to be expected hence
>> the message on the mailing list.
>> >>
>> >> The situation is conceptually simple. I am using a table defined as
>> follows:
>> >>
>> >> CREATE TABLE test_cql.test_cf (
>> >>  hash blob,
>> >>  timeid timeuuid,
>> >>  PRIMARY KEY (hash, timeid)
>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>> >>
>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a
>> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
>> _half_ of that partition by executing the query below, and restart the node:
>> >>
>> >> DELETE
>> >> FROM test_cql.test_cf
>> >> WHERE hash = x AND timeid < y;
>> >>
>> >> If I keep compactions disabled the following query timeouts (takes
>> more than 10 seconds to
>> >> succeed):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid ASC;
>> >>
>> >> While the following returns immediately (obviously because no deleted
>> data is ever read):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid DESC;
>> >>
>> >> If I force a compaction the problem is gone, but I presume just
>> because the data is rearranged.
>> >>
>> >> It seems to me that reading by ASC does not make use of the range
>> tombstone until C* reads the
>> >> last sstables (which actually contains the range tombstone and is
>> flushed at node restart), and it wastes time reading all rows that are
>> actually not live anymore.
>> >>
>> >> Is this expected? Should the range tombstone actually help in these
>> cases?
>> >>
>> >> Thanks a lot!
>> >> Stefano
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>> >
>>
>
>
>


Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Hannu Kröger
Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
bigger regions of deleted data based on range tombstone. If some piece of data 
in a partition is newer than the tombstone, then it cannot be skipped. 
Therefore some partition level statistics of cell ages would need to be kept in 
the column index for the skipping and that is probably not there.

Hannu 

> On 16 May 2017, at 17:33, Stefano Ortolani  wrote:
> 
> That is another way to see the question: are reverse iterators range 
> tombstone aware? Yes.
> That is why I am puzzled by this afore-mentioned behavior. 
> I would expect them to handle this case more gracefully.
> 
> Cheers,
> Stefano
> 
> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth  > wrote:
> Hannu,
> 
> How can you read a partition in reverse?
> 
> Sent from my iPhone
> 
> > On May 16, 2017, at 9:20 AM, Hannu Kröger  > > wrote:
> >
> > Well, I’m guessing that Cassandra doesn't really know if the range 
> > tombstone is useful for this or not.
> >
> > In many cases it might be that the partition contains data that is within 
> > the range of the tombstone but is newer than the tombstone and therefore it 
> > might be still be returned. Scanning through deleted data can be avoided by 
> > reading the partition in reverse (if all the deleted data is in the 
> > beginning of the partition). Eventually you will still end up reading a lot 
> > of tombstones but you will get a lot of live data first and the implicit 
> > query limit of 1 probably is reached before you get to the tombstones. 
> > Therefore you will get an immediate answer.
> >
> > Does it make sense?
> >
> > Hannu
> >
> >> On 16 May 2017, at 16:33, Stefano Ortolani  >> > wrote:
> >>
> >> Hi all,
> >>
> >> I am seeing inconsistencies when mixing range tombstones, wide partitions, 
> >> and reverse iterators.
> >> I still have to understand if the behaviour is to be expected hence the 
> >> message on the mailing list.
> >>
> >> The situation is conceptually simple. I am using a table defined as 
> >> follows:
> >>
> >> CREATE TABLE test_cql.test_cf (
> >>  hash blob,
> >>  timeid timeuuid,
> >>  PRIMARY KEY (hash, timeid)
> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
> >>
> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
> >> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest 
> >> _half_ of that partition by executing the query below, and restart the 
> >> node:
> >>
> >> DELETE
> >> FROM test_cql.test_cf
> >> WHERE hash = x AND timeid < y;
> >>
> >> If I keep compactions disabled the following query timeouts (takes more 
> >> than 10 seconds to
> >> succeed):
> >>
> >> SELECT *
> >> FROM test_cql.test_cf
> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> >> ORDER BY timeid ASC;
> >>
> >> While the following returns immediately (obviously because no deleted data 
> >> is ever read):
> >>
> >> SELECT *
> >> FROM test_cql.test_cf
> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> >> ORDER BY timeid DESC;
> >>
> >> If I force a compaction the problem is gone, but I presume just because 
> >> the data is rearranged.
> >>
> >> It seems to me that reading by ASC does not make use of the range 
> >> tombstone until C* reads the
> >> last sstables (which actually contains the range tombstone and is flushed 
> >> at node restart), and it wastes time reading all rows that are actually 
> >> not live anymore.
> >>
> >> Is this expected? Should the range tombstone actually help in these cases?
> >>
> >> Thanks a lot!
> >> Stefano
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
> > 
> > For additional commands, e-mail: user-h...@cassandra.apache.org 
> > 
> >
> 



Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Hannu Kröger
Hello,

If you mean how to construct a query like that: you use ORDER BY clause with 
SELECT which is reverse to the default just like in the example below? If the 
table is constructed with "clustering order by (timeid ASC)” and you query 
“SELECT ... ORDER BY timeid DESC”, then the partition is read backwards. I 
don’t know how it is technically done but it is apparently slightly slower then 
reading partition normally.

Hannu 

> On 16 May 2017, at 17:29, Nitan Kainth  wrote:
> 
> Hannu,
> 
> How can you read a partition in reverse? 
> 
> Sent from my iPhone
> 
>> On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
>> 
>> Well, I’m guessing that Cassandra doesn't really know if the range tombstone 
>> is useful for this or not. 
>> 
>> In many cases it might be that the partition contains data that is within 
>> the range of the tombstone but is newer than the tombstone and therefore it 
>> might be still be returned. Scanning through deleted data can be avoided by 
>> reading the partition in reverse (if all the deleted data is in the 
>> beginning of the partition). Eventually you will still end up reading a lot 
>> of tombstones but you will get a lot of live data first and the implicit 
>> query limit of 1 probably is reached before you get to the tombstones. 
>> Therefore you will get an immediate answer.
>> 
>> Does it make sense?
>> 
>> Hannu
>> 
>>> On 16 May 2017, at 16:33, Stefano Ortolani  wrote:
>>> 
>>> Hi all,
>>> 
>>> I am seeing inconsistencies when mixing range tombstones, wide partitions, 
>>> and reverse iterators.
>>> I still have to understand if the behaviour is to be expected hence the 
>>> message on the mailing list.
>>> 
>>> The situation is conceptually simple. I am using a table defined as follows:
>>> 
>>> CREATE TABLE test_cql.test_cf (
>>> hash blob,
>>> timeid timeuuid,
>>> PRIMARY KEY (hash, timeid)
>>> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> 
>>> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>>> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest 
>>> _half_ of that partition by executing the query below, and restart the node:
>>> 
>>> DELETE 
>>> FROM test_cql.test_cf 
>>> WHERE hash = x AND timeid < y;
>>> 
>>> If I keep compactions disabled the following query timeouts (takes more 
>>> than 10 seconds to 
>>> succeed):
>>> 
>>> SELECT * 
>>> FROM test_cql.test_cf 
>>> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
>>> ORDER BY timeid ASC;
>>> 
>>> While the following returns immediately (obviously because no deleted data 
>>> is ever read):
>>> 
>>> SELECT * 
>>> FROM test_cql.test_cf 
>>> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
>>> ORDER BY timeid DESC;
>>> 
>>> If I force a compaction the problem is gone, but I presume just because the 
>>> data is rearranged.
>>> 
>>> It seems to me that reading by ASC does not make use of the range tombstone 
>>> until C* reads the
>>> last sstables (which actually contains the range tombstone and is flushed 
>>> at node restart), and it wastes time reading all rows that are actually not 
>>> live anymore. 
>>> 
>>> Is this expected? Should the range tombstone actually help in these cases?
>>> 
>>> Thanks a lot!
>>> Stefano
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> 


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Stefano Ortolani
That is another way to see the question: are reverse iterators range
tombstone aware? Yes.
That is why I am puzzled by this afore-mentioned behavior.
I would expect them to handle this case more gracefully.

Cheers,
Stefano

On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth  wrote:

> Hannu,
>
> How can you read a partition in reverse?
>
> Sent from my iPhone
>
> > On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
> >
> > Well, I’m guessing that Cassandra doesn't really know if the range
> tombstone is useful for this or not.
> >
> > In many cases it might be that the partition contains data that is
> within the range of the tombstone but is newer than the tombstone and
> therefore it might be still be returned. Scanning through deleted data can
> be avoided by reading the partition in reverse (if all the deleted data is
> in the beginning of the partition). Eventually you will still end up
> reading a lot of tombstones but you will get a lot of live data first and
> the implicit query limit of 1 probably is reached before you get to the
> tombstones. Therefore you will get an immediate answer.
> >
> > Does it make sense?
> >
> > Hannu
> >
> >> On 16 May 2017, at 16:33, Stefano Ortolani  wrote:
> >>
> >> Hi all,
> >>
> >> I am seeing inconsistencies when mixing range tombstones, wide
> partitions, and reverse iterators.
> >> I still have to understand if the behaviour is to be expected hence the
> message on the mailing list.
> >>
> >> The situation is conceptually simple. I am using a table defined as
> follows:
> >>
> >> CREATE TABLE test_cql.test_cf (
> >>  hash blob,
> >>  timeid timeuuid,
> >>  PRIMARY KEY (hash, timeid)
> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
> >>
> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a
> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
> _half_ of that partition by executing the query below, and restart the node:
> >>
> >> DELETE
> >> FROM test_cql.test_cf
> >> WHERE hash = x AND timeid < y;
> >>
> >> If I keep compactions disabled the following query timeouts (takes more
> than 10 seconds to
> >> succeed):
> >>
> >> SELECT *
> >> FROM test_cql.test_cf
> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> >> ORDER BY timeid ASC;
> >>
> >> While the following returns immediately (obviously because no deleted
> data is ever read):
> >>
> >> SELECT *
> >> FROM test_cql.test_cf
> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
> >> ORDER BY timeid DESC;
> >>
> >> If I force a compaction the problem is gone, but I presume just because
> the data is rearranged.
> >>
> >> It seems to me that reading by ASC does not make use of the range
> tombstone until C* reads the
> >> last sstables (which actually contains the range tombstone and is
> flushed at node restart), and it wastes time reading all rows that are
> actually not live anymore.
> >>
> >> Is this expected? Should the range tombstone actually help in these
> cases?
> >>
> >> Thanks a lot!
> >> Stefano
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>


Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Nitan Kainth
Hannu,

How can you read a partition in reverse? 

Sent from my iPhone

> On May 16, 2017, at 9:20 AM, Hannu Kröger  wrote:
> 
> Well, I’m guessing that Cassandra doesn't really know if the range tombstone 
> is useful for this or not. 
> 
> In many cases it might be that the partition contains data that is within the 
> range of the tombstone but is newer than the tombstone and therefore it might 
> be still be returned. Scanning through deleted data can be avoided by reading 
> the partition in reverse (if all the deleted data is in the beginning of the 
> partition). Eventually you will still end up reading a lot of tombstones but 
> you will get a lot of live data first and the implicit query limit of 1 
> probably is reached before you get to the tombstones. Therefore you will get 
> an immediate answer.
> 
> Does it make sense?
> 
> Hannu
> 
>> On 16 May 2017, at 16:33, Stefano Ortolani  wrote:
>> 
>> Hi all,
>> 
>> I am seeing inconsistencies when mixing range tombstones, wide partitions, 
>> and reverse iterators.
>> I still have to understand if the behaviour is to be expected hence the 
>> message on the mailing list.
>> 
>> The situation is conceptually simple. I am using a table defined as follows:
>> 
>> CREATE TABLE test_cql.test_cf (
>>  hash blob,
>>  timeid timeuuid,
>>  PRIMARY KEY (hash, timeid)
>> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>> 
>> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest 
>> _half_ of that partition by executing the query below, and restart the node:
>> 
>> DELETE 
>> FROM test_cql.test_cf 
>> WHERE hash = x AND timeid < y;
>> 
>> If I keep compactions disabled the following query timeouts (takes more than 
>> 10 seconds to 
>> succeed):
>> 
>> SELECT * 
>> FROM test_cql.test_cf 
>> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
>> ORDER BY timeid ASC;
>> 
>> While the following returns immediately (obviously because no deleted data 
>> is ever read):
>> 
>> SELECT * 
>> FROM test_cql.test_cf 
>> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
>> ORDER BY timeid DESC;
>> 
>> If I force a compaction the problem is gone, but I presume just because the 
>> data is rearranged.
>> 
>> It seems to me that reading by ASC does not make use of the range tombstone 
>> until C* reads the
>> last sstables (which actually contains the range tombstone and is flushed at 
>> node restart), and it wastes time reading all rows that are actually not 
>> live anymore. 
>> 
>> Is this expected? Should the range tombstone actually help in these cases?
>> 
>> Thanks a lot!
>> Stefano
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Range deletes, wide partitions, and reverse iterators

2017-05-16 Thread Hannu Kröger
Well, I’m guessing that Cassandra doesn't really know if the range tombstone is 
useful for this or not. 

In many cases it might be that the partition contains data that is within the 
range of the tombstone but is newer than the tombstone and therefore it might 
be still be returned. Scanning through deleted data can be avoided by reading 
the partition in reverse (if all the deleted data is in the beginning of the 
partition). Eventually you will still end up reading a lot of tombstones but 
you will get a lot of live data first and the implicit query limit of 1 
probably is reached before you get to the tombstones. Therefore you will get an 
immediate answer.

Does it make sense?

Hannu

> On 16 May 2017, at 16:33, Stefano Ortolani  wrote:
> 
> Hi all,
> 
> I am seeing inconsistencies when mixing range tombstones, wide partitions, 
> and reverse iterators.
> I still have to understand if the behaviour is to be expected hence the 
> message on the mailing list.
> 
> The situation is conceptually simple. I am using a table defined as follows:
> 
> CREATE TABLE test_cql.test_cf (
>   hash blob,
>   timeid timeuuid,
>   PRIMARY KEY (hash, timeid)
> ) WITH CLUSTERING ORDER BY (timeid ASC)
>   AND compaction = {'class' : 'LeveledCompactionStrategy'};
> 
> I then proceed by loading 2/3GB from 3 sstables which I know contain a really 
> wide partition (> 512 MB) for `hash = x`. I then delete the oldest _half_ of 
> that partition by executing the query below, and restart the node:
> 
> DELETE 
> FROM test_cql.test_cf 
> WHERE hash = x AND timeid < y;
> 
> If I keep compactions disabled the following query timeouts (takes more than 
> 10 seconds to 
> succeed):
> 
> SELECT * 
> FROM test_cql.test_cf 
> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
> ORDER BY timeid ASC;
> 
> While the following returns immediately (obviously because no deleted data is 
> ever read):
> 
> SELECT * 
> FROM test_cql.test_cf 
> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
> ORDER BY timeid DESC;
> 
> If I force a compaction the problem is gone, but I presume just because the 
> data is rearranged.
> 
> It seems to me that reading by ASC does not make use of the range tombstone 
> until C* reads the
> last sstables (which actually contains the range tombstone and is flushed at 
> node restart), and it wastes time reading all rows that are actually not live 
> anymore. 
> 
> Is this expected? Should the range tombstone actually help in these cases?
> 
> Thanks a lot!
> Stefano


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org