Re: Range deletes, wide partitions, and reverse iterators

Hannu Kröger Tue, 16 May 2017 11:05:23 -0700

Yes, I agree. I would say it cannot skip those cells because it doesn’t check 
the max timestamp of the cells of the sstable and therefore scans them one by 
one.


Hannu
 
> On 16 May 2017, at 19:48, Stefano Ortolani <ostef...@gmail.com> wrote:
> 
> But it should skip those records since they are sorted. My understanding 
> would be something like:
> 
> 1) read sstable 2
> 2) read the range tombstone
> 3) skip records from sstable2 and sstable1 within the range boundaries
> 4) read remaining records from sstable1
> 5) no records, return
> 
> On Tue, May 16, 2017 at 5:43 PM, Hannu Kröger <hkro...@gmail.com 
> <mailto:hkro...@gmail.com>> wrote:
> This is a bit of guessing but it probably reads sstables in some sort of 
> sequence, so even if sstable 2 contains the tombstone, it still scans through 
> the sstable 1 for possible data to be read.
> 
> BR,
> Hannu
> 
>> On 16 May 2017, at 19:40, Stefano Ortolani <ostef...@gmail.com 
>> <mailto:ostef...@gmail.com>> wrote:
>> 
>> Little update: also the following query timeouts, which is weird since the 
>> range tombstone should have been read by then...
>> 
>> SELECT * 
>> FROM test_cql.test_cf 
>> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
>> AND timeid < the_oldest_deleted_timeid
>> ORDER BY timeid DESC;
>> 
>> 
>> 
>> On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani <ostef...@gmail.com 
>> <mailto:ostef...@gmail.com>> wrote:
>> Yes, that was my intention but I wanted to cross-check with the ML and the 
>> devs keeping an eye on it first.
>> 
>> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger <hkro...@gmail.com 
>> <mailto:hkro...@gmail.com>> wrote:
>> Well,
>> 
>> sstables contain some statistics about the cell timestamps and using that 
>> information and the tombstone timestamp it might be possible to skip some 
>> data but I’m not sure that Cassandra currently does that. Maybe it would be 
>> worth a JIRA ticket and see what the devs think about it. If optimizing this 
>> case would make sense.
>> 
>> Hannu
>> 
>>> On 16 May 2017, at 18:03, Stefano Ortolani <ostef...@gmail.com 
>>> <mailto:ostef...@gmail.com>> wrote:
>>> 
>>> Hi Hannu,
>>> 
>>> the piece of data in question is older. In my example the tombstone is the 
>>> newest piece of data.
>>> Since a range tombstone has information re the clustering key ranges, and 
>>> the data is clustering key sorted, I would expect a linear scan not to be 
>>> necessary.
>>> 
>>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com 
>>> <mailto:hkro...@gmail.com>> wrote:
>>> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
>>> bigger regions of deleted data based on range tombstone. If some piece of 
>>> data in a partition is newer than the tombstone, then it cannot be skipped. 
>>> Therefore some partition level statistics of cell ages would need to be 
>>> kept in the column index for the skipping and that is probably not there.
>>> 
>>> Hannu 
>>> 
>>>> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com 
>>>> <mailto:ostef...@gmail.com>> wrote:
>>>> 
>>>> That is another way to see the question: are reverse iterators range 
>>>> tombstone aware? Yes.
>>>> That is why I am puzzled by this afore-mentioned behavior. 
>>>> I would expect them to handle this case more gracefully.
>>>> 
>>>> Cheers,
>>>> Stefano
>>>> 
>>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com 
>>>> <mailto:ni...@bamlabs.com>> wrote:
>>>> Hannu,
>>>> 
>>>> How can you read a partition in reverse?
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com 
>>>> > <mailto:hkro...@gmail.com>> wrote:
>>>> >
>>>> > Well, I’m guessing that Cassandra doesn't really know if the range 
>>>> > tombstone is useful for this or not.
>>>> >
>>>> > In many cases it might be that the partition contains data that is 
>>>> > within the range of the tombstone but is newer than the tombstone and 
>>>> > therefore it might be still be returned. Scanning through deleted data 
>>>> > can be avoided by reading the partition in reverse (if all the deleted 
>>>> > data is in the beginning of the partition). Eventually you will still 
>>>> > end up reading a lot of tombstones but you will get a lot of live data 
>>>> > first and the implicit query limit of 10000 probably is reached before 
>>>> > you get to the tombstones. Therefore you will get an immediate answer.
>>>> >
>>>> > Does it make sense?
>>>> >
>>>> > Hannu
>>>> >
>>>> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com 
>>>> >> <mailto:ostef...@gmail.com>> wrote:
>>>> >>
>>>> >> Hi all,
>>>> >>
>>>> >> I am seeing inconsistencies when mixing range tombstones, wide 
>>>> >> partitions, and reverse iterators.
>>>> >> I still have to understand if the behaviour is to be expected hence the 
>>>> >> message on the mailing list.
>>>> >>
>>>> >> The situation is conceptually simple. I am using a table defined as 
>>>> >> follows:
>>>> >>
>>>> >> CREATE TABLE test_cql.test_cf (
>>>> >>  hash blob,
>>>> >>  timeid timeuuid,
>>>> >>  PRIMARY KEY (hash, timeid)
>>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>>> >>
>>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>>>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the 
>>>> >> oldest _half_ of that partition by executing the query below, and 
>>>> >> restart the node:
>>>> >>
>>>> >> DELETE
>>>> >> FROM test_cql.test_cf
>>>> >> WHERE hash = x AND timeid < y;
>>>> >>
>>>> >> If I keep compactions disabled the following query timeouts (takes more 
>>>> >> than 10 seconds to
>>>> >> succeed):
>>>> >>
>>>> >> SELECT *
>>>> >> FROM test_cql.test_cf
>>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>>> >> ORDER BY timeid ASC;
>>>> >>
>>>> >> While the following returns immediately (obviously because no deleted 
>>>> >> data is ever read):
>>>> >>
>>>> >> SELECT *
>>>> >> FROM test_cql.test_cf
>>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>>> >> ORDER BY timeid DESC;
>>>> >>
>>>> >> If I force a compaction the problem is gone, but I presume just because 
>>>> >> the data is rearranged.
>>>> >>
>>>> >> It seems to me that reading by ASC does not make use of the range 
>>>> >> tombstone until C* reads the
>>>> >> last sstables (which actually contains the range tombstone and is 
>>>> >> flushed at node restart), and it wastes time reading all rows that are 
>>>> >> actually not live anymore.
>>>> >>
>>>> >> Is this expected? Should the range tombstone actually help in these 
>>>> >> cases?
>>>> >>
>>>> >> Thanks a lot!
>>>> >> Stefano
>>>> >
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
>>>> > <mailto:user-unsubscr...@cassandra.apache.org>
>>>> > For additional commands, e-mail: user-h...@cassandra.apache.org 
>>>> > <mailto:user-h...@cassandra.apache.org>
>>>> >
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
>

Re: Range deletes, wide partitions, and reverse iterators

Reply via email to