Yes, I agree. I would say it cannot skip those cells because it doesn’t check the max timestamp of the cells of the sstable and therefore scans them one by one.
Hannu > On 16 May 2017, at 19:48, Stefano Ortolani <ostef...@gmail.com> wrote: > > But it should skip those records since they are sorted. My understanding > would be something like: > > 1) read sstable 2 > 2) read the range tombstone > 3) skip records from sstable2 and sstable1 within the range boundaries > 4) read remaining records from sstable1 > 5) no records, return > > On Tue, May 16, 2017 at 5:43 PM, Hannu Kröger <hkro...@gmail.com > <mailto:hkro...@gmail.com>> wrote: > This is a bit of guessing but it probably reads sstables in some sort of > sequence, so even if sstable 2 contains the tombstone, it still scans through > the sstable 1 for possible data to be read. > > BR, > Hannu > >> On 16 May 2017, at 19:40, Stefano Ortolani <ostef...@gmail.com >> <mailto:ostef...@gmail.com>> wrote: >> >> Little update: also the following query timeouts, which is weird since the >> range tombstone should have been read by then... >> >> SELECT * >> FROM test_cql.test_cf >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf >> AND timeid < the_oldest_deleted_timeid >> ORDER BY timeid DESC; >> >> >> >> On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani <ostef...@gmail.com >> <mailto:ostef...@gmail.com>> wrote: >> Yes, that was my intention but I wanted to cross-check with the ML and the >> devs keeping an eye on it first. >> >> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger <hkro...@gmail.com >> <mailto:hkro...@gmail.com>> wrote: >> Well, >> >> sstables contain some statistics about the cell timestamps and using that >> information and the tombstone timestamp it might be possible to skip some >> data but I’m not sure that Cassandra currently does that. Maybe it would be >> worth a JIRA ticket and see what the devs think about it. If optimizing this >> case would make sense. >> >> Hannu >> >>> On 16 May 2017, at 18:03, Stefano Ortolani <ostef...@gmail.com >>> <mailto:ostef...@gmail.com>> wrote: >>> >>> Hi Hannu, >>> >>> the piece of data in question is older. In my example the tombstone is the >>> newest piece of data. >>> Since a range tombstone has information re the clustering key ranges, and >>> the data is clustering key sorted, I would expect a linear scan not to be >>> necessary. >>> >>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com >>> <mailto:hkro...@gmail.com>> wrote: >>> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip >>> bigger regions of deleted data based on range tombstone. If some piece of >>> data in a partition is newer than the tombstone, then it cannot be skipped. >>> Therefore some partition level statistics of cell ages would need to be >>> kept in the column index for the skipping and that is probably not there. >>> >>> Hannu >>> >>>> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com >>>> <mailto:ostef...@gmail.com>> wrote: >>>> >>>> That is another way to see the question: are reverse iterators range >>>> tombstone aware? Yes. >>>> That is why I am puzzled by this afore-mentioned behavior. >>>> I would expect them to handle this case more gracefully. >>>> >>>> Cheers, >>>> Stefano >>>> >>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com >>>> <mailto:ni...@bamlabs.com>> wrote: >>>> Hannu, >>>> >>>> How can you read a partition in reverse? >>>> >>>> Sent from my iPhone >>>> >>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com >>>> > <mailto:hkro...@gmail.com>> wrote: >>>> > >>>> > Well, I’m guessing that Cassandra doesn't really know if the range >>>> > tombstone is useful for this or not. >>>> > >>>> > In many cases it might be that the partition contains data that is >>>> > within the range of the tombstone but is newer than the tombstone and >>>> > therefore it might be still be returned. Scanning through deleted data >>>> > can be avoided by reading the partition in reverse (if all the deleted >>>> > data is in the beginning of the partition). Eventually you will still >>>> > end up reading a lot of tombstones but you will get a lot of live data >>>> > first and the implicit query limit of 10000 probably is reached before >>>> > you get to the tombstones. Therefore you will get an immediate answer. >>>> > >>>> > Does it make sense? >>>> > >>>> > Hannu >>>> > >>>> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com >>>> >> <mailto:ostef...@gmail.com>> wrote: >>>> >> >>>> >> Hi all, >>>> >> >>>> >> I am seeing inconsistencies when mixing range tombstones, wide >>>> >> partitions, and reverse iterators. >>>> >> I still have to understand if the behaviour is to be expected hence the >>>> >> message on the mailing list. >>>> >> >>>> >> The situation is conceptually simple. I am using a table defined as >>>> >> follows: >>>> >> >>>> >> CREATE TABLE test_cql.test_cf ( >>>> >> hash blob, >>>> >> timeid timeuuid, >>>> >> PRIMARY KEY (hash, timeid) >>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC) >>>> >> AND compaction = {'class' : 'LeveledCompactionStrategy'}; >>>> >> >>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a >>>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the >>>> >> oldest _half_ of that partition by executing the query below, and >>>> >> restart the node: >>>> >> >>>> >> DELETE >>>> >> FROM test_cql.test_cf >>>> >> WHERE hash = x AND timeid < y; >>>> >> >>>> >> If I keep compactions disabled the following query timeouts (takes more >>>> >> than 10 seconds to >>>> >> succeed): >>>> >> >>>> >> SELECT * >>>> >> FROM test_cql.test_cf >>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf >>>> >> ORDER BY timeid ASC; >>>> >> >>>> >> While the following returns immediately (obviously because no deleted >>>> >> data is ever read): >>>> >> >>>> >> SELECT * >>>> >> FROM test_cql.test_cf >>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf >>>> >> ORDER BY timeid DESC; >>>> >> >>>> >> If I force a compaction the problem is gone, but I presume just because >>>> >> the data is rearranged. >>>> >> >>>> >> It seems to me that reading by ASC does not make use of the range >>>> >> tombstone until C* reads the >>>> >> last sstables (which actually contains the range tombstone and is >>>> >> flushed at node restart), and it wastes time reading all rows that are >>>> >> actually not live anymore. >>>> >> >>>> >> Is this expected? Should the range tombstone actually help in these >>>> >> cases? >>>> >> >>>> >> Thanks a lot! >>>> >> Stefano >>>> > >>>> > >>>> > --------------------------------------------------------------------- >>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>>> > <mailto:user-unsubscr...@cassandra.apache.org> >>>> > For additional commands, e-mail: user-h...@cassandra.apache.org >>>> > <mailto:user-h...@cassandra.apache.org> >>>> > >>>> >>> >>> >> >> >> > >