Re: row tombstones as a separate sstable citizen

2018-02-16 Thread Carl Mueller
re: the tombstone sstables being read-only inputs to compaction, there
would be one case the non-tombstone sstables would input to the compaction
of the row tombstones: when the row no longer exists in any of the data
sstables with respect to the row tombstone timestamp.

There may be other opportunities for simplified processing of the row
tombstone sstables, as they are pure key-value (row key : deletion flag)
rather than columnar data. We may be able to offer the option of a memory
map if the row tombstones fit in a sufficiently small space. The "row
cache" may be wayyy simpler for these than the general row cache
difficulties for cassandra data. Those caches could only be loaded during
compaction operations too.

On Thu, Feb 15, 2018 at 11:24 AM, Jeff Jirsa  wrote:

> Worth a JIRA, yes
>
>
> On Wed, Feb 14, 2018 at 9:45 AM, Carl Mueller <
> carl.muel...@smartthings.com>
> wrote:
>
> > So is this at least a decent candidate for a feature request ticket?
> >
> >
> > On Tue, Feb 13, 2018 at 8:09 PM, Carl Mueller <
> > carl.muel...@smartthings.com>
> > wrote:
> >
> > > I'm particularly interested in getting the tombstones to "promote" up
> the
> > > levels of LCS more quickly. Currently they get attached at the low
> level
> > > and don't propagate up to higher levels until enough activity at a
> lower
> > > level promotes the data. Meanwhile, LCS means compactions can occur in
> > > parallel at each level. So row tombstones in their own sstable could be
> > up
> > > promoted the LCS levels preferentially before normal processes would
> move
> > > them up.
> > >
> > > So if the delete-only sstables could move up more quickly, the
> compaction
> > > at the levels would happen more quickly.
> > >
> > > The threshold stuff is nice if I read 7019 correctly, but what is the %
> > > there? % of rows? % of columns? or % of the size of the sstable? Row
> > > tombstones are pretty compact being just the rowkey and the tombstone
> > > marker. So if 7019 is triggered at 10% of the sstable size, even a
> > crapton
> > > of tombstones deleting practially the entire database would only be a
> > small
> > > % size of the sstable.
> > >
> > > Since the row tombstones are so compact, that's why I think they are
> good
> > > candidates for special handling.
> > >
> > > On Tue, Feb 13, 2018 at 5:22 PM, J. D. Jordan <
> jeremiah.jor...@gmail.com
> > >
> > > wrote:
> > >
> > >> Have you taken a look at the new stuff introduced by
> > >> https://issues.apache.org/jira/browse/CASSANDRA-7019 ?  I think it
> may
> > >> go a ways to reducing the need for something complicated like this.
> > >> Though it is an interesting idea as special handling for bulk deletes.
> > >> If they were truly just sstables that only contained deletes the logic
> > from
> > >> 7109 would probably go a long ways. Though if you are bulk inserting
> > >> deletes that is what you would end up with, so maybe it already works.
> > >>
> > >> -Jeremiah
> > >>
> > >> > On Feb 13, 2018, at 6:04 PM, Jeff Jirsa  wrote:
> > >> >
> > >> > On Tue, Feb 13, 2018 at 2:38 PM, Carl Mueller <
> > >> carl.muel...@smartthings.com>
> > >> > wrote:
> > >> >
> > >> >> In process of doing my second major data purge from a cassandra
> > system.
> > >> >>
> > >> >> Almost all of my purging is done via row tombstones. While
> performing
> > >> this
> > >> >> the second time while trying to cajole compaction to occur (in
> 2.1.x,
> > >> >> LevelledCompaction) to goddamn actually compact the data, I've been
> > >> >> thinking as to why there isn't a separate set of sstable
> > infrastructure
> > >> >> setup for row deletion tombstones.
> > >> >>
> > >> >> I'm imagining that row tombstones are written to separate sstables
> > than
> > >> >> mainline data updates/appends and range/column tombstones.
> > >> >>
> > >> >> By writing them to separate sstables, the compaction systems can
> > >> >> preferentially merge / process them when compacting sstables.
> > >> >>
> > >> >> This would create an additional sstable for lookup in the bloom
> > >> filters,
> > >> >> granted. I had visions of short circuiting the lookups to other
> > >> sstables if
> > >> >> a row tombstone was present in one of the special row tombstone
> > >> sstables.
> > >> >>
> > >> >>
> > >> > All of the above sounds really interesting to me, but I suspect
> it's a
> > >> LOT
> > >> > of work to make it happen correctly.
> > >> >
> > >> > You'd almost end up with 2 sets of logs for the LSM - a tombstone
> > >> > log/generation, and a data log/generation, and the tombstone logs
> > would
> > >> be
> > >> > read-only inputs to data compactions.
> > >> >
> > >> >
> > >> >> But that would only be possible if there was the notion of a "super
> > row
> > >> >> tombstone" that permanently deleted a rowkey and all future writes
> > >> would be
> > >> >> invalidated. Kind of like how a tombstone with a mistakenly huge
> > >> timestamp
> > >> >> becomes a sneaky permanent tombstone, but 

Re: row tombstones as a separate sstable citizen

2018-02-15 Thread Jeff Jirsa
Worth a JIRA, yes


On Wed, Feb 14, 2018 at 9:45 AM, Carl Mueller 
wrote:

> So is this at least a decent candidate for a feature request ticket?
>
>
> On Tue, Feb 13, 2018 at 8:09 PM, Carl Mueller <
> carl.muel...@smartthings.com>
> wrote:
>
> > I'm particularly interested in getting the tombstones to "promote" up the
> > levels of LCS more quickly. Currently they get attached at the low level
> > and don't propagate up to higher levels until enough activity at a lower
> > level promotes the data. Meanwhile, LCS means compactions can occur in
> > parallel at each level. So row tombstones in their own sstable could be
> up
> > promoted the LCS levels preferentially before normal processes would move
> > them up.
> >
> > So if the delete-only sstables could move up more quickly, the compaction
> > at the levels would happen more quickly.
> >
> > The threshold stuff is nice if I read 7019 correctly, but what is the %
> > there? % of rows? % of columns? or % of the size of the sstable? Row
> > tombstones are pretty compact being just the rowkey and the tombstone
> > marker. So if 7019 is triggered at 10% of the sstable size, even a
> crapton
> > of tombstones deleting practially the entire database would only be a
> small
> > % size of the sstable.
> >
> > Since the row tombstones are so compact, that's why I think they are good
> > candidates for special handling.
> >
> > On Tue, Feb 13, 2018 at 5:22 PM, J. D. Jordan  >
> > wrote:
> >
> >> Have you taken a look at the new stuff introduced by
> >> https://issues.apache.org/jira/browse/CASSANDRA-7019 ?  I think it may
> >> go a ways to reducing the need for something complicated like this.
> >> Though it is an interesting idea as special handling for bulk deletes.
> >> If they were truly just sstables that only contained deletes the logic
> from
> >> 7109 would probably go a long ways. Though if you are bulk inserting
> >> deletes that is what you would end up with, so maybe it already works.
> >>
> >> -Jeremiah
> >>
> >> > On Feb 13, 2018, at 6:04 PM, Jeff Jirsa  wrote:
> >> >
> >> > On Tue, Feb 13, 2018 at 2:38 PM, Carl Mueller <
> >> carl.muel...@smartthings.com>
> >> > wrote:
> >> >
> >> >> In process of doing my second major data purge from a cassandra
> system.
> >> >>
> >> >> Almost all of my purging is done via row tombstones. While performing
> >> this
> >> >> the second time while trying to cajole compaction to occur (in 2.1.x,
> >> >> LevelledCompaction) to goddamn actually compact the data, I've been
> >> >> thinking as to why there isn't a separate set of sstable
> infrastructure
> >> >> setup for row deletion tombstones.
> >> >>
> >> >> I'm imagining that row tombstones are written to separate sstables
> than
> >> >> mainline data updates/appends and range/column tombstones.
> >> >>
> >> >> By writing them to separate sstables, the compaction systems can
> >> >> preferentially merge / process them when compacting sstables.
> >> >>
> >> >> This would create an additional sstable for lookup in the bloom
> >> filters,
> >> >> granted. I had visions of short circuiting the lookups to other
> >> sstables if
> >> >> a row tombstone was present in one of the special row tombstone
> >> sstables.
> >> >>
> >> >>
> >> > All of the above sounds really interesting to me, but I suspect it's a
> >> LOT
> >> > of work to make it happen correctly.
> >> >
> >> > You'd almost end up with 2 sets of logs for the LSM - a tombstone
> >> > log/generation, and a data log/generation, and the tombstone logs
> would
> >> be
> >> > read-only inputs to data compactions.
> >> >
> >> >
> >> >> But that would only be possible if there was the notion of a "super
> row
> >> >> tombstone" that permanently deleted a rowkey and all future writes
> >> would be
> >> >> invalidated. Kind of like how a tombstone with a mistakenly huge
> >> timestamp
> >> >> becomes a sneaky permanent tombstone, but intended. There could be a
> >> >> special operation / statement to undo this permanent tombstone, and
> >> since
> >> >> the row tombstones would be in their own dedicated sstables, they
> could
> >> >> process and compact more quickly, with prioritization by the
> compactor.
> >> >>
> >> >>
> >> > This part sounds way less interesting to me (other than the fact you
> can
> >> > already do this with a timestamp in the future, but it'll gc away at
> >> gcgs).
> >> >
> >> >
> >> >> I'm thinking there must be something I am forgetting in the
> >> >> read/write/compaction paths that invalidate this.
> >> >>
> >> >
> >> > There are a lot of places where we do "smart" things to make sure we
> >> don't
> >> > accidentally resurrect data. Read path includes old sstables for
> >> tombstones
> >> > for example. Those all need to be concretely identified and handled
> (and
> >> > tested),.
> >>
> >
> >
>


Re: row tombstones as a separate sstable citizen

2018-02-14 Thread Carl Mueller
So is this at least a decent candidate for a feature request ticket?


On Tue, Feb 13, 2018 at 8:09 PM, Carl Mueller 
wrote:

> I'm particularly interested in getting the tombstones to "promote" up the
> levels of LCS more quickly. Currently they get attached at the low level
> and don't propagate up to higher levels until enough activity at a lower
> level promotes the data. Meanwhile, LCS means compactions can occur in
> parallel at each level. So row tombstones in their own sstable could be up
> promoted the LCS levels preferentially before normal processes would move
> them up.
>
> So if the delete-only sstables could move up more quickly, the compaction
> at the levels would happen more quickly.
>
> The threshold stuff is nice if I read 7019 correctly, but what is the %
> there? % of rows? % of columns? or % of the size of the sstable? Row
> tombstones are pretty compact being just the rowkey and the tombstone
> marker. So if 7019 is triggered at 10% of the sstable size, even a crapton
> of tombstones deleting practially the entire database would only be a small
> % size of the sstable.
>
> Since the row tombstones are so compact, that's why I think they are good
> candidates for special handling.
>
> On Tue, Feb 13, 2018 at 5:22 PM, J. D. Jordan 
> wrote:
>
>> Have you taken a look at the new stuff introduced by
>> https://issues.apache.org/jira/browse/CASSANDRA-7019 ?  I think it may
>> go a ways to reducing the need for something complicated like this.
>> Though it is an interesting idea as special handling for bulk deletes.
>> If they were truly just sstables that only contained deletes the logic from
>> 7109 would probably go a long ways. Though if you are bulk inserting
>> deletes that is what you would end up with, so maybe it already works.
>>
>> -Jeremiah
>>
>> > On Feb 13, 2018, at 6:04 PM, Jeff Jirsa  wrote:
>> >
>> > On Tue, Feb 13, 2018 at 2:38 PM, Carl Mueller <
>> carl.muel...@smartthings.com>
>> > wrote:
>> >
>> >> In process of doing my second major data purge from a cassandra system.
>> >>
>> >> Almost all of my purging is done via row tombstones. While performing
>> this
>> >> the second time while trying to cajole compaction to occur (in 2.1.x,
>> >> LevelledCompaction) to goddamn actually compact the data, I've been
>> >> thinking as to why there isn't a separate set of sstable infrastructure
>> >> setup for row deletion tombstones.
>> >>
>> >> I'm imagining that row tombstones are written to separate sstables than
>> >> mainline data updates/appends and range/column tombstones.
>> >>
>> >> By writing them to separate sstables, the compaction systems can
>> >> preferentially merge / process them when compacting sstables.
>> >>
>> >> This would create an additional sstable for lookup in the bloom
>> filters,
>> >> granted. I had visions of short circuiting the lookups to other
>> sstables if
>> >> a row tombstone was present in one of the special row tombstone
>> sstables.
>> >>
>> >>
>> > All of the above sounds really interesting to me, but I suspect it's a
>> LOT
>> > of work to make it happen correctly.
>> >
>> > You'd almost end up with 2 sets of logs for the LSM - a tombstone
>> > log/generation, and a data log/generation, and the tombstone logs would
>> be
>> > read-only inputs to data compactions.
>> >
>> >
>> >> But that would only be possible if there was the notion of a "super row
>> >> tombstone" that permanently deleted a rowkey and all future writes
>> would be
>> >> invalidated. Kind of like how a tombstone with a mistakenly huge
>> timestamp
>> >> becomes a sneaky permanent tombstone, but intended. There could be a
>> >> special operation / statement to undo this permanent tombstone, and
>> since
>> >> the row tombstones would be in their own dedicated sstables, they could
>> >> process and compact more quickly, with prioritization by the compactor.
>> >>
>> >>
>> > This part sounds way less interesting to me (other than the fact you can
>> > already do this with a timestamp in the future, but it'll gc away at
>> gcgs).
>> >
>> >
>> >> I'm thinking there must be something I am forgetting in the
>> >> read/write/compaction paths that invalidate this.
>> >>
>> >
>> > There are a lot of places where we do "smart" things to make sure we
>> don't
>> > accidentally resurrect data. Read path includes old sstables for
>> tombstones
>> > for example. Those all need to be concretely identified and handled (and
>> > tested),.
>>
>
>


Re: row tombstones as a separate sstable citizen

2018-02-13 Thread Carl Mueller
I'm particularly interested in getting the tombstones to "promote" up the
levels of LCS more quickly. Currently they get attached at the low level
and don't propagate up to higher levels until enough activity at a lower
level promotes the data. Meanwhile, LCS means compactions can occur in
parallel at each level. So row tombstones in their own sstable could be up
promoted the LCS levels preferentially before normal processes would move
them up.

So if the delete-only sstables could move up more quickly, the compaction
at the levels would happen more quickly.

The threshold stuff is nice if I read 7019 correctly, but what is the %
there? % of rows? % of columns? or % of the size of the sstable? Row
tombstones are pretty compact being just the rowkey and the tombstone
marker. So if 7019 is triggered at 10% of the sstable size, even a crapton
of tombstones deleting practially the entire database would only be a small
% size of the sstable.

Since the row tombstones are so compact, that's why I think they are good
candidates for special handling.

On Tue, Feb 13, 2018 at 5:22 PM, J. D. Jordan 
wrote:

> Have you taken a look at the new stuff introduced by
> https://issues.apache.org/jira/browse/CASSANDRA-7019 ?  I think it may go
> a ways to reducing the need for something complicated like this.
> Though it is an interesting idea as special handling for bulk deletes.  If
> they were truly just sstables that only contained deletes the logic from
> 7109 would probably go a long ways. Though if you are bulk inserting
> deletes that is what you would end up with, so maybe it already works.
>
> -Jeremiah
>
> > On Feb 13, 2018, at 6:04 PM, Jeff Jirsa  wrote:
> >
> > On Tue, Feb 13, 2018 at 2:38 PM, Carl Mueller <
> carl.muel...@smartthings.com>
> > wrote:
> >
> >> In process of doing my second major data purge from a cassandra system.
> >>
> >> Almost all of my purging is done via row tombstones. While performing
> this
> >> the second time while trying to cajole compaction to occur (in 2.1.x,
> >> LevelledCompaction) to goddamn actually compact the data, I've been
> >> thinking as to why there isn't a separate set of sstable infrastructure
> >> setup for row deletion tombstones.
> >>
> >> I'm imagining that row tombstones are written to separate sstables than
> >> mainline data updates/appends and range/column tombstones.
> >>
> >> By writing them to separate sstables, the compaction systems can
> >> preferentially merge / process them when compacting sstables.
> >>
> >> This would create an additional sstable for lookup in the bloom filters,
> >> granted. I had visions of short circuiting the lookups to other
> sstables if
> >> a row tombstone was present in one of the special row tombstone
> sstables.
> >>
> >>
> > All of the above sounds really interesting to me, but I suspect it's a
> LOT
> > of work to make it happen correctly.
> >
> > You'd almost end up with 2 sets of logs for the LSM - a tombstone
> > log/generation, and a data log/generation, and the tombstone logs would
> be
> > read-only inputs to data compactions.
> >
> >
> >> But that would only be possible if there was the notion of a "super row
> >> tombstone" that permanently deleted a rowkey and all future writes
> would be
> >> invalidated. Kind of like how a tombstone with a mistakenly huge
> timestamp
> >> becomes a sneaky permanent tombstone, but intended. There could be a
> >> special operation / statement to undo this permanent tombstone, and
> since
> >> the row tombstones would be in their own dedicated sstables, they could
> >> process and compact more quickly, with prioritization by the compactor.
> >>
> >>
> > This part sounds way less interesting to me (other than the fact you can
> > already do this with a timestamp in the future, but it'll gc away at
> gcgs).
> >
> >
> >> I'm thinking there must be something I am forgetting in the
> >> read/write/compaction paths that invalidate this.
> >>
> >
> > There are a lot of places where we do "smart" things to make sure we
> don't
> > accidentally resurrect data. Read path includes old sstables for
> tombstones
> > for example. Those all need to be concretely identified and handled (and
> > tested),.
>


Re: row tombstones as a separate sstable citizen

2018-02-13 Thread J. D. Jordan
Have you taken a look at the new stuff introduced by 
https://issues.apache.org/jira/browse/CASSANDRA-7019 ?  I think it may go a 
ways to reducing the need for something complicated like this.
Though it is an interesting idea as special handling for bulk deletes.  If they 
were truly just sstables that only contained deletes the logic from 7109 would 
probably go a long ways. Though if you are bulk inserting deletes that is what 
you would end up with, so maybe it already works.

-Jeremiah

> On Feb 13, 2018, at 6:04 PM, Jeff Jirsa  wrote:
> 
> On Tue, Feb 13, 2018 at 2:38 PM, Carl Mueller 
> wrote:
> 
>> In process of doing my second major data purge from a cassandra system.
>> 
>> Almost all of my purging is done via row tombstones. While performing this
>> the second time while trying to cajole compaction to occur (in 2.1.x,
>> LevelledCompaction) to goddamn actually compact the data, I've been
>> thinking as to why there isn't a separate set of sstable infrastructure
>> setup for row deletion tombstones.
>> 
>> I'm imagining that row tombstones are written to separate sstables than
>> mainline data updates/appends and range/column tombstones.
>> 
>> By writing them to separate sstables, the compaction systems can
>> preferentially merge / process them when compacting sstables.
>> 
>> This would create an additional sstable for lookup in the bloom filters,
>> granted. I had visions of short circuiting the lookups to other sstables if
>> a row tombstone was present in one of the special row tombstone sstables.
>> 
>> 
> All of the above sounds really interesting to me, but I suspect it's a LOT
> of work to make it happen correctly.
> 
> You'd almost end up with 2 sets of logs for the LSM - a tombstone
> log/generation, and a data log/generation, and the tombstone logs would be
> read-only inputs to data compactions.
> 
> 
>> But that would only be possible if there was the notion of a "super row
>> tombstone" that permanently deleted a rowkey and all future writes would be
>> invalidated. Kind of like how a tombstone with a mistakenly huge timestamp
>> becomes a sneaky permanent tombstone, but intended. There could be a
>> special operation / statement to undo this permanent tombstone, and since
>> the row tombstones would be in their own dedicated sstables, they could
>> process and compact more quickly, with prioritization by the compactor.
>> 
>> 
> This part sounds way less interesting to me (other than the fact you can
> already do this with a timestamp in the future, but it'll gc away at gcgs).
> 
> 
>> I'm thinking there must be something I am forgetting in the
>> read/write/compaction paths that invalidate this.
>> 
> 
> There are a lot of places where we do "smart" things to make sure we don't
> accidentally resurrect data. Read path includes old sstables for tombstones
> for example. Those all need to be concretely identified and handled (and
> tested),.


Re: row tombstones as a separate sstable citizen

2018-02-13 Thread Jeff Jirsa
On Tue, Feb 13, 2018 at 2:38 PM, Carl Mueller 
wrote:

> In process of doing my second major data purge from a cassandra system.
>
> Almost all of my purging is done via row tombstones. While performing this
> the second time while trying to cajole compaction to occur (in 2.1.x,
> LevelledCompaction) to goddamn actually compact the data, I've been
> thinking as to why there isn't a separate set of sstable infrastructure
> setup for row deletion tombstones.
>
> I'm imagining that row tombstones are written to separate sstables than
> mainline data updates/appends and range/column tombstones.
>
> By writing them to separate sstables, the compaction systems can
> preferentially merge / process them when compacting sstables.
>
> This would create an additional sstable for lookup in the bloom filters,
> granted. I had visions of short circuiting the lookups to other sstables if
> a row tombstone was present in one of the special row tombstone sstables.
>
>
All of the above sounds really interesting to me, but I suspect it's a LOT
of work to make it happen correctly.

You'd almost end up with 2 sets of logs for the LSM - a tombstone
log/generation, and a data log/generation, and the tombstone logs would be
read-only inputs to data compactions.


> But that would only be possible if there was the notion of a "super row
> tombstone" that permanently deleted a rowkey and all future writes would be
> invalidated. Kind of like how a tombstone with a mistakenly huge timestamp
> becomes a sneaky permanent tombstone, but intended. There could be a
> special operation / statement to undo this permanent tombstone, and since
> the row tombstones would be in their own dedicated sstables, they could
> process and compact more quickly, with prioritization by the compactor.
>
>
This part sounds way less interesting to me (other than the fact you can
already do this with a timestamp in the future, but it'll gc away at gcgs).


> I'm thinking there must be something I am forgetting in the
> read/write/compaction paths that invalidate this.
>

There are a lot of places where we do "smart" things to make sure we don't
accidentally resurrect data. Read path includes old sstables for tombstones
for example. Those all need to be concretely identified and handled (and
tested),.