Re: Iceberg old partition gc

2023-06-04 Thread Ryan Blue
Let me paraphrase the use case to make sure I'm getting it right: The idea
is to be able to remove expired data and delete the data files associated
with it, but without losing the history of other changes to the table.
Because new data and old data are modified in the same linear history,
physically removing old data (via snapshot expiration) prevents you from
keeping history for the new data.

There are a few ways I can think of to work around this. I think what most
people do is remove data a few days ahead of time so that it doesn't need
to be physically removed immediately. That's the default behavior, which I
think isn't what you want in this case.

Another option is to just delete the expired data files immediately. You'd
still have metadata references to them, but those won't cause issues as
long as no one tries to read the files. Of course, that runs into issues
with full table operations, like `select * limit 10` where you could
accidentally try to access a deleted file that's still referenced.

Last, I think you could solve this with branching, while also keeping
overhead down. The idea is to create a branch for each version you actually
want to keep. That's probably like a daily branch so you don't keep every
version of the table. Then you can apply deletes to all of the historical
branches and keep just the latest snapshot for each branch. That allows you
to select the table states you want to keep and still delete within that
set of states. Deleting data would be a bit more difficult, but you would
probably be able to reuse the same metadata changes for all the deletes.

It sounds like the last option is probably the one that makes the most
sense for you. Customizing history is a great use for tagging and branching.

Ryan

On Sat, Jun 3, 2023 at 5:03 AM Szehon Ho  wrote:

> @Szehon, I am wondering if we can create materialized views for metadata
>> tables to support infinite history on metadata tables (like snapshots or
>> partitions). Obviously, materialized views can't be used for time travel or
>> rollback. They are only meant for maintaining long/infinite histories.
>
>
> Yea, that's a good idea, there's definitely options like building a tool
> outside Iceberg (dumped it from time to time to materialized view), or
> build a history-preserving catalog layer that saves old snapshot metadata,
> rather than building it in Iceberg spec itself to keep expired metadata
> files.
>
> Thanks
> Szehon
>
> On Sat, Jun 3, 2023 at 10:06 AM Steven Wu  wrote:
>
>> > the main use case I had was table historical analysis (last update time
>> for each partitions, how many snapshots did this table ever have, for
>> example),
>>
>> Partition level stats can probably help with questions like "last update
>> time for each partition".
>>
>> @Szehon, I am wondering if we can create materialized views for metadata
>> tables to support infinite history on metadata tables (like snapshots or
>> partitions). Obviously, materialized views can't be used for time travel or
>> rollback. They are only meant for maintaining long/infinite histories.
>>
>> > One use case is the user might need to time travel to a certain
>> snapshot. However, such a snapshot is expired due to the snapshot
>> expiration that only retains the latest snapshot operation, and this
>> operation's only intent is to remove the gc partition. It seems a little
>> overkill to me.
>>
>> @Pucheng, usually people keep Iceberg snapshot history (for time travel
>> or rollback) for a few days (like 7). Very long history can burden the
>> metadata system. tagging can extend the history with selective snapshots.
>>
>> It seems that you are saying that purging actions of old partitions are
>> creating new snapshots, which are taking up some space in the snapshot
>> history. But if snapshot expiration is time based (like 7 days), this
>> shouldn't be a problem, right?
>>
>> On Fri, Jun 2, 2023 at 6:17 PM Szehon Ho  wrote:
>>
>>> Yea, for the original use case in this thread, agree it's delete (soft)
>>> + expire (physical, permanent).
>>>
>>> I guess I should have phrased my thought better, I was replying to
>>> Ryan's question above
>>>
  We don't often have people ask to keep snapshots that can't be read
>>>
>>>
>>> and had thought it'd be nice to have a ExpireSnapshot mode where we
>>> keep older metadata for longer periods of time beyond physical expiration.
>>>
>>> But the main use case I had was table historical analysis (last update
>>> time for each partitions, how many snapshots did this table ever have, for
>>> example), it's more a nice-to-have and definitely not sure it is a very
>>> compelling use-case.  Another option I guess, is custom catalog can keep
>>> around these historical information.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Fri, Jun 2, 2023 at 10:28 PM Russell Spitzer <
>>> [email protected]> wrote:
>>>
 I think "soft-mode" is really just doing the delete. You can then
 recover the snapshot if you happen to have accidental

Re: Iceberg old partition gc

2023-06-03 Thread Szehon Ho
>
> @Szehon, I am wondering if we can create materialized views for metadata
> tables to support infinite history on metadata tables (like snapshots or
> partitions). Obviously, materialized views can't be used for time travel or
> rollback. They are only meant for maintaining long/infinite histories.


Yea, that's a good idea, there's definitely options like building a tool
outside Iceberg (dumped it from time to time to materialized view), or
build a history-preserving catalog layer that saves old snapshot metadata,
rather than building it in Iceberg spec itself to keep expired metadata
files.

Thanks
Szehon

On Sat, Jun 3, 2023 at 10:06 AM Steven Wu  wrote:

> > the main use case I had was table historical analysis (last update time
> for each partitions, how many snapshots did this table ever have, for
> example),
>
> Partition level stats can probably help with questions like "last update
> time for each partition".
>
> @Szehon, I am wondering if we can create materialized views for metadata
> tables to support infinite history on metadata tables (like snapshots or
> partitions). Obviously, materialized views can't be used for time travel or
> rollback. They are only meant for maintaining long/infinite histories.
>
> > One use case is the user might need to time travel to a certain
> snapshot. However, such a snapshot is expired due to the snapshot
> expiration that only retains the latest snapshot operation, and this
> operation's only intent is to remove the gc partition. It seems a little
> overkill to me.
>
> @Pucheng, usually people keep Iceberg snapshot history (for time travel or
> rollback) for a few days (like 7). Very long history can burden the
> metadata system. tagging can extend the history with selective snapshots.
>
> It seems that you are saying that purging actions of old partitions are
> creating new snapshots, which are taking up some space in the snapshot
> history. But if snapshot expiration is time based (like 7 days), this
> shouldn't be a problem, right?
>
> On Fri, Jun 2, 2023 at 6:17 PM Szehon Ho  wrote:
>
>> Yea, for the original use case in this thread, agree it's delete (soft) +
>> expire (physical, permanent).
>>
>> I guess I should have phrased my thought better, I was replying to Ryan's
>> question above
>>
>>>  We don't often have people ask to keep snapshots that can't be read
>>
>>
>> and had thought it'd be nice to have a ExpireSnapshot mode where we
>> keep older metadata for longer periods of time beyond physical expiration.
>>
>> But the main use case I had was table historical analysis (last update
>> time for each partitions, how many snapshots did this table ever have, for
>> example), it's more a nice-to-have and definitely not sure it is a very
>> compelling use-case.  Another option I guess, is custom catalog can keep
>> around these historical information.
>>
>> Thanks
>> Szehon
>>
>> On Fri, Jun 2, 2023 at 10:28 PM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> I think "soft-mode" is really just doing the delete. You can then
>>> recover the snapshot if you happen to have accidentally TTL'd a partition.
>>>
>>> On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho 
>>> wrote:
>>>
 I think this violates Iceberg’s assumption of immutable snapshots.
 That would require modifying the old snapshot to no longer point to those
 gc’ed data files, else not sure how you can time-travel to read from that
 snapshot, if some of its files are deleted?

 That being said, I also had this thought at some point, to keep
 snapshot info around longer.  I expect most organizations operate in a mode
 where they expire snapshots after a few days, and reasonably expect any
 time-travel or snapshot-related operation (like CDC) to happen within this
 timeframe.   And of course, use tags to keep the snapshot from expiration.

 But there are some use-cases where keeping more snapshot metadata for a
 period longer than when it could be read could be interesting.  For
 example, if I want to know info about the snapshot that added each data
 file, we probably have lost most of those snapshot metadata as they were
 added long ago.  Example, the frequent ask to find each partition's last
 modified time, (in an earlier email thread).

 I haven't thought it completely through, but it crossed my mind that a
 ‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files
 but just mark snapshot’s metadata files as expired without physically
 deleting them, and so retain the ability to answer these questions.  It
 could be done by adding ‘expired-snapshots’ list to metadata.json.  That
 being said, its a singular use case and not sure if anyone also has
 interest or other use-case?  It would add a bit of complexity.

 Thanks
 Szehon
 Szehon

 On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang 
 wrote:

> Ryan,
>
> One use case is the user mig

Re: Iceberg old partition gc

2023-06-02 Thread Steven Wu
> the main use case I had was table historical analysis (last update time
for each partitions, how many snapshots did this table ever have, for
example),

Partition level stats can probably help with questions like "last update
time for each partition".

@Szehon, I am wondering if we can create materialized views for metadata
tables to support infinite history on metadata tables (like snapshots or
partitions). Obviously, materialized views can't be used for time travel or
rollback. They are only meant for maintaining long/infinite histories.

> One use case is the user might need to time travel to a certain snapshot.
However, such a snapshot is expired due to the snapshot expiration
that only retains the latest snapshot operation, and this operation's only
intent is to remove the gc partition. It seems a little overkill to me.

@Pucheng, usually people keep Iceberg snapshot history (for time travel or
rollback) for a few days (like 7). Very long history can burden the
metadata system. tagging can extend the history with selective snapshots.

It seems that you are saying that purging actions of old partitions are
creating new snapshots, which are taking up some space in the snapshot
history. But if snapshot expiration is time based (like 7 days), this
shouldn't be a problem, right?

On Fri, Jun 2, 2023 at 6:17 PM Szehon Ho  wrote:

> Yea, for the original use case in this thread, agree it's delete (soft) +
> expire (physical, permanent).
>
> I guess I should have phrased my thought better, I was replying to Ryan's
> question above
>
>>  We don't often have people ask to keep snapshots that can't be read
>
>
> and had thought it'd be nice to have a ExpireSnapshot mode where we
> keep older metadata for longer periods of time beyond physical expiration.
>
> But the main use case I had was table historical analysis (last update
> time for each partitions, how many snapshots did this table ever have, for
> example), it's more a nice-to-have and definitely not sure it is a very
> compelling use-case.  Another option I guess, is custom catalog can keep
> around these historical information.
>
> Thanks
> Szehon
>
> On Fri, Jun 2, 2023 at 10:28 PM Russell Spitzer 
> wrote:
>
>> I think "soft-mode" is really just doing the delete. You can then recover
>> the snapshot if you happen to have accidentally TTL'd a partition.
>>
>> On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho  wrote:
>>
>>> I think this violates Iceberg’s assumption of immutable snapshots.  That
>>> would require modifying the old snapshot to no longer point to those gc’ed
>>> data files, else not sure how you can time-travel to read from that
>>> snapshot, if some of its files are deleted?
>>>
>>> That being said, I also had this thought at some point, to keep snapshot
>>> info around longer.  I expect most organizations operate in a mode where
>>> they expire snapshots after a few days, and reasonably expect any
>>> time-travel or snapshot-related operation (like CDC) to happen within this
>>> timeframe.   And of course, use tags to keep the snapshot from expiration.
>>>
>>> But there are some use-cases where keeping more snapshot metadata for a
>>> period longer than when it could be read could be interesting.  For
>>> example, if I want to know info about the snapshot that added each data
>>> file, we probably have lost most of those snapshot metadata as they were
>>> added long ago.  Example, the frequent ask to find each partition's last
>>> modified time, (in an earlier email thread).
>>>
>>> I haven't thought it completely through, but it crossed my mind that a
>>> ‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files
>>> but just mark snapshot’s metadata files as expired without physically
>>> deleting them, and so retain the ability to answer these questions.  It
>>> could be done by adding ‘expired-snapshots’ list to metadata.json.  That
>>> being said, its a singular use case and not sure if anyone also has
>>> interest or other use-case?  It would add a bit of complexity.
>>>
>>> Thanks
>>> Szehon
>>> Szehon
>>>
>>> On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang 
>>> wrote:
>>>
 Ryan,

 One use case is the user might need to time travel to a certain
 snapshot. However, such a snapshot is expired due to the snapshot
 expiration that only retains the latest snapshot operation, and this
 operation's only intent is to remove the gc partition. It seems a little
 overkill to me.

 I hope my explanation makes sense to you.

 On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue  wrote:

> Pucheng,
>
> What is the use case around keeping the snapshot longer? We don't
> often have people ask to keep snapshots that can't be read, so it sounds
> like you might have something specific in mind?
>
> Ryan
>
> On Wed, May 31, 2023 at 8:19 PM Pucheng Yang
>  wrote:
>
>> Hi community,
>>
>> In my organization, a big portion of the datasets are partitioned 

Re: Iceberg old partition gc

2023-06-02 Thread Szehon Ho
Yea, for the original use case in this thread, agree it's delete (soft) +
expire (physical, permanent).

I guess I should have phrased my thought better, I was replying to Ryan's
question above

>  We don't often have people ask to keep snapshots that can't be read


and had thought it'd be nice to have a ExpireSnapshot mode where we
keep older metadata for longer periods of time beyond physical expiration.

But the main use case I had was table historical analysis (last update time
for each partitions, how many snapshots did this table ever have, for
example), it's more a nice-to-have and definitely not sure it is a very
compelling use-case.  Another option I guess, is custom catalog can keep
around these historical information.

Thanks
Szehon

On Fri, Jun 2, 2023 at 10:28 PM Russell Spitzer 
wrote:

> I think "soft-mode" is really just doing the delete. You can then recover
> the snapshot if you happen to have accidentally TTL'd a partition.
>
> On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho  wrote:
>
>> I think this violates Iceberg’s assumption of immutable snapshots.  That
>> would require modifying the old snapshot to no longer point to those gc’ed
>> data files, else not sure how you can time-travel to read from that
>> snapshot, if some of its files are deleted?
>>
>> That being said, I also had this thought at some point, to keep snapshot
>> info around longer.  I expect most organizations operate in a mode where
>> they expire snapshots after a few days, and reasonably expect any
>> time-travel or snapshot-related operation (like CDC) to happen within this
>> timeframe.   And of course, use tags to keep the snapshot from expiration.
>>
>> But there are some use-cases where keeping more snapshot metadata for a
>> period longer than when it could be read could be interesting.  For
>> example, if I want to know info about the snapshot that added each data
>> file, we probably have lost most of those snapshot metadata as they were
>> added long ago.  Example, the frequent ask to find each partition's last
>> modified time, (in an earlier email thread).
>>
>> I haven't thought it completely through, but it crossed my mind that a
>> ‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files
>> but just mark snapshot’s metadata files as expired without physically
>> deleting them, and so retain the ability to answer these questions.  It
>> could be done by adding ‘expired-snapshots’ list to metadata.json.  That
>> being said, its a singular use case and not sure if anyone also has
>> interest or other use-case?  It would add a bit of complexity.
>>
>> Thanks
>> Szehon
>> Szehon
>>
>> On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang 
>> wrote:
>>
>>> Ryan,
>>>
>>> One use case is the user might need to time travel to a certain
>>> snapshot. However, such a snapshot is expired due to the snapshot
>>> expiration that only retains the latest snapshot operation, and this
>>> operation's only intent is to remove the gc partition. It seems a little
>>> overkill to me.
>>>
>>> I hope my explanation makes sense to you.
>>>
>>> On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue  wrote:
>>>
 Pucheng,

 What is the use case around keeping the snapshot longer? We don't often
 have people ask to keep snapshots that can't be read, so it sounds like you
 might have something specific in mind?

 Ryan

 On Wed, May 31, 2023 at 8:19 PM Pucheng Yang
  wrote:

> Hi community,
>
> In my organization, a big portion of the datasets are partitioned by
> date, normally we keep the latest X dates of partition for a given 
> dataset.
>
> One issue that always bothers me is if I want to delete a partition
> that should be GC, I will run SQL query "delete from tbl where dt = ..."
> and do snapshot expiration to keep the latest snapshot to make sure that
> partition data is physically removed. However, the downside of this
> approach is the table snapshot history will be completely lost..
>
> I wonder if anyone else in the community has the same pain point? How
> do you solve this? I would love to understand if there is a solution to
> this otherwise we can brainstorm if there is a way to solve this.
>
> Thanks!
>
> Pucheng
>


 --
 Ryan Blue
 Tabular

>>>


Re: Iceberg old partition gc

2023-06-02 Thread Russell Spitzer
I think "soft-mode" is really just doing the delete. You can then recover
the snapshot if you happen to have accidentally TTL'd a partition.

On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho  wrote:

> I think this violates Iceberg’s assumption of immutable snapshots.  That
> would require modifying the old snapshot to no longer point to those gc’ed
> data files, else not sure how you can time-travel to read from that
> snapshot, if some of its files are deleted?
>
> That being said, I also had this thought at some point, to keep snapshot
> info around longer.  I expect most organizations operate in a mode where
> they expire snapshots after a few days, and reasonably expect any
> time-travel or snapshot-related operation (like CDC) to happen within this
> timeframe.   And of course, use tags to keep the snapshot from expiration.
>
> But there are some use-cases where keeping more snapshot metadata for a
> period longer than when it could be read could be interesting.  For
> example, if I want to know info about the snapshot that added each data
> file, we probably have lost most of those snapshot metadata as they were
> added long ago.  Example, the frequent ask to find each partition's last
> modified time, (in an earlier email thread).
>
> I haven't thought it completely through, but it crossed my mind that a
> ‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files
> but just mark snapshot’s metadata files as expired without physically
> deleting them, and so retain the ability to answer these questions.  It
> could be done by adding ‘expired-snapshots’ list to metadata.json.  That
> being said, its a singular use case and not sure if anyone also has
> interest or other use-case?  It would add a bit of complexity.
>
> Thanks
> Szehon
> Szehon
>
> On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang 
> wrote:
>
>> Ryan,
>>
>> One use case is the user might need to time travel to a certain snapshot.
>> However, such a snapshot is expired due to the snapshot expiration
>> that only retains the latest snapshot operation, and this operation's only
>> intent is to remove the gc partition. It seems a little overkill to me.
>>
>> I hope my explanation makes sense to you.
>>
>> On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue  wrote:
>>
>>> Pucheng,
>>>
>>> What is the use case around keeping the snapshot longer? We don't often
>>> have people ask to keep snapshots that can't be read, so it sounds like you
>>> might have something specific in mind?
>>>
>>> Ryan
>>>
>>> On Wed, May 31, 2023 at 8:19 PM Pucheng Yang 
>>> wrote:
>>>
 Hi community,

 In my organization, a big portion of the datasets are partitioned by
 date, normally we keep the latest X dates of partition for a given dataset.

 One issue that always bothers me is if I want to delete a partition
 that should be GC, I will run SQL query "delete from tbl where dt = ..."
 and do snapshot expiration to keep the latest snapshot to make sure that
 partition data is physically removed. However, the downside of this
 approach is the table snapshot history will be completely lost..

 I wonder if anyone else in the community has the same pain point? How
 do you solve this? I would love to understand if there is a solution to
 this otherwise we can brainstorm if there is a way to solve this.

 Thanks!

 Pucheng

>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>


Re: Iceberg old partition gc

2023-06-02 Thread Szehon Ho
I think this violates Iceberg’s assumption of immutable snapshots.  That
would require modifying the old snapshot to no longer point to those gc’ed
data files, else not sure how you can time-travel to read from that
snapshot, if some of its files are deleted?

That being said, I also had this thought at some point, to keep snapshot
info around longer.  I expect most organizations operate in a mode where
they expire snapshots after a few days, and reasonably expect any
time-travel or snapshot-related operation (like CDC) to happen within this
timeframe.   And of course, use tags to keep the snapshot from expiration.

But there are some use-cases where keeping more snapshot metadata for a
period longer than when it could be read could be interesting.  For
example, if I want to know info about the snapshot that added each data
file, we probably have lost most of those snapshot metadata as they were
added long ago.  Example, the frequent ask to find each partition's last
modified time, (in an earlier email thread).

I haven't thought it completely through, but it crossed my mind that a
‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files
but just mark snapshot’s metadata files as expired without physically
deleting them, and so retain the ability to answer these questions.  It
could be done by adding ‘expired-snapshots’ list to metadata.json.  That
being said, its a singular use case and not sure if anyone also has
interest or other use-case?  It would add a bit of complexity.

Thanks
Szehon
Szehon

On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang 
wrote:

> Ryan,
>
> One use case is the user might need to time travel to a certain snapshot.
> However, such a snapshot is expired due to the snapshot expiration
> that only retains the latest snapshot operation, and this operation's only
> intent is to remove the gc partition. It seems a little overkill to me.
>
> I hope my explanation makes sense to you.
>
> On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue  wrote:
>
>> Pucheng,
>>
>> What is the use case around keeping the snapshot longer? We don't often
>> have people ask to keep snapshots that can't be read, so it sounds like you
>> might have something specific in mind?
>>
>> Ryan
>>
>> On Wed, May 31, 2023 at 8:19 PM Pucheng Yang 
>> wrote:
>>
>>> Hi community,
>>>
>>> In my organization, a big portion of the datasets are partitioned by
>>> date, normally we keep the latest X dates of partition for a given dataset.
>>>
>>> One issue that always bothers me is if I want to delete a partition
>>> that should be GC, I will run SQL query "delete from tbl where dt = ..."
>>> and do snapshot expiration to keep the latest snapshot to make sure that
>>> partition data is physically removed. However, the downside of this
>>> approach is the table snapshot history will be completely lost..
>>>
>>> I wonder if anyone else in the community has the same pain point? How do
>>> you solve this? I would love to understand if there is a solution to this
>>> otherwise we can brainstorm if there is a way to solve this.
>>>
>>> Thanks!
>>>
>>> Pucheng
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: Iceberg old partition gc

2023-06-01 Thread Pucheng Yang
Ryan,

One use case is the user might need to time travel to a certain snapshot.
However, such a snapshot is expired due to the snapshot expiration
that only retains the latest snapshot operation, and this operation's only
intent is to remove the gc partition. It seems a little overkill to me.

I hope my explanation makes sense to you.

On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue  wrote:

> Pucheng,
>
> What is the use case around keeping the snapshot longer? We don't often
> have people ask to keep snapshots that can't be read, so it sounds like you
> might have something specific in mind?
>
> Ryan
>
> On Wed, May 31, 2023 at 8:19 PM Pucheng Yang 
> wrote:
>
>> Hi community,
>>
>> In my organization, a big portion of the datasets are partitioned by
>> date, normally we keep the latest X dates of partition for a given dataset.
>>
>> One issue that always bothers me is if I want to delete a partition
>> that should be GC, I will run SQL query "delete from tbl where dt = ..."
>> and do snapshot expiration to keep the latest snapshot to make sure that
>> partition data is physically removed. However, the downside of this
>> approach is the table snapshot history will be completely lost..
>>
>> I wonder if anyone else in the community has the same pain point? How do
>> you solve this? I would love to understand if there is a solution to this
>> otherwise we can brainstorm if there is a way to solve this.
>>
>> Thanks!
>>
>> Pucheng
>>
>
>
> --
> Ryan Blue
> Tabular
>


Re: Iceberg old partition gc

2023-06-01 Thread Ryan Blue
Pucheng,

What is the use case around keeping the snapshot longer? We don't often
have people ask to keep snapshots that can't be read, so it sounds like you
might have something specific in mind?

Ryan

On Wed, May 31, 2023 at 8:19 PM Pucheng Yang 
wrote:

> Hi community,
>
> In my organization, a big portion of the datasets are partitioned by date,
> normally we keep the latest X dates of partition for a given dataset.
>
> One issue that always bothers me is if I want to delete a partition
> that should be GC, I will run SQL query "delete from tbl where dt = ..."
> and do snapshot expiration to keep the latest snapshot to make sure that
> partition data is physically removed. However, the downside of this
> approach is the table snapshot history will be completely lost..
>
> I wonder if anyone else in the community has the same pain point? How do
> you solve this? I would love to understand if there is a solution to this
> otherwise we can brainstorm if there is a way to solve this.
>
> Thanks!
>
> Pucheng
>


-- 
Ryan Blue
Tabular