Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-08-06 Thread Yufei Gu
> Yes, it won't impact reads/writes but it may become a bottleneck for
other operations that need that information.

We can set a limit to allow a certain number of snapshots, and purge old
items in each commit just like what we did for metadata logs. I admit that
it doesn't seem like an elegant solution, but it may solve most of the
problems.

> I'd be curious to hear more from people who have experience
implementing the REST catalog API.

REST catalog can definitely preserve snapshot entries for a long period,
but we still need an interface/spec to allow metadata table query(in the
client side) to reference these expired entries.

Yufei


On Tue, Aug 6, 2024 at 11:30 AM Anton Okolnychyi 
wrote:

> I agree it is unfortunate to not be able to find the snapshot information
> from a manifest entry when the original snapshot is expired even though we
> still know the snapshot ID that added the file. I am not sure about a
> separate JSON file, though. It is still JSON and I bet people will store
> the snapshot history forever, so the size of that file will gradually
> increase. Yes, it won't impact reads/writes but it may become a bottleneck
> for other operations that need that information. Using Parquet may help but
> I am not sure that's the right approach overall.
>
> I'd be curious to hear more from people who have experience
> implementing the REST catalog API. It seems like most implementations have
> addressed that or at least have a way to do that.
>
> - Anton
>
> пн, 5 серп. 2024 р. о 18:12 Yufei Gu  пише:
>
>> Thanks Szehone for the new proposal. I think it is a useful feature with
>> the least spec change. A candidate for v3 spec?
>>
>> Yufei
>>
>>
>> On Tue, Jul 16, 2024 at 3:02 PM Szehon Ho 
>> wrote:
>>
>>> Hi,
>>>
>>> Thanks for reading through the proposal and the good feedback. I was
>>> thinking about the mentioned concerns:
>>>
>>>- The motivation for the change
>>>- Too much additional metadata (storage overhead, namenode pressure
>>>on HDFS)
>>>- Performance impact for read/writing TableMetadata
>>>- Some impact to existing Table API's, and maintenance procedures,
>>>to have to check for these snapshots
>>>
>>> I chatted a bit offline with Yufei to brainstorm, and I wrote a V2 of
>>> the proposal at the same link:
>>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit.
>>> I also tried to clarify the motivation in the doc with actual metadata
>>> table queries that would be possible.
>>>
>>> This version now simply adds an optional 'expired-snapshots-path' that
>>> contains the metadata of expired Snapshots.  I think this should address
>>> the above concerns:
>>>
>>>- Minimal storage overhead for just snapshot references (capped).  I
>>>don't propose anymore to keep old snapshot manifest-list/manifest files,
>>>the snapshot reference to the expired snapshot should be a good start.
>>>- Minimize perf overhead of read/write TableMetadata.  The
>>>additional file is only written by ExpireSnapshots if feature is enabled,
>>>and only read on demand (via metadata table query for example)
>>>- No impact to other Table APIs or maintenance procedures (as these
>>>dont show up as regular table.snapshots() list anymore).
>>>- Only additive optional spec change (backwards compatible)
>>>
>>> Of course, again, this feature is possible outside Iceberg, but the
>>> advantage of doing it in Iceberg is that it could be integrated into
>>> ExpireSnapshots and Metadata Table frameworks.
>>>
>>> Curious what people think?
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Wed, Jul 10, 2024 at 1:44 AM Péter Váry 
>>> wrote:
>>>
 > I believe DeleteOrphanFiles may be ok as is, because currently the
 logic walks down the reachable graph and marks those metadata files as
 'not-orphan', so it should naturally walk these 'expired' snapshots as 
 well.

 We need to keep the metadata files, but remove data files if they are
 not removed for whatever reason. Doable, but logic change.

 > You mean purging expired snapshots in the middle of the history,
 right?  I think the current mechanism for this is 'tagging' and 
 'branching'.

 I think for most users the compaction commits are technical details
 which they would like to avoid / don't want to see. The real table history
 is only the changes initiated by the user, and it would be good to hide the
 technical/compaction commits.


 On Wed, Jul 10, 2024, 08:52 himadri pal  wrote:

> Hi Szehon,
>
> This is a good idea considering the use case it intends to solve.
> Added few questions and comments in the design doc.
>
> IMO , Alternate options considered specified in the design doc look
> cleaner to me.
>
> I think, it might add to maintenance burden, now that we need to
> remember to remove these metadata only snapshots.
>
> Also I wonder some of the use cases it 

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-08-06 Thread Anton Okolnychyi
I agree it is unfortunate to not be able to find the snapshot information
from a manifest entry when the original snapshot is expired even though we
still know the snapshot ID that added the file. I am not sure about a
separate JSON file, though. It is still JSON and I bet people will store
the snapshot history forever, so the size of that file will gradually
increase. Yes, it won't impact reads/writes but it may become a bottleneck
for other operations that need that information. Using Parquet may help but
I am not sure that's the right approach overall.

I'd be curious to hear more from people who have experience
implementing the REST catalog API. It seems like most implementations have
addressed that or at least have a way to do that.

- Anton

пн, 5 серп. 2024 р. о 18:12 Yufei Gu  пише:

> Thanks Szehone for the new proposal. I think it is a useful feature with
> the least spec change. A candidate for v3 spec?
>
> Yufei
>
>
> On Tue, Jul 16, 2024 at 3:02 PM Szehon Ho  wrote:
>
>> Hi,
>>
>> Thanks for reading through the proposal and the good feedback. I was
>> thinking about the mentioned concerns:
>>
>>- The motivation for the change
>>- Too much additional metadata (storage overhead, namenode pressure
>>on HDFS)
>>- Performance impact for read/writing TableMetadata
>>- Some impact to existing Table API's, and maintenance procedures, to
>>have to check for these snapshots
>>
>> I chatted a bit offline with Yufei to brainstorm, and I wrote a V2 of the
>> proposal at the same link:
>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit.
>> I also tried to clarify the motivation in the doc with actual metadata
>> table queries that would be possible.
>>
>> This version now simply adds an optional 'expired-snapshots-path' that
>> contains the metadata of expired Snapshots.  I think this should address
>> the above concerns:
>>
>>- Minimal storage overhead for just snapshot references (capped).  I
>>don't propose anymore to keep old snapshot manifest-list/manifest files,
>>the snapshot reference to the expired snapshot should be a good start.
>>- Minimize perf overhead of read/write TableMetadata.  The additional
>>file is only written by ExpireSnapshots if feature is enabled, and only
>>read on demand (via metadata table query for example)
>>- No impact to other Table APIs or maintenance procedures (as these
>>dont show up as regular table.snapshots() list anymore).
>>- Only additive optional spec change (backwards compatible)
>>
>> Of course, again, this feature is possible outside Iceberg, but the
>> advantage of doing it in Iceberg is that it could be integrated into
>> ExpireSnapshots and Metadata Table frameworks.
>>
>> Curious what people think?
>>
>> Thanks
>> Szehon
>>
>> On Wed, Jul 10, 2024 at 1:44 AM Péter Váry 
>> wrote:
>>
>>> > I believe DeleteOrphanFiles may be ok as is, because currently the
>>> logic walks down the reachable graph and marks those metadata files as
>>> 'not-orphan', so it should naturally walk these 'expired' snapshots as well.
>>>
>>> We need to keep the metadata files, but remove data files if they are
>>> not removed for whatever reason. Doable, but logic change.
>>>
>>> > You mean purging expired snapshots in the middle of the history,
>>> right?  I think the current mechanism for this is 'tagging' and 'branching'.
>>>
>>> I think for most users the compaction commits are technical details
>>> which they would like to avoid / don't want to see. The real table history
>>> is only the changes initiated by the user, and it would be good to hide the
>>> technical/compaction commits.
>>>
>>>
>>> On Wed, Jul 10, 2024, 08:52 himadri pal  wrote:
>>>
 Hi Szehon,

 This is a good idea considering the use case it intends to solve. Added
 few questions and comments in the design doc.

 IMO , Alternate options considered specified in the design doc look
 cleaner to me.

 I think, it might add to maintenance burden, now that we need to
 remember to remove these metadata only snapshots.

 Also I wonder some of the use cases it intends to address, is solvable
 by metadata alone? - i.e how much data was added in a given time range? -
 May be to answer these kind of questions user would prefer a to create KPI
 using columns in the dataset.


 Regards,
 Himadri Pal


 On Tue, Jul 9, 2024 at 11:10 PM Steven Wu  wrote:

> I am not totally convinced of the motivation yet.
>
> I thought the snapshot retention window is primarily meant for time
> travel and troubleshooting table changes that happened recently (like a 
> few
> days or weeks).
>
> Is it valuable enough to keep expired snapshots for as long as months
> or years? While metadata files are typically smaller than data files in
> total size, it can still be significant considering the default amount of
> colu

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-08-05 Thread Yufei Gu
Thanks Szehone for the new proposal. I think it is a useful feature with
the least spec change. A candidate for v3 spec?

Yufei


On Tue, Jul 16, 2024 at 3:02 PM Szehon Ho  wrote:

> Hi,
>
> Thanks for reading through the proposal and the good feedback. I was
> thinking about the mentioned concerns:
>
>- The motivation for the change
>- Too much additional metadata (storage overhead, namenode pressure on
>HDFS)
>- Performance impact for read/writing TableMetadata
>- Some impact to existing Table API's, and maintenance procedures, to
>have to check for these snapshots
>
> I chatted a bit offline with Yufei to brainstorm, and I wrote a V2 of the
> proposal at the same link:
> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit.
> I also tried to clarify the motivation in the doc with actual metadata
> table queries that would be possible.
>
> This version now simply adds an optional 'expired-snapshots-path' that
> contains the metadata of expired Snapshots.  I think this should address
> the above concerns:
>
>- Minimal storage overhead for just snapshot references (capped).  I
>don't propose anymore to keep old snapshot manifest-list/manifest files,
>the snapshot reference to the expired snapshot should be a good start.
>- Minimize perf overhead of read/write TableMetadata.  The additional
>file is only written by ExpireSnapshots if feature is enabled, and only
>read on demand (via metadata table query for example)
>- No impact to other Table APIs or maintenance procedures (as these
>dont show up as regular table.snapshots() list anymore).
>- Only additive optional spec change (backwards compatible)
>
> Of course, again, this feature is possible outside Iceberg, but the
> advantage of doing it in Iceberg is that it could be integrated into
> ExpireSnapshots and Metadata Table frameworks.
>
> Curious what people think?
>
> Thanks
> Szehon
>
> On Wed, Jul 10, 2024 at 1:44 AM Péter Váry 
> wrote:
>
>> > I believe DeleteOrphanFiles may be ok as is, because currently the
>> logic walks down the reachable graph and marks those metadata files as
>> 'not-orphan', so it should naturally walk these 'expired' snapshots as well.
>>
>> We need to keep the metadata files, but remove data files if they are not
>> removed for whatever reason. Doable, but logic change.
>>
>> > You mean purging expired snapshots in the middle of the history,
>> right?  I think the current mechanism for this is 'tagging' and 'branching'.
>>
>> I think for most users the compaction commits are technical details which
>> they would like to avoid / don't want to see. The real table history is
>> only the changes initiated by the user, and it would be good to hide the
>> technical/compaction commits.
>>
>>
>> On Wed, Jul 10, 2024, 08:52 himadri pal  wrote:
>>
>>> Hi Szehon,
>>>
>>> This is a good idea considering the use case it intends to solve. Added
>>> few questions and comments in the design doc.
>>>
>>> IMO , Alternate options considered specified in the design doc look
>>> cleaner to me.
>>>
>>> I think, it might add to maintenance burden, now that we need to
>>> remember to remove these metadata only snapshots.
>>>
>>> Also I wonder some of the use cases it intends to address, is solvable
>>> by metadata alone? - i.e how much data was added in a given time range? -
>>> May be to answer these kind of questions user would prefer a to create KPI
>>> using columns in the dataset.
>>>
>>>
>>> Regards,
>>> Himadri Pal
>>>
>>>
>>> On Tue, Jul 9, 2024 at 11:10 PM Steven Wu  wrote:
>>>
 I am not totally convinced of the motivation yet.

 I thought the snapshot retention window is primarily meant for time
 travel and troubleshooting table changes that happened recently (like a few
 days or weeks).

 Is it valuable enough to keep expired snapshots for as long as months
 or years? While metadata files are typically smaller than data files in
 total size, it can still be significant considering the default amount of
 column stats written today (especially for wide tables with many columns).

 How long are we going to keep the expired snapshot references by
 default? If it is months/years, it can have major implications on the query
 performance of metadata tables (like snapshots, all_*).

 I assume it will also have some performance impact on table loading as
 a lot more expired snapshots are still referenced.




 On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho 
 wrote:

> Thanks Peter and Yufei.
>
> Yes, in terms of implementation, I noted in the doc we need to add
> error checks to prevent time-travel / rollback / cherry-pick operations to
> 'expired' snapshots.  I'll make it more clear in the doc, which operations
> we need to check against.
>
> I believe DeleteOrphanFiles may be ok as is, because currently the
> logic walks down the

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-16 Thread Szehon Ho
Hi,

Thanks for reading through the proposal and the good feedback. I was
thinking about the mentioned concerns:

   - The motivation for the change
   - Too much additional metadata (storage overhead, namenode pressure on
   HDFS)
   - Performance impact for read/writing TableMetadata
   - Some impact to existing Table API's, and maintenance procedures, to
   have to check for these snapshots

I chatted a bit offline with Yufei to brainstorm, and I wrote a V2 of the
proposal at the same link:
https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit.
I also tried to clarify the motivation in the doc with actual metadata
table queries that would be possible.

This version now simply adds an optional 'expired-snapshots-path' that
contains the metadata of expired Snapshots.  I think this should address
the above concerns:

   - Minimal storage overhead for just snapshot references (capped).  I
   don't propose anymore to keep old snapshot manifest-list/manifest files,
   the snapshot reference to the expired snapshot should be a good start.
   - Minimize perf overhead of read/write TableMetadata.  The additional
   file is only written by ExpireSnapshots if feature is enabled, and only
   read on demand (via metadata table query for example)
   - No impact to other Table APIs or maintenance procedures (as these dont
   show up as regular table.snapshots() list anymore).
   - Only additive optional spec change (backwards compatible)

Of course, again, this feature is possible outside Iceberg, but the
advantage of doing it in Iceberg is that it could be integrated into
ExpireSnapshots and Metadata Table frameworks.

Curious what people think?

Thanks
Szehon

On Wed, Jul 10, 2024 at 1:44 AM Péter Váry 
wrote:

> > I believe DeleteOrphanFiles may be ok as is, because currently the logic
> walks down the reachable graph and marks those metadata files as
> 'not-orphan', so it should naturally walk these 'expired' snapshots as well.
>
> We need to keep the metadata files, but remove data files if they are not
> removed for whatever reason. Doable, but logic change.
>
> > You mean purging expired snapshots in the middle of the history, right?
> I think the current mechanism for this is 'tagging' and 'branching'.
>
> I think for most users the compaction commits are technical details which
> they would like to avoid / don't want to see. The real table history is
> only the changes initiated by the user, and it would be good to hide the
> technical/compaction commits.
>
>
> On Wed, Jul 10, 2024, 08:52 himadri pal  wrote:
>
>> Hi Szehon,
>>
>> This is a good idea considering the use case it intends to solve. Added
>> few questions and comments in the design doc.
>>
>> IMO , Alternate options considered specified in the design doc look
>> cleaner to me.
>>
>> I think, it might add to maintenance burden, now that we need to remember
>> to remove these metadata only snapshots.
>>
>> Also I wonder some of the use cases it intends to address, is solvable by
>> metadata alone? - i.e how much data was added in a given time range? - May
>> be to answer these kind of questions user would prefer a to create KPI
>> using columns in the dataset.
>>
>>
>> Regards,
>> Himadri Pal
>>
>>
>> On Tue, Jul 9, 2024 at 11:10 PM Steven Wu  wrote:
>>
>>> I am not totally convinced of the motivation yet.
>>>
>>> I thought the snapshot retention window is primarily meant for time
>>> travel and troubleshooting table changes that happened recently (like a few
>>> days or weeks).
>>>
>>> Is it valuable enough to keep expired snapshots for as long as months or
>>> years? While metadata files are typically smaller than data files in total
>>> size, it can still be significant considering the default amount of column
>>> stats written today (especially for wide tables with many columns).
>>>
>>> How long are we going to keep the expired snapshot references by
>>> default? If it is months/years, it can have major implications on the query
>>> performance of metadata tables (like snapshots, all_*).
>>>
>>> I assume it will also have some performance impact on table loading as a
>>> lot more expired snapshots are still referenced.
>>>
>>>
>>>
>>>
>>> On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho 
>>> wrote:
>>>
 Thanks Peter and Yufei.

 Yes, in terms of implementation, I noted in the doc we need to add
 error checks to prevent time-travel / rollback / cherry-pick operations to
 'expired' snapshots.  I'll make it more clear in the doc, which operations
 we need to check against.

 I believe DeleteOrphanFiles may be ok as is, because currently the
 logic walks down the reachable graph and marks those metadata files as
 'not-orphan', so it should naturally walk these 'expired' snapshots as 
 well.

 So, I think the main changes in terms of implementations is going to be
 adding error checks in those Table API's, and updating ExpireSnapshots API.

 Do we want to consider

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-10 Thread Péter Váry
> I believe DeleteOrphanFiles may be ok as is, because currently the logic
walks down the reachable graph and marks those metadata files as
'not-orphan', so it should naturally walk these 'expired' snapshots as well.

We need to keep the metadata files, but remove data files if they are not
removed for whatever reason. Doable, but logic change.

> You mean purging expired snapshots in the middle of the history, right?
I think the current mechanism for this is 'tagging' and 'branching'.

I think for most users the compaction commits are technical details which
they would like to avoid / don't want to see. The real table history is
only the changes initiated by the user, and it would be good to hide the
technical/compaction commits.


On Wed, Jul 10, 2024, 08:52 himadri pal  wrote:

> Hi Szehon,
>
> This is a good idea considering the use case it intends to solve. Added
> few questions and comments in the design doc.
>
> IMO , Alternate options considered specified in the design doc look
> cleaner to me.
>
> I think, it might add to maintenance burden, now that we need to remember
> to remove these metadata only snapshots.
>
> Also I wonder some of the use cases it intends to address, is solvable by
> metadata alone? - i.e how much data was added in a given time range? - May
> be to answer these kind of questions user would prefer a to create KPI
> using columns in the dataset.
>
>
> Regards,
> Himadri Pal
>
>
> On Tue, Jul 9, 2024 at 11:10 PM Steven Wu  wrote:
>
>> I am not totally convinced of the motivation yet.
>>
>> I thought the snapshot retention window is primarily meant for time
>> travel and troubleshooting table changes that happened recently (like a few
>> days or weeks).
>>
>> Is it valuable enough to keep expired snapshots for as long as months or
>> years? While metadata files are typically smaller than data files in total
>> size, it can still be significant considering the default amount of column
>> stats written today (especially for wide tables with many columns).
>>
>> How long are we going to keep the expired snapshot references by default?
>> If it is months/years, it can have major implications on the query
>> performance of metadata tables (like snapshots, all_*).
>>
>> I assume it will also have some performance impact on table loading as a
>> lot more expired snapshots are still referenced.
>>
>>
>>
>>
>> On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho  wrote:
>>
>>> Thanks Peter and Yufei.
>>>
>>> Yes, in terms of implementation, I noted in the doc we need to add error
>>> checks to prevent time-travel / rollback / cherry-pick operations to
>>> 'expired' snapshots.  I'll make it more clear in the doc, which operations
>>> we need to check against.
>>>
>>> I believe DeleteOrphanFiles may be ok as is, because currently the logic
>>> walks down the reachable graph and marks those metadata files as
>>> 'not-orphan', so it should naturally walk these 'expired' snapshots as well.
>>>
>>> So, I think the main changes in terms of implementations is going to be
>>> adding error checks in those Table API's, and updating ExpireSnapshots API.
>>>
>>> Do we want to consider expiring snapshots in the middle of the history
 of the table?

>>> You mean purging expired snapshots in the middle of the history, right?
>>> I think the current mechanism for this is 'tagging' and 'branching'.  So
>>> interestingly, I was thinking its related to your other question, and if we
>>> don't add error-check to 'tagging' and 'branching' on 'expired' snapshot,
>>> it could be handled just as they are handled today for other snapshots.
>>> Its one option.  We could support it subsequently as well , after the first
>>> version and if there's some usage of this.
>>>
>>> One thing that comes up in this thread and google doc is some question
>>> about the size of preserved metadata.  I had put in the Alternatives
>>> section, that we could potentially make the ExpireSnapshots purge boolean
>>> argument more nuanced like PURGE, PRESERVE_REFS (snapshot refs are
>>> preserved), PRESERVE_METADATA (snapshot refs and all metadata files are
>>> preserved), though I am still debating if its worth it, as users could
>>> choose not to use this feature.
>>>
>>> Thanks
>>> Szehon
>>>
>>>
>>>
>>> On Tue, Jul 9, 2024 at 6:02 PM Yufei Gu  wrote:
>>>
 Thank you for the interesting proposal. With a minor specification
 change, it could indeed enable different retention periods for data files
 and metadata files. This differentiation is useful for two reasons:

1. More metadata helps us better understand the table history,
providing valuable insights.
2. Users often prioritize data file deletion as it frees up
significant storage space and removes potentially sensitive data.

 However, adding a boolean property to the specification isn't
 necessarily a lightweight solution. As Peter mentioned, implementing this
 change requires modifications in several places. In this co

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-09 Thread himadri pal
Hi Szehon,

This is a good idea considering the use case it intends to solve. Added few
questions and comments in the design doc.

IMO , Alternate options considered specified in the design doc look cleaner
to me.

I think, it might add to maintenance burden, now that we need to remember
to remove these metadata only snapshots.

Also I wonder some of the use cases it intends to address, is solvable by
metadata alone? - i.e how much data was added in a given time range? - May
be to answer these kind of questions user would prefer a to create KPI
using columns in the dataset.


Regards,
Himadri Pal


On Tue, Jul 9, 2024 at 11:10 PM Steven Wu  wrote:

> I am not totally convinced of the motivation yet.
>
> I thought the snapshot retention window is primarily meant for time travel
> and troubleshooting table changes that happened recently (like a few days
> or weeks).
>
> Is it valuable enough to keep expired snapshots for as long as months or
> years? While metadata files are typically smaller than data files in total
> size, it can still be significant considering the default amount of column
> stats written today (especially for wide tables with many columns).
>
> How long are we going to keep the expired snapshot references by default?
> If it is months/years, it can have major implications on the query
> performance of metadata tables (like snapshots, all_*).
>
> I assume it will also have some performance impact on table loading as a
> lot more expired snapshots are still referenced.
>
>
>
>
> On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho  wrote:
>
>> Thanks Peter and Yufei.
>>
>> Yes, in terms of implementation, I noted in the doc we need to add error
>> checks to prevent time-travel / rollback / cherry-pick operations to
>> 'expired' snapshots.  I'll make it more clear in the doc, which operations
>> we need to check against.
>>
>> I believe DeleteOrphanFiles may be ok as is, because currently the logic
>> walks down the reachable graph and marks those metadata files as
>> 'not-orphan', so it should naturally walk these 'expired' snapshots as well.
>>
>> So, I think the main changes in terms of implementations is going to be
>> adding error checks in those Table API's, and updating ExpireSnapshots API.
>>
>> Do we want to consider expiring snapshots in the middle of the history of
>>> the table?
>>>
>> You mean purging expired snapshots in the middle of the history, right?
>> I think the current mechanism for this is 'tagging' and 'branching'.  So
>> interestingly, I was thinking its related to your other question, and if we
>> don't add error-check to 'tagging' and 'branching' on 'expired' snapshot,
>> it could be handled just as they are handled today for other snapshots.
>> Its one option.  We could support it subsequently as well , after the first
>> version and if there's some usage of this.
>>
>> One thing that comes up in this thread and google doc is some question
>> about the size of preserved metadata.  I had put in the Alternatives
>> section, that we could potentially make the ExpireSnapshots purge boolean
>> argument more nuanced like PURGE, PRESERVE_REFS (snapshot refs are
>> preserved), PRESERVE_METADATA (snapshot refs and all metadata files are
>> preserved), though I am still debating if its worth it, as users could
>> choose not to use this feature.
>>
>> Thanks
>> Szehon
>>
>>
>>
>> On Tue, Jul 9, 2024 at 6:02 PM Yufei Gu  wrote:
>>
>>> Thank you for the interesting proposal. With a minor specification
>>> change, it could indeed enable different retention periods for data files
>>> and metadata files. This differentiation is useful for two reasons:
>>>
>>>1. More metadata helps us better understand the table history,
>>>providing valuable insights.
>>>2. Users often prioritize data file deletion as it frees up
>>>significant storage space and removes potentially sensitive data.
>>>
>>> However, adding a boolean property to the specification isn't
>>> necessarily a lightweight solution. As Peter mentioned, implementing this
>>> change requires modifications in several places. In this context, external
>>> systems like LakeChime or a REST catalog implementation could offer
>>> effective solutions to manage extended metadata retention periods, without
>>> spec changes.
>>>
>>> I am neutral on this proposal (+0) and look forward to seeing more input
>>> from people.
>>> Yufei
>>>
>>>
>>> On Mon, Jul 8, 2024 at 10:32 PM Péter Váry 
>>> wrote:
>>>
 We need to handle expired snapshots in several places differently in
 Iceberg core as well.
 - We need to add checks to prevent scans read these snapshots and throw
 a meaningful error.
 - We need to add checks to prevent tagging/branching these snapshots
 - We need to update DeleteOrphanFiles in Spark/Flink to not consider
 files only referenced by the expired snapshots

 Some Flink jobs do frequent commits, and in these cases, the size of
 the metadata file becomes a constraining factor too. 

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-09 Thread Steven Wu
I am not totally convinced of the motivation yet.

I thought the snapshot retention window is primarily meant for time travel
and troubleshooting table changes that happened recently (like a few days
or weeks).

Is it valuable enough to keep expired snapshots for as long as months or
years? While metadata files are typically smaller than data files in total
size, it can still be significant considering the default amount of column
stats written today (especially for wide tables with many columns).

How long are we going to keep the expired snapshot references by default?
If it is months/years, it can have major implications on the query
performance of metadata tables (like snapshots, all_*).

I assume it will also have some performance impact on table loading as a
lot more expired snapshots are still referenced.




On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho  wrote:

> Thanks Peter and Yufei.
>
> Yes, in terms of implementation, I noted in the doc we need to add error
> checks to prevent time-travel / rollback / cherry-pick operations to
> 'expired' snapshots.  I'll make it more clear in the doc, which operations
> we need to check against.
>
> I believe DeleteOrphanFiles may be ok as is, because currently the logic
> walks down the reachable graph and marks those metadata files as
> 'not-orphan', so it should naturally walk these 'expired' snapshots as well.
>
> So, I think the main changes in terms of implementations is going to be
> adding error checks in those Table API's, and updating ExpireSnapshots API.
>
> Do we want to consider expiring snapshots in the middle of the history of
>> the table?
>>
> You mean purging expired snapshots in the middle of the history, right?  I
> think the current mechanism for this is 'tagging' and 'branching'.  So
> interestingly, I was thinking its related to your other question, and if we
> don't add error-check to 'tagging' and 'branching' on 'expired' snapshot,
> it could be handled just as they are handled today for other snapshots.
> Its one option.  We could support it subsequently as well , after the first
> version and if there's some usage of this.
>
> One thing that comes up in this thread and google doc is some question
> about the size of preserved metadata.  I had put in the Alternatives
> section, that we could potentially make the ExpireSnapshots purge boolean
> argument more nuanced like PURGE, PRESERVE_REFS (snapshot refs are
> preserved), PRESERVE_METADATA (snapshot refs and all metadata files are
> preserved), though I am still debating if its worth it, as users could
> choose not to use this feature.
>
> Thanks
> Szehon
>
>
>
> On Tue, Jul 9, 2024 at 6:02 PM Yufei Gu  wrote:
>
>> Thank you for the interesting proposal. With a minor specification
>> change, it could indeed enable different retention periods for data files
>> and metadata files. This differentiation is useful for two reasons:
>>
>>1. More metadata helps us better understand the table history,
>>providing valuable insights.
>>2. Users often prioritize data file deletion as it frees up
>>significant storage space and removes potentially sensitive data.
>>
>> However, adding a boolean property to the specification isn't necessarily
>> a lightweight solution. As Peter mentioned, implementing this change
>> requires modifications in several places. In this context, external systems
>> like LakeChime or a REST catalog implementation could offer effective
>> solutions to manage extended metadata retention periods, without spec
>> changes.
>>
>> I am neutral on this proposal (+0) and look forward to seeing more input
>> from people.
>> Yufei
>>
>>
>> On Mon, Jul 8, 2024 at 10:32 PM Péter Váry 
>> wrote:
>>
>>> We need to handle expired snapshots in several places differently in
>>> Iceberg core as well.
>>> - We need to add checks to prevent scans read these snapshots and throw
>>> a meaningful error.
>>> - We need to add checks to prevent tagging/branching these snapshots
>>> - We need to update DeleteOrphanFiles in Spark/Flink to not consider
>>> files only referenced by the expired snapshots
>>>
>>> Some Flink jobs do frequent commits, and in these cases, the size of the
>>> metadata file becomes a constraining factor too. In this case, we could
>>> just tell not to use this feature, and expire the metadata as we do now,
>>> but I thought it's worth to mention.
>>>
>>> Do we want to consider expiring snapshots in the middle of the history
>>> of the table?
>>> When we compact the table, then the compaction commits litter the real
>>> history of the table. Consider the following:
>>> - S1 writes some data
>>> - S2 writes some more data
>>> - S3 compacts the previous 2 commits
>>> - S4 writes even more data
>>> From the query engine user perspective S3 is a commit which does
>>> nothing, not initiated by the user, and most probably they don't even want
>>> to know of. If one can expire a snapshot from the middle of the history,
>>> that would be nice, so users would see only S1

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-09 Thread Szehon Ho
Thanks Peter and Yufei.

Yes, in terms of implementation, I noted in the doc we need to add error
checks to prevent time-travel / rollback / cherry-pick operations to
'expired' snapshots.  I'll make it more clear in the doc, which operations
we need to check against.

I believe DeleteOrphanFiles may be ok as is, because currently the logic
walks down the reachable graph and marks those metadata files as
'not-orphan', so it should naturally walk these 'expired' snapshots as well.

So, I think the main changes in terms of implementations is going to be
adding error checks in those Table API's, and updating ExpireSnapshots API.

Do we want to consider expiring snapshots in the middle of the history of
> the table?
>
You mean purging expired snapshots in the middle of the history, right?  I
think the current mechanism for this is 'tagging' and 'branching'.  So
interestingly, I was thinking its related to your other question, and if we
don't add error-check to 'tagging' and 'branching' on 'expired' snapshot,
it could be handled just as they are handled today for other snapshots.
Its one option.  We could support it subsequently as well , after the first
version and if there's some usage of this.

One thing that comes up in this thread and google doc is some question
about the size of preserved metadata.  I had put in the Alternatives
section, that we could potentially make the ExpireSnapshots purge boolean
argument more nuanced like PURGE, PRESERVE_REFS (snapshot refs are
preserved), PRESERVE_METADATA (snapshot refs and all metadata files are
preserved), though I am still debating if its worth it, as users could
choose not to use this feature.

Thanks
Szehon



On Tue, Jul 9, 2024 at 6:02 PM Yufei Gu  wrote:

> Thank you for the interesting proposal. With a minor specification change,
> it could indeed enable different retention periods for data files and
> metadata files. This differentiation is useful for two reasons:
>
>1. More metadata helps us better understand the table history,
>providing valuable insights.
>2. Users often prioritize data file deletion as it frees up
>significant storage space and removes potentially sensitive data.
>
> However, adding a boolean property to the specification isn't necessarily
> a lightweight solution. As Peter mentioned, implementing this change
> requires modifications in several places. In this context, external systems
> like LakeChime or a REST catalog implementation could offer effective
> solutions to manage extended metadata retention periods, without spec
> changes.
>
> I am neutral on this proposal (+0) and look forward to seeing more input
> from people.
> Yufei
>
>
> On Mon, Jul 8, 2024 at 10:32 PM Péter Váry 
> wrote:
>
>> We need to handle expired snapshots in several places differently in
>> Iceberg core as well.
>> - We need to add checks to prevent scans read these snapshots and throw a
>> meaningful error.
>> - We need to add checks to prevent tagging/branching these snapshots
>> - We need to update DeleteOrphanFiles in Spark/Flink to not consider
>> files only referenced by the expired snapshots
>>
>> Some Flink jobs do frequent commits, and in these cases, the size of the
>> metadata file becomes a constraining factor too. In this case, we could
>> just tell not to use this feature, and expire the metadata as we do now,
>> but I thought it's worth to mention.
>>
>> Do we want to consider expiring snapshots in the middle of the history of
>> the table?
>> When we compact the table, then the compaction commits litter the real
>> history of the table. Consider the following:
>> - S1 writes some data
>> - S2 writes some more data
>> - S3 compacts the previous 2 commits
>> - S4 writes even more data
>> From the query engine user perspective S3 is a commit which does nothing,
>> not initiated by the user, and most probably they don't even want to know
>> of. If one can expire a snapshot from the middle of the history, that would
>> be nice, so users would see only S1/S2/S4. The only downside is that
>> reading S2 is less performant than reading S3, but IMHO this could be
>> acceptable for having only user driven changes in the table history.
>>
>>
>> In Mon, Jul 8, 2024, 20:15 Szehon Ho  wrote:
>>
>>> Thanks for the comments so far.  I also thought previously that this
>>> functionality would be in an external system, like LakeChime, or a custom
>>> catalog extension.  But after doing an initial analysis (please double
>>> check), I thought it's a small enough change that it would be worth putting
>>> in the Iceberg spec/API's for all users:
>>>
>>>- Table Spec, only one optional boolean field (on Snapshot, only set
>>>if the functionality is used).
>>>- API, only one boolean parameter (on ExpireSnapshots).
>>>
>>> I do wonder, will keeping expired snapshots as is slow down
 manifest/scan planning though (REST catalog approaches could probably
 mitigate this)?

>>>
>>> I think it should not slow down manifest/scan plan

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-09 Thread Yufei Gu
Thank you for the interesting proposal. With a minor specification change,
it could indeed enable different retention periods for data files and
metadata files. This differentiation is useful for two reasons:

   1. More metadata helps us better understand the table history, providing
   valuable insights.
   2. Users often prioritize data file deletion as it frees up significant
   storage space and removes potentially sensitive data.

However, adding a boolean property to the specification isn't necessarily a
lightweight solution. As Peter mentioned, implementing this change requires
modifications in several places. In this context, external systems like
LakeChime or a REST catalog implementation could offer effective solutions
to manage extended metadata retention periods, without spec changes.

I am neutral on this proposal (+0) and look forward to seeing more input
from people.
Yufei


On Mon, Jul 8, 2024 at 10:32 PM Péter Váry 
wrote:

> We need to handle expired snapshots in several places differently in
> Iceberg core as well.
> - We need to add checks to prevent scans read these snapshots and throw a
> meaningful error.
> - We need to add checks to prevent tagging/branching these snapshots
> - We need to update DeleteOrphanFiles in Spark/Flink to not consider files
> only referenced by the expired snapshots
>
> Some Flink jobs do frequent commits, and in these cases, the size of the
> metadata file becomes a constraining factor too. In this case, we could
> just tell not to use this feature, and expire the metadata as we do now,
> but I thought it's worth to mention.
>
> Do we want to consider expiring snapshots in the middle of the history of
> the table?
> When we compact the table, then the compaction commits litter the real
> history of the table. Consider the following:
> - S1 writes some data
> - S2 writes some more data
> - S3 compacts the previous 2 commits
> - S4 writes even more data
> From the query engine user perspective S3 is a commit which does nothing,
> not initiated by the user, and most probably they don't even want to know
> of. If one can expire a snapshot from the middle of the history, that would
> be nice, so users would see only S1/S2/S4. The only downside is that
> reading S2 is less performant than reading S3, but IMHO this could be
> acceptable for having only user driven changes in the table history.
>
>
> In Mon, Jul 8, 2024, 20:15 Szehon Ho  wrote:
>
>> Thanks for the comments so far.  I also thought previously that this
>> functionality would be in an external system, like LakeChime, or a custom
>> catalog extension.  But after doing an initial analysis (please double
>> check), I thought it's a small enough change that it would be worth putting
>> in the Iceberg spec/API's for all users:
>>
>>- Table Spec, only one optional boolean field (on Snapshot, only set
>>if the functionality is used).
>>- API, only one boolean parameter (on ExpireSnapshots).
>>
>> I do wonder, will keeping expired snapshots as is slow down manifest/scan
>>> planning though (REST catalog approaches could probably mitigate this)?
>>>
>>
>> I think it should not slow down manifest/scan planning, because we plan
>> using the current snapshot (or the one we specify via time travel), and we
>> wouldn't read expired snapshots in this case.
>>
>> Thanks
>> Szehon
>>
>> On Mon, Jul 8, 2024 at 10:54 AM John Greene 
>> wrote:
>>
>>> I do agree with the need that this proposal solves, to decouple the
>>> snapshot history from the data deletion. I do wonder, will keeping expired
>>> snapshots as is slow down manifest/scan planning though (REST catalog
>>> approaches could probably mitigate this)?
>>>
>>> On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen 
>>> wrote:
>>>
 Hi Shehon, Walaa

 Thank Shehon for bringing this up. And thank you Walaa for proving more
 context from similar existing solution to the problem.
 The choices that LakeChime seems to have made -- to keep information in
 a separate RDBMS and which particular metadata information to retain --
 they indeed look as use-case specific, until we observe repeating patterns.
 The idea to bake lifecycle changes into table format spec was proposed
 as an alternative to the idea to bake lifecycle changes into REST catalog
 spec. It was brought into discussion based on the intuition that REST
 catalog is first-class citizen in Iceberg world, just like other catalogs,
 and so solutions to table-centric problems do not need to be limited to
 REST catalog. What is the information we retain, how/whether this is
 configurable are open question and applicable to both avenues.

 As a 3rd/another alternative, we could focus on REST catalog
 *extensions*, without naming snapshot metadata lifecycle, and leave
 the problem up to REST's implementors. That would mean Iceberg project
 doesn't address snapshot metadata lifecycle changes topic directly, but
 instead gives users t

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-08 Thread Péter Váry
We need to handle expired snapshots in several places differently in
Iceberg core as well.
- We need to add checks to prevent scans read these snapshots and throw a
meaningful error.
- We need to add checks to prevent tagging/branching these snapshots
- We need to update DeleteOrphanFiles in Spark/Flink to not consider files
only referenced by the expired snapshots

Some Flink jobs do frequent commits, and in these cases, the size of the
metadata file becomes a constraining factor too. In this case, we could
just tell not to use this feature, and expire the metadata as we do now,
but I thought it's worth to mention.

Do we want to consider expiring snapshots in the middle of the history of
the table?
When we compact the table, then the compaction commits litter the real
history of the table. Consider the following:
- S1 writes some data
- S2 writes some more data
- S3 compacts the previous 2 commits
- S4 writes even more data
>From the query engine user perspective S3 is a commit which does nothing,
not initiated by the user, and most probably they don't even want to know
of. If one can expire a snapshot from the middle of the history, that would
be nice, so users would see only S1/S2/S4. The only downside is that
reading S2 is less performant than reading S3, but IMHO this could be
acceptable for having only user driven changes in the table history.


In Mon, Jul 8, 2024, 20:15 Szehon Ho  wrote:

> Thanks for the comments so far.  I also thought previously that this
> functionality would be in an external system, like LakeChime, or a custom
> catalog extension.  But after doing an initial analysis (please double
> check), I thought it's a small enough change that it would be worth putting
> in the Iceberg spec/API's for all users:
>
>- Table Spec, only one optional boolean field (on Snapshot, only set
>if the functionality is used).
>- API, only one boolean parameter (on ExpireSnapshots).
>
> I do wonder, will keeping expired snapshots as is slow down manifest/scan
>> planning though (REST catalog approaches could probably mitigate this)?
>>
>
> I think it should not slow down manifest/scan planning, because we plan
> using the current snapshot (or the one we specify via time travel), and we
> wouldn't read expired snapshots in this case.
>
> Thanks
> Szehon
>
> On Mon, Jul 8, 2024 at 10:54 AM John Greene  wrote:
>
>> I do agree with the need that this proposal solves, to decouple the
>> snapshot history from the data deletion. I do wonder, will keeping expired
>> snapshots as is slow down manifest/scan planning though (REST catalog
>> approaches could probably mitigate this)?
>>
>> On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen 
>> wrote:
>>
>>> Hi Shehon, Walaa
>>>
>>> Thank Shehon for bringing this up. And thank you Walaa for proving more
>>> context from similar existing solution to the problem.
>>> The choices that LakeChime seems to have made -- to keep information in
>>> a separate RDBMS and which particular metadata information to retain --
>>> they indeed look as use-case specific, until we observe repeating patterns.
>>> The idea to bake lifecycle changes into table format spec was proposed
>>> as an alternative to the idea to bake lifecycle changes into REST catalog
>>> spec. It was brought into discussion based on the intuition that REST
>>> catalog is first-class citizen in Iceberg world, just like other catalogs,
>>> and so solutions to table-centric problems do not need to be limited to
>>> REST catalog. What is the information we retain, how/whether this is
>>> configurable are open question and applicable to both avenues.
>>>
>>> As a 3rd/another alternative, we could focus on REST catalog
>>> *extensions*, without naming snapshot metadata lifecycle, and leave the
>>> problem up to REST's implementors. That would mean Iceberg project doesn't
>>> address snapshot metadata lifecycle changes topic directly, but instead
>>> gives users tools to build solutions around it. At this point I am not
>>> trying to judge whether it's a good idea or not. Probably depends how
>>> important it is to solve the problem and have a common solution.
>>>
>>> Best,
>>> Piotr
>>>
>>>
>>>
>>>
>>> On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa 
>>> wrote:
>>>
 Hi Szehon,

 Thanks for sharing this proposal. We have thought along the same lines
 and implemented an external system (LakeChime [1]) that retains snapshot +
 partition metadata for longer (actual internal implementation keeps data
 for 13 months, but that can be tuned). For efficient analysis, we have kept
 this data in an RDBMS. My opinion is this may be a better fit to an
 external system (similar to LakeChime) since it could potentially
 complicate the Iceberg spec, APIs, or their implementations. Also, the type
 of metadata tracked can differ depending on the use case. For example,
 while LakeChime retains partition and operation type metadata, it does not
 track file-level metadata as ther

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-08 Thread Szehon Ho
Thanks for the comments so far.  I also thought previously that this
functionality would be in an external system, like LakeChime, or a custom
catalog extension.  But after doing an initial analysis (please double
check), I thought it's a small enough change that it would be worth putting
in the Iceberg spec/API's for all users:

   - Table Spec, only one optional boolean field (on Snapshot, only set if
   the functionality is used).
   - API, only one boolean parameter (on ExpireSnapshots).

I do wonder, will keeping expired snapshots as is slow down manifest/scan
> planning though (REST catalog approaches could probably mitigate this)?
>

I think it should not slow down manifest/scan planning, because we plan
using the current snapshot (or the one we specify via time travel), and we
wouldn't read expired snapshots in this case.

Thanks
Szehon

On Mon, Jul 8, 2024 at 10:54 AM John Greene  wrote:

> I do agree with the need that this proposal solves, to decouple the
> snapshot history from the data deletion. I do wonder, will keeping expired
> snapshots as is slow down manifest/scan planning though (REST catalog
> approaches could probably mitigate this)?
>
> On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen 
> wrote:
>
>> Hi Shehon, Walaa
>>
>> Thank Shehon for bringing this up. And thank you Walaa for proving more
>> context from similar existing solution to the problem.
>> The choices that LakeChime seems to have made -- to keep information in a
>> separate RDBMS and which particular metadata information to retain -- they
>> indeed look as use-case specific, until we observe repeating patterns.
>> The idea to bake lifecycle changes into table format spec was proposed as
>> an alternative to the idea to bake lifecycle changes into REST catalog
>> spec. It was brought into discussion based on the intuition that REST
>> catalog is first-class citizen in Iceberg world, just like other catalogs,
>> and so solutions to table-centric problems do not need to be limited to
>> REST catalog. What is the information we retain, how/whether this is
>> configurable are open question and applicable to both avenues.
>>
>> As a 3rd/another alternative, we could focus on REST catalog *extensions*,
>> without naming snapshot metadata lifecycle, and leave the problem up to
>> REST's implementors. That would mean Iceberg project doesn't address
>> snapshot metadata lifecycle changes topic directly, but instead gives users
>> tools to build solutions around it. At this point I am not trying to judge
>> whether it's a good idea or not. Probably depends how important it is to
>> solve the problem and have a common solution.
>>
>> Best,
>> Piotr
>>
>>
>>
>>
>> On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa 
>> wrote:
>>
>>> Hi Szehon,
>>>
>>> Thanks for sharing this proposal. We have thought along the same lines
>>> and implemented an external system (LakeChime [1]) that retains snapshot +
>>> partition metadata for longer (actual internal implementation keeps data
>>> for 13 months, but that can be tuned). For efficient analysis, we have kept
>>> this data in an RDBMS. My opinion is this may be a better fit to an
>>> external system (similar to LakeChime) since it could potentially
>>> complicate the Iceberg spec, APIs, or their implementations. Also, the type
>>> of metadata tracked can differ depending on the use case. For example,
>>> while LakeChime retains partition and operation type metadata, it does not
>>> track file-level metadata as there was no specific use case for that.
>>>
>>> [1]
>>> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho 
>>> wrote:
>>>
 Hi folks,

 I would like to discuss an idea for an optional extension of Iceberg's
 Snapshot metadata lifecycle.  Thanks Piotr for replying on the other thread
 that this should be a fuller Iceberg format change.

 *Proposal Summary*

 Currently, ExpireSnapshots(long olderThan) purges metadata and deleted
 data of a Snapshot together.  Purging deleted data often requires a smaller
 timeline, due to strict requirements to claw back unused disk space,
 fulfill data lifecycle compliance, etc.  In many deployments, this means
 'olderThan' timestamp is set to just a few days before the current time
 (the default is 5 days).

 On the other hand, purging metadata could be ideally done on a more
 relaxed timeline, such as months or more, to allow for meaningful
 historical table analysis.

 We should have an optional way to purge Snapshot metadata separately
 from purging deleted data.  This would allow us to get history of the
 table, and answer questions like:

- When was a file/partition added
- When was a file/partition deleted
- How much data was added or removed in time X

 that are currently only possible for d

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-08 Thread John Greene
I do agree with the need that this proposal solves, to decouple the
snapshot history from the data deletion. I do wonder, will keeping expired
snapshots as is slow down manifest/scan planning though (REST catalog
approaches could probably mitigate this)?

On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen 
wrote:

> Hi Shehon, Walaa
>
> Thank Shehon for bringing this up. And thank you Walaa for proving more
> context from similar existing solution to the problem.
> The choices that LakeChime seems to have made -- to keep information in a
> separate RDBMS and which particular metadata information to retain -- they
> indeed look as use-case specific, until we observe repeating patterns.
> The idea to bake lifecycle changes into table format spec was proposed as
> an alternative to the idea to bake lifecycle changes into REST catalog
> spec. It was brought into discussion based on the intuition that REST
> catalog is first-class citizen in Iceberg world, just like other catalogs,
> and so solutions to table-centric problems do not need to be limited to
> REST catalog. What is the information we retain, how/whether this is
> configurable are open question and applicable to both avenues.
>
> As a 3rd/another alternative, we could focus on REST catalog *extensions*,
> without naming snapshot metadata lifecycle, and leave the problem up to
> REST's implementors. That would mean Iceberg project doesn't address
> snapshot metadata lifecycle changes topic directly, but instead gives users
> tools to build solutions around it. At this point I am not trying to judge
> whether it's a good idea or not. Probably depends how important it is to
> solve the problem and have a common solution.
>
> Best,
> Piotr
>
>
>
>
> On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa 
> wrote:
>
>> Hi Szehon,
>>
>> Thanks for sharing this proposal. We have thought along the same lines
>> and implemented an external system (LakeChime [1]) that retains snapshot +
>> partition metadata for longer (actual internal implementation keeps data
>> for 13 months, but that can be tuned). For efficient analysis, we have kept
>> this data in an RDBMS. My opinion is this may be a better fit to an
>> external system (similar to LakeChime) since it could potentially
>> complicate the Iceberg spec, APIs, or their implementations. Also, the type
>> of metadata tracked can differ depending on the use case. For example,
>> while LakeChime retains partition and operation type metadata, it does not
>> track file-level metadata as there was no specific use case for that.
>>
>> [1]
>> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes
>>
>> Thanks,
>> Walaa.
>>
>> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho 
>> wrote:
>>
>>> Hi folks,
>>>
>>> I would like to discuss an idea for an optional extension of Iceberg's
>>> Snapshot metadata lifecycle.  Thanks Piotr for replying on the other thread
>>> that this should be a fuller Iceberg format change.
>>>
>>> *Proposal Summary*
>>>
>>> Currently, ExpireSnapshots(long olderThan) purges metadata and deleted
>>> data of a Snapshot together.  Purging deleted data often requires a smaller
>>> timeline, due to strict requirements to claw back unused disk space,
>>> fulfill data lifecycle compliance, etc.  In many deployments, this means
>>> 'olderThan' timestamp is set to just a few days before the current time
>>> (the default is 5 days).
>>>
>>> On the other hand, purging metadata could be ideally done on a more
>>> relaxed timeline, such as months or more, to allow for meaningful
>>> historical table analysis.
>>>
>>> We should have an optional way to purge Snapshot metadata separately
>>> from purging deleted data.  This would allow us to get history of the
>>> table, and answer questions like:
>>>
>>>- When was a file/partition added
>>>- When was a file/partition deleted
>>>- How much data was added or removed in time X
>>>
>>> that are currently only possible for data operations within a few days.
>>>
>>> *Github Proposal*:  https://github.com/apache/iceberg/issues/10646
>>> *Google Design Doc*:
>>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit
>>> 
>>>
>>> Curious if anyone has thought along these lines and/or sees obvious
>>> issues.  Would appreciate any feedback on the proposal.
>>>
>>> Thanks
>>> Szehon
>>>
>>


Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-08 Thread Piotr Findeisen
Hi Shehon, Walaa

Thank Shehon for bringing this up. And thank you Walaa for proving more
context from similar existing solution to the problem.
The choices that LakeChime seems to have made -- to keep information in a
separate RDBMS and which particular metadata information to retain -- they
indeed look as use-case specific, until we observe repeating patterns.
The idea to bake lifecycle changes into table format spec was proposed as
an alternative to the idea to bake lifecycle changes into REST catalog
spec. It was brought into discussion based on the intuition that REST
catalog is first-class citizen in Iceberg world, just like other catalogs,
and so solutions to table-centric problems do not need to be limited to
REST catalog. What is the information we retain, how/whether this is
configurable are open question and applicable to both avenues.

As a 3rd/another alternative, we could focus on REST catalog *extensions*,
without naming snapshot metadata lifecycle, and leave the problem up to
REST's implementors. That would mean Iceberg project doesn't address
snapshot metadata lifecycle changes topic directly, but instead gives users
tools to build solutions around it. At this point I am not trying to judge
whether it's a good idea or not. Probably depends how important it is to
solve the problem and have a common solution.

Best,
Piotr




On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa 
wrote:

> Hi Szehon,
>
> Thanks for sharing this proposal. We have thought along the same lines and
> implemented an external system (LakeChime [1]) that retains snapshot +
> partition metadata for longer (actual internal implementation keeps data
> for 13 months, but that can be tuned). For efficient analysis, we have kept
> this data in an RDBMS. My opinion is this may be a better fit to an
> external system (similar to LakeChime) since it could potentially
> complicate the Iceberg spec, APIs, or their implementations. Also, the type
> of metadata tracked can differ depending on the use case. For example,
> while LakeChime retains partition and operation type metadata, it does not
> track file-level metadata as there was no specific use case for that.
>
> [1]
> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes
>
> Thanks,
> Walaa.
>
> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho  wrote:
>
>> Hi folks,
>>
>> I would like to discuss an idea for an optional extension of Iceberg's
>> Snapshot metadata lifecycle.  Thanks Piotr for replying on the other thread
>> that this should be a fuller Iceberg format change.
>>
>> *Proposal Summary*
>>
>> Currently, ExpireSnapshots(long olderThan) purges metadata and deleted
>> data of a Snapshot together.  Purging deleted data often requires a smaller
>> timeline, due to strict requirements to claw back unused disk space,
>> fulfill data lifecycle compliance, etc.  In many deployments, this means
>> 'olderThan' timestamp is set to just a few days before the current time
>> (the default is 5 days).
>>
>> On the other hand, purging metadata could be ideally done on a more
>> relaxed timeline, such as months or more, to allow for meaningful
>> historical table analysis.
>>
>> We should have an optional way to purge Snapshot metadata separately from
>> purging deleted data.  This would allow us to get history of the table, and
>> answer questions like:
>>
>>- When was a file/partition added
>>- When was a file/partition deleted
>>- How much data was added or removed in time X
>>
>> that are currently only possible for data operations within a few days.
>>
>> *Github Proposal*:  https://github.com/apache/iceberg/issues/10646
>> *Google Design Doc*:
>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit
>> 
>>
>> Curious if anyone has thought along these lines and/or sees obvious
>> issues.  Would appreciate any feedback on the proposal.
>>
>> Thanks
>> Szehon
>>
>


Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-06 Thread Walaa Eldin Moustafa
Hi Szehon,

Thanks for sharing this proposal. We have thought along the same lines and
implemented an external system (LakeChime [1]) that retains snapshot +
partition metadata for longer (actual internal implementation keeps data
for 13 months, but that can be tuned). For efficient analysis, we have kept
this data in an RDBMS. My opinion is this may be a better fit to an
external system (similar to LakeChime) since it could potentially
complicate the Iceberg spec, APIs, or their implementations. Also, the type
of metadata tracked can differ depending on the use case. For example,
while LakeChime retains partition and operation type metadata, it does not
track file-level metadata as there was no specific use case for that.

[1]
https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes

Thanks,
Walaa.

On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho  wrote:

> Hi folks,
>
> I would like to discuss an idea for an optional extension of Iceberg's
> Snapshot metadata lifecycle.  Thanks Piotr for replying on the other thread
> that this should be a fuller Iceberg format change.
>
> *Proposal Summary*
>
> Currently, ExpireSnapshots(long olderThan) purges metadata and deleted
> data of a Snapshot together.  Purging deleted data often requires a smaller
> timeline, due to strict requirements to claw back unused disk space,
> fulfill data lifecycle compliance, etc.  In many deployments, this means
> 'olderThan' timestamp is set to just a few days before the current time
> (the default is 5 days).
>
> On the other hand, purging metadata could be ideally done on a more
> relaxed timeline, such as months or more, to allow for meaningful
> historical table analysis.
>
> We should have an optional way to purge Snapshot metadata separately from
> purging deleted data.  This would allow us to get history of the table, and
> answer questions like:
>
>- When was a file/partition added
>- When was a file/partition deleted
>- How much data was added or removed in time X
>
> that are currently only possible for data operations within a few days.
>
> *Github Proposal*:  https://github.com/apache/iceberg/issues/10646
> *Google Design Doc*:
> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit
> 
>
> Curious if anyone has thought along these lines and/or sees obvious
> issues.  Would appreciate any feedback on the proposal.
>
> Thanks
> Szehon
>