Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-21 Thread Walaa Eldin Moustafa
Hi Jan, I do not think this is feasible because it assumes the catalog
identifiers do not collide across catalogs. Anyways, let us not over
engineer this use case. As I mentioned, it was for illustration purposes.

Since the discussion moved from “UUIDs vs Sequence numbers” to “UUIDs vs
catalog table identifiers”, I can help summarize the discussion on the new
options in a doc and we can iterate on it so we are on the same page on the
pros and cons. It might help others catch up on this discussion too.

Thanks,
Walaa.

On Tue, Aug 20, 2024 at 11:31 PM Jan Kaul 
wrote:

> Hi Walaa,
>
> aside from the fact that this is not a desired setup this is not an
> argument against using a refresh-state with identifiers as the key. In the
> case you're describing you would just have two entries with different
> identifiers in the refresh-state referencing the same snapshot-id. You
> would still be able to determine the freshness of the precomputed data.
>
> Thanks,
>
> Jan
> On 20.08.24 18:53, Walaa Eldin Moustafa wrote:
>
> Theoretically, we could have multiple catalogs each with different table
> name entries but referring to the same Iceberg table metadata, and hence
> same UUIDs (view metadata cannot be shared since they are strongly bound to
> the catalog identifiers). I understand this is not an everyday scenario but
> it illustrates the semantic gap.
>
> Thanks,
> Walaa.
>
>
> On Mon, Aug 19, 2024 at 10:54 PM Benny Chow  wrote:
>
>> Hi Walaa, I personally don't see a semantic issue with putting the table
>> identifiers in the refresh state.  The purpose of the refresh state is to
>> basically take a snapshot of the table and view versions at the time of
>> materialization.  Directly using table identifiers seems pretty natural to
>> me.   So, I'm +1 for:
>>
>> no lineage
>> + refresh-state key = identifier
>>
>> Thanks
>> Benny
>>
>>
>>
>> On Mon, Aug 19, 2024 at 9:19 PM Walaa Eldin Moustafa <
>> [email protected]> wrote:
>>
>>> Hi Micah, it is mostly about the typical results of denormalization such
>>> as data consistency, management complexity, integrity, etc. However, as
>>> mentioned earlier, the main reason would be the semantic gap around using
>>> catalog table identifiers as a concept in the table (more specifically
>>> snapshot summary) spec. Denormalization in this case is a minor issue.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Mon, Aug 19, 2024 at 4:44 PM Micah Kornfield 
>>> wrote:
>>>
 Thanks Micah, for the latter, I meant the type of denormalization of
> repeating a 3-part name as opposed to using an ID.


 Is the concern here just metadata size or something else?  For size I
 think if this is really anticipated to be a problem that it is likely for
 the state map in general, and we could investigate some more sophisticated
 encodings for the State map even without the overlap. I think maybe this
 could be handled if it proves to be a problem but hopefully engines are
 placing a reasonable cap on view depth + number of tables per view which
 puts an upper bound on overall size.

 Thanks,
 Micah

 On Fri, Aug 16, 2024 at 4:56 PM Walaa Eldin Moustafa <
 [email protected]> wrote:

> Thanks Micah, for the latter, I meant the type of denormalization of
> repeating a 3-part name as opposed to using an ID.
>
> On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield 
> wrote:
>
>> However, this still does not address the semantic issue which is more
>>> fundamental in my opinion. The Iceberg table spec is not aware of 
>>> catalog
>>> table identifiers and this use will be the first break of this 
>>> abstraction.
>>
>>
>> IIUC, based on Jan's comments, we are not going to modify the table
>> specification.  I thought the state map was effectively opaque metadata
>> from the table specification perspective?  If this is the case I feel 
>> like
>> that is OK and not a blocker, I think by their nature as already 
>> discussed
>> MVs need catalog information to function properly and the choice to put
>> catalog information into the table metadata is pragmatic and preserves
>> other desirable properties.  It might be a more important point if we 
>> want
>> to update the table specification (IMO, I still think it would probably 
>> be
>> OK).
>>
>> On a side note, it does not address the denormalization issue either
>>> if we ever want to introduce the lineage in the view as a nice-to-have.
>>
>>
>> I think if lineage is introduced to the View metadata, it should only
>> hold direct dependencies for the reasons already discussed. IMO, I think
>> the potential overlap is OK as they serve two different purposes.
>>
>> Cheers,
>> Micah
>>
>>
>>
>>
>>
>> On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa <
>> [email protected]> wrote:
>>
>>> That is righ

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-20 Thread Walaa Eldin Moustafa
Theoretically, we could have multiple catalogs each with different table
name entries but referring to the same Iceberg table metadata, and hence
same UUIDs (view metadata cannot be shared since they are strongly bound to
the catalog identifiers). I understand this is not an everyday scenario but
it illustrates the semantic gap.

Thanks,
Walaa.


On Mon, Aug 19, 2024 at 10:54 PM Benny Chow  wrote:

> Hi Walaa, I personally don't see a semantic issue with putting the table
> identifiers in the refresh state.  The purpose of the refresh state is to
> basically take a snapshot of the table and view versions at the time of
> materialization.  Directly using table identifiers seems pretty natural to
> me.   So, I'm +1 for:
>
> no lineage
> + refresh-state key = identifier
>
> Thanks
> Benny
>
>
>
> On Mon, Aug 19, 2024 at 9:19 PM Walaa Eldin Moustafa <
> [email protected]> wrote:
>
>> Hi Micah, it is mostly about the typical results of denormalization such
>> as data consistency, management complexity, integrity, etc. However, as
>> mentioned earlier, the main reason would be the semantic gap around using
>> catalog table identifiers as a concept in the table (more specifically
>> snapshot summary) spec. Denormalization in this case is a minor issue.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Mon, Aug 19, 2024 at 4:44 PM Micah Kornfield 
>> wrote:
>>
>>> Thanks Micah, for the latter, I meant the type of denormalization of
 repeating a 3-part name as opposed to using an ID.
>>>
>>>
>>> Is the concern here just metadata size or something else?  For size I
>>> think if this is really anticipated to be a problem that it is likely for
>>> the state map in general, and we could investigate some more sophisticated
>>> encodings for the State map even without the overlap. I think maybe this
>>> could be handled if it proves to be a problem but hopefully engines are
>>> placing a reasonable cap on view depth + number of tables per view which
>>> puts an upper bound on overall size.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Fri, Aug 16, 2024 at 4:56 PM Walaa Eldin Moustafa <
>>> [email protected]> wrote:
>>>
 Thanks Micah, for the latter, I meant the type of denormalization of
 repeating a 3-part name as opposed to using an ID.

 On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield 
 wrote:

> However, this still does not address the semantic issue which is more
>> fundamental in my opinion. The Iceberg table spec is not aware of catalog
>> table identifiers and this use will be the first break of this 
>> abstraction.
>
>
> IIUC, based on Jan's comments, we are not going to modify the table
> specification.  I thought the state map was effectively opaque metadata
> from the table specification perspective?  If this is the case I feel like
> that is OK and not a blocker, I think by their nature as already discussed
> MVs need catalog information to function properly and the choice to put
> catalog information into the table metadata is pragmatic and preserves
> other desirable properties.  It might be a more important point if we want
> to update the table specification (IMO, I still think it would probably be
> OK).
>
> On a side note, it does not address the denormalization issue either
>> if we ever want to introduce the lineage in the view as a nice-to-have.
>
>
> I think if lineage is introduced to the View metadata, it should only
> hold direct dependencies for the reasons already discussed. IMO, I think
> the potential overlap is OK as they serve two different purposes.
>
> Cheers,
> Micah
>
>
>
>
>
> On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa <
> [email protected]> wrote:
>
>> That is right. I agree that in the case of using catalog identifiers
>> in state information, using them in lineage information would be a
>> nice-to-have and not a requirement.
>>
>> However, this still does not address the semantic issue which is more
>> fundamental in my opinion. The Iceberg table spec is not aware of catalog
>> table identifiers and this use will be the first break of this 
>> abstraction.
>>
>> On a side note, it does not address the denormalization issue either
>> if we ever want to introduce the lineage in the view as a nice-to-have.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul 
>> wrote:
>>
>>> Hi Walaa,
>>>
>>> I would argue that for the refresh operation the query engine has to
>>> parse the query and then somehow execute it. For a full refresh it will
>>> directly execute the query and for a incremental refresh it will 
>>> execute a
>>> modified version. Therefore it has to fully expand the query tree.
>>>
>>> Best wishes,
>>>
>>> Jan
>>>
>>> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa 

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-19 Thread Benny Chow
Hi Walaa, I personally don't see a semantic issue with putting the table
identifiers in the refresh state.  The purpose of the refresh state is to
basically take a snapshot of the table and view versions at the time of
materialization.  Directly using table identifiers seems pretty natural to
me.   So, I'm +1 for:

no lineage
+ refresh-state key = identifier

Thanks
Benny



On Mon, Aug 19, 2024 at 9:19 PM Walaa Eldin Moustafa 
wrote:

> Hi Micah, it is mostly about the typical results of denormalization such
> as data consistency, management complexity, integrity, etc. However, as
> mentioned earlier, the main reason would be the semantic gap around using
> catalog table identifiers as a concept in the table (more specifically
> snapshot summary) spec. Denormalization in this case is a minor issue.
>
> Thanks,
> Walaa.
>
>
> On Mon, Aug 19, 2024 at 4:44 PM Micah Kornfield 
> wrote:
>
>> Thanks Micah, for the latter, I meant the type of denormalization of
>>> repeating a 3-part name as opposed to using an ID.
>>
>>
>> Is the concern here just metadata size or something else?  For size I
>> think if this is really anticipated to be a problem that it is likely for
>> the state map in general, and we could investigate some more sophisticated
>> encodings for the State map even without the overlap. I think maybe this
>> could be handled if it proves to be a problem but hopefully engines are
>> placing a reasonable cap on view depth + number of tables per view which
>> puts an upper bound on overall size.
>>
>> Thanks,
>> Micah
>>
>> On Fri, Aug 16, 2024 at 4:56 PM Walaa Eldin Moustafa <
>> [email protected]> wrote:
>>
>>> Thanks Micah, for the latter, I meant the type of denormalization of
>>> repeating a 3-part name as opposed to using an ID.
>>>
>>> On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield 
>>> wrote:
>>>
 However, this still does not address the semantic issue which is more
> fundamental in my opinion. The Iceberg table spec is not aware of catalog
> table identifiers and this use will be the first break of this 
> abstraction.


 IIUC, based on Jan's comments, we are not going to modify the table
 specification.  I thought the state map was effectively opaque metadata
 from the table specification perspective?  If this is the case I feel like
 that is OK and not a blocker, I think by their nature as already discussed
 MVs need catalog information to function properly and the choice to put
 catalog information into the table metadata is pragmatic and preserves
 other desirable properties.  It might be a more important point if we want
 to update the table specification (IMO, I still think it would probably be
 OK).

 On a side note, it does not address the denormalization issue either if
> we ever want to introduce the lineage in the view as a nice-to-have.


 I think if lineage is introduced to the View metadata, it should only
 hold direct dependencies for the reasons already discussed. IMO, I think
 the potential overlap is OK as they serve two different purposes.

 Cheers,
 Micah





 On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa <
 [email protected]> wrote:

> That is right. I agree that in the case of using catalog identifiers
> in state information, using them in lineage information would be a
> nice-to-have and not a requirement.
>
> However, this still does not address the semantic issue which is more
> fundamental in my opinion. The Iceberg table spec is not aware of catalog
> table identifiers and this use will be the first break of this 
> abstraction.
>
> On a side note, it does not address the denormalization issue either
> if we ever want to introduce the lineage in the view as a nice-to-have.
>
> Thanks,
> Walaa.
>
>
> On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul 
> wrote:
>
>> Hi Walaa,
>>
>> I would argue that for the refresh operation the query engine has to
>> parse the query and then somehow execute it. For a full refresh it will
>> directly execute the query and for a incremental refresh it will execute 
>> a
>> modified version. Therefore it has to fully expand the query tree.
>>
>> Best wishes,
>>
>> Jan
>>
>> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa <
>> [email protected]>:
>>
>> Thanks Jan for the summary.
>>
>> For this point:
>>
>> > For a refresh operation the query engine has to parse the SQL and
>> fully expand the lineage with it's children anyway.  So the lineage is 
>> not
>> strictly required.
>>
>> If the lineage is provided at creation time by the respective engine,
>> the refresh operation does not need to parse the SQL, correct?
>>
>> Thanks,
>> Walaa.
>>
>> On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul 
>> 

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-19 Thread Walaa Eldin Moustafa
Hi Micah, it is mostly about the typical results of denormalization such as
data consistency, management complexity, integrity, etc. However, as
mentioned earlier, the main reason would be the semantic gap around using
catalog table identifiers as a concept in the table (more specifically
snapshot summary) spec. Denormalization in this case is a minor issue.

Thanks,
Walaa.


On Mon, Aug 19, 2024 at 4:44 PM Micah Kornfield 
wrote:

> Thanks Micah, for the latter, I meant the type of denormalization of
>> repeating a 3-part name as opposed to using an ID.
>
>
> Is the concern here just metadata size or something else?  For size I
> think if this is really anticipated to be a problem that it is likely for
> the state map in general, and we could investigate some more sophisticated
> encodings for the State map even without the overlap. I think maybe this
> could be handled if it proves to be a problem but hopefully engines are
> placing a reasonable cap on view depth + number of tables per view which
> puts an upper bound on overall size.
>
> Thanks,
> Micah
>
> On Fri, Aug 16, 2024 at 4:56 PM Walaa Eldin Moustafa <
> [email protected]> wrote:
>
>> Thanks Micah, for the latter, I meant the type of denormalization of
>> repeating a 3-part name as opposed to using an ID.
>>
>> On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield 
>> wrote:
>>
>>> However, this still does not address the semantic issue which is more
 fundamental in my opinion. The Iceberg table spec is not aware of catalog
 table identifiers and this use will be the first break of this abstraction.
>>>
>>>
>>> IIUC, based on Jan's comments, we are not going to modify the table
>>> specification.  I thought the state map was effectively opaque metadata
>>> from the table specification perspective?  If this is the case I feel like
>>> that is OK and not a blocker, I think by their nature as already discussed
>>> MVs need catalog information to function properly and the choice to put
>>> catalog information into the table metadata is pragmatic and preserves
>>> other desirable properties.  It might be a more important point if we want
>>> to update the table specification (IMO, I still think it would probably be
>>> OK).
>>>
>>> On a side note, it does not address the denormalization issue either if
 we ever want to introduce the lineage in the view as a nice-to-have.
>>>
>>>
>>> I think if lineage is introduced to the View metadata, it should only
>>> hold direct dependencies for the reasons already discussed. IMO, I think
>>> the potential overlap is OK as they serve two different purposes.
>>>
>>> Cheers,
>>> Micah
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa <
>>> [email protected]> wrote:
>>>
 That is right. I agree that in the case of using catalog identifiers in
 state information, using them in lineage information would be a
 nice-to-have and not a requirement.

 However, this still does not address the semantic issue which is more
 fundamental in my opinion. The Iceberg table spec is not aware of catalog
 table identifiers and this use will be the first break of this abstraction.

 On a side note, it does not address the denormalization issue either if
 we ever want to introduce the lineage in the view as a nice-to-have.

 Thanks,
 Walaa.


 On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul 
 wrote:

> Hi Walaa,
>
> I would argue that for the refresh operation the query engine has to
> parse the query and then somehow execute it. For a full refresh it will
> directly execute the query and for a incremental refresh it will execute a
> modified version. Therefore it has to fully expand the query tree.
>
> Best wishes,
>
> Jan
>
> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa <
> [email protected]>:
>
> Thanks Jan for the summary.
>
> For this point:
>
> > For a refresh operation the query engine has to parse the SQL and
> fully expand the lineage with it's children anyway.  So the lineage is not
> strictly required.
>
> If the lineage is provided at creation time by the respective engine,
> the refresh operation does not need to parse the SQL, correct?
>
> Thanks,
> Walaa.
>
> On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul 
> wrote:
>
> As the table I created is not properly shown in the mailing list I'll
> reformat the summary of the different drawbacks again:
>
> Drawbacks of (no lineage, refresh-state key = identifier):
>
> - introduces catalog identifiers into table metadata (#4)
> - query engine has to expand lineage at refresh time (happens anyway)
>
> Drawbacks of (normalized lineage, refresh-state key = uuid):
>
> - recursive calls to catalog to expand lineage at read time (#2)
> - fragile by requiring child views to have lineage field
>
> Drawb

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-19 Thread Micah Kornfield
>
> Thanks Micah, for the latter, I meant the type of denormalization of
> repeating a 3-part name as opposed to using an ID.


Is the concern here just metadata size or something else?  For size I think
if this is really anticipated to be a problem that it is likely for the
state map in general, and we could investigate some more sophisticated
encodings for the State map even without the overlap. I think maybe this
could be handled if it proves to be a problem but hopefully engines are
placing a reasonable cap on view depth + number of tables per view which
puts an upper bound on overall size.

Thanks,
Micah

On Fri, Aug 16, 2024 at 4:56 PM Walaa Eldin Moustafa 
wrote:

> Thanks Micah, for the latter, I meant the type of denormalization of
> repeating a 3-part name as opposed to using an ID.
>
> On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield 
> wrote:
>
>> However, this still does not address the semantic issue which is more
>>> fundamental in my opinion. The Iceberg table spec is not aware of catalog
>>> table identifiers and this use will be the first break of this abstraction.
>>
>>
>> IIUC, based on Jan's comments, we are not going to modify the table
>> specification.  I thought the state map was effectively opaque metadata
>> from the table specification perspective?  If this is the case I feel like
>> that is OK and not a blocker, I think by their nature as already discussed
>> MVs need catalog information to function properly and the choice to put
>> catalog information into the table metadata is pragmatic and preserves
>> other desirable properties.  It might be a more important point if we want
>> to update the table specification (IMO, I still think it would probably be
>> OK).
>>
>> On a side note, it does not address the denormalization issue either if
>>> we ever want to introduce the lineage in the view as a nice-to-have.
>>
>>
>> I think if lineage is introduced to the View metadata, it should only
>> hold direct dependencies for the reasons already discussed. IMO, I think
>> the potential overlap is OK as they serve two different purposes.
>>
>> Cheers,
>> Micah
>>
>>
>>
>>
>>
>> On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa <
>> [email protected]> wrote:
>>
>>> That is right. I agree that in the case of using catalog identifiers in
>>> state information, using them in lineage information would be a
>>> nice-to-have and not a requirement.
>>>
>>> However, this still does not address the semantic issue which is more
>>> fundamental in my opinion. The Iceberg table spec is not aware of catalog
>>> table identifiers and this use will be the first break of this abstraction.
>>>
>>> On a side note, it does not address the denormalization issue either if
>>> we ever want to introduce the lineage in the view as a nice-to-have.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul 
>>> wrote:
>>>
 Hi Walaa,

 I would argue that for the refresh operation the query engine has to
 parse the query and then somehow execute it. For a full refresh it will
 directly execute the query and for a incremental refresh it will execute a
 modified version. Therefore it has to fully expand the query tree.

 Best wishes,

 Jan

 Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa >>> >:

 Thanks Jan for the summary.

 For this point:

 > For a refresh operation the query engine has to parse the SQL and
 fully expand the lineage with it's children anyway.  So the lineage is not
 strictly required.

 If the lineage is provided at creation time by the respective engine,
 the refresh operation does not need to parse the SQL, correct?

 Thanks,
 Walaa.

 On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul 
 wrote:

 As the table I created is not properly shown in the mailing list I'll
 reformat the summary of the different drawbacks again:

 Drawbacks of (no lineage, refresh-state key = identifier):

 - introduces catalog identifiers into table metadata (#4)
 - query engine has to expand lineage at refresh time (happens anyway)

 Drawbacks of (normalized lineage, refresh-state key = uuid):

 - recursive calls to catalog to expand lineage at read time (#2)
 - fragile by requiring child views to have lineage field

 Drawbacks of (denormalized lineage, refresh-state key = uuid):

 - update of materialized view version required if child view is updated
 (#5)
 On 16.08.24 09:17, Jan Kaul wrote:

 Hi,

 Thanks Micah for clearly stating the requirements. I think this gives
 better clarity for the discussion.

 It seems like we don't have a solution that satisfies all requirements
 at once. So we would need to choose which has the fewest drawbacks.

 I would like to summarize the different drawbacks that came up in the
 discussion.
 no lineage
 + refresh-state 

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Walaa Eldin Moustafa
Thanks Micah, for the latter, I meant the type of denormalization of
repeating a 3-part name as opposed to using an ID.

On Fri, Aug 16, 2024 at 4:52 PM Micah Kornfield 
wrote:

> However, this still does not address the semantic issue which is more
>> fundamental in my opinion. The Iceberg table spec is not aware of catalog
>> table identifiers and this use will be the first break of this abstraction.
>
>
> IIUC, based on Jan's comments, we are not going to modify the table
> specification.  I thought the state map was effectively opaque metadata
> from the table specification perspective?  If this is the case I feel like
> that is OK and not a blocker, I think by their nature as already discussed
> MVs need catalog information to function properly and the choice to put
> catalog information into the table metadata is pragmatic and preserves
> other desirable properties.  It might be a more important point if we want
> to update the table specification (IMO, I still think it would probably be
> OK).
>
> On a side note, it does not address the denormalization issue either if we
>> ever want to introduce the lineage in the view as a nice-to-have.
>
>
> I think if lineage is introduced to the View metadata, it should only hold
> direct dependencies for the reasons already discussed. IMO, I think the
> potential overlap is OK as they serve two different purposes.
>
> Cheers,
> Micah
>
>
>
>
>
> On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa <
> [email protected]> wrote:
>
>> That is right. I agree that in the case of using catalog identifiers in
>> state information, using them in lineage information would be a
>> nice-to-have and not a requirement.
>>
>> However, this still does not address the semantic issue which is more
>> fundamental in my opinion. The Iceberg table spec is not aware of catalog
>> table identifiers and this use will be the first break of this abstraction.
>>
>> On a side note, it does not address the denormalization issue either if
>> we ever want to introduce the lineage in the view as a nice-to-have.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul 
>> wrote:
>>
>>> Hi Walaa,
>>>
>>> I would argue that for the refresh operation the query engine has to
>>> parse the query and then somehow execute it. For a full refresh it will
>>> directly execute the query and for a incremental refresh it will execute a
>>> modified version. Therefore it has to fully expand the query tree.
>>>
>>> Best wishes,
>>>
>>> Jan
>>>
>>> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa >> >:
>>>
>>> Thanks Jan for the summary.
>>>
>>> For this point:
>>>
>>> > For a refresh operation the query engine has to parse the SQL and
>>> fully expand the lineage with it's children anyway.  So the lineage is not
>>> strictly required.
>>>
>>> If the lineage is provided at creation time by the respective engine,
>>> the refresh operation does not need to parse the SQL, correct?
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul 
>>> wrote:
>>>
>>> As the table I created is not properly shown in the mailing list I'll
>>> reformat the summary of the different drawbacks again:
>>>
>>> Drawbacks of (no lineage, refresh-state key = identifier):
>>>
>>> - introduces catalog identifiers into table metadata (#4)
>>> - query engine has to expand lineage at refresh time (happens anyway)
>>>
>>> Drawbacks of (normalized lineage, refresh-state key = uuid):
>>>
>>> - recursive calls to catalog to expand lineage at read time (#2)
>>> - fragile by requiring child views to have lineage field
>>>
>>> Drawbacks of (denormalized lineage, refresh-state key = uuid):
>>>
>>> - update of materialized view version required if child view is updated
>>> (#5)
>>> On 16.08.24 09:17, Jan Kaul wrote:
>>>
>>> Hi,
>>>
>>> Thanks Micah for clearly stating the requirements. I think this gives
>>> better clarity for the discussion.
>>>
>>> It seems like we don't have a solution that satisfies all requirements
>>> at once. So we would need to choose which has the fewest drawbacks.
>>>
>>> I would like to summarize the different drawbacks that came up in the
>>> discussion.
>>> no lineage
>>> + refresh-state key = identifier
>>> normalized lineage
>>> + refresh-state key = uuid
>>> denormalized lineage
>>> + refresh-state key = uuid
>>> - introduces catalog identifiers into table metadata (#4)
>>> - query engine has to expand lineage at refresh time (happens anyway)
>>> - recursive calls to catalog to expand lineage at read time (#2)
>>> - fragile by requiring child views to have lineage field
>>> - update of materialized view version required if child view is updated
>>> (#5)
>>>
>>> With identifiers as the refresh-state keys, the lineage is not strictly
>>> required and becomes an orthogonal proposal. That's why I left it out if
>>> the comparison.
>>>
>>> In my opinion introducing catalog identifiers into the table metadata
>>> (requirement #4) is the least significant drawback as it is not a 

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Micah Kornfield
>
> However, this still does not address the semantic issue which is more
> fundamental in my opinion. The Iceberg table spec is not aware of catalog
> table identifiers and this use will be the first break of this abstraction.


IIUC, based on Jan's comments, we are not going to modify the table
specification.  I thought the state map was effectively opaque metadata
from the table specification perspective?  If this is the case I feel like
that is OK and not a blocker, I think by their nature as already discussed
MVs need catalog information to function properly and the choice to put
catalog information into the table metadata is pragmatic and preserves
other desirable properties.  It might be a more important point if we want
to update the table specification (IMO, I still think it would probably be
OK).

On a side note, it does not address the denormalization issue either if we
> ever want to introduce the lineage in the view as a nice-to-have.


I think if lineage is introduced to the View metadata, it should only hold
direct dependencies for the reasons already discussed. IMO, I think the
potential overlap is OK as they serve two different purposes.

Cheers,
Micah





On Fri, Aug 16, 2024 at 4:27 PM Walaa Eldin Moustafa 
wrote:

> That is right. I agree that in the case of using catalog identifiers in
> state information, using them in lineage information would be a
> nice-to-have and not a requirement.
>
> However, this still does not address the semantic issue which is more
> fundamental in my opinion. The Iceberg table spec is not aware of catalog
> table identifiers and this use will be the first break of this abstraction.
>
> On a side note, it does not address the denormalization issue either if we
> ever want to introduce the lineage in the view as a nice-to-have.
>
> Thanks,
> Walaa.
>
>
> On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul 
> wrote:
>
>> Hi Walaa,
>>
>> I would argue that for the refresh operation the query engine has to
>> parse the query and then somehow execute it. For a full refresh it will
>> directly execute the query and for a incremental refresh it will execute a
>> modified version. Therefore it has to fully expand the query tree.
>>
>> Best wishes,
>>
>> Jan
>>
>> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa :
>>
>> Thanks Jan for the summary.
>>
>> For this point:
>>
>> > For a refresh operation the query engine has to parse the SQL and fully
>> expand the lineage with it's children anyway.  So the lineage is not
>> strictly required.
>>
>> If the lineage is provided at creation time by the respective engine, the
>> refresh operation does not need to parse the SQL, correct?
>>
>> Thanks,
>> Walaa.
>>
>> On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul 
>> wrote:
>>
>> As the table I created is not properly shown in the mailing list I'll
>> reformat the summary of the different drawbacks again:
>>
>> Drawbacks of (no lineage, refresh-state key = identifier):
>>
>> - introduces catalog identifiers into table metadata (#4)
>> - query engine has to expand lineage at refresh time (happens anyway)
>>
>> Drawbacks of (normalized lineage, refresh-state key = uuid):
>>
>> - recursive calls to catalog to expand lineage at read time (#2)
>> - fragile by requiring child views to have lineage field
>>
>> Drawbacks of (denormalized lineage, refresh-state key = uuid):
>>
>> - update of materialized view version required if child view is updated
>> (#5)
>> On 16.08.24 09:17, Jan Kaul wrote:
>>
>> Hi,
>>
>> Thanks Micah for clearly stating the requirements. I think this gives
>> better clarity for the discussion.
>>
>> It seems like we don't have a solution that satisfies all requirements at
>> once. So we would need to choose which has the fewest drawbacks.
>>
>> I would like to summarize the different drawbacks that came up in the
>> discussion.
>> no lineage
>> + refresh-state key = identifier
>> normalized lineage
>> + refresh-state key = uuid
>> denormalized lineage
>> + refresh-state key = uuid
>> - introduces catalog identifiers into table metadata (#4)
>> - query engine has to expand lineage at refresh time (happens anyway)
>> - recursive calls to catalog to expand lineage at read time (#2)
>> - fragile by requiring child views to have lineage field
>> - update of materialized view version required if child view is updated
>> (#5)
>>
>> With identifiers as the refresh-state keys, the lineage is not strictly
>> required and becomes an orthogonal proposal. That's why I left it out if
>> the comparison.
>>
>> In my opinion introducing catalog identifiers into the table metadata
>> (requirement #4) is the least significant drawback as it is not a technical
>> reason but more about semantics. Especially as the identifiers are not
>> introduced into the table spec but are rather stored in the snapshot
>> summary. That's why I'm in favor of using the catalog identifiers as the
>> refresh-state keys.
>>
>> Regarding your last point Walaa:
>>
>> The option of using catalog identifiers in the st

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Walaa Eldin Moustafa
That is right. I agree that in the case of using catalog identifiers in
state information, using them in lineage information would be a
nice-to-have and not a requirement.

However, this still does not address the semantic issue which is more
fundamental in my opinion. The Iceberg table spec is not aware of catalog
table identifiers and this use will be the first break of this abstraction.

On a side note, it does not address the denormalization issue either if we
ever want to introduce the lineage in the view as a nice-to-have.

Thanks,
Walaa.


On Fri, Aug 16, 2024 at 10:09 AM Jan Kaul 
wrote:

> Hi Walaa,
>
> I would argue that for the refresh operation the query engine has to parse
> the query and then somehow execute it. For a full refresh it will directly
> execute the query and for a incremental refresh it will execute a modified
> version. Therefore it has to fully expand the query tree.
>
> Best wishes,
>
> Jan
>
> Am 16.08.2024 18:13 schrieb Walaa Eldin Moustafa :
>
> Thanks Jan for the summary.
>
> For this point:
>
> > For a refresh operation the query engine has to parse the SQL and fully
> expand the lineage with it's children anyway.  So the lineage is not
> strictly required.
>
> If the lineage is provided at creation time by the respective engine, the
> refresh operation does not need to parse the SQL, correct?
>
> Thanks,
> Walaa.
>
> On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul 
> wrote:
>
> As the table I created is not properly shown in the mailing list I'll
> reformat the summary of the different drawbacks again:
>
> Drawbacks of (no lineage, refresh-state key = identifier):
>
> - introduces catalog identifiers into table metadata (#4)
> - query engine has to expand lineage at refresh time (happens anyway)
>
> Drawbacks of (normalized lineage, refresh-state key = uuid):
>
> - recursive calls to catalog to expand lineage at read time (#2)
> - fragile by requiring child views to have lineage field
>
> Drawbacks of (denormalized lineage, refresh-state key = uuid):
>
> - update of materialized view version required if child view is updated
> (#5)
> On 16.08.24 09:17, Jan Kaul wrote:
>
> Hi,
>
> Thanks Micah for clearly stating the requirements. I think this gives
> better clarity for the discussion.
>
> It seems like we don't have a solution that satisfies all requirements at
> once. So we would need to choose which has the fewest drawbacks.
>
> I would like to summarize the different drawbacks that came up in the
> discussion.
> no lineage
> + refresh-state key = identifier
> normalized lineage
> + refresh-state key = uuid
> denormalized lineage
> + refresh-state key = uuid
> - introduces catalog identifiers into table metadata (#4)
> - query engine has to expand lineage at refresh time (happens anyway)
> - recursive calls to catalog to expand lineage at read time (#2)
> - fragile by requiring child views to have lineage field
> - update of materialized view version required if child view is updated
> (#5)
>
> With identifiers as the refresh-state keys, the lineage is not strictly
> required and becomes an orthogonal proposal. That's why I left it out if
> the comparison.
>
> In my opinion introducing catalog identifiers into the table metadata
> (requirement #4) is the least significant drawback as it is not a technical
> reason but more about semantics. Especially as the identifiers are not
> introduced into the table spec but are rather stored in the snapshot
> summary. That's why I'm in favor of using the catalog identifiers as the
> refresh-state keys.
>
> Regarding your last point Walaa:
>
> The option of using catalog identifiers in the state map still requires
> keeping lineage information in the view because REFRESH MV needs the latest
> fully expanded children (which could have changed from the set of children
> currently in the state map), without reparsing the view tree.
>
> For a refresh operation the query engine has to parse the SQL and fully
> expand the lineage with it's children anyway.  So the lineage is not
> strictly required.
>
> If I understand correctly, most of you are also in favor of using catalog
> identifiers + ref as the refresh-state keys and postponing the lineage
> proposal.
>
> I hope that we can move the discussion forward.
>
> Jan
> On 16.08.24 08:07, Walaa Eldin Moustafa wrote:
>
> The option of using catalog identifiers in the state map still requires
> keeping lineage information in the view because REFRESH MV needs the latest
> fully expanded children (which could have changed from the set of children
> currently in the state map), without reparsing the view tree. Therefore,
> catalog identifiers in the state map, does not eliminate the need for
> tracking children in the form of catalog identifiers in the lineage side
> (but in this case lineage will be a set instead of just a map).
>
> Hence, my concerns with using catalog identifiers (as opposed to UUIDs)
> are:
> * The fundamental issue where the table spec depends on/refers to the view
> spec (be

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Jan Kaul
Hi Walaa,I would argue that for the refresh operation the query engine has to parse the query and then somehow execute it. For a full refresh it will directly execute the query and for a incremental refresh it will execute a modified version. Therefore it has to fully expand the query tree.Best wishes, JanAm 16.08.2024 18:13 schrieb Walaa Eldin Moustafa :Thanks Jan for the summary.For this point:> For a refresh operation the query engine has to parse the SQL and fully expand the lineage with it's children anyway.  So the lineage is not strictly required.If the lineage is provided at creation time by the respective engine, the refresh operation does not need to parse the SQL, correct?Thanks,Walaa.On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul  wrote:

  

  
  
As the table I created is not properly shown in the mailing list
  I'll reformat the summary of the different drawbacks again:
Drawbacks of (no lineage, refresh-state key = identifier):
- introduces catalog identifiers into table metadata (#4)
  - query engine has to expand lineage at refresh time (happens
  anyway)
  
  Drawbacks of (normalized lineage, refresh-state key = uuid):
- recursive calls to catalog to expand lineage at read time (#2)
  - fragile by requiring child views to have lineage field
Drawbacks of (denormalized lineage, refresh-state key = uuid):
- update of materialized view version required if child view is
  updated (#5)
On 16.08.24 09:17, Jan Kaul wrote:


  
  Hi,
  Thanks Micah for clearly stating the requirements. I think this
gives better clarity for the discussion. 
  It seems like we don't have a solution that satisfies all
requirements at once. So we would need to choose which has the
fewest drawbacks.
  I would like to summarize the different drawbacks that came up
in the discussion.
  no lineage 
  + refresh-state key = identifier
normalized lineage 
  + refresh-state key = uuid
denormalized lineage 
  + refresh-state key = uuid- introduces catalog identifiers into table
  metadata (#4)
  - query engine has to expand lineage at refresh time
  (happens anyway)
- recursive calls to catalog to expand
  lineage at read time (#2)
  - fragile by requiring child views to have lineage field
- update of materialized view version
  required if child view is updated (#5)

  With identifiers as the refresh-state keys, the lineage is not
strictly required and becomes an orthogonal proposal. That's why
I left it out if the comparison.
  In my opinion introducing catalog identifiers into the table
metadata (requirement #4) is the least significant drawback as
it is not a technical reason but more about semantics.
Especially as the identifiers are not introduced into the table
spec but are rather stored in the snapshot summary. That's why
I'm in favor of using the catalog identifiers as the
refresh-state keys.
  
  Regarding your last point Walaa:
  
  The option of using catalog identifiers in
the state map still requires keeping lineage information in the
view because REFRESH MV needs the latest fully expanded children
(which could have changed from the set of children currently in
the state map), without reparsing the view tree.
  
   For a refresh operation the query engine has to parse the SQL
and fully expand the lineage with it's children anyway.  So the
lineage is not strictly required.
  If I understand correctly, most of you are also in favor of
using catalog identifiers + ref as the refresh-state keys and
postponing the lineage proposal.
  I hope that we can move the discussion forward.
  Jan
  
  On 16.08.24 08:07, Walaa Eldin
Moustafa wrote:
  
  

The option of using catalog identifiers in the
  state map still requires keeping lineage information in the
  view because REFRESH MV needs the latest fully expanded
  children (which could have changed from the set of children
  currently in the state map), without reparsing the view tree.
  Therefore, catalog identifiers in the state map, does not
  eliminate the need for tracking children in the form of
  catalog identifiers in the lineage side (but in this case
  lineage will be a set instead of just a map).
  
  
  Hence, my concerns with using catalog identifiers (as
opposed to UUIDs) are:
* The fundamental issue where the table spec depends
  on/refers to the view spec

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Walaa Eldin Moustafa
Thanks Jan for the summary.

For this point:

> For a refresh operation the query engine has to parse the SQL and fully
expand the lineage with it's children anyway.  So the lineage is not
strictly required.

If the lineage is provided at creation time by the respective engine, the
refresh operation does not need to parse the SQL, correct?

Thanks,
Walaa.

On Fri, Aug 16, 2024 at 12:24 AM Jan Kaul 
wrote:

> As the table I created is not properly shown in the mailing list I'll
> reformat the summary of the different drawbacks again:
>
> Drawbacks of (no lineage, refresh-state key = identifier):
>
> - introduces catalog identifiers into table metadata (#4)
> - query engine has to expand lineage at refresh time (happens anyway)
>
> Drawbacks of (normalized lineage, refresh-state key = uuid):
>
> - recursive calls to catalog to expand lineage at read time (#2)
> - fragile by requiring child views to have lineage field
>
> Drawbacks of (denormalized lineage, refresh-state key = uuid):
>
> - update of materialized view version required if child view is updated
> (#5)
> On 16.08.24 09:17, Jan Kaul wrote:
>
> Hi,
>
> Thanks Micah for clearly stating the requirements. I think this gives
> better clarity for the discussion.
>
> It seems like we don't have a solution that satisfies all requirements at
> once. So we would need to choose which has the fewest drawbacks.
>
> I would like to summarize the different drawbacks that came up in the
> discussion.
> no lineage
> + refresh-state key = identifier
> normalized lineage
> + refresh-state key = uuid
> denormalized lineage
> + refresh-state key = uuid
> - introduces catalog identifiers into table metadata (#4)
> - query engine has to expand lineage at refresh time (happens anyway)
> - recursive calls to catalog to expand lineage at read time (#2)
> - fragile by requiring child views to have lineage field
> - update of materialized view version required if child view is updated
> (#5)
>
> With identifiers as the refresh-state keys, the lineage is not strictly
> required and becomes an orthogonal proposal. That's why I left it out if
> the comparison.
>
> In my opinion introducing catalog identifiers into the table metadata
> (requirement #4) is the least significant drawback as it is not a technical
> reason but more about semantics. Especially as the identifiers are not
> introduced into the table spec but are rather stored in the snapshot
> summary. That's why I'm in favor of using the catalog identifiers as the
> refresh-state keys.
>
> Regarding your last point Walaa:
>
> The option of using catalog identifiers in the state map still requires
> keeping lineage information in the view because REFRESH MV needs the latest
> fully expanded children (which could have changed from the set of children
> currently in the state map), without reparsing the view tree.
>
> For a refresh operation the query engine has to parse the SQL and fully
> expand the lineage with it's children anyway.  So the lineage is not
> strictly required.
>
> If I understand correctly, most of you are also in favor of using catalog
> identifiers + ref as the refresh-state keys and postponing the lineage
> proposal.
>
> I hope that we can move the discussion forward.
>
> Jan
> On 16.08.24 08:07, Walaa Eldin Moustafa wrote:
>
> The option of using catalog identifiers in the state map still requires
> keeping lineage information in the view because REFRESH MV needs the latest
> fully expanded children (which could have changed from the set of children
> currently in the state map), without reparsing the view tree. Therefore,
> catalog identifiers in the state map, does not eliminate the need for
> tracking children in the form of catalog identifiers in the lineage side
> (but in this case lineage will be a set instead of just a map).
>
> Hence, my concerns with using catalog identifiers (as opposed to UUIDs)
> are:
> * The fundamental issue where the table spec depends on/refers to the view
> spec (because such catalog identifiers are not defined in the table spec
> and the only place they have a meaning is in the view spec lineage
> information).
> * (less fundamental) The denormalization introduced by this arrangement,
> where each identifier is 3-parts and all of them repeat in both lineage
> info and state map.
>
> I am not very concerned with recursive expansion (through multiple calls),
> as it is always the case with views.
>
> On a positive note, looks like we agree to move past sequence numbers :)
>
> Thanks,
> Walaa.
>
>
>
>
> On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield 
> wrote:
>
>> I think given the constraint that catalog lookup has to be by identifier
>>> and not UUID, I'd prefer using identifier in the refresh state.  If we use
>>> identifiers, we can directly parallelize the catalog calls to fetch the
>>> latest state.  If we use UUID, the engine has to go back to the MV and
>>> possibly additional views to reconstruct the lineage map.  It's just a lot
>>> slower and more work fo

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Jan Kaul
As the table I created is not properly shown in the mailing list I'll 
reformat the summary of the different drawbacks again:


Drawbacks of (no lineage, refresh-state key = identifier):

- introduces catalog identifiers into table metadata (#4)
- query engine has to expand lineage at refresh time (happens anyway)

Drawbacks of (normalized lineage, refresh-state key = uuid):

- recursive calls to catalog to expand lineage at read time (#2)
- fragile by requiring child views to have lineage field

Drawbacks of (denormalized lineage, refresh-state key = uuid):

- update of materialized view version required if child view is updated (#5)

On 16.08.24 09:17, Jan Kaul wrote:


Hi,

Thanks Micah for clearly stating the requirements. I think this gives 
better clarity for the discussion.


It seems like we don't have a solution that satisfies all requirements 
at once. So we would need to choose which has the fewest drawbacks.


I would like to summarize the different drawbacks that came up in the 
discussion.


no lineage
+ refresh-state key = identifier
normalized lineage
+ refresh-state key = uuid
denormalized lineage
+ refresh-state key = uuid
- introduces catalog identifiers into table metadata (#4)
- query engine has to expand lineage at refresh time (happens anyway)
- recursive calls to catalog to expand lineage at read time (#2)
- fragile by requiring child views to have lineage field
	- update of materialized view version required if child view is 
updated (#5)


With identifiers as the refresh-state keys, the lineage is not 
strictly required and becomes an orthogonal proposal. That's why I 
left it out if the comparison.


In my opinion introducing catalog identifiers into the table metadata 
(requirement #4) is the least significant drawback as it is not a 
technical reason but more about semantics. Especially as the 
identifiers are not introduced into the table spec but are rather 
stored in the snapshot summary. That's why I'm in favor of using the 
catalog identifiers as the refresh-state keys.


Regarding your last point Walaa:

The option of using catalog identifiers in the state map still 
requires keeping lineage information in the view because REFRESH MV 
needs the latest fully expanded children (which could have changed 
from the set of children currently in the state map), 
without reparsing the view tree.


For a refresh operation the query engine has to parse the SQL and 
fully expand the lineage with it's children anyway.  So the lineage is 
not strictly required.


If I understand correctly, most of you are also in favor of using 
catalog identifiers + ref as the refresh-state keys and postponing the 
lineage proposal.


I hope that we can move the discussion forward.

Jan

On 16.08.24 08:07, Walaa Eldin Moustafa wrote:
The option of using catalog identifiers in the state map still 
requires keeping lineage information in the view because REFRESH MV 
needs the latest fully expanded children (which could have changed 
from the set of children currently in the state map), 
without reparsing the view tree. Therefore, catalog identifiers in 
the state map, does not eliminate the need for tracking children in 
the form of catalog identifiers in the lineage side (but in this case 
lineage will be a set instead of just a map).


Hence, my concerns with using catalog identifiers (as opposed to 
UUIDs) are:
* The fundamental issue where the table spec depends on/refers to the 
view spec (because such catalog identifiers are not defined in the 
table spec and the only place they have a meaning is in the view spec 
lineage information).
* (less fundamental) The denormalization introduced by this 
arrangement, where each identifier is 3-parts and all of them repeat 
in both lineage info and state map.


I am not very concerned with recursive expansion (through multiple 
calls), as it is always the case with views.


On a positive note, looks like we agree to move past sequence numbers :)

Thanks,
Walaa.




On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield 
 wrote:


I think given the constraint that catalog lookup has to be by
identifier and not UUID, I'd prefer using identifier in the
refresh state.  If we use identifiers, we can directly
parallelize the catalog calls to fetch the latest state.  If
we use UUID, the engine has to go back to the MV and possibly
additional views to reconstruct the lineage map.  It's just a
lot slower and more work for the engine when there is a MV
that references a lot of views (and those views reference
additional views).


I'm +1 on using catalog identifiers as the key.  As you point out
this is inline with #2 (try to minimize serial catalog lookups)
in addition to supporting requirement #3.

On Thu, Aug 15, 2024 at 3:27 PM Benny Chow  wrote:

I think given the constraint that catalog lookup has to be by
identifier and not UUID, I'd prefer 

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-16 Thread Jan Kaul

Hi,

Thanks Micah for clearly stating the requirements. I think this gives 
better clarity for the discussion.


It seems like we don't have a solution that satisfies all requirements 
at once. So we would need to choose which has the fewest drawbacks.


I would like to summarize the different drawbacks that came up in the 
discussion.


no lineage
+ refresh-state key = identifier
normalized lineage
+ refresh-state key = uuid
denormalized lineage
+ refresh-state key = uuid
- introduces catalog identifiers into table metadata (#4)
- query engine has to expand lineage at refresh time (happens anyway)
- recursive calls to catalog to expand lineage at read time (#2)
- fragile by requiring child views to have lineage field
	- update of materialized view version required if child view is updated 
(#5)


With identifiers as the refresh-state keys, the lineage is not strictly 
required and becomes an orthogonal proposal. That's why I left it out if 
the comparison.


In my opinion introducing catalog identifiers into the table metadata 
(requirement #4) is the least significant drawback as it is not a 
technical reason but more about semantics. Especially as the identifiers 
are not introduced into the table spec but are rather stored in the 
snapshot summary. That's why I'm in favor of using the catalog 
identifiers as the refresh-state keys.


Regarding your last point Walaa:

The option of using catalog identifiers in the state map still 
requires keeping lineage information in the view because REFRESH MV 
needs the latest fully expanded children (which could have changed 
from the set of children currently in the state map), 
without reparsing the view tree.


For a refresh operation the query engine has to parse the SQL and fully 
expand the lineage with it's children anyway.  So the lineage is not 
strictly required.


If I understand correctly, most of you are also in favor of using 
catalog identifiers + ref as the refresh-state keys and postponing the 
lineage proposal.


I hope that we can move the discussion forward.

Jan

On 16.08.24 08:07, Walaa Eldin Moustafa wrote:
The option of using catalog identifiers in the state map still 
requires keeping lineage information in the view because REFRESH MV 
needs the latest fully expanded children (which could have changed 
from the set of children currently in the state map), 
without reparsing the view tree. Therefore, catalog identifiers in the 
state map, does not eliminate the need for tracking children in the 
form of catalog identifiers in the lineage side (but in this case 
lineage will be a set instead of just a map).


Hence, my concerns with using catalog identifiers (as opposed to 
UUIDs) are:
* The fundamental issue where the table spec depends on/refers to the 
view spec (because such catalog identifiers are not defined in the 
table spec and the only place they have a meaning is in the view spec 
lineage information).
* (less fundamental) The denormalization introduced by this 
arrangement, where each identifier is 3-parts and all of them repeat 
in both lineage info and state map.


I am not very concerned with recursive expansion (through multiple 
calls), as it is always the case with views.


On a positive note, looks like we agree to move past sequence numbers :)

Thanks,
Walaa.




On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield 
 wrote:


I think given the constraint that catalog lookup has to be by
identifier and not UUID, I'd prefer using identifier in the
refresh state.  If we use identifiers, we can directly
parallelize the catalog calls to fetch the latest state. If we
use UUID, the engine has to go back to the MV and possibly
additional views to reconstruct the lineage map. It's just a
lot slower and more work for the engine when there is a MV
that references a lot of views (and those views reference
additional views).


I'm +1 on using catalog identifiers as the key.  As you point out
this is inline with #2 (try to minimize serial catalog lookups) in
addition to supporting requirement #3.

On Thu, Aug 15, 2024 at 3:27 PM Benny Chow  wrote:

I think given the constraint that catalog lookup has to be by
identifier and not UUID, I'd prefer using identifier in the
refresh state.  If we use identifiers, we can directly
parallelize the catalog calls to fetch the latest state.  If
we use UUID, the engine has to go back to the MV and possibly
additional views to reconstruct the lineage map.  It's just a
lot slower and more work for the engine when there is a MV
that references a lot of views (and those views reference
additional views).

Thanks
Benny


On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa
 wrote:

Thanks Jan, Micah, and Karuppayya for chiming in.

I do not think 3 and 4 are at odds with e

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Walaa Eldin Moustafa
The option of using catalog identifiers in the state map still requires
keeping lineage information in the view because REFRESH MV needs the latest
fully expanded children (which could have changed from the set of children
currently in the state map), without reparsing the view tree. Therefore,
catalog identifiers in the state map, does not eliminate the need for
tracking children in the form of catalog identifiers in the lineage side
(but in this case lineage will be a set instead of just a map).

Hence, my concerns with using catalog identifiers (as opposed to UUIDs) are:
* The fundamental issue where the table spec depends on/refers to the view
spec (because such catalog identifiers are not defined in the table spec
and the only place they have a meaning is in the view spec lineage
information).
* (less fundamental) The denormalization introduced by this arrangement,
where each identifier is 3-parts and all of them repeat in both lineage
info and state map.

I am not very concerned with recursive expansion (through multiple calls),
as it is always the case with views.

On a positive note, looks like we agree to move past sequence numbers :)

Thanks,
Walaa.




On Thu, Aug 15, 2024 at 4:07 PM Micah Kornfield 
wrote:

> I think given the constraint that catalog lookup has to be by identifier
>> and not UUID, I'd prefer using identifier in the refresh state.  If we use
>> identifiers, we can directly parallelize the catalog calls to fetch the
>> latest state.  If we use UUID, the engine has to go back to the MV and
>> possibly additional views to reconstruct the lineage map.  It's just a lot
>> slower and more work for the engine when there is a MV that references a
>> lot of views (and those views reference additional views).
>
>
> I'm +1 on using catalog identifiers as the key.  As you point out this is
> inline with #2 (try to minimize serial catalog lookups) in addition to
> supporting requirement #3.
>
> On Thu, Aug 15, 2024 at 3:27 PM Benny Chow  wrote:
>
>> I think given the constraint that catalog lookup has to be by identifier
>> and not UUID, I'd prefer using identifier in the refresh state.  If we use
>> identifiers, we can directly parallelize the catalog calls to fetch the
>> latest state.  If we use UUID, the engine has to go back to the MV and
>> possibly additional views to reconstruct the lineage map.  It's just a lot
>> slower and more work for the engine when there is a MV that references a
>> lot of views (and those views reference additional views).
>>
>> Thanks
>> Benny
>>
>>
>> On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa <
>> [email protected]> wrote:
>>
>>> Thanks Jan, Micah, and Karuppayya for chiming in.
>>>
>>> I do not think 3 and 4 are at odds with each other (for example
>>> maintaining both lineage map and state map through UUID can achieve both).
>>> Also, I do not think we can drop the lineage map since in many catalogs,
>>> the only lookup method is by the catalog identifier, and not the UUID.
>>>
>>> I think if we go with UUIDs in the state, we should have a lineage map
>>> (from identifiers to UUIDs) to go with it.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Thu, Aug 15, 2024 at 1:45 PM karuppayya 
>>> wrote:
>>>
 +1 to storing the refresh state as a map of UUIDs to snapshot IDs, and
 deferring the inclusion of lineage to a future iteration.(like Micha
 mentioned)
 This would greatly simplify the current design.

 Also in terms of identifiers to use(UUID or catalog identifier) for the
 refresh state
 We will not be able to fetch the table/View using the UUID alone, for
 example from Hive based catalog.
 We do not have the direct mapping between UUID and table/view.
 Which leaves us only with the catalog identifiers?

 Thanks & Regards
 Karuppayya


 On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield 
 wrote:

> I think it might be worth restating perceived requirements and making
> sure there is alignment on them.
>
> If I am reading correctly, I think the following are perceived
> requirements:
> 1. An engine must be able to unambiguously detect that an underlying
> queried entity has changed or not via metadata to decide if materialized
> table data can be used.
> 2. The number of sequential catalog reads an engine needs to make to
> make use of a materialized table state at read time is minimized.
> 3. Engines that don't understand a SQL dialect can still use MV
> information if it is not stale.
> 4. Table refs (catalog identifiers) should not appear in the
> materialized table metadata (i.e. state).
> 5. The view part of the MV definition should not need a new revision
> for any changes to objects it queries as long as their schemas stay
> compatible (only state information on the materialized table need to
> change).
>
> In my mind, requirement 1, is the only true requirement.  I think this
> necessit

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Micah Kornfield
>
> I think given the constraint that catalog lookup has to be by identifier
> and not UUID, I'd prefer using identifier in the refresh state.  If we use
> identifiers, we can directly parallelize the catalog calls to fetch the
> latest state.  If we use UUID, the engine has to go back to the MV and
> possibly additional views to reconstruct the lineage map.  It's just a lot
> slower and more work for the engine when there is a MV that references a
> lot of views (and those views reference additional views).


I'm +1 on using catalog identifiers as the key.  As you point out this is
inline with #2 (try to minimize serial catalog lookups) in addition to
supporting requirement #3.

On Thu, Aug 15, 2024 at 3:27 PM Benny Chow  wrote:

> I think given the constraint that catalog lookup has to be by identifier
> and not UUID, I'd prefer using identifier in the refresh state.  If we use
> identifiers, we can directly parallelize the catalog calls to fetch the
> latest state.  If we use UUID, the engine has to go back to the MV and
> possibly additional views to reconstruct the lineage map.  It's just a lot
> slower and more work for the engine when there is a MV that references a
> lot of views (and those views reference additional views).
>
> Thanks
> Benny
>
>
> On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa <
> [email protected]> wrote:
>
>> Thanks Jan, Micah, and Karuppayya for chiming in.
>>
>> I do not think 3 and 4 are at odds with each other (for example
>> maintaining both lineage map and state map through UUID can achieve both).
>> Also, I do not think we can drop the lineage map since in many catalogs,
>> the only lookup method is by the catalog identifier, and not the UUID.
>>
>> I think if we go with UUIDs in the state, we should have a lineage map
>> (from identifiers to UUIDs) to go with it.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Thu, Aug 15, 2024 at 1:45 PM karuppayya 
>> wrote:
>>
>>> +1 to storing the refresh state as a map of UUIDs to snapshot IDs, and
>>> deferring the inclusion of lineage to a future iteration.(like Micha
>>> mentioned)
>>> This would greatly simplify the current design.
>>>
>>> Also in terms of identifiers to use(UUID or catalog identifier) for the
>>> refresh state
>>> We will not be able to fetch the table/View using the UUID alone, for
>>> example from Hive based catalog.
>>> We do not have the direct mapping between UUID and table/view.
>>> Which leaves us only with the catalog identifiers?
>>>
>>> Thanks & Regards
>>> Karuppayya
>>>
>>>
>>> On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield 
>>> wrote:
>>>
 I think it might be worth restating perceived requirements and making
 sure there is alignment on them.

 If I am reading correctly, I think the following are perceived
 requirements:
 1. An engine must be able to unambiguously detect that an underlying
 queried entity has changed or not via metadata to decide if materialized
 table data can be used.
 2. The number of sequential catalog reads an engine needs to make to
 make use of a materialized table state at read time is minimized.
 3. Engines that don't understand a SQL dialect can still use MV
 information if it is not stale.
 4. Table refs (catalog identifiers) should not appear in the
 materialized table metadata (i.e. state).
 5. The view part of the MV definition should not need a new revision
 for any changes to objects it queries as long as their schemas stay
 compatible (only state information on the materialized table need to
 change).

 In my mind, requirement 1, is the only true requirement.  I think this
 necessitates having UUID + snapshot ID as part of the state information
 (not necessarily part of the Lineage).  I think it also necessitates having
 a denormalized view of all entities that are inputs into the MV in the
 state information (a view object might not change but its underlying tables
 or views could change and that must be detected).

 Requirements 2 and 5 are somewhat at odds with each other.  If
 information is denormalized (fully expanded) in Lineage, it means if table
 information is somehow dropped from an intermediate view, one would need to
 update the view (or make excess calls to the catalog). In my mind, this
 argues for normalization of the lineage stored on the view (with the cost
 of potentially 1 additional serial catalog lookup once the state
 information is retrieved).

 I think #3 is at odds with #4.  I think #3 is more worthwhile, then
 keeping #4 (and as Jan noted #4 adds complexity).

 I think the last remaining question is if lineage serves any purpose.
 I think it is useful for the following reasons:
 a)  When there are no intermediate views queried, it allows for fully
 parallelized lookup calls to the catalog without having to parse the SQL
 statement first
 b)  Allows tools that don't need 

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Micah Kornfield
>
> I do not think 3 and 4 are at odds with each other (for example
> maintaining both lineage map and state map through UUID can achieve both).


I agree, I should have been more clear that #5 (limiting new view versions)
also comes into play.  If UUID is used in lineage as part of the view spec,
a "create or replace" or any table or view the MV references would force a
new version of the view part of the MV (I believe  this is undesirable but
there arguments to be made both ways).  If one wants to support allowing
engines that can't parse the SQL to still validate the materialized
tables's usability (requirement #3), then denormalization (full expansion)
is necessary (we can't assume any intermediate views would be parseable
either) [1].  Denormalization stored in the view metadata (lineage map)
again, would go against requirement #5.  Therefore, in order to support
requirement #3, we are left with denormalization in the materialized view
state (and catalog identifiers would need to be used for all tables/views
the MV depends on) which would violate #4.

Also, I do not think we can drop the lineage map since in many catalogs,
> the only lookup method is by the catalog identifier, and not the UUID.


I think this depends on the keys that are used in the state map? If
the keys are already catalog identifiers, then the lineage map seems
superfluous?

I think if we go with UUIDs in the state, we should have a lineage map
> (from identifiers to UUIDs) to go with it.


I'm not sure if this is referring to the current proposal of lineage map in
view metadata (which I think is covered above), or having a separate
lineage map in the materialized table's metadata.  If the latter, then this
could work but it seems cleaner to have a map of { catalog reference :
(UUID, sequence number)} [1] instead of 2 separate maps.

[1] Technically one would probably need a key of (catalog reference, tag
ID) or some other metadata to disambiguate if the underlying  view query is
doing time travel or referencing a specific branch of the table, or this
can be called out as unsupported.

Thanks,
Micah






On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa 
wrote:

> Thanks Jan, Micah, and Karuppayya for chiming in.
>
> I do not think 3 and 4 are at odds with each other (for example
> maintaining both lineage map and state map through UUID can achieve both).
> Also, I do not think we can drop the lineage map since in many catalogs,
> the only lookup method is by the catalog identifier, and not the UUID.
>
> I think if we go with UUIDs in the state, we should have a lineage map
> (from identifiers to UUIDs) to go with it.
>
> Thanks,
> Walaa.
>
>
> On Thu, Aug 15, 2024 at 1:45 PM karuppayya 
> wrote:
>
>> +1 to storing the refresh state as a map of UUIDs to snapshot IDs, and
>> deferring the inclusion of lineage to a future iteration.(like Micha
>> mentioned)
>> This would greatly simplify the current design.
>>
>> Also in terms of identifiers to use(UUID or catalog identifier) for the
>> refresh state
>> We will not be able to fetch the table/View using the UUID alone, for
>> example from Hive based catalog.
>> We do not have the direct mapping between UUID and table/view.
>> Which leaves us only with the catalog identifiers?
>>
>> Thanks & Regards
>> Karuppayya
>>
>>
>> On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield 
>> wrote:
>>
>>> I think it might be worth restating perceived requirements and making
>>> sure there is alignment on them.
>>>
>>> If I am reading correctly, I think the following are perceived
>>> requirements:
>>> 1. An engine must be able to unambiguously detect that an underlying
>>> queried entity has changed or not via metadata to decide if materialized
>>> table data can be used.
>>> 2. The number of sequential catalog reads an engine needs to make to
>>> make use of a materialized table state at read time is minimized.
>>> 3. Engines that don't understand a SQL dialect can still use MV
>>> information if it is not stale.
>>> 4. Table refs (catalog identifiers) should not appear in the
>>> materialized table metadata (i.e. state).
>>> 5. The view part of the MV definition should not need a new revision for
>>> any changes to objects it queries as long as their schemas stay compatible
>>> (only state information on the materialized table need to change).
>>>
>>> In my mind, requirement 1, is the only true requirement.  I think this
>>> necessitates having UUID + snapshot ID as part of the state information
>>> (not necessarily part of the Lineage).  I think it also necessitates having
>>> a denormalized view of all entities that are inputs into the MV in the
>>> state information (a view object might not change but its underlying tables
>>> or views could change and that must be detected).
>>>
>>> Requirements 2 and 5 are somewhat at odds with each other.  If
>>> information is denormalized (fully expanded) in Lineage, it means if table
>>> information is somehow dropped from an intermediate view, one would n

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Benny Chow
I think given the constraint that catalog lookup has to be by identifier
and not UUID, I'd prefer using identifier in the refresh state.  If we use
identifiers, we can directly parallelize the catalog calls to fetch the
latest state.  If we use UUID, the engine has to go back to the MV and
possibly additional views to reconstruct the lineage map.  It's just a lot
slower and more work for the engine when there is a MV that references a
lot of views (and those views reference additional views).

Thanks
Benny


On Thu, Aug 15, 2024 at 2:14 PM Walaa Eldin Moustafa 
wrote:

> Thanks Jan, Micah, and Karuppayya for chiming in.
>
> I do not think 3 and 4 are at odds with each other (for example
> maintaining both lineage map and state map through UUID can achieve both).
> Also, I do not think we can drop the lineage map since in many catalogs,
> the only lookup method is by the catalog identifier, and not the UUID.
>
> I think if we go with UUIDs in the state, we should have a lineage map
> (from identifiers to UUIDs) to go with it.
>
> Thanks,
> Walaa.
>
>
> On Thu, Aug 15, 2024 at 1:45 PM karuppayya 
> wrote:
>
>> +1 to storing the refresh state as a map of UUIDs to snapshot IDs, and
>> deferring the inclusion of lineage to a future iteration.(like Micha
>> mentioned)
>> This would greatly simplify the current design.
>>
>> Also in terms of identifiers to use(UUID or catalog identifier) for the
>> refresh state
>> We will not be able to fetch the table/View using the UUID alone, for
>> example from Hive based catalog.
>> We do not have the direct mapping between UUID and table/view.
>> Which leaves us only with the catalog identifiers?
>>
>> Thanks & Regards
>> Karuppayya
>>
>>
>> On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield 
>> wrote:
>>
>>> I think it might be worth restating perceived requirements and making
>>> sure there is alignment on them.
>>>
>>> If I am reading correctly, I think the following are perceived
>>> requirements:
>>> 1. An engine must be able to unambiguously detect that an underlying
>>> queried entity has changed or not via metadata to decide if materialized
>>> table data can be used.
>>> 2. The number of sequential catalog reads an engine needs to make to
>>> make use of a materialized table state at read time is minimized.
>>> 3. Engines that don't understand a SQL dialect can still use MV
>>> information if it is not stale.
>>> 4. Table refs (catalog identifiers) should not appear in the
>>> materialized table metadata (i.e. state).
>>> 5. The view part of the MV definition should not need a new revision for
>>> any changes to objects it queries as long as their schemas stay compatible
>>> (only state information on the materialized table need to change).
>>>
>>> In my mind, requirement 1, is the only true requirement.  I think this
>>> necessitates having UUID + snapshot ID as part of the state information
>>> (not necessarily part of the Lineage).  I think it also necessitates having
>>> a denormalized view of all entities that are inputs into the MV in the
>>> state information (a view object might not change but its underlying tables
>>> or views could change and that must be detected).
>>>
>>> Requirements 2 and 5 are somewhat at odds with each other.  If
>>> information is denormalized (fully expanded) in Lineage, it means if table
>>> information is somehow dropped from an intermediate view, one would need to
>>> update the view (or make excess calls to the catalog). In my mind, this
>>> argues for normalization of the lineage stored on the view (with the cost
>>> of potentially 1 additional serial catalog lookup once the state
>>> information is retrieved).
>>>
>>> I think #3 is at odds with #4.  I think #3 is more worthwhile, then
>>> keeping #4 (and as Jan noted #4 adds complexity).
>>>
>>> I think the last remaining question is if lineage serves any purpose.  I
>>> think it is useful for the following reasons:
>>> a)  When there are no intermediate views queried, it allows for fully
>>> parallelized lookup calls to the catalog without having to parse the SQL
>>> statement first
>>> b)  Allows tools that don't need to lookup state information  or parse
>>> SQL but still navigate MV/view trees.
>>>
>>> Both of these seem relatively minor, so lineage could perhaps be left
>>> out in the first iteration.
>>>
>>> As it applies to Jan's questions:
>>>
>>> 1. Should we move the identifiers out of the refresh-state into a new
 lineage record that is stored as part of the view metadata?
>>>
>>> No, I don't think so, I think #5 is a reasonable requirement and I think
>>> this violates it.
>>>
>>>
 2. If yes, should the lineage in the view be fully expanded?
>>>
>>> No, I think only the state should be fully expanded (for reasons
>>> mentioned above, it potentially requires more updates to the view then
>>> necessary).
>>>
>>>
 3. What should be used as an identifier in the lineage to reference
 entries in the refresh-state?
>>>
>>>
>>> Catalog identifiers make sens

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Walaa Eldin Moustafa
Thanks Jan, Micah, and Karuppayya for chiming in.

I do not think 3 and 4 are at odds with each other (for example
maintaining both lineage map and state map through UUID can achieve both).
Also, I do not think we can drop the lineage map since in many catalogs,
the only lookup method is by the catalog identifier, and not the UUID.

I think if we go with UUIDs in the state, we should have a lineage map
(from identifiers to UUIDs) to go with it.

Thanks,
Walaa.


On Thu, Aug 15, 2024 at 1:45 PM karuppayya  wrote:

> +1 to storing the refresh state as a map of UUIDs to snapshot IDs, and
> deferring the inclusion of lineage to a future iteration.(like Micha
> mentioned)
> This would greatly simplify the current design.
>
> Also in terms of identifiers to use(UUID or catalog identifier) for the
> refresh state
> We will not be able to fetch the table/View using the UUID alone, for
> example from Hive based catalog.
> We do not have the direct mapping between UUID and table/view.
> Which leaves us only with the catalog identifiers?
>
> Thanks & Regards
> Karuppayya
>
>
> On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield 
> wrote:
>
>> I think it might be worth restating perceived requirements and making
>> sure there is alignment on them.
>>
>> If I am reading correctly, I think the following are perceived
>> requirements:
>> 1. An engine must be able to unambiguously detect that an underlying
>> queried entity has changed or not via metadata to decide if materialized
>> table data can be used.
>> 2. The number of sequential catalog reads an engine needs to make to make
>> use of a materialized table state at read time is minimized.
>> 3. Engines that don't understand a SQL dialect can still use MV
>> information if it is not stale.
>> 4. Table refs (catalog identifiers) should not appear in the materialized
>> table metadata (i.e. state).
>> 5. The view part of the MV definition should not need a new revision for
>> any changes to objects it queries as long as their schemas stay compatible
>> (only state information on the materialized table need to change).
>>
>> In my mind, requirement 1, is the only true requirement.  I think this
>> necessitates having UUID + snapshot ID as part of the state information
>> (not necessarily part of the Lineage).  I think it also necessitates having
>> a denormalized view of all entities that are inputs into the MV in the
>> state information (a view object might not change but its underlying tables
>> or views could change and that must be detected).
>>
>> Requirements 2 and 5 are somewhat at odds with each other.  If
>> information is denormalized (fully expanded) in Lineage, it means if table
>> information is somehow dropped from an intermediate view, one would need to
>> update the view (or make excess calls to the catalog). In my mind, this
>> argues for normalization of the lineage stored on the view (with the cost
>> of potentially 1 additional serial catalog lookup once the state
>> information is retrieved).
>>
>> I think #3 is at odds with #4.  I think #3 is more worthwhile, then
>> keeping #4 (and as Jan noted #4 adds complexity).
>>
>> I think the last remaining question is if lineage serves any purpose.  I
>> think it is useful for the following reasons:
>> a)  When there are no intermediate views queried, it allows for fully
>> parallelized lookup calls to the catalog without having to parse the SQL
>> statement first
>> b)  Allows tools that don't need to lookup state information  or parse
>> SQL but still navigate MV/view trees.
>>
>> Both of these seem relatively minor, so lineage could perhaps be left out
>> in the first iteration.
>>
>> As it applies to Jan's questions:
>>
>> 1. Should we move the identifiers out of the refresh-state into a new
>>> lineage record that is stored as part of the view metadata?
>>
>> No, I don't think so, I think #5 is a reasonable requirement and I think
>> this violates it.
>>
>>
>>> 2. If yes, should the lineage in the view be fully expanded?
>>
>> No, I think only the state should be fully expanded (for reasons
>> mentioned above, it potentially requires more updates to the view then
>> necessary).
>>
>>
>>> 3. What should be used as an identifier in the lineage to reference
>>> entries in the refresh-state?
>>
>>
>> Catalog identifiers make sense to me.  If we agree requirement #3 is not
>> a requirement then it seems like this could also be UUIDs.
>>
>> Thanks,
>> Micah
>>
>> On Thu, Aug 15, 2024 at 7:57 AM Benny Chow  wrote:
>>
>>> If we go with either UUID or Table Identifier + VersionID/SnapshotId in
>>> the refresh state, then this list is fully expanded already.  So, to
>>> validate the freshness of a materialization, the engine doesn't even need
>>> to look at the view lineage.  IMO, the view lineage is nice to have but not
>>> a necessary requirement for MVs.  The view lineage makes sharing of views
>>> between engines without common SQL dialects possible.
>>>
>>> Benny
>>>
>>> On Thu, Aug 15, 2024 at 12:22 AM Jan 

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread karuppayya
+1 to storing the refresh state as a map of UUIDs to snapshot IDs, and
deferring the inclusion of lineage to a future iteration.(like Micha
mentioned)
This would greatly simplify the current design.

Also in terms of identifiers to use(UUID or catalog identifier) for the
refresh state
We will not be able to fetch the table/View using the UUID alone, for
example from Hive based catalog.
We do not have the direct mapping between UUID and table/view.
Which leaves us only with the catalog identifiers?

Thanks & Regards
Karuppayya


On Thu, Aug 15, 2024 at 9:16 AM Micah Kornfield 
wrote:

> I think it might be worth restating perceived requirements and making sure
> there is alignment on them.
>
> If I am reading correctly, I think the following are perceived
> requirements:
> 1. An engine must be able to unambiguously detect that an underlying
> queried entity has changed or not via metadata to decide if materialized
> table data can be used.
> 2. The number of sequential catalog reads an engine needs to make to make
> use of a materialized table state at read time is minimized.
> 3. Engines that don't understand a SQL dialect can still use MV
> information if it is not stale.
> 4. Table refs (catalog identifiers) should not appear in the materialized
> table metadata (i.e. state).
> 5. The view part of the MV definition should not need a new revision for
> any changes to objects it queries as long as their schemas stay compatible
> (only state information on the materialized table need to change).
>
> In my mind, requirement 1, is the only true requirement.  I think this
> necessitates having UUID + snapshot ID as part of the state information
> (not necessarily part of the Lineage).  I think it also necessitates having
> a denormalized view of all entities that are inputs into the MV in the
> state information (a view object might not change but its underlying tables
> or views could change and that must be detected).
>
> Requirements 2 and 5 are somewhat at odds with each other.  If information
> is denormalized (fully expanded) in Lineage, it means if table information
> is somehow dropped from an intermediate view, one would need to update the
> view (or make excess calls to the catalog). In my mind, this argues for
> normalization of the lineage stored on the view (with the cost of
> potentially 1 additional serial catalog lookup once the state information
> is retrieved).
>
> I think #3 is at odds with #4.  I think #3 is more worthwhile, then
> keeping #4 (and as Jan noted #4 adds complexity).
>
> I think the last remaining question is if lineage serves any purpose.  I
> think it is useful for the following reasons:
> a)  When there are no intermediate views queried, it allows for fully
> parallelized lookup calls to the catalog without having to parse the SQL
> statement first
> b)  Allows tools that don't need to lookup state information  or parse SQL
> but still navigate MV/view trees.
>
> Both of these seem relatively minor, so lineage could perhaps be left out
> in the first iteration.
>
> As it applies to Jan's questions:
>
> 1. Should we move the identifiers out of the refresh-state into a new
>> lineage record that is stored as part of the view metadata?
>
> No, I don't think so, I think #5 is a reasonable requirement and I think
> this violates it.
>
>
>> 2. If yes, should the lineage in the view be fully expanded?
>
> No, I think only the state should be fully expanded (for reasons mentioned
> above, it potentially requires more updates to the view then necessary).
>
>
>> 3. What should be used as an identifier in the lineage to reference
>> entries in the refresh-state?
>
>
> Catalog identifiers make sense to me.  If we agree requirement #3 is not a
> requirement then it seems like this could also be UUIDs.
>
> Thanks,
> Micah
>
> On Thu, Aug 15, 2024 at 7:57 AM Benny Chow  wrote:
>
>> If we go with either UUID or Table Identifier + VersionID/SnapshotId in
>> the refresh state, then this list is fully expanded already.  So, to
>> validate the freshness of a materialization, the engine doesn't even need
>> to look at the view lineage.  IMO, the view lineage is nice to have but not
>> a necessary requirement for MVs.  The view lineage makes sharing of views
>> between engines without common SQL dialects possible.
>>
>> Benny
>>
>> On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul 
>> wrote:
>>
>>> Hi all,
>>>
>>> I would like to reemphasize the purpose of the refresh-state for
>>> materialized views. The purpose is to determine if the precomputed data is
>>> fresh, stale or invalid. For that the current snapshot-id of every table in
>>> the query tree has to be fetched from the catalog by using its full
>>> identifier and ref. Additionally the refresh state stores the snapshot-id
>>> of the last refresh.
>>>
>>> To summarize: *To determine the freshness of the precomputed data we
>>> require the full identifier + ref and snapshot-id of the last refresh for
>>> every table in the fully expanded query tr

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Micah Kornfield
I think it might be worth restating perceived requirements and making sure
there is alignment on them.

If I am reading correctly, I think the following are perceived requirements:
1. An engine must be able to unambiguously detect that an underlying
queried entity has changed or not via metadata to decide if materialized
table data can be used.
2. The number of sequential catalog reads an engine needs to make to make
use of a materialized table state at read time is minimized.
3. Engines that don't understand a SQL dialect can still use MV information
if it is not stale.
4. Table refs (catalog identifiers) should not appear in the materialized
table metadata (i.e. state).
5. The view part of the MV definition should not need a new revision for
any changes to objects it queries as long as their schemas stay compatible
(only state information on the materialized table need to change).

In my mind, requirement 1, is the only true requirement.  I think this
necessitates having UUID + snapshot ID as part of the state information
(not necessarily part of the Lineage).  I think it also necessitates having
a denormalized view of all entities that are inputs into the MV in the
state information (a view object might not change but its underlying tables
or views could change and that must be detected).

Requirements 2 and 5 are somewhat at odds with each other.  If information
is denormalized (fully expanded) in Lineage, it means if table information
is somehow dropped from an intermediate view, one would need to update the
view (or make excess calls to the catalog). In my mind, this argues for
normalization of the lineage stored on the view (with the cost of
potentially 1 additional serial catalog lookup once the state information
is retrieved).

I think #3 is at odds with #4.  I think #3 is more worthwhile, then keeping
#4 (and as Jan noted #4 adds complexity).

I think the last remaining question is if lineage serves any purpose.  I
think it is useful for the following reasons:
a)  When there are no intermediate views queried, it allows for fully
parallelized lookup calls to the catalog without having to parse the SQL
statement first
b)  Allows tools that don't need to lookup state information  or parse SQL
but still navigate MV/view trees.

Both of these seem relatively minor, so lineage could perhaps be left out
in the first iteration.

As it applies to Jan's questions:

1. Should we move the identifiers out of the refresh-state into a new
> lineage record that is stored as part of the view metadata?

No, I don't think so, I think #5 is a reasonable requirement and I think
this violates it.


> 2. If yes, should the lineage in the view be fully expanded?

No, I think only the state should be fully expanded (for reasons mentioned
above, it potentially requires more updates to the view then necessary).


> 3. What should be used as an identifier in the lineage to reference
> entries in the refresh-state?


Catalog identifiers make sense to me.  If we agree requirement #3 is not a
requirement then it seems like this could also be UUIDs.

Thanks,
Micah

On Thu, Aug 15, 2024 at 7:57 AM Benny Chow  wrote:

> If we go with either UUID or Table Identifier + VersionID/SnapshotId in
> the refresh state, then this list is fully expanded already.  So, to
> validate the freshness of a materialization, the engine doesn't even need
> to look at the view lineage.  IMO, the view lineage is nice to have but not
> a necessary requirement for MVs.  The view lineage makes sharing of views
> between engines without common SQL dialects possible.
>
> Benny
>
> On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul 
> wrote:
>
>> Hi all,
>>
>> I would like to reemphasize the purpose of the refresh-state for
>> materialized views. The purpose is to determine if the precomputed data is
>> fresh, stale or invalid. For that the current snapshot-id of every table in
>> the query tree has to be fetched from the catalog by using its full
>> identifier and ref. Additionally the refresh state stores the snapshot-id
>> of the last refresh.
>>
>> To summarize: *To determine the freshness of the precomputed data we
>> require the full identifier + ref and snapshot-id of the last refresh for
>> every table in the fully expanded query tree*
>>
>> This is a requirement from how the catalog works and independent from how
>> we design the lineage/refresh state. Additionally we previously agreed that
>> we should be able to obtain the full list of identifiers without needing to
>> parse the SQL definition.
>>
>> Now we are having a discussion in how to store and obtain the fully
>> expanded list of table identifiers and snapshot-ids. To move the discussion
>> forward I think it would be valuable to answer the following 3 questions:
>>
>> 1. Should we move the identifiers out of the refresh-state into a new
>> lineage record that is stored as part of the view metadata?
>>
>> 2. If yes, should the lineage in the view be fully expanded?
>>
>> 3. What should be used as a

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Benny Chow
If we go with either UUID or Table Identifier + VersionID/SnapshotId in the
refresh state, then this list is fully expanded already.  So, to validate
the freshness of a materialization, the engine doesn't even need to look at
the view lineage.  IMO, the view lineage is nice to have but not a
necessary requirement for MVs.  The view lineage makes sharing of views
between engines without common SQL dialects possible.

Benny

On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul 
wrote:

> Hi all,
>
> I would like to reemphasize the purpose of the refresh-state for
> materialized views. The purpose is to determine if the precomputed data is
> fresh, stale or invalid. For that the current snapshot-id of every table in
> the query tree has to be fetched from the catalog by using its full
> identifier and ref. Additionally the refresh state stores the snapshot-id
> of the last refresh.
>
> To summarize: *To determine the freshness of the precomputed data we
> require the full identifier + ref and snapshot-id of the last refresh for
> every table in the fully expanded query tree*
>
> This is a requirement from how the catalog works and independent from how
> we design the lineage/refresh state. Additionally we previously agreed that
> we should be able to obtain the full list of identifiers without needing to
> parse the SQL definition.
>
> Now we are having a discussion in how to store and obtain the fully
> expanded list of table identifiers and snapshot-ids. To move the discussion
> forward I think it would be valuable to answer the following 3 questions:
>
> 1. Should we move the identifiers out of the refresh-state into a new
> lineage record that is stored as part of the view metadata?
>
> 2. If yes, should the lineage in the view be fully expanded?
>
> 3. What should be used as an identifier in the lineage to reference
> entries in the refresh-state?
>
> 1. Question:
>
> We already agreed that this would be a good idea because we wouldn't
> introduce the identifier concept to the table metadata. However, looking at
> the complexity that comes with the alternatives, I would like to keep this
> question open.
>
> 2. Question:
>
> I'm against using a not fully expanded lineage in the view struct. To
> recall we require every identifier in the fully expanded query tree to
> determine the freshness. Not storing all identifiers in the lineage would
> mean to recursively call the catalog and expand the query tree at read
> time. This can lead to a large overhead for determining the refresh state
> compared to expanding the query tree once at creation time and then storing
> the fully expanded lineage.
>
> 3. Question:
>
> This depends on Question 2.
>
> For a not fully expanded lineage, the only options would be uuids or
> catalog identifiers.
>
> For a fully expanded lineage the question isn't all that relevant. The
> current design specifies that the lineage is a map from an identifier to an
> id and the refresh-state is a map from such id to a snapshot-id. For this
> to work we don't have to specify which kind of identifier has to be used.
> One query engine could use uuids, the other engine sequence-ids. The
> important assumption we are making is that every id that is used in the
> refresh-state has to be defined in the lineage.
> So the question about using uuids is rather, can the query engine trust
> that the id defined in the lineage is the uuid of the table.
>
>
> Regarding the complexity that comes from introducing the lineage in the
> view I would like to revisit question 1. Introducing the lineage in the
> view metadata opens up the question of when should the lineage be fully
> expanded. We see that we have 3 options:
>
> 1. Not fully expanded lineage -> Expansion at read time
>
> 2. Fully expanded lineage -> Expansion at creation time
>
> 3. No lineage (use identifiers in refresh-state) -> Expansion at refresh
> time
>
> As reading is expected to be the most frequent operation I see option 1 as
> not favorable. As the query engine has to fully expand the query tree for a
> refresh anyway, I see option 3 as the most natural. For a refresh operation
> the query engine must understand the SQL dialects of all views in the query
> tree and therefore is guaranteed to successfully expand the lineage. This
> might not be the case at creation time, which makes option 2 less favorable.
>
> As can be seen, I'm in favor of just storing the refresh-state as a map
> from identifier to snapshot-id and not using the lineage. I know that this
> introduces the concept of a catalog identifiers to the table metadata spec,
> but in my opinion it is by far the simplest option.
>
> I'm interested in your opinions.
>
> Best wishes,
>
> Jan
> On 14.08.24 22:24, Walaa Eldin Moustafa wrote:
>
> Thanks Benny. For refs, I am +1 to represent them as UUID + optional ref,
> although we can iterate ohe exact JSON structure (e.g., another option is
> splitting for (UUID) state from (UUID + ref) state into two separate
> higher-level fields).
>

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-15 Thread Jan Kaul

Hi all,

I would like to reemphasize the purpose of the refresh-state for 
materialized views. The purpose is to determine if the precomputed data 
is fresh, stale or invalid. For that the current snapshot-id of every 
table in the query tree has to be fetched from the catalog by using its 
full identifier and ref. Additionally the refresh state stores the 
snapshot-id of the last refresh.


To summarize: *To determine the freshness of the precomputed data we 
require the full identifier + ref and snapshot-id of the last refresh 
for every table in the fully expanded query tree*


This is a requirement from how the catalog works and independent from 
how we design the lineage/refresh state. Additionally we previously 
agreed that we should be able to obtain the full list of identifiers 
without needing to parse the SQL definition.


Now we are having a discussion in how to store and obtain the fully 
expanded list of table identifiers and snapshot-ids. To move the 
discussion forward I think it would be valuable to answer the following 
3 questions:


1. Should we move the identifiers out of the refresh-state into a new 
lineage record that is stored as part of the view metadata?


2. If yes, should the lineage in the view be fully expanded?

3. What should be used as an identifier in the lineage to reference 
entries in the refresh-state?


1. Question:

We already agreed that this would be a good idea because we wouldn't 
introduce the identifier concept to the table metadata. However, looking 
at the complexity that comes with the alternatives, I would like to keep 
this question open.


2. Question:

I'm against using a not fully expanded lineage in the view struct. To 
recall we require every identifier in the fully expanded query tree to 
determine the freshness. Not storing all identifiers in the lineage 
would mean to recursively call the catalog and expand the query tree at 
read time. This can lead to a large overhead for determining the refresh 
state compared to expanding the query tree once at creation time and 
then storing the fully expanded lineage.


3. Question:

This depends on Question 2.

For a not fully expanded lineage, the only options would be uuids or 
catalog identifiers.


For a fully expanded lineage the question isn't all that relevant. The 
current design specifies that the lineage is a map from an identifier to 
an id and the refresh-state is a map from such id to a snapshot-id. For 
this to work we don't have to specify which kind of identifier has to be 
used. One query engine could use uuids, the other engine sequence-ids. 
The important assumption we are making is that every id that is used in 
the refresh-state has to be defined in the lineage.
So the question about using uuids is rather, can the query engine trust 
that the id defined in the lineage is the uuid of the table.



Regarding the complexity that comes from introducing the lineage in the 
view I would like to revisit question 1. Introducing the lineage in the 
view metadata opens up the question of when should the lineage be fully 
expanded. We see that we have 3 options:


1. Not fully expanded lineage -> Expansion at read time

2. Fully expanded lineage -> Expansion at creation time

3. No lineage (use identifiers in refresh-state) -> Expansion at refresh 
time


As reading is expected to be the most frequent operation I see option 1 
as not favorable. As the query engine has to fully expand the query tree 
for a refresh anyway, I see option 3 as the most natural. For a refresh 
operation the query engine must understand the SQL dialects of all views 
in the query tree and therefore is guaranteed to successfully expand the 
lineage. This might not be the case at creation time, which makes option 
2 less favorable.


As can be seen, I'm in favor of just storing the refresh-state as a map 
from identifier to snapshot-id and not using the lineage. I know that 
this introduces the concept of a catalog identifiers to the table 
metadata spec, but in my opinion it is by far the simplest option.


I'm interested in your opinions.

Best wishes,

Jan

On 14.08.24 22:24, Walaa Eldin Moustafa wrote:
Thanks Benny. For refs, I am +1 to represent them as UUID + optional 
ref, although we can iterate ohe exact JSON structure (e.g., another 
option is splitting for (UUID) state from (UUID + ref) state into two 
separate higher-level fields).


Generally agree on REFRESH VIEW strategy could be up to the engine, 
but it seems like an area where Iceberg could have an opinion/spec on. 
I will start a separate thread for that.


Thanks,
Walaa.


Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-14 Thread Walaa Eldin Moustafa
Thanks Benny. For refs, I am +1 to represent them as UUID + optional ref,
although we can iterate ohe exact JSON structure (e.g., another option is
splitting for (UUID) state from (UUID + ref) state into two separate
higher-level fields).

Generally agree on REFRESH VIEW strategy could be up to the engine, but it
seems like an area where Iceberg could have an opinion/spec on. I will
start a separate thread for that.

Thanks,
Walaa.


Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-14 Thread Benny Chow
I'd like to hear Jan's feedback on using UUID and normalizing the view
lineage.  I'm on board with this change.

I updated the fully spec'd out example using UUID and a normalized view
linage:

https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit#heading=h.o6yn2lnpxowg

Thanks a lot Walaa for driving this discussion.

We need to revisit the issue about identifying the refresh table for
snapshot references.  The refresh table needs to include both UUID and the
"ref".  Because of this, I think we should still include the fully
qualified identifier in the refresh state.  If we do this, then the state
looks a lot like the lineage except that it includes the version property
and is fully denormalized.  Alternatively, the refresh table just has UUID
and an optional "ref' and no identifier.  What do you guys think of this?

Regarding dropping and re-creating a dependent view in the MV, I think we
should keep it flexible for the engine to decide what that means for the
materialization validity.  Once the view is re-created, the version-ids
inside are all new and don't match up with the refresh state.  For REFRESH
VIEW, if the view schema changes in an additive way such as new columns
projected from a SELECT *, then MVs built on top could still be valid.
Again, this would be up to the engine to decide on the behavior.

Thanks
Benny

On Tue, Aug 13, 2024 at 11:47 AM Walaa Eldin Moustafa 
wrote:

> Hi Everyone,
>
> Just a follow up on this thread. Thanks Benny and Micah for the discussion
> on the doc [1]. We have been converging more on using UUIDs from the
> discussion. The only open question was related to UUIDs (of underlying
> views/tables) being stale upon a REPLACE (or DROP and CREATE) operation on
> those tables/views. The response to that question is:
>
> (1) While stale, those UUIDs are still good for freshness calculations.
> (2) If we really want to refresh them, that might be in scope of a REFRESH
> VIEW operation (not to be confused with REFRESH MV).
>
> (2) seems appropriate for a separate discussion since REFRESH VIEW
> operation behavior can be concerned with other things as well like
> refreshing the schema.
>
> Detailed discussion on this question can be found in this thread in the
> doc [2]. If people are okay with the stance above, we can move forward
> towards a vote on using UUIDs. I will leave a couple of days for the
> discussion to wrap up in this thread/the doc.
>
> [1]
> https://docs.google.com/document/d/1-OaPqm8ahVT3_OCbVdAPQ_wZ8I3ToeqU3RLUjcyKQM0/edit
> [2]
> https://docs.google.com/document/d/1-OaPqm8ahVT3_OCbVdAPQ_wZ8I3ToeqU3RLUjcyKQM0/edit?disco=AAABTXtgBFw
>
> Thanks,
> Walaa.
>
>
> On Thu, Aug 8, 2024 at 4:43 PM Walaa Eldin Moustafa 
> wrote:
>
>> Thanks Benny! We discussed this option during the meeting but we did not
>> prefer it because we did not want to leak the SQL identifiers to the
>> storage table since SQL identifiers are view concepts and fit better with
>> the view.
>>
>> Thanks,
>> Walaa.
>>
>> On Thu, Aug 8, 2024 at 4:12 PM Benny Chow  wrote:
>>
>>> Maybe a third option is to decouple the view lineage and materialization
>>> state.
>>>
>>> The view lineage can just list out the SQL identifiers+ref... we can
>>> still decide whether this is just direct children or fully expanded.
>>> The materialization state doesn't have to depend on the view lineage
>>> (through either sequence ID or UUIDs).  It's a direct mapping of SQL
>>> identifiers (possibly with refs) to their corresponding snapshot id or view
>>> versions.
>>>
>>> The materialization state was our original design until Dan brought up
>>> the idea to also capture the view lineage.  So, what I am suggesting is
>>> that we do both but not couple them together through any sort of sequence
>>> ID or UUID.
>>>
>>> Thanks
>>> Benny
>>>
>>> On Thu, Aug 8, 2024 at 2:04 PM Walaa Eldin Moustafa <
>>> [email protected]> wrote:
>>>
 Hi Everyone,

 In the last community sync on Materialized Views [1], we agreed to
 split the information that is used to determine the materialized view
 staleness to two parts: Lineage Information and State Information. We have
 made a lot of progress on representing both but one issue remains open:

 Both lineage information and state information express facts about the
 child objects of the MV, and to correlate those facts we need some sort of
 an ID for those objects.
 The open question is whether the ID should be of type sequence ID or
 UUID (i.e., the Iceberg UUID of the object).

 I have described the issue in more detail in this doc [2]. I have also
 added some thoughts on the advantages of using UUIDs in the same doc.
 Further, there is a discussion thread on that topic in the MV spec doc [3].

 Please feel free to comment on the docs or share your thoughts in the
 thread. After the discussion we can move forward to a vote.

 Thanks,
 Walaa.

 [1] https://

Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-13 Thread Walaa Eldin Moustafa
Hi Everyone,

Just a follow up on this thread. Thanks Benny and Micah for the discussion
on the doc [1]. We have been converging more on using UUIDs from the
discussion. The only open question was related to UUIDs (of underlying
views/tables) being stale upon a REPLACE (or DROP and CREATE) operation on
those tables/views. The response to that question is:

(1) While stale, those UUIDs are still good for freshness calculations.
(2) If we really want to refresh them, that might be in scope of a REFRESH
VIEW operation (not to be confused with REFRESH MV).

(2) seems appropriate for a separate discussion since REFRESH VIEW
operation behavior can be concerned with other things as well like
refreshing the schema.

Detailed discussion on this question can be found in this thread in the doc
[2]. If people are okay with the stance above, we can move forward towards
a vote on using UUIDs. I will leave a couple of days for the discussion to
wrap up in this thread/the doc.

[1]
https://docs.google.com/document/d/1-OaPqm8ahVT3_OCbVdAPQ_wZ8I3ToeqU3RLUjcyKQM0/edit
[2]
https://docs.google.com/document/d/1-OaPqm8ahVT3_OCbVdAPQ_wZ8I3ToeqU3RLUjcyKQM0/edit?disco=AAABTXtgBFw

Thanks,
Walaa.


On Thu, Aug 8, 2024 at 4:43 PM Walaa Eldin Moustafa 
wrote:

> Thanks Benny! We discussed this option during the meeting but we did not
> prefer it because we did not want to leak the SQL identifiers to the
> storage table since SQL identifiers are view concepts and fit better with
> the view.
>
> Thanks,
> Walaa.
>
> On Thu, Aug 8, 2024 at 4:12 PM Benny Chow  wrote:
>
>> Maybe a third option is to decouple the view lineage and materialization
>> state.
>>
>> The view lineage can just list out the SQL identifiers+ref... we can
>> still decide whether this is just direct children or fully expanded.
>> The materialization state doesn't have to depend on the view lineage
>> (through either sequence ID or UUIDs).  It's a direct mapping of SQL
>> identifiers (possibly with refs) to their corresponding snapshot id or view
>> versions.
>>
>> The materialization state was our original design until Dan brought up
>> the idea to also capture the view lineage.  So, what I am suggesting is
>> that we do both but not couple them together through any sort of sequence
>> ID or UUID.
>>
>> Thanks
>> Benny
>>
>> On Thu, Aug 8, 2024 at 2:04 PM Walaa Eldin Moustafa <
>> [email protected]> wrote:
>>
>>> Hi Everyone,
>>>
>>> In the last community sync on Materialized Views [1], we agreed to split
>>> the information that is used to determine the materialized view staleness
>>> to two parts: Lineage Information and State Information. We have made a lot
>>> of progress on representing both but one issue remains open:
>>>
>>> Both lineage information and state information express facts about the
>>> child objects of the MV, and to correlate those facts we need some sort of
>>> an ID for those objects.
>>> The open question is whether the ID should be of type sequence ID or
>>> UUID (i.e., the Iceberg UUID of the object).
>>>
>>> I have described the issue in more detail in this doc [2]. I have also
>>> added some thoughts on the advantages of using UUIDs in the same doc.
>>> Further, there is a discussion thread on that topic in the MV spec doc [3].
>>>
>>> Please feel free to comment on the docs or share your thoughts in the
>>> thread. After the discussion we can move forward to a vote.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>> [1] https://lists.apache.org/thread/bhmxo1w1bdp1p2hh842kpm2gy1g5rscp
>>> [2]
>>> https://docs.google.com/document/d/1-OaPqm8ahVT3_OCbVdAPQ_wZ8I3ToeqU3RLUjcyKQM0/edit
>>> [3]
>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABPUgvras
>>>
>>


Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-08 Thread Walaa Eldin Moustafa
Thanks Benny! We discussed this option during the meeting but we did not
prefer it because we did not want to leak the SQL identifiers to the
storage table since SQL identifiers are view concepts and fit better with
the view.

Thanks,
Walaa.

On Thu, Aug 8, 2024 at 4:12 PM Benny Chow  wrote:

> Maybe a third option is to decouple the view lineage and materialization
> state.
>
> The view lineage can just list out the SQL identifiers+ref... we can still
> decide whether this is just direct children or fully expanded.
> The materialization state doesn't have to depend on the view lineage
> (through either sequence ID or UUIDs).  It's a direct mapping of SQL
> identifiers (possibly with refs) to their corresponding snapshot id or view
> versions.
>
> The materialization state was our original design until Dan brought up the
> idea to also capture the view lineage.  So, what I am suggesting is that we
> do both but not couple them together through any sort of sequence ID or
> UUID.
>
> Thanks
> Benny
>
> On Thu, Aug 8, 2024 at 2:04 PM Walaa Eldin Moustafa 
> wrote:
>
>> Hi Everyone,
>>
>> In the last community sync on Materialized Views [1], we agreed to split
>> the information that is used to determine the materialized view staleness
>> to two parts: Lineage Information and State Information. We have made a lot
>> of progress on representing both but one issue remains open:
>>
>> Both lineage information and state information express facts about the
>> child objects of the MV, and to correlate those facts we need some sort of
>> an ID for those objects.
>> The open question is whether the ID should be of type sequence ID or UUID
>> (i.e., the Iceberg UUID of the object).
>>
>> I have described the issue in more detail in this doc [2]. I have also
>> added some thoughts on the advantages of using UUIDs in the same doc.
>> Further, there is a discussion thread on that topic in the MV spec doc [3].
>>
>> Please feel free to comment on the docs or share your thoughts in the
>> thread. After the discussion we can move forward to a vote.
>>
>> Thanks,
>> Walaa.
>>
>> [1] https://lists.apache.org/thread/bhmxo1w1bdp1p2hh842kpm2gy1g5rscp
>> [2]
>> https://docs.google.com/document/d/1-OaPqm8ahVT3_OCbVdAPQ_wZ8I3ToeqU3RLUjcyKQM0/edit
>> [3]
>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABPUgvras
>>
>


Re: [DISCUSS] Materialized Views: Lineage and State information

2024-08-08 Thread Benny Chow
Maybe a third option is to decouple the view lineage and materialization
state.

The view lineage can just list out the SQL identifiers+ref... we can still
decide whether this is just direct children or fully expanded.
The materialization state doesn't have to depend on the view lineage
(through either sequence ID or UUIDs).  It's a direct mapping of SQL
identifiers (possibly with refs) to their corresponding snapshot id or view
versions.

The materialization state was our original design until Dan brought up the
idea to also capture the view lineage.  So, what I am suggesting is that we
do both but not couple them together through any sort of sequence ID or
UUID.

Thanks
Benny

On Thu, Aug 8, 2024 at 2:04 PM Walaa Eldin Moustafa 
wrote:

> Hi Everyone,
>
> In the last community sync on Materialized Views [1], we agreed to split
> the information that is used to determine the materialized view staleness
> to two parts: Lineage Information and State Information. We have made a lot
> of progress on representing both but one issue remains open:
>
> Both lineage information and state information express facts about the
> child objects of the MV, and to correlate those facts we need some sort of
> an ID for those objects.
> The open question is whether the ID should be of type sequence ID or UUID
> (i.e., the Iceberg UUID of the object).
>
> I have described the issue in more detail in this doc [2]. I have also
> added some thoughts on the advantages of using UUIDs in the same doc.
> Further, there is a discussion thread on that topic in the MV spec doc [3].
>
> Please feel free to comment on the docs or share your thoughts in the
> thread. After the discussion we can move forward to a vote.
>
> Thanks,
> Walaa.
>
> [1] https://lists.apache.org/thread/bhmxo1w1bdp1p2hh842kpm2gy1g5rscp
> [2]
> https://docs.google.com/document/d/1-OaPqm8ahVT3_OCbVdAPQ_wZ8I3ToeqU3RLUjcyKQM0/edit
> [3]
> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABPUgvras
>