Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-12-02 Thread Steve Zhang
Thank you all for your valuable input on this topic. I apologize for not being 
able to keep up with the discussion last week, as I was traveling during the 
U.S. Thanksgiving holiday.

To summarize, I think we have the agreement on following points:
- tableExists API shall indicate whether a table identifier exists in the 
catalog, regardless of metadata corruption
- Currently, we attempt to load the table first to determine table existence, 
but this is not necessary
- For Hive catalog, API can return true if the table identifier exists in HMS 
and it is an iceberg table.
- There are no plans to change other existing behaviors.

I've addressed the feedback from reviewers and also added explicit tests 
coverage, PR is ready for another look in 
https://github.com/apache/iceberg/pull/11597.

Thanks,
Steve Zhang



> On Nov 27, 2024, at 10:15 PM, Péter Váry  wrote:
> 
> +1 from my side too.
> 
> I wanted to make sure that the community is aware of this change which will 
> bring behavioral difference compared to other catalogs. This is why I have 
> asked Steve to start this thread.
> 
> 
> On Thu, Nov 28, 2024, 02:10 Szehon Ho  > wrote:
>> Yea, I think that part is definitely kept.
>> 
>> Thanks
>> Szehon
>> 
>> On Wed, Nov 27, 2024 at 12:02 PM [email protected]  
>> mailto:[email protected]>> wrote:
>>> I'd support changing the behavior if we still have a way to match the 
>>> intent, which is to return true if the table exists in Hive and is an 
>>> Iceberg table.
>>> 
>>> On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho >> > wrote:
 Hm I think the thread got a bit sidetracked by the other question.
 
 The initial proposal by Steve is a performance improvement for 
 HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg 
 table metadata, and if successful returns true.  The proposal is to load 
 from Hive only, and return true if Hive metadata identifies that an 
 Iceberg table exists with this name.  
 
 Checking corruption of Iceberg's table metadata.json is a side-effect of 
 the current behavior, but would not anymore with the proposed change.  
 That's the question of the original thread, and so far there's agreement 
 that it is not necessarily part of this scope of HiveCatalog's 
 tableExists().  
 
 At least this is my understanding.
 Thanks,
 Szehon
 
 On Wed, Nov 27, 2024 at 10:56 AM [email protected] 
  mailto:[email protected]>> 
 wrote:
> What kind of corruption are you referring to? I would expect corruption 
> to result in an exception when loading the table, but that the table 
> should still exist. The problem is likely that we determine if a table 
> exists by attempting to load it. We could fix that by not attempting to 
> load the table. I think that's a reasonable solution.
> 
> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang  > wrote:
>>> The current behavior's intent is not to check whether the metadata is 
>>> valid, it is to detect whether the table is an Iceberg table.
>> 
>> Is there a way to detect this from HiveCatalog without loading the 
>> table? 
>> 
>> 
>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry > > wrote:
>>> I think we have an agreement, not to change the behavior wrt existing 
>>> non-Iceberg tables, and throw an exception.
>>> 
>>> Are we also in agreement with the original proposal to return true when 
>>> the table exists but the metadata is somehow corrupted? Note: this is 
>>> the proposed change of behavior why the thread was originally started.
>>> 
>>> On Tue, Nov 26, 2024, 21:30 [email protected]  
>>> mailto:[email protected]>> wrote:
 I'd argue against changing this. The current behavior's intent is not 
 to check whether the metadata is valid, it is to detect whether the 
 table is an Iceberg table. It ignores non-Iceberg tables. Changing 
 that behavior would be surprising, especially if we started throwing 
 exceptions.
 
 On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu >>> > wrote:
> > Should add, my personal preference is probably not to change the 
> > existing behavior for this part
> 
> +1. I realized that this is not a new behavior. The `loadTable` 
> implementation has this problem too. 
> It would be good to have a test case specifically for this edge case 
> and maybe call this out in the documentation. 
> 
> Thanks,
> Kevin Liu
> 
> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho  > wrote:
>> Should add, my personal preference is proba

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread Péter Váry
+1 from my side too.

I wanted to make sure that the community is aware of this change which will
bring behavioral difference compared to other catalogs. This is why I have
asked Steve to start this thread.


On Thu, Nov 28, 2024, 02:10 Szehon Ho  wrote:

> Yea, I think that part is definitely kept.
>
> Thanks
> Szehon
>
> On Wed, Nov 27, 2024 at 12:02 PM [email protected] 
> wrote:
>
>> I'd support changing the behavior if we still have a way to match the
>> intent, which is to return true if the table exists in Hive and is an
>> Iceberg table.
>>
>> On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho 
>> wrote:
>>
>>> Hm I think the thread got a bit sidetracked by the other question.
>>>
>>> The initial proposal by Steve is a performance improvement for
>>> HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg
>>> table metadata, and if successful returns true.  The proposal is to load
>>> from Hive only, and return true if Hive metadata identifies that an Iceberg
>>> table exists with this name.
>>>
>>> Checking corruption of Iceberg's table metadata.json is a side-effect of
>>> the current behavior, but would not anymore with the proposed change.
>>> That's the question of the original thread, and so far there's agreement
>>> that it is not necessarily part of this scope of HiveCatalog's
>>> tableExists().
>>>
>>> At least this is my understanding.
>>> Thanks,
>>> Szehon
>>>
>>> On Wed, Nov 27, 2024 at 10:56 AM [email protected] 
>>> wrote:
>>>
 What kind of corruption are you referring to? I would expect corruption
 to result in an exception when loading the table, but that the table should
 still exist. The problem is likely that we determine if a table exists by
 attempting to load it. We could fix that by not attempting to load the
 table. I think that's a reasonable solution.

 On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang 
 wrote:

> The current behavior's intent is not to check whether the metadata is
>> valid, it is to detect whether the table is an Iceberg table.
>
>
> Is there a way to detect this from HiveCatalog without loading the
> table?
>
>
> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry <
> [email protected]> wrote:
>
>> I think we have an agreement, not to change the behavior wrt existing
>> non-Iceberg tables, and throw an exception.
>>
>> Are we also in agreement with the original proposal to return true
>> when the table exists but the metadata is somehow corrupted? Note: this 
>> is
>> the proposed change of behavior why the thread was originally started.
>>
>> On Tue, Nov 26, 2024, 21:30 [email protected] 
>> wrote:
>>
>>> I'd argue against changing this. The current behavior's intent is
>>> not to check whether the metadata is valid, it is to detect whether the
>>> table is an Iceberg table. It ignores non-Iceberg tables. Changing that
>>> behavior would be surprising, especially if we started throwing 
>>> exceptions.
>>>
>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu 
>>> wrote:
>>>
 > Should add, my personal preference is probably not to change the
 existing behavior for this part

 +1. I realized that this is not a new behavior. The `loadTable`
 implementation has this problem too.
 It would be good to have a test case specifically for this edge
 case and maybe call this out in the documentation.

 Thanks,
 Kevin Liu

 On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
 wrote:

> Should add, my personal preference is probably not to change the
> existing behavior for this part (false, if exists a Hive table with 
> same
> name) at the moment, just adding another possibility for 
> consideration.
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
> wrote:
>
>> Thanks Kevin and Gabor, this is an interesting discussion.  I
>> guess a third option instead of returning true/false in this case, 
>> is to
>> change it to throw an NoSuchIcebergTableException if its a 
>> non-Iceberg
>> table, which I think is actually what this pr does?
>>
>> Thanks
>> Szehon
>>
>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>  wrote:
>>
>>> Hey,
>>>
>>> I think what Kevin says makes sense. However, it would then
>>> confuse the opposite use case of this function. Let's assume that 
>>> we change
>>> the implementation of tableExists() to not load the table 
>>> internally:
>>>
>>> if (tableExists(table_name)) {
>>> table = loadTable(table_name);
>>> }
>>>
>>> Here, you find that the table exists but when you t

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread Szehon Ho
Yea, I think that part is definitely kept.

Thanks
Szehon

On Wed, Nov 27, 2024 at 12:02 PM [email protected]  wrote:

> I'd support changing the behavior if we still have a way to match the
> intent, which is to return true if the table exists in Hive and is an
> Iceberg table.
>
> On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho 
> wrote:
>
>> Hm I think the thread got a bit sidetracked by the other question.
>>
>> The initial proposal by Steve is a performance improvement for
>> HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg
>> table metadata, and if successful returns true.  The proposal is to load
>> from Hive only, and return true if Hive metadata identifies that an Iceberg
>> table exists with this name.
>>
>> Checking corruption of Iceberg's table metadata.json is a side-effect of
>> the current behavior, but would not anymore with the proposed change.
>> That's the question of the original thread, and so far there's agreement
>> that it is not necessarily part of this scope of HiveCatalog's
>> tableExists().
>>
>> At least this is my understanding.
>> Thanks,
>> Szehon
>>
>> On Wed, Nov 27, 2024 at 10:56 AM [email protected] 
>> wrote:
>>
>>> What kind of corruption are you referring to? I would expect corruption
>>> to result in an exception when loading the table, but that the table should
>>> still exist. The problem is likely that we determine if a table exists by
>>> attempting to load it. We could fix that by not attempting to load the
>>> table. I think that's a reasonable solution.
>>>
>>> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang 
>>> wrote:
>>>
 The current behavior's intent is not to check whether the metadata is
> valid, it is to detect whether the table is an Iceberg table.


 Is there a way to detect this from HiveCatalog without loading the
 table?


 On Wed, Nov 27, 2024 at 2:01 PM Péter Váry 
 wrote:

> I think we have an agreement, not to change the behavior wrt existing
> non-Iceberg tables, and throw an exception.
>
> Are we also in agreement with the original proposal to return true
> when the table exists but the metadata is somehow corrupted? Note: this is
> the proposed change of behavior why the thread was originally started.
>
> On Tue, Nov 26, 2024, 21:30 [email protected]  wrote:
>
>> I'd argue against changing this. The current behavior's intent is not
>> to check whether the metadata is valid, it is to detect whether the table
>> is an Iceberg table. It ignores non-Iceberg tables. Changing that 
>> behavior
>> would be surprising, especially if we started throwing exceptions.
>>
>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu 
>> wrote:
>>
>>> > Should add, my personal preference is probably not to change the
>>> existing behavior for this part
>>>
>>> +1. I realized that this is not a new behavior. The `loadTable`
>>> implementation has this problem too.
>>> It would be good to have a test case specifically for this edge case
>>> and maybe call this out in the documentation.
>>>
>>> Thanks,
>>> Kevin Liu
>>>
>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
>>> wrote:
>>>
 Should add, my personal preference is probably not to change the
 existing behavior for this part (false, if exists a Hive table with 
 same
 name) at the moment, just adding another possibility for consideration.

 Thanks
 Szehon

 On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
 wrote:

> Thanks Kevin and Gabor, this is an interesting discussion.  I
> guess a third option instead of returning true/false in this case, is 
> to
> change it to throw an NoSuchIcebergTableException if its a non-Iceberg
> table, which I think is actually what this pr does?
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>  wrote:
>
>> Hey,
>>
>> I think what Kevin says makes sense. However, it would then
>> confuse the opposite use case of this function. Let's assume that we 
>> change
>> the implementation of tableExists() to not load the table internally:
>>
>> if (tableExists(table_name)) {
>> table = loadTable(table_name);
>> }
>>
>> Here, you find that the table exists but when you try to load it
>> it fails because it's not an Iceberg table. I don't think that any 
>> of these
>> 2 are intuitive. I think the question here is how much an API of the
>> Iceberg table format should know about the existence of tables in 
>> other
>> formats.
>>
>> If `tableExists` is meant to check for conflicting entries in the
>>> HMS
>>
>> Another interpretation of calli

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread [email protected]
I'd support changing the behavior if we still have a way to match the
intent, which is to return true if the table exists in Hive and is an
Iceberg table.

On Wed, Nov 27, 2024 at 11:26 AM Szehon Ho  wrote:

> Hm I think the thread got a bit sidetracked by the other question.
>
> The initial proposal by Steve is a performance improvement for
> HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg
> table metadata, and if successful returns true.  The proposal is to load
> from Hive only, and return true if Hive metadata identifies that an Iceberg
> table exists with this name.
>
> Checking corruption of Iceberg's table metadata.json is a side-effect of
> the current behavior, but would not anymore with the proposed change.
> That's the question of the original thread, and so far there's agreement
> that it is not necessarily part of this scope of HiveCatalog's
> tableExists().
>
> At least this is my understanding.
> Thanks,
> Szehon
>
> On Wed, Nov 27, 2024 at 10:56 AM [email protected] 
> wrote:
>
>> What kind of corruption are you referring to? I would expect corruption
>> to result in an exception when loading the table, but that the table should
>> still exist. The problem is likely that we determine if a table exists by
>> attempting to load it. We could fix that by not attempting to load the
>> table. I think that's a reasonable solution.
>>
>> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang 
>> wrote:
>>
>>> The current behavior's intent is not to check whether the metadata is
 valid, it is to detect whether the table is an Iceberg table.
>>>
>>>
>>> Is there a way to detect this from HiveCatalog without loading the
>>> table?
>>>
>>>
>>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry 
>>> wrote:
>>>
 I think we have an agreement, not to change the behavior wrt existing
 non-Iceberg tables, and throw an exception.

 Are we also in agreement with the original proposal to return true when
 the table exists but the metadata is somehow corrupted? Note: this is the
 proposed change of behavior why the thread was originally started.

 On Tue, Nov 26, 2024, 21:30 [email protected]  wrote:

> I'd argue against changing this. The current behavior's intent is not
> to check whether the metadata is valid, it is to detect whether the table
> is an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
> would be surprising, especially if we started throwing exceptions.
>
> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu 
> wrote:
>
>> > Should add, my personal preference is probably not to change the
>> existing behavior for this part
>>
>> +1. I realized that this is not a new behavior. The `loadTable`
>> implementation has this problem too.
>> It would be good to have a test case specifically for this edge case
>> and maybe call this out in the documentation.
>>
>> Thanks,
>> Kevin Liu
>>
>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
>> wrote:
>>
>>> Should add, my personal preference is probably not to change the
>>> existing behavior for this part (false, if exists a Hive table with same
>>> name) at the moment, just adding another possibility for consideration.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
>>> wrote:
>>>
 Thanks Kevin and Gabor, this is an interesting discussion.  I guess
 a third option instead of returning true/false in this case, is to 
 change
 it to throw an NoSuchIcebergTableException if its a non-Iceberg table,
 which I think is actually what this pr does?

 Thanks
 Szehon

 On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
  wrote:

> Hey,
>
> I think what Kevin says makes sense. However, it would then
> confuse the opposite use case of this function. Let's assume that we 
> change
> the implementation of tableExists() to not load the table internally:
>
> if (tableExists(table_name)) {
> table = loadTable(table_name);
> }
>
> Here, you find that the table exists but when you try to load it
> it fails because it's not an Iceberg table. I don't think that any of 
> these
> 2 are intuitive. I think the question here is how much an API of the
> Iceberg table format should know about the existence of tables in 
> other
> formats.
>
> If `tableExists` is meant to check for conflicting entries in the
>> HMS
>
> Another interpretation of calling Catalog.tableExists() on an
> Iceberg API is instead "is there such an Iceberg table". TBH, not 
> sure if
> any of the 2 approaches are better than the other, I just wanted to 
> show
> that there is another side of the c

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread Szehon Ho
Hm I think the thread got a bit sidetracked by the other question.

The initial proposal by Steve is a performance improvement for
HiveCatalog's tableExists().  Currently it loads both Hive and Iceberg
table metadata, and if successful returns true.  The proposal is to load
from Hive only, and return true if Hive metadata identifies that an Iceberg
table exists with this name.

Checking corruption of Iceberg's table metadata.json is a side-effect of
the current behavior, but would not anymore with the proposed change.
That's the question of the original thread, and so far there's agreement
that it is not necessarily part of this scope of HiveCatalog's
tableExists().

At least this is my understanding.
Thanks,
Szehon

On Wed, Nov 27, 2024 at 10:56 AM [email protected]  wrote:

> What kind of corruption are you referring to? I would expect corruption to
> result in an exception when loading the table, but that the table should
> still exist. The problem is likely that we determine if a table exists by
> attempting to load it. We could fix that by not attempting to load the
> table. I think that's a reasonable solution.
>
> On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang 
> wrote:
>
>> The current behavior's intent is not to check whether the metadata is
>>> valid, it is to detect whether the table is an Iceberg table.
>>
>>
>> Is there a way to detect this from HiveCatalog without loading the table?
>>
>>
>> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry 
>> wrote:
>>
>>> I think we have an agreement, not to change the behavior wrt existing
>>> non-Iceberg tables, and throw an exception.
>>>
>>> Are we also in agreement with the original proposal to return true when
>>> the table exists but the metadata is somehow corrupted? Note: this is the
>>> proposed change of behavior why the thread was originally started.
>>>
>>> On Tue, Nov 26, 2024, 21:30 [email protected]  wrote:
>>>
 I'd argue against changing this. The current behavior's intent is not
 to check whether the metadata is valid, it is to detect whether the table
 is an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
 would be surprising, especially if we started throwing exceptions.

 On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu 
 wrote:

> > Should add, my personal preference is probably not to change the
> existing behavior for this part
>
> +1. I realized that this is not a new behavior. The `loadTable`
> implementation has this problem too.
> It would be good to have a test case specifically for this edge case
> and maybe call this out in the documentation.
>
> Thanks,
> Kevin Liu
>
> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
> wrote:
>
>> Should add, my personal preference is probably not to change the
>> existing behavior for this part (false, if exists a Hive table with same
>> name) at the moment, just adding another possibility for consideration.
>>
>> Thanks
>> Szehon
>>
>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
>> wrote:
>>
>>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess
>>> a third option instead of returning true/false in this case, is to 
>>> change
>>> it to throw an NoSuchIcebergTableException if its a non-Iceberg table,
>>> which I think is actually what this pr does?
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>>  wrote:
>>>
 Hey,

 I think what Kevin says makes sense. However, it would then confuse
 the opposite use case of this function. Let's assume that we change the
 implementation of tableExists() to not load the table internally:

 if (tableExists(table_name)) {
 table = loadTable(table_name);
 }

 Here, you find that the table exists but when you try to load it it
 fails because it's not an Iceberg table. I don't think that any of 
 these 2
 are intuitive. I think the question here is how much an API of the 
 Iceberg
 table format should know about the existence of tables in other 
 formats.

 If `tableExists` is meant to check for conflicting entries in the
> HMS

 Another interpretation of calling Catalog.tableExists() on an
 Iceberg API is instead "is there such an Iceberg table". TBH, not sure 
 if
 any of the 2 approaches are better than the other, I just wanted to 
 show
 that there is another side of the coin :)

 Regards,
 Gabor

 On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu 
 wrote:

> Hi Steve,
>
> This makes sense to me. The semantics of `tableExists` focus on
> whether a table's name exists in the catalog. For the Hive catalog,
> checking the HMS entry should b

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread [email protected]
What kind of corruption are you referring to? I would expect corruption to
result in an exception when loading the table, but that the table should
still exist. The problem is likely that we determine if a table exists by
attempting to load it. We could fix that by not attempting to load the
table. I think that's a reasonable solution.

On Wed, Nov 27, 2024 at 12:45 AM Manu Zhang  wrote:

> The current behavior's intent is not to check whether the metadata is
>> valid, it is to detect whether the table is an Iceberg table.
>
>
> Is there a way to detect this from HiveCatalog without loading the table?
>
>
> On Wed, Nov 27, 2024 at 2:01 PM Péter Váry 
> wrote:
>
>> I think we have an agreement, not to change the behavior wrt existing
>> non-Iceberg tables, and throw an exception.
>>
>> Are we also in agreement with the original proposal to return true when
>> the table exists but the metadata is somehow corrupted? Note: this is the
>> proposed change of behavior why the thread was originally started.
>>
>> On Tue, Nov 26, 2024, 21:30 [email protected]  wrote:
>>
>>> I'd argue against changing this. The current behavior's intent is not to
>>> check whether the metadata is valid, it is to detect whether the table is
>>> an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
>>> would be surprising, especially if we started throwing exceptions.
>>>
>>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu  wrote:
>>>
 > Should add, my personal preference is probably not to change the
 existing behavior for this part

 +1. I realized that this is not a new behavior. The `loadTable`
 implementation has this problem too.
 It would be good to have a test case specifically for this edge case
 and maybe call this out in the documentation.

 Thanks,
 Kevin Liu

 On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
 wrote:

> Should add, my personal preference is probably not to change the
> existing behavior for this part (false, if exists a Hive table with same
> name) at the moment, just adding another possibility for consideration.
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
> wrote:
>
>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
>> third option instead of returning true/false in this case, is to change 
>> it
>> to throw an NoSuchIcebergTableException if its a non-Iceberg table, 
>> which I
>> think is actually what this pr does?
>>
>> Thanks
>> Szehon
>>
>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>  wrote:
>>
>>> Hey,
>>>
>>> I think what Kevin says makes sense. However, it would then confuse
>>> the opposite use case of this function. Let's assume that we change the
>>> implementation of tableExists() to not load the table internally:
>>>
>>> if (tableExists(table_name)) {
>>> table = loadTable(table_name);
>>> }
>>>
>>> Here, you find that the table exists but when you try to load it it
>>> fails because it's not an Iceberg table. I don't think that any of 
>>> these 2
>>> are intuitive. I think the question here is how much an API of the 
>>> Iceberg
>>> table format should know about the existence of tables in other formats.
>>>
>>> If `tableExists` is meant to check for conflicting entries in the HMS
>>>
>>> Another interpretation of calling Catalog.tableExists() on an
>>> Iceberg API is instead "is there such an Iceberg table". TBH, not sure 
>>> if
>>> any of the 2 approaches are better than the other, I just wanted to show
>>> that there is another side of the coin :)
>>>
>>> Regards,
>>> Gabor
>>>
>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu 
>>> wrote:
>>>
 Hi Steve,

 This makes sense to me. The semantics of `tableExists` focus on
 whether a table's name exists in the catalog. For the Hive catalog,
 checking the HMS entry should be sufficient.

 I do have a question about usage, though. Typically, I would use `
 tableExists` like this:

 ```
 if (!tableExists(table_name)) {
 table = createTable(table_name);
 }
 ```
 What happens when a Hive table with the same name already exists in
 the catalog? In the current implementation, `tableExists` would return
 `false` because `HiveOperationsBase.validateTableIsIceberg` throws a
 `NoSuchTableException`.
 This would cause the code above to attempt to create the table,
 only to fail since the name already exists in the HMS.
 If `tableExists` is meant to check for conflicting entries in the
 HMS, perhaps it should return true even when a Hive table with the same
 name exists.

 I’d love to hear your thoughts on this.

 Bes

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-27 Thread Manu Zhang
>
> The current behavior's intent is not to check whether the metadata is
> valid, it is to detect whether the table is an Iceberg table.


Is there a way to detect this from HiveCatalog without loading the table?


On Wed, Nov 27, 2024 at 2:01 PM Péter Váry 
wrote:

> I think we have an agreement, not to change the behavior wrt existing
> non-Iceberg tables, and throw an exception.
>
> Are we also in agreement with the original proposal to return true when
> the table exists but the metadata is somehow corrupted? Note: this is the
> proposed change of behavior why the thread was originally started.
>
> On Tue, Nov 26, 2024, 21:30 [email protected]  wrote:
>
>> I'd argue against changing this. The current behavior's intent is not to
>> check whether the metadata is valid, it is to detect whether the table is
>> an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
>> would be surprising, especially if we started throwing exceptions.
>>
>> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu  wrote:
>>
>>> > Should add, my personal preference is probably not to change the
>>> existing behavior for this part
>>>
>>> +1. I realized that this is not a new behavior. The `loadTable`
>>> implementation has this problem too.
>>> It would be good to have a test case specifically for this edge case and
>>> maybe call this out in the documentation.
>>>
>>> Thanks,
>>> Kevin Liu
>>>
>>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
>>> wrote:
>>>
 Should add, my personal preference is probably not to change the
 existing behavior for this part (false, if exists a Hive table with same
 name) at the moment, just adding another possibility for consideration.

 Thanks
 Szehon

 On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
 wrote:

> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
> third option instead of returning true/false in this case, is to change it
> to throw an NoSuchIcebergTableException if its a non-Iceberg table, which 
> I
> think is actually what this pr does?
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>  wrote:
>
>> Hey,
>>
>> I think what Kevin says makes sense. However, it would then confuse
>> the opposite use case of this function. Let's assume that we change the
>> implementation of tableExists() to not load the table internally:
>>
>> if (tableExists(table_name)) {
>> table = loadTable(table_name);
>> }
>>
>> Here, you find that the table exists but when you try to load it it
>> fails because it's not an Iceberg table. I don't think that any of these 
>> 2
>> are intuitive. I think the question here is how much an API of the 
>> Iceberg
>> table format should know about the existence of tables in other formats.
>>
>> If `tableExists` is meant to check for conflicting entries in the HMS
>>
>> Another interpretation of calling Catalog.tableExists() on an Iceberg
>> API is instead "is there such an Iceberg table". TBH, not sure if any of
>> the 2 approaches are better than the other, I just wanted to show that
>> there is another side of the coin :)
>>
>> Regards,
>> Gabor
>>
>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu 
>> wrote:
>>
>>> Hi Steve,
>>>
>>> This makes sense to me. The semantics of `tableExists` focus on
>>> whether a table's name exists in the catalog. For the Hive catalog,
>>> checking the HMS entry should be sufficient.
>>>
>>> I do have a question about usage, though. Typically, I would use `
>>> tableExists` like this:
>>>
>>> ```
>>> if (!tableExists(table_name)) {
>>> table = createTable(table_name);
>>> }
>>> ```
>>> What happens when a Hive table with the same name already exists in
>>> the catalog? In the current implementation, `tableExists` would return
>>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a
>>> `NoSuchTableException`.
>>> This would cause the code above to attempt to create the table, only
>>> to fail since the name already exists in the HMS.
>>> If `tableExists` is meant to check for conflicting entries in the
>>> HMS, perhaps it should return true even when a Hive table with the same
>>> name exists.
>>>
>>> I’d love to hear your thoughts on this.
>>>
>>> Best,
>>> Kevin Liu
>>>
>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho 
>>> wrote:
>>>
 Hi,

 It's a good performance find and improvement.   Left some comment
 on the PR.

 IMO, the behavior actually more matches the API javadoc ("Check
 whether table exists"), not whether it is corrupted or not, so I'm
 supportive of it.

 Thanks
 Szehon

 On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
  wrote:
>>

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-26 Thread Péter Váry
I think we have an agreement, not to change the behavior wrt existing
non-Iceberg tables, and throw an exception.

Are we also in agreement with the original proposal to return true when the
table exists but the metadata is somehow corrupted? Note: this is the
proposed change of behavior why the thread was originally started.

On Tue, Nov 26, 2024, 21:30 [email protected]  wrote:

> I'd argue against changing this. The current behavior's intent is not to
> check whether the metadata is valid, it is to detect whether the table is
> an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
> would be surprising, especially if we started throwing exceptions.
>
> On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu  wrote:
>
>> > Should add, my personal preference is probably not to change the
>> existing behavior for this part
>>
>> +1. I realized that this is not a new behavior. The `loadTable`
>> implementation has this problem too.
>> It would be good to have a test case specifically for this edge case and
>> maybe call this out in the documentation.
>>
>> Thanks,
>> Kevin Liu
>>
>> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
>> wrote:
>>
>>> Should add, my personal preference is probably not to change the
>>> existing behavior for this part (false, if exists a Hive table with same
>>> name) at the moment, just adding another possibility for consideration.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
>>> wrote:
>>>
 Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
 third option instead of returning true/false in this case, is to change it
 to throw an NoSuchIcebergTableException if its a non-Iceberg table, which I
 think is actually what this pr does?

 Thanks
 Szehon

 On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
  wrote:

> Hey,
>
> I think what Kevin says makes sense. However, it would then confuse
> the opposite use case of this function. Let's assume that we change the
> implementation of tableExists() to not load the table internally:
>
> if (tableExists(table_name)) {
> table = loadTable(table_name);
> }
>
> Here, you find that the table exists but when you try to load it it
> fails because it's not an Iceberg table. I don't think that any of these 2
> are intuitive. I think the question here is how much an API of the Iceberg
> table format should know about the existence of tables in other formats.
>
> If `tableExists` is meant to check for conflicting entries in the HMS
>
> Another interpretation of calling Catalog.tableExists() on an Iceberg
> API is instead "is there such an Iceberg table". TBH, not sure if any of
> the 2 approaches are better than the other, I just wanted to show that
> there is another side of the coin :)
>
> Regards,
> Gabor
>
> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu 
> wrote:
>
>> Hi Steve,
>>
>> This makes sense to me. The semantics of `tableExists` focus on
>> whether a table's name exists in the catalog. For the Hive catalog,
>> checking the HMS entry should be sufficient.
>>
>> I do have a question about usage, though. Typically, I would use `
>> tableExists` like this:
>>
>> ```
>> if (!tableExists(table_name)) {
>> table = createTable(table_name);
>> }
>> ```
>> What happens when a Hive table with the same name already exists in
>> the catalog? In the current implementation, `tableExists` would return
>> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a
>> `NoSuchTableException`.
>> This would cause the code above to attempt to create the table, only
>> to fail since the name already exists in the HMS.
>> If `tableExists` is meant to check for conflicting entries in the
>> HMS, perhaps it should return true even when a Hive table with the same
>> name exists.
>>
>> I’d love to hear your thoughts on this.
>>
>> Best,
>> Kevin Liu
>>
>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho 
>> wrote:
>>
>>> Hi,
>>>
>>> It's a good performance find and improvement.   Left some comment on
>>> the PR.
>>>
>>> IMO, the behavior actually more matches the API javadoc ("Check
>>> whether table exists"), not whether it is corrupted or not, so I'm
>>> supportive of it.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>>  wrote:
>>>
 Hi Iceberger,

   I have a proposal to simplify the tableExists API in the Hive
 catalog, which involves a behavior change, and I’d like to hear your
 thoughts.

   Currently, in our catalog interface[1], the tableExists method
 is implemented as a default API by invoking the loadTable method.
 It returns true if the table can be loaded with

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-26 Thread [email protected]
I'd argue against changing this. The current behavior's intent is not to
check whether the metadata is valid, it is to detect whether the table is
an Iceberg table. It ignores non-Iceberg tables. Changing that behavior
would be surprising, especially if we started throwing exceptions.

On Fri, Nov 22, 2024 at 2:01 PM Kevin Liu  wrote:

> > Should add, my personal preference is probably not to change the
> existing behavior for this part
>
> +1. I realized that this is not a new behavior. The `loadTable`
> implementation has this problem too.
> It would be good to have a test case specifically for this edge case and
> maybe call this out in the documentation.
>
> Thanks,
> Kevin Liu
>
> On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho 
> wrote:
>
>> Should add, my personal preference is probably not to change the existing
>> behavior for this part (false, if exists a Hive table with same name) at
>> the moment, just adding another possibility for consideration.
>>
>> Thanks
>> Szehon
>>
>> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho 
>> wrote:
>>
>>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
>>> third option instead of returning true/false in this case, is to change it
>>> to throw an NoSuchIcebergTableException if its a non-Iceberg table, which I
>>> think is actually what this pr does?
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>>  wrote:
>>>
 Hey,

 I think what Kevin says makes sense. However, it would then confuse the
 opposite use case of this function. Let's assume that we change the
 implementation of tableExists() to not load the table internally:

 if (tableExists(table_name)) {
 table = loadTable(table_name);
 }

 Here, you find that the table exists but when you try to load it it
 fails because it's not an Iceberg table. I don't think that any of these 2
 are intuitive. I think the question here is how much an API of the Iceberg
 table format should know about the existence of tables in other formats.

 If `tableExists` is meant to check for conflicting entries in the HMS

 Another interpretation of calling Catalog.tableExists() on an Iceberg
 API is instead "is there such an Iceberg table". TBH, not sure if any of
 the 2 approaches are better than the other, I just wanted to show that
 there is another side of the coin :)

 Regards,
 Gabor

 On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu 
 wrote:

> Hi Steve,
>
> This makes sense to me. The semantics of `tableExists` focus on
> whether a table's name exists in the catalog. For the Hive catalog,
> checking the HMS entry should be sufficient.
>
> I do have a question about usage, though. Typically, I would use `
> tableExists` like this:
>
> ```
> if (!tableExists(table_name)) {
> table = createTable(table_name);
> }
> ```
> What happens when a Hive table with the same name already exists in
> the catalog? In the current implementation, `tableExists` would return
> `false` because `HiveOperationsBase.validateTableIsIceberg` throws a
> `NoSuchTableException`.
> This would cause the code above to attempt to create the table, only
> to fail since the name already exists in the HMS.
> If `tableExists` is meant to check for conflicting entries in the HMS,
> perhaps it should return true even when a Hive table with the same name
> exists.
>
> I’d love to hear your thoughts on this.
>
> Best,
> Kevin Liu
>
> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho 
> wrote:
>
>> Hi,
>>
>> It's a good performance find and improvement.   Left some comment on
>> the PR.
>>
>> IMO, the behavior actually more matches the API javadoc ("Check
>> whether table exists"), not whether it is corrupted or not, so I'm
>> supportive of it.
>>
>> Thanks
>> Szehon
>>
>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>  wrote:
>>
>>> Hi Iceberger,
>>>
>>>   I have a proposal to simplify the tableExists API in the Hive
>>> catalog, which involves a behavior change, and I’d like to hear your
>>> thoughts.
>>>
>>>   Currently, in our catalog interface[1], the tableExists method is
>>> implemented as a default API by invoking the loadTable method. It
>>> returns true if the table can be loaded without exceptions. This
>>> behavior implies two checks:
>>>
>>>1. The table entry exists in the catalog.
>>>2. The latest metadata.json for the table is not corrupted.
>>>
>>>   The behavior change I’m proposing focuses only on the first
>>> condition—checking if the table entry exists in the catalog. This 
>>> separates
>>> the concerns of table existence and table health (e.g., metadata not
>>> corrupted). Such a change could improve the performance of existence

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Kevin Liu
> Should add, my personal preference is probably not to change the existing
behavior for this part

+1. I realized that this is not a new behavior. The `loadTable`
implementation has this problem too.
It would be good to have a test case specifically for this edge case and
maybe call this out in the documentation.

Thanks,
Kevin Liu

On Fri, Nov 22, 2024 at 11:57 AM Szehon Ho  wrote:

> Should add, my personal preference is probably not to change the existing
> behavior for this part (false, if exists a Hive table with same name) at
> the moment, just adding another possibility for consideration.
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho  wrote:
>
>> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
>> third option instead of returning true/false in this case, is to change it
>> to throw an NoSuchIcebergTableException if its a non-Iceberg table, which I
>> think is actually what this pr does?
>>
>> Thanks
>> Szehon
>>
>> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>>  wrote:
>>
>>> Hey,
>>>
>>> I think what Kevin says makes sense. However, it would then confuse the
>>> opposite use case of this function. Let's assume that we change the
>>> implementation of tableExists() to not load the table internally:
>>>
>>> if (tableExists(table_name)) {
>>> table = loadTable(table_name);
>>> }
>>>
>>> Here, you find that the table exists but when you try to load it it
>>> fails because it's not an Iceberg table. I don't think that any of these 2
>>> are intuitive. I think the question here is how much an API of the Iceberg
>>> table format should know about the existence of tables in other formats.
>>>
>>> If `tableExists` is meant to check for conflicting entries in the HMS
>>>
>>> Another interpretation of calling Catalog.tableExists() on an Iceberg
>>> API is instead "is there such an Iceberg table". TBH, not sure if any of
>>> the 2 approaches are better than the other, I just wanted to show that
>>> there is another side of the coin :)
>>>
>>> Regards,
>>> Gabor
>>>
>>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu 
>>> wrote:
>>>
 Hi Steve,

 This makes sense to me. The semantics of `tableExists` focus on whether
 a table's name exists in the catalog. For the Hive catalog, checking the
 HMS entry should be sufficient.

 I do have a question about usage, though. Typically, I would use `
 tableExists` like this:

 ```
 if (!tableExists(table_name)) {
 table = createTable(table_name);
 }
 ```
 What happens when a Hive table with the same name already exists in the
 catalog? In the current implementation, `tableExists` would return `false`
 because `HiveOperationsBase.validateTableIsIceberg` throws a
 `NoSuchTableException`.
 This would cause the code above to attempt to create the table, only to
 fail since the name already exists in the HMS.
 If `tableExists` is meant to check for conflicting entries in the HMS,
 perhaps it should return true even when a Hive table with the same name
 exists.

 I’d love to hear your thoughts on this.

 Best,
 Kevin Liu

 On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho 
 wrote:

> Hi,
>
> It's a good performance find and improvement.   Left some comment on
> the PR.
>
> IMO, the behavior actually more matches the API javadoc ("Check
> whether table exists"), not whether it is corrupted or not, so I'm
> supportive of it.
>
> Thanks
> Szehon
>
> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>  wrote:
>
>> Hi Iceberger,
>>
>>   I have a proposal to simplify the tableExists API in the Hive
>> catalog, which involves a behavior change, and I’d like to hear your
>> thoughts.
>>
>>   Currently, in our catalog interface[1], the tableExists method is
>> implemented as a default API by invoking the loadTable method. It
>> returns true if the table can be loaded without exceptions. This
>> behavior implies two checks:
>>
>>1. The table entry exists in the catalog.
>>2. The latest metadata.json for the table is not corrupted.
>>
>>   The behavior change I’m proposing focuses only on the first
>> condition—checking if the table entry exists in the catalog. This 
>> separates
>> the concerns of table existence and table health (e.g., metadata not
>> corrupted). Such a change could improve the performance of existence
>> checks, especially for RESTcatalog where table existence is abstracted as
>> an HTTP HEAD request [2].
>>
>> I also reviewed the current usage of the tableExists API in the
>> Iceberg codebase to ensure that this optimization would not have any
>> negative impact.
>>
>> I’d love to hear everyone’s feedback on this! If there’s consensus, I
>> can follow up with a similar optimization for the viewExists method
>> in the Hive catalog.

Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Szehon Ho
Should add, my personal preference is probably not to change the existing
behavior for this part (false, if exists a Hive table with same name) at
the moment, just adding another possibility for consideration.

Thanks
Szehon

On Fri, Nov 22, 2024 at 2:00 AM Szehon Ho  wrote:

> Thanks Kevin and Gabor, this is an interesting discussion.  I guess a
> third option instead of returning true/false in this case, is to change it
> to throw an NoSuchIcebergTableException if its a non-Iceberg table, which I
> think is actually what this pr does?
>
> Thanks
> Szehon
>
> On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
>  wrote:
>
>> Hey,
>>
>> I think what Kevin says makes sense. However, it would then confuse the
>> opposite use case of this function. Let's assume that we change the
>> implementation of tableExists() to not load the table internally:
>>
>> if (tableExists(table_name)) {
>> table = loadTable(table_name);
>> }
>>
>> Here, you find that the table exists but when you try to load it it fails
>> because it's not an Iceberg table. I don't think that any of these 2 are
>> intuitive. I think the question here is how much an API of the Iceberg
>> table format should know about the existence of tables in other formats.
>>
>> If `tableExists` is meant to check for conflicting entries in the HMS
>>
>> Another interpretation of calling Catalog.tableExists() on an Iceberg API
>> is instead "is there such an Iceberg table". TBH, not sure if any of the 2
>> approaches are better than the other, I just wanted to show that there is
>> another side of the coin :)
>>
>> Regards,
>> Gabor
>>
>> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu  wrote:
>>
>>> Hi Steve,
>>>
>>> This makes sense to me. The semantics of `tableExists` focus on whether
>>> a table's name exists in the catalog. For the Hive catalog, checking the
>>> HMS entry should be sufficient.
>>>
>>> I do have a question about usage, though. Typically, I would use `
>>> tableExists` like this:
>>>
>>> ```
>>> if (!tableExists(table_name)) {
>>> table = createTable(table_name);
>>> }
>>> ```
>>> What happens when a Hive table with the same name already exists in the
>>> catalog? In the current implementation, `tableExists` would return `false`
>>> because `HiveOperationsBase.validateTableIsIceberg` throws a
>>> `NoSuchTableException`.
>>> This would cause the code above to attempt to create the table, only to
>>> fail since the name already exists in the HMS.
>>> If `tableExists` is meant to check for conflicting entries in the HMS,
>>> perhaps it should return true even when a Hive table with the same name
>>> exists.
>>>
>>> I’d love to hear your thoughts on this.
>>>
>>> Best,
>>> Kevin Liu
>>>
>>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho 
>>> wrote:
>>>
 Hi,

 It's a good performance find and improvement.   Left some comment on
 the PR.

 IMO, the behavior actually more matches the API javadoc ("Check whether
 table exists"), not whether it is corrupted or not, so I'm supportive of 
 it.

 Thanks
 Szehon

 On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
  wrote:

> Hi Iceberger,
>
>   I have a proposal to simplify the tableExists API in the Hive
> catalog, which involves a behavior change, and I’d like to hear your
> thoughts.
>
>   Currently, in our catalog interface[1], the tableExists method is
> implemented as a default API by invoking the loadTable method. It
> returns true if the table can be loaded without exceptions. This
> behavior implies two checks:
>
>1. The table entry exists in the catalog.
>2. The latest metadata.json for the table is not corrupted.
>
>   The behavior change I’m proposing focuses only on the first
> condition—checking if the table entry exists in the catalog. This 
> separates
> the concerns of table existence and table health (e.g., metadata not
> corrupted). Such a change could improve the performance of existence
> checks, especially for RESTcatalog where table existence is abstracted as
> an HTTP HEAD request [2].
>
> I also reviewed the current usage of the tableExists API in the
> Iceberg codebase to ensure that this optimization would not have any
> negative impact.
>
> I’d love to hear everyone’s feedback on this! If there’s consensus, I
> can follow up with a similar optimization for the viewExists method
> in the Hive catalog.
>
> [1]: https://github.com/apache/iceberg/pull/11597
>
> [2]:
> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>
>
>
> Best regards,
> Steve Zhang
>
>
>
>


Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Szehon Ho
Thanks Kevin and Gabor, this is an interesting discussion.  I guess a third
option instead of returning true/false in this case, is to change it to
throw an NoSuchIcebergTableException if its a non-Iceberg table, which I
think is actually what this pr does?

Thanks
Szehon

On Fri, Nov 22, 2024 at 1:08 AM Gabor Kaszab
 wrote:

> Hey,
>
> I think what Kevin says makes sense. However, it would then confuse the
> opposite use case of this function. Let's assume that we change the
> implementation of tableExists() to not load the table internally:
>
> if (tableExists(table_name)) {
> table = loadTable(table_name);
> }
>
> Here, you find that the table exists but when you try to load it it fails
> because it's not an Iceberg table. I don't think that any of these 2 are
> intuitive. I think the question here is how much an API of the Iceberg
> table format should know about the existence of tables in other formats.
>
> If `tableExists` is meant to check for conflicting entries in the HMS
>
> Another interpretation of calling Catalog.tableExists() on an Iceberg API
> is instead "is there such an Iceberg table". TBH, not sure if any of the 2
> approaches are better than the other, I just wanted to show that there is
> another side of the coin :)
>
> Regards,
> Gabor
>
> On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu  wrote:
>
>> Hi Steve,
>>
>> This makes sense to me. The semantics of `tableExists` focus on whether a
>> table's name exists in the catalog. For the Hive catalog, checking the HMS
>> entry should be sufficient.
>>
>> I do have a question about usage, though. Typically, I would use `
>> tableExists` like this:
>>
>> ```
>> if (!tableExists(table_name)) {
>> table = createTable(table_name);
>> }
>> ```
>> What happens when a Hive table with the same name already exists in the
>> catalog? In the current implementation, `tableExists` would return `false`
>> because `HiveOperationsBase.validateTableIsIceberg` throws a
>> `NoSuchTableException`.
>> This would cause the code above to attempt to create the table, only to
>> fail since the name already exists in the HMS.
>> If `tableExists` is meant to check for conflicting entries in the HMS,
>> perhaps it should return true even when a Hive table with the same name
>> exists.
>>
>> I’d love to hear your thoughts on this.
>>
>> Best,
>> Kevin Liu
>>
>> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho 
>> wrote:
>>
>>> Hi,
>>>
>>> It's a good performance find and improvement.   Left some comment on the
>>> PR.
>>>
>>> IMO, the behavior actually more matches the API javadoc ("Check whether
>>> table exists"), not whether it is corrupted or not, so I'm supportive of it.
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>>  wrote:
>>>
 Hi Iceberger,

   I have a proposal to simplify the tableExists API in the Hive
 catalog, which involves a behavior change, and I’d like to hear your
 thoughts.

   Currently, in our catalog interface[1], the tableExists method is
 implemented as a default API by invoking the loadTable method. It
 returns true if the table can be loaded without exceptions. This
 behavior implies two checks:

1. The table entry exists in the catalog.
2. The latest metadata.json for the table is not corrupted.

   The behavior change I’m proposing focuses only on the first
 condition—checking if the table entry exists in the catalog. This separates
 the concerns of table existence and table health (e.g., metadata not
 corrupted). Such a change could improve the performance of existence
 checks, especially for RESTcatalog where table existence is abstracted as
 an HTTP HEAD request [2].

 I also reviewed the current usage of the tableExists API in the
 Iceberg codebase to ensure that this optimization would not have any
 negative impact.

 I’d love to hear everyone’s feedback on this! If there’s consensus, I
 can follow up with a similar optimization for the viewExists method in
 the Hive catalog.

 [1]: https://github.com/apache/iceberg/pull/11597

 [2]:
 https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133



 Best regards,
 Steve Zhang






Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-22 Thread Gabor Kaszab
Hey,

I think what Kevin says makes sense. However, it would then confuse the
opposite use case of this function. Let's assume that we change the
implementation of tableExists() to not load the table internally:

if (tableExists(table_name)) {
table = loadTable(table_name);
}

Here, you find that the table exists but when you try to load it it fails
because it's not an Iceberg table. I don't think that any of these 2 are
intuitive. I think the question here is how much an API of the Iceberg
table format should know about the existence of tables in other formats.

If `tableExists` is meant to check for conflicting entries in the HMS

Another interpretation of calling Catalog.tableExists() on an Iceberg API
is instead "is there such an Iceberg table". TBH, not sure if any of the 2
approaches are better than the other, I just wanted to show that there is
another side of the coin :)

Regards,
Gabor

On Fri, Nov 22, 2024 at 3:13 AM Kevin Liu  wrote:

> Hi Steve,
>
> This makes sense to me. The semantics of `tableExists` focus on whether a
> table's name exists in the catalog. For the Hive catalog, checking the HMS
> entry should be sufficient.
>
> I do have a question about usage, though. Typically, I would use `
> tableExists` like this:
>
> ```
> if (!tableExists(table_name)) {
> table = createTable(table_name);
> }
> ```
> What happens when a Hive table with the same name already exists in the
> catalog? In the current implementation, `tableExists` would return `false`
> because `HiveOperationsBase.validateTableIsIceberg` throws a
> `NoSuchTableException`.
> This would cause the code above to attempt to create the table, only to
> fail since the name already exists in the HMS.
> If `tableExists` is meant to check for conflicting entries in the HMS,
> perhaps it should return true even when a Hive table with the same name
> exists.
>
> I’d love to hear your thoughts on this.
>
> Best,
> Kevin Liu
>
> On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho  wrote:
>
>> Hi,
>>
>> It's a good performance find and improvement.   Left some comment on the
>> PR.
>>
>> IMO, the behavior actually more matches the API javadoc ("Check whether
>> table exists"), not whether it is corrupted or not, so I'm supportive of it.
>>
>> Thanks
>> Szehon
>>
>> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>>  wrote:
>>
>>> Hi Iceberger,
>>>
>>>   I have a proposal to simplify the tableExists API in the Hive
>>> catalog, which involves a behavior change, and I’d like to hear your
>>> thoughts.
>>>
>>>   Currently, in our catalog interface[1], the tableExists method is
>>> implemented as a default API by invoking the loadTable method. It
>>> returns true if the table can be loaded without exceptions. This
>>> behavior implies two checks:
>>>
>>>1. The table entry exists in the catalog.
>>>2. The latest metadata.json for the table is not corrupted.
>>>
>>>   The behavior change I’m proposing focuses only on the first
>>> condition—checking if the table entry exists in the catalog. This separates
>>> the concerns of table existence and table health (e.g., metadata not
>>> corrupted). Such a change could improve the performance of existence
>>> checks, especially for RESTcatalog where table existence is abstracted as
>>> an HTTP HEAD request [2].
>>>
>>> I also reviewed the current usage of the tableExists API in the Iceberg
>>> codebase to ensure that this optimization would not have any negative
>>> impact.
>>>
>>> I’d love to hear everyone’s feedback on this! If there’s consensus, I
>>> can follow up with a similar optimization for the viewExists method in
>>> the Hive catalog.
>>>
>>> [1]: https://github.com/apache/iceberg/pull/11597
>>>
>>> [2]:
>>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>>
>>>
>>>
>>> Best regards,
>>> Steve Zhang
>>>
>>>
>>>
>>>


Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-21 Thread Kevin Liu
Hi Steve,

This makes sense to me. The semantics of `tableExists` focus on whether a
table's name exists in the catalog. For the Hive catalog, checking the HMS
entry should be sufficient.

I do have a question about usage, though. Typically, I would use `
tableExists` like this:

```
if (!tableExists(table_name)) {
table = createTable(table_name);
}
```
What happens when a Hive table with the same name already exists in the
catalog? In the current implementation, `tableExists` would return `false`
because `HiveOperationsBase.validateTableIsIceberg` throws a
`NoSuchTableException`.
This would cause the code above to attempt to create the table, only to
fail since the name already exists in the HMS.
If `tableExists` is meant to check for conflicting entries in the HMS,
perhaps it should return true even when a Hive table with the same name
exists.

I’d love to hear your thoughts on this.

Best,
Kevin Liu

On Thu, Nov 21, 2024 at 5:22 PM Szehon Ho  wrote:

> Hi,
>
> It's a good performance find and improvement.   Left some comment on the
> PR.
>
> IMO, the behavior actually more matches the API javadoc ("Check whether
> table exists"), not whether it is corrupted or not, so I'm supportive of it.
>
> Thanks
> Szehon
>
> On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
>  wrote:
>
>> Hi Iceberger,
>>
>>   I have a proposal to simplify the tableExists API in the Hive catalog,
>> which involves a behavior change, and I’d like to hear your thoughts.
>>
>>   Currently, in our catalog interface[1], the tableExists method is
>> implemented as a default API by invoking the loadTable method. It
>> returns true if the table can be loaded without exceptions. This
>> behavior implies two checks:
>>
>>1. The table entry exists in the catalog.
>>2. The latest metadata.json for the table is not corrupted.
>>
>>   The behavior change I’m proposing focuses only on the first
>> condition—checking if the table entry exists in the catalog. This separates
>> the concerns of table existence and table health (e.g., metadata not
>> corrupted). Such a change could improve the performance of existence
>> checks, especially for RESTcatalog where table existence is abstracted as
>> an HTTP HEAD request [2].
>>
>> I also reviewed the current usage of the tableExists API in the Iceberg
>> codebase to ensure that this optimization would not have any negative
>> impact.
>>
>> I’d love to hear everyone’s feedback on this! If there’s consensus, I can
>> follow up with a similar optimization for the viewExists method in the
>> Hive catalog.
>>
>> [1]: https://github.com/apache/iceberg/pull/11597
>>
>> [2]:
>> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>>
>>
>>
>> Best regards,
>> Steve Zhang
>>
>>
>>
>>


Re: [Discuss] Simplify tableExists API in HiveCatalog

2024-11-21 Thread Szehon Ho
Hi,

It's a good performance find and improvement.   Left some comment on the PR.

IMO, the behavior actually more matches the API javadoc ("Check whether
table exists"), not whether it is corrupted or not, so I'm supportive of it.

Thanks
Szehon

On Thu, Nov 21, 2024 at 10:57 AM Steve Zhang
 wrote:

> Hi Iceberger,
>
>   I have a proposal to simplify the tableExists API in the Hive catalog,
> which involves a behavior change, and I’d like to hear your thoughts.
>
>   Currently, in our catalog interface[1], the tableExists method is
> implemented as a default API by invoking the loadTable method. It returns
> true if the table can be loaded without exceptions. This behavior implies
> two checks:
>
>1. The table entry exists in the catalog.
>2. The latest metadata.json for the table is not corrupted.
>
>   The behavior change I’m proposing focuses only on the first
> condition—checking if the table entry exists in the catalog. This separates
> the concerns of table existence and table health (e.g., metadata not
> corrupted). Such a change could improve the performance of existence
> checks, especially for RESTcatalog where table existence is abstracted as
> an HTTP HEAD request [2].
>
> I also reviewed the current usage of the tableExists API in the Iceberg
> codebase to ensure that this optimization would not have any negative
> impact.
>
> I’d love to hear everyone’s feedback on this! If there’s consensus, I can
> follow up with a similar optimization for the viewExists method in the
> Hive catalog.
>
> [1]: https://github.com/apache/iceberg/pull/11597
>
> [2]:
> https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
>
>
>
> Best regards,
> Steve Zhang
>
>
>
>