Re: Hive 4 support

2025-05-22 Thread Wing Yew Poon
Cheng,
You are right that Spark's IsolatedClientLoader is only used by its
HiveExternalCatalog.
In any case, that only adds to the argument to continue supporting
iceberg-hive-metastore on Hive 2.
In yesterday's community sync, there seems to be agreement to support
publishing iceberg-hive-metastore jars that are built against different
versions of Hive, and to remove the iceberg-hive-metastore classes from the
runtime jars (for engines). However, Ryan Blue suggested that the latter
step should only be done in an Iceberg 2.0 release, as it would be a
breaking change for users.
- Wing Yew



On Tue, May 20, 2025 at 7:19 PM Cheng Pan  wrote:

> > Spark can be configured to use a different version
>
> It’s not true. IsolatedClassLoader only supports Spark Hive session
> catalog and Iceberg always uses Spark built-in Hive jars, namely, Hive
> 2.3.9 for Spark 3.x and Hive 2.3.10 for Spark 4.0
>
> Thanks,
> Cheng Pan
>
>
>
> On May 21, 2025, at 03:06, Denys Kuzmenko  wrote:
>
> I don't believe continuing to build iceberg-hive-metastore jars for Hive 2
> is well justified.
> Hive 2 is deprecated and has known security vulnerabilities, which should
> encourage users to upgrade.
> As you pointed out, Spark can be configured to use a different version,
> but it's ultimately up to the Iceberg community to make that decision.
>
>
>


Re: Hive 4 support

2025-05-20 Thread Cheng Pan
> Spark can be configured to use a different version

It’s not true. IsolatedClassLoader only supports Spark Hive session catalog and 
Iceberg always uses Spark built-in Hive jars, namely, Hive 2.3.9 for Spark 3.x 
and Hive 2.3.10 for Spark 4.0

Thanks,
Cheng Pan



> On May 21, 2025, at 03:06, Denys Kuzmenko  wrote:
> 
> I don't believe continuing to build iceberg-hive-metastore jars for Hive 2 is 
> well justified. 
> Hive 2 is deprecated and has known security vulnerabilities, which should 
> encourage users to upgrade.
> As you pointed out, Spark can be configured to use a different version, but 
> it's ultimately up to the Iceberg community to make that decision.



Re: Hive 4 support

2025-05-20 Thread Denys Kuzmenko
I don't believe continuing to build iceberg-hive-metastore jars for Hive 2 is 
well justified. 
Hive 2 is deprecated and has known security vulnerabilities, which should 
encourage users to upgrade.
As you pointed out, Spark can be configured to use a different version, but 
it's ultimately up to the Iceberg community to make that decision.


Re: Hive 4 support

2025-05-20 Thread Wing Yew Poon
Denys,

As I have mentioned before, Spark uses a built-in version of the Hive
metastore (currently 2.3.10), but can be configured to use a different
version and be pointed to a path containing Hive metastore jars for the
different version. However, the highest Hive version that can be configured
for Spark 3.5 is 3.1.3.

Many Spark users use the built-in version of the Hive metastore support.
For that reason, I think Iceberg should still support Hive 2 for now. The
proposal here is to provide iceberg-hive-metastore jars built against Hive
2.3.10, 3.1.3, and 4.0. (Note that we can build them against a single
source set.) We can remove the iceberg-hive-metastore classes from engine
runtime jars, and engine users are free to use the version of
iceberg-hive-metastore jar corresponding to their HMS, if they are using
the HiveCatalog.

- Wing Yew






On Tue, May 20, 2025 at 1:10 AM Denys Kuzmenko  wrote:

> Wing Yew, how about Hive 3.x HMS? Is it compatible with Spark 3.x? If yes,
> why keep support for v2?
> FYI Hive 2.x has many unresolved security vulnerabilities.
>


Re: Hive 4 support

2025-05-20 Thread Denys Kuzmenko
Wing Yew, how about Hive 3.x HMS? Is it compatible with Spark 3.x? If yes, why 
keep support for v2?
FYI Hive 2.x has many unresolved security vulnerabilities.


Re: Hive 4 support

2025-05-19 Thread Wing Yew Poon
Peter and Denys, thank you for your responses.

First, I'd like to get agreement on producing separate
iceberg-hive-metastore jars for Hive 2, 3 and 4.
Peter, do you agree with that?
Second, if we do that, I listed three options for the runtime jars of
engines, and it appears that you prefer option a. (remove
iceberg-hive-metastore classes from the runtime jars).

Denys, I do know that Hive 2 and 3 are EOL. However, we support Spark 3 in
Iceberg, and Spark 3 is not able to work with a Hive 4 HMS. Spark 4.0 is
about to be released, and I'm happy to see efforts for Hive 4 support in
Spark. I think there is still a lot of work to be done before Spark can use
Hive 4 in its built-in classes (this won't be in Spark 4.0). The bottom
line is that for now, we should support older versions of HMS.

- Wing Yew

ps. I want to thank Peter for all his help in the work so far!


On Mon, May 19, 2025 at 1:54 AM Denys Kuzmenko  wrote:

> Hi Wing Yew,
>
> Hive-2.x and Hive 3.x are EOL. We encourage everyone to upgrade instead of
> keeping compatibility with the legacy versions.
>
> Note: Spark started migration to Hive-4.0, and we are helping here:
> https://github.com/apache/spark/pull/50213
>
> Regards,
> Denys
>


Re: Hive 4 support

2025-05-19 Thread Péter Váry
Thanks for working on this @Wing Yew Poon !

As I have mentioned on the PR, I would prefer unbundling the hive-metastore
dependency from the runtime jars.
This would help in multiple ways:
- Decrease the size of the runtime jars - nowadays more-and-more people use
Iceberg against a REST catalog, so bundling hive-metastore jar is
unnecessary for them
- Hive 4 is the only officially supported Hive version - with the current
solution we don't support it
- Testing against a specific Hive version will be more clear

Thanks,
Peter

Wing Yew Poon  ezt írta (időpont: 2025. máj.
16., P, 23:52):

> Hi,
>
> I'd like to bring attention to this again.
> As you know, the hive-metastore module (which provides the HiveCatalog) is
> built against Hive 2.
> I have a PR, https://github.com/apache/iceberg/pull/12721, for building
> and testing the hive-metastore module against Hive 2, 3 and 4. Daniel Weeks
> looked into building the module against a single version of Hive and
> testing that against all three Hive versions, but this approach turned out
> to be infeasible because of binary incompatibility introduced by Hive 4 (so
> e.g., a hive-metastore jar built against Hive 2 would be incompatible if we
> use Hive 4 jars for our Hive dependencies). Note that Hive 4 also breaks
> thrift compatibility, so talking to a Hive 4 HMS requires using a Hive 4
> client.
>
> Given this, I propose that we have three versions of the module,
> hive-metastore, hive3-metastore and hive4-metastore, built against Hive 2,
> 3 and 4 respectively. (This is currently what is implemented in the PR.)
> This means three versions of the artifact. First, I'd like to get agreement
> from the community on doing this.
> Second, the hive-metastore classes are currently bundled in runtime jars,
> e.g., the Spark runtime jar and the Flink runtime jar. If we have agreement
> on the first point, I'd like to hear opinions on options for this second
> point:
> a. Remove hive-metastore classes from these runtime jars. Users will pick
> the hive-metastore jar appropriate to the Hive version they are using and
> add it to the classpath of their engine.
> b. Keep the runtime jars unchanged, which means bundling the
> hive-metastore classes built against Hive 2.
> c. Build multiple versions of the runtime jars, for Hive versions
> supported by those engines. (This does not mean all combinations, as e.g.,
> Spark 3 does not support Hive 4.)
>
> a. provides the most flexibility at the expense of inconvenience to the
> user (to have to add another jar to the classpath).
> c. is the most complex.
>
> There is a separate question of testing engines such as Spark and Flink
> against different versions of HMS. At this time, I do not wish to get into
> this question until the above is resolved. Regardless of the option chosen
> for the second point, I think it is possible to test engines against
> different versions of HMS.
>
> Thanks,
> Wing Yew
>
>
> On Fri, Apr 11, 2025 at 12:03 PM Daniel Weeks  wrote:
>
>> Hey Wing,
>>
>> There are a few concerning issues I have with the current PR:
>>
>> 1) We need to update LICENSE/NOTICE for the hive3/hive4 dependencies
>> because I believe we only looked at the referenced version
>> 2) We're producing artifacts for hive3 and hive4 modules which I think we
>> want to exclude (we shouldn't triplicate the artifacts that are cross
>> compatible)
>> 3) I don't think we should be copying the hive DDL files into the Iceberg
>> project
>>
>> I'd really like to find a workaround for #3.  I've added comments to the
>> PR as well.
>>
>> Thanks,
>> Dan
>>
>> On Fri, Apr 11, 2025 at 9:28 AM Wing Yew Poon 
>> wrote:
>>
>>> Hi,
>>>
>>> I created https://github.com/apache/iceberg/pull/12721 as an
>>> alternative to my earlier PR,
>>> https://github.com/apache/iceberg/pull/12681. With very helpful
>>> pointers from Peter Vary and his persistent prodding, we have an
>>> implementation where a single source set can be used to build and test with
>>> Hive 2, 3 and 4.
>>>
>>> In the new PR, the existing hive-metastore module is still built with
>>> Hive 2, and additional modules, hive3-metastore and hive4-metastore are
>>> introduced. Currently no other modules depend on the latter two. All three
>>> are built and tested in the existing java-ci workflow (without any change).
>>>
>>> The (non-test) classes in the hive-metastore module are packaged in the
>>> Flink and Spark runtime JARs. As a single source is used to support Hive 2,
>>> 3 and 4, they can be used with a HiveCatalog backed by a Hive 2, 3 or 4
>>> HMS. In the PR, we do not make any changes to Flink or Spark modules.
>>> Testing them with a different version of Hive is the next step. I'd like to
>>> hear your opinion on how to approach that.
>>>
>>> Thanks,
>>> Wing Yew
>>>
>>>
>>>
>>>
>>> On Thu, Apr 3, 2025 at 5:31 PM Wing Yew Poon 
>>> wrote:
>>>
 Hi Peter,

 Thank you for your thoughts on this.

 My PR, https://github.com/apache/iceberg/pull/12681, achieves the
 f

Re: Hive 4 support

2025-05-19 Thread Denys Kuzmenko
Hi Wing Yew,

Hive-2.x and Hive 3.x are EOL. We encourage everyone to upgrade instead of 
keeping compatibility with the legacy versions.

Note: Spark started migration to Hive-4.0, and we are helping here:
https://github.com/apache/spark/pull/50213

Regards,
Denys


Re: Hive 4 support

2025-05-16 Thread Wing Yew Poon
Hi,

I'd like to bring attention to this again.
As you know, the hive-metastore module (which provides the HiveCatalog) is
built against Hive 2.
I have a PR, https://github.com/apache/iceberg/pull/12721, for building and
testing the hive-metastore module against Hive 2, 3 and 4. Daniel Weeks
looked into building the module against a single version of Hive and
testing that against all three Hive versions, but this approach turned out
to be infeasible because of binary incompatibility introduced by Hive 4 (so
e.g., a hive-metastore jar built against Hive 2 would be incompatible if we
use Hive 4 jars for our Hive dependencies). Note that Hive 4 also breaks
thrift compatibility, so talking to a Hive 4 HMS requires using a Hive 4
client.

Given this, I propose that we have three versions of the module,
hive-metastore, hive3-metastore and hive4-metastore, built against Hive 2,
3 and 4 respectively. (This is currently what is implemented in the PR.)
This means three versions of the artifact. First, I'd like to get agreement
from the community on doing this.
Second, the hive-metastore classes are currently bundled in runtime jars,
e.g., the Spark runtime jar and the Flink runtime jar. If we have agreement
on the first point, I'd like to hear opinions on options for this second
point:
a. Remove hive-metastore classes from these runtime jars. Users will pick
the hive-metastore jar appropriate to the Hive version they are using and
add it to the classpath of their engine.
b. Keep the runtime jars unchanged, which means bundling the hive-metastore
classes built against Hive 2.
c. Build multiple versions of the runtime jars, for Hive versions supported
by those engines. (This does not mean all combinations, as e.g., Spark 3
does not support Hive 4.)

a. provides the most flexibility at the expense of inconvenience to the
user (to have to add another jar to the classpath).
c. is the most complex.

There is a separate question of testing engines such as Spark and Flink
against different versions of HMS. At this time, I do not wish to get into
this question until the above is resolved. Regardless of the option chosen
for the second point, I think it is possible to test engines against
different versions of HMS.

Thanks,
Wing Yew


On Fri, Apr 11, 2025 at 12:03 PM Daniel Weeks  wrote:

> Hey Wing,
>
> There are a few concerning issues I have with the current PR:
>
> 1) We need to update LICENSE/NOTICE for the hive3/hive4 dependencies
> because I believe we only looked at the referenced version
> 2) We're producing artifacts for hive3 and hive4 modules which I think we
> want to exclude (we shouldn't triplicate the artifacts that are cross
> compatible)
> 3) I don't think we should be copying the hive DDL files into the Iceberg
> project
>
> I'd really like to find a workaround for #3.  I've added comments to the
> PR as well.
>
> Thanks,
> Dan
>
> On Fri, Apr 11, 2025 at 9:28 AM Wing Yew Poon 
> wrote:
>
>> Hi,
>>
>> I created https://github.com/apache/iceberg/pull/12721 as an alternative
>> to my earlier PR, https://github.com/apache/iceberg/pull/12681. With
>> very helpful pointers from Peter Vary and his persistent prodding, we have
>> an implementation where a single source set can be used to build and test
>> with Hive 2, 3 and 4.
>>
>> In the new PR, the existing hive-metastore module is still built with
>> Hive 2, and additional modules, hive3-metastore and hive4-metastore are
>> introduced. Currently no other modules depend on the latter two. All three
>> are built and tested in the existing java-ci workflow (without any change).
>>
>> The (non-test) classes in the hive-metastore module are packaged in the
>> Flink and Spark runtime JARs. As a single source is used to support Hive 2,
>> 3 and 4, they can be used with a HiveCatalog backed by a Hive 2, 3 or 4
>> HMS. In the PR, we do not make any changes to Flink or Spark modules.
>> Testing them with a different version of Hive is the next step. I'd like to
>> hear your opinion on how to approach that.
>>
>> Thanks,
>> Wing Yew
>>
>>
>>
>>
>> On Thu, Apr 3, 2025 at 5:31 PM Wing Yew Poon  wrote:
>>
>>> Hi Peter,
>>>
>>> Thank you for your thoughts on this.
>>>
>>> My PR, https://github.com/apache/iceberg/pull/12681, achieves the
>>> following objectives:
>>> (1) Build and test the hive-metastore module against Hive 2, 3, and 4.
>>> (As you suggested in the earlier thread, we build and test against 2.3.10,
>>> 3.1.3, and 4.0.1.)
>>> (2) Keep a single code source for the hive-metastore implementation (the
>>> test code needed to be different, unfortunately).
>>>
>>> In order to achieve (2), code changes were in fact needed (due to
>>> HIVE-27925  breaking
>>> backward compatibility). So, even if HMS API is backwards compatible (and
>>> HiveConf is not considered to be a part of HMS API), this shows the
>>> importance of actually doing (1).
>>> As a result of (2), even though in this PR, Flink and Spark modules
>>> conti

Re: Hive 4 support

2025-04-11 Thread Daniel Weeks
Hey Wing,

There are a few concerning issues I have with the current PR:

1) We need to update LICENSE/NOTICE for the hive3/hive4 dependencies
because I believe we only looked at the referenced version
2) We're producing artifacts for hive3 and hive4 modules which I think we
want to exclude (we shouldn't triplicate the artifacts that are cross
compatible)
3) I don't think we should be copying the hive DDL files into the Iceberg
project

I'd really like to find a workaround for #3.  I've added comments to the PR
as well.

Thanks,
Dan

On Fri, Apr 11, 2025 at 9:28 AM Wing Yew Poon 
wrote:

> Hi,
>
> I created https://github.com/apache/iceberg/pull/12721 as an alternative
> to my earlier PR, https://github.com/apache/iceberg/pull/12681. With very
> helpful pointers from Peter Vary and his persistent prodding, we have an
> implementation where a single source set can be used to build and test with
> Hive 2, 3 and 4.
>
> In the new PR, the existing hive-metastore module is still built with Hive
> 2, and additional modules, hive3-metastore and hive4-metastore are
> introduced. Currently no other modules depend on the latter two. All three
> are built and tested in the existing java-ci workflow (without any change).
>
> The (non-test) classes in the hive-metastore module are packaged in the
> Flink and Spark runtime JARs. As a single source is used to support Hive 2,
> 3 and 4, they can be used with a HiveCatalog backed by a Hive 2, 3 or 4
> HMS. In the PR, we do not make any changes to Flink or Spark modules.
> Testing them with a different version of Hive is the next step. I'd like to
> hear your opinion on how to approach that.
>
> Thanks,
> Wing Yew
>
>
>
>
> On Thu, Apr 3, 2025 at 5:31 PM Wing Yew Poon  wrote:
>
>> Hi Peter,
>>
>> Thank you for your thoughts on this.
>>
>> My PR, https://github.com/apache/iceberg/pull/12681, achieves the
>> following objectives:
>> (1) Build and test the hive-metastore module against Hive 2, 3, and 4.
>> (As you suggested in the earlier thread, we build and test against 2.3.10,
>> 3.1.3, and 4.0.1.)
>> (2) Keep a single code source for the hive-metastore implementation (the
>> test code needed to be different, unfortunately).
>>
>> In order to achieve (2), code changes were in fact needed (due to
>> HIVE-27925  breaking
>> backward compatibility). So, even if HMS API is backwards compatible (and
>> HiveConf is not considered to be a part of HMS API), this shows the
>> importance of actually doing (1).
>> As a result of (2), even though in this PR, Flink and Spark modules
>> continue to be built with Hive 2 dependencies, the hive-metastore classes
>> packaged in the Flink and Spark runtime jars are usable with Hive 2, 3, or
>> 4 HMS (provided the engine supported that version of Hive).
>>
>> The use of separate Gradle modules (hive3-metastore, hive4-metastore) is
>> an implementation detail. I don't think we disagree on the objectives. If
>> you have suggestions for implementing the objectives differently, I'm happy
>> to hear them. To avoid getting into the weeds, we can discuss them in the
>> PR.
>>
>> I understand the desire to have a single version of `TestHiveMetastore`,
>> a test class that is used by the Flink and Spark modules as well.
>> Unfortunately, I did not see a way of doing this even with
>> `DynConstructors` and `DynMethods`. In Hive 4, `HMSHandler` became a
>> top-level class instead of being a nested class in `HiveMetaStore` (and I
>> don't see how to import one class from one version of Hive and a different
>> class from another version of Hive). Even with `DynConstructors` you have
>> to give it the instance of the class. Again, if you have suggestions to
>> overcome this, we can discuss them in the PR. However, I feel that keeping
>> a single version of test code that works for all Hive versions, while
>> desirable, should not be a blocker.
>>
>> The scope of the PR is deliberately limited to the hive-metastore module.
>> I feel that what versions of Hive to use for testing Flink and Spark
>> modules against the HiveCatalog and how to do so should be a second (and
>> possibly even third) step. As an experiment/proof of concept, in
>> https://github.com/apache/iceberg/pull/12693, I show that Flink modules
>> run successfully with hive-metastore built with Hive 4. Additionally, Spark
>> modules run successfully with hive-metastore built with Hive 3 (they fail
>> with Hive 4). I would say though that while Spark 4.0 will support using a
>> Hive 4 HMS, I do not expect we will see Spark 3.4 or 3.5 supporting using a
>> Hive 4 HMS. Thus, I do not think we should block this work on what Spark
>> supports.
>>
>> Thanks,
>> Wing Yew
>>
>>
>>
>>
>> On Thu, Apr 3, 2025 at 3:01 AM Péter Váry 
>> wrote:
>>
>>> Hi Wing Yew,
>>>
>>> Thanks for taking a look at this.
>>> After the removal of the Hive runtime code, we only depend on HMS in the
>>> HiveCatalog module in the production code. Since HMS API is supposed to be

Re: Hive 4 support

2025-04-11 Thread Wing Yew Poon
Hi,

I created https://github.com/apache/iceberg/pull/12721 as an alternative to
my earlier PR, https://github.com/apache/iceberg/pull/12681. With very
helpful pointers from Peter Vary and his persistent prodding, we have an
implementation where a single source set can be used to build and test with
Hive 2, 3 and 4.

In the new PR, the existing hive-metastore module is still built with Hive
2, and additional modules, hive3-metastore and hive4-metastore are
introduced. Currently no other modules depend on the latter two. All three
are built and tested in the existing java-ci workflow (without any change).

The (non-test) classes in the hive-metastore module are packaged in the
Flink and Spark runtime JARs. As a single source is used to support Hive 2,
3 and 4, they can be used with a HiveCatalog backed by a Hive 2, 3 or 4
HMS. In the PR, we do not make any changes to Flink or Spark modules.
Testing them with a different version of Hive is the next step. I'd like to
hear your opinion on how to approach that.

Thanks,
Wing Yew




On Thu, Apr 3, 2025 at 5:31 PM Wing Yew Poon  wrote:

> Hi Peter,
>
> Thank you for your thoughts on this.
>
> My PR, https://github.com/apache/iceberg/pull/12681, achieves the
> following objectives:
> (1) Build and test the hive-metastore module against Hive 2, 3, and 4. (As
> you suggested in the earlier thread, we build and test against 2.3.10,
> 3.1.3, and 4.0.1.)
> (2) Keep a single code source for the hive-metastore implementation (the
> test code needed to be different, unfortunately).
>
> In order to achieve (2), code changes were in fact needed (due to
> HIVE-27925  breaking
> backward compatibility). So, even if HMS API is backwards compatible (and
> HiveConf is not considered to be a part of HMS API), this shows the
> importance of actually doing (1).
> As a result of (2), even though in this PR, Flink and Spark modules
> continue to be built with Hive 2 dependencies, the hive-metastore classes
> packaged in the Flink and Spark runtime jars are usable with Hive 2, 3, or
> 4 HMS (provided the engine supported that version of Hive).
>
> The use of separate Gradle modules (hive3-metastore, hive4-metastore) is
> an implementation detail. I don't think we disagree on the objectives. If
> you have suggestions for implementing the objectives differently, I'm happy
> to hear them. To avoid getting into the weeds, we can discuss them in the
> PR.
>
> I understand the desire to have a single version of `TestHiveMetastore`, a
> test class that is used by the Flink and Spark modules as well.
> Unfortunately, I did not see a way of doing this even with
> `DynConstructors` and `DynMethods`. In Hive 4, `HMSHandler` became a
> top-level class instead of being a nested class in `HiveMetaStore` (and I
> don't see how to import one class from one version of Hive and a different
> class from another version of Hive). Even with `DynConstructors` you have
> to give it the instance of the class. Again, if you have suggestions to
> overcome this, we can discuss them in the PR. However, I feel that keeping
> a single version of test code that works for all Hive versions, while
> desirable, should not be a blocker.
>
> The scope of the PR is deliberately limited to the hive-metastore module.
> I feel that what versions of Hive to use for testing Flink and Spark
> modules against the HiveCatalog and how to do so should be a second (and
> possibly even third) step. As an experiment/proof of concept, in
> https://github.com/apache/iceberg/pull/12693, I show that Flink modules
> run successfully with hive-metastore built with Hive 4. Additionally, Spark
> modules run successfully with hive-metastore built with Hive 3 (they fail
> with Hive 4). I would say though that while Spark 4.0 will support using a
> Hive 4 HMS, I do not expect we will see Spark 3.4 or 3.5 supporting using a
> Hive 4 HMS. Thus, I do not think we should block this work on what Spark
> supports.
>
> Thanks,
> Wing Yew
>
>
>
>
> On Thu, Apr 3, 2025 at 3:01 AM Péter Váry 
> wrote:
>
>> Hi Wing Yew,
>>
>> Thanks for taking a look at this.
>> After the removal of the Hive runtime code, we only depend on HMS in the
>> HiveCatalog module in the production code. Since HMS API is supposed to be
>> backward compatible, I would prefer to keep a single hive-metastore module
>> and a single source dir for accessing HMS. When there is need we can use
>> the DynMethods as we do now.
>>
>> I understand that for testing we need to start a HiveMetastore. It is
>> used by Flink and Spark as well. Here things become harder as we need to
>> instantiate a test Hive Metastore where we don't have backward
>> compatibility guarantees. If possible, I would still prefer to keep a
>> single codebase for testing purposes where we use DynMethods and
>> HiveVersion to decide which version of HMS we instantiate. This could
>> easily become a mess, but I would try to check if we can get away with it
>> somehow.
>

Re: Hive 4 support

2025-04-05 Thread Péter Váry
Hi Wing Yew,

Thanks for taking a look at this.
After the removal of the Hive runtime code, we only depend on HMS in the
HiveCatalog module in the production code. Since HMS API is supposed to be
backward compatible, I would prefer to keep a single hive-metastore module
and a single source dir for accessing HMS. When there is need we can use
the DynMethods as we do now.

I understand that for testing we need to start a HiveMetastore. It is used
by Flink and Spark as well. Here things become harder as we need to
instantiate a test Hive Metastore where we don't have backward
compatibility guarantees. If possible, I would still prefer to keep a
single codebase for testing purposes where we use DynMethods and
HiveVersion to decide which version of HMS we instantiate. This could
easily become a mess, but I would try to check if we can get away with it
somehow.

I would definitely run the hive-metastore tests with all of the supported
Hive versions. OTOH, I would defer to the Spark/Flink maintainers to decide
how confident they are in their products ability to abstract away the
differences between the Hive versions. For Flink I would suggest to run the
Flink tests only with the latest supported Hive version.

Thanks,
Peter


Wing Yew Poon  ezt írta (időpont: 2025. márc.
31., H, 4:27):

> Hi,
>
> I want to follow up on the earlier discussion about Hive support, where it
> was decided to remove the Hive runtime support from the Iceberg repo, and
> leave the Hive metastore support (where the HiveCatalog is implemented) on
> Hive 2 for the time being.
>
> In that earlier thread, Peter Vary proposed that we support Hive 2.3.10,
> 3.1.3 and 4.0.1 for the hive-metastore module and ensure that it gets built
> and tested against those versions.
>
> I have implemented this in https://github.com/apache/iceberg/pull/12681.
> I have left the existing hive-metastore module depending on Hive 2.3.10,
> and added new hive3-metastore and hive4-metastore modules that depend on
> Hive 3.1.3 and 4.0.1 respectively. I have followed the approach used by the
> mr and hive3 modules previously and kept all common code in one directory
> (the existing hive-metastore directory) and avoided code duplication. In
> order to workaround https://issues.apache.org/jira/browse/HIVE-27925,
> which introduced a backward incompatibility in Hive 4, I have avoided the
> use of HiveConf.ConfVars enums and used the conf property names (which have
> not changed) instead. (This is also the approach used by Spark in
> https://issues.apache.org/jira/browse/SPARK-47679.) Please see the PR for
> more details.
>
> The Flink and Spark modules (along with the delta-lake module) have test
> code that depend on the hive-metastore module as well as on Hive metastore
> jars. Having those modules test against Hive 3 and Hive 4 metastore
> versions is not in the scope of the above PR. I plan to work on that
> separately as follow up, and I want to hear opinions on the approach. As a
> proof of concept, I have put up
> https://github.com/apache/iceberg/pull/12693 with follow on changes to
> test the Flink modules against hive4-metastore. This is straightforward and
> there are no issues.
>
> For Spark though, as I have mentioned in the earlier thread, Spark uses a
> built-in version of the Hive metastore (currently 2.3.10), but can be
> configured to use a different version and be pointed to a path containing
> Hive metastore jars for the different version. However, the highest Hive
> version that can be configured for Spark 3.5 is 3.1.3 (Spark 4 will support
> 4.0.x), as changes in Spark code is needed to be able to workaround
> HIVE-27925.
>
> What I'm interested in hearing is: For testing Flink and Spark against
> Hive versions, do we want to test against
> (1) just one version, e.g., the highest version supportable by that
> Flink/Spark version (or alternatively just 2.3.10).
> (2) multiple versions from 2.3.10, 3.1.3 and 4.0.1, as long as they are
> supportable by that Flink/Spark version.
> And if (2), how do we want to do that, e.g., full matrix, or some kind of
> sampling.
>
> Thanks,
> Wing Yew
>
>


Re: Hive 4 support

2025-04-03 Thread Wing Yew Poon
Hi Peter,

Thank you for your thoughts on this.

My PR, https://github.com/apache/iceberg/pull/12681, achieves the following
objectives:
(1) Build and test the hive-metastore module against Hive 2, 3, and 4. (As
you suggested in the earlier thread, we build and test against 2.3.10,
3.1.3, and 4.0.1.)
(2) Keep a single code source for the hive-metastore implementation (the
test code needed to be different, unfortunately).

In order to achieve (2), code changes were in fact needed (due to HIVE-27925
 breaking backward
compatibility). So, even if HMS API is backwards compatible (and HiveConf
is not considered to be a part of HMS API), this shows the importance of
actually doing (1).
As a result of (2), even though in this PR, Flink and Spark modules
continue to be built with Hive 2 dependencies, the hive-metastore classes
packaged in the Flink and Spark runtime jars are usable with Hive 2, 3, or
4 HMS (provided the engine supported that version of Hive).

The use of separate Gradle modules (hive3-metastore, hive4-metastore) is an
implementation detail. I don't think we disagree on the objectives. If you
have suggestions for implementing the objectives differently, I'm happy to
hear them. To avoid getting into the weeds, we can discuss them in the PR.

I understand the desire to have a single version of `TestHiveMetastore`, a
test class that is used by the Flink and Spark modules as well.
Unfortunately, I did not see a way of doing this even with
`DynConstructors` and `DynMethods`. In Hive 4, `HMSHandler` became a
top-level class instead of being a nested class in `HiveMetaStore` (and I
don't see how to import one class from one version of Hive and a different
class from another version of Hive). Even with `DynConstructors` you have
to give it the instance of the class. Again, if you have suggestions to
overcome this, we can discuss them in the PR. However, I feel that keeping
a single version of test code that works for all Hive versions, while
desirable, should not be a blocker.

The scope of the PR is deliberately limited to the hive-metastore module. I
feel that what versions of Hive to use for testing Flink and Spark modules
against the HiveCatalog and how to do so should be a second (and possibly
even third) step. As an experiment/proof of concept, in
https://github.com/apache/iceberg/pull/12693, I show that Flink modules run
successfully with hive-metastore built with Hive 4. Additionally, Spark
modules run successfully with hive-metastore built with Hive 3 (they fail
with Hive 4). I would say though that while Spark 4.0 will support using a
Hive 4 HMS, I do not expect we will see Spark 3.4 or 3.5 supporting using a
Hive 4 HMS. Thus, I do not think we should block this work on what Spark
supports.

Thanks,
Wing Yew




On Thu, Apr 3, 2025 at 3:01 AM Péter Váry 
wrote:

> Hi Wing Yew,
>
> Thanks for taking a look at this.
> After the removal of the Hive runtime code, we only depend on HMS in the
> HiveCatalog module in the production code. Since HMS API is supposed to be
> backward compatible, I would prefer to keep a single hive-metastore module
> and a single source dir for accessing HMS. When there is need we can use
> the DynMethods as we do now.
>
> I understand that for testing we need to start a HiveMetastore. It is used
> by Flink and Spark as well. Here things become harder as we need to
> instantiate a test Hive Metastore where we don't have backward
> compatibility guarantees. If possible, I would still prefer to keep a
> single codebase for testing purposes where we use DynMethods and
> HiveVersion to decide which version of HMS we instantiate. This could
> easily become a mess, but I would try to check if we can get away with it
> somehow.
>
> I would definitely run the hive-metastore tests with all of the supported
> Hive versions. OTOH, I would defer to the Spark/Flink maintainers to decide
> how confident they are in their products ability to abstract away the
> differences between the Hive versions. For Flink I would suggest to run the
> Flink tests only with the latest supported Hive version.
>
> Thanks,
> Peter
>
>
> Wing Yew Poon  ezt írta (időpont: 2025.
> márc. 31., H, 4:27):
>
>> Hi,
>>
>> I want to follow up on the earlier discussion about Hive support, where
>> it was decided to remove the Hive runtime support from the Iceberg repo,
>> and leave the Hive metastore support (where the HiveCatalog is implemented)
>> on Hive 2 for the time being.
>>
>> In that earlier thread, Peter Vary proposed that we support Hive 2.3.10,
>> 3.1.3 and 4.0.1 for the hive-metastore module and ensure that it gets built
>> and tested against those versions.
>>
>> I have implemented this in https://github.com/apache/iceberg/pull/12681.
>> I have left the existing hive-metastore module depending on Hive 2.3.10,
>> and added new hive3-metastore and hive4-metastore modules that depend on
>> Hive 3.1.3 and 4.0.1 respectively. I have followed the approach use