Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Dongjoon Hyun
Thank you all. I'll try to make JIRA and PR for that. Bests, Dongjoon. On Wed, Nov 20, 2019 at 4:08 PM Cheng Lian wrote: > Sean, thanks for the corner cases you listed. They make a lot of sense. > Now I do incline to have Hive 2.3 as the default version. > > Dongjoon, apologize if I didn't

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Cheng Lian
Sean, thanks for the corner cases you listed. They make a lot of sense. Now I do incline to have Hive 2.3 as the default version. Dongjoon, apologize if I didn't make it clear before. What made me concerned initially was only the following part: > can we remove the usage of forked `hive` in

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Dongjoon Hyun
Yes. Right. That's the situation we are hitting and the result I expected. We need to change our default with Hive 2 in the POM. Dongjoon. On Wed, Nov 20, 2019 at 5:20 AM Sean Owen wrote: > Yes, good point. A user would get whatever the POM says without > profiles enabled so it matters. > >

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Sean Owen
Yes, good point. A user would get whatever the POM says without profiles enabled so it matters. Playing it out, an app _should_ compile with the Spark dependency marked 'provided'. In that case the app that is spark-submit-ted is agnostic to the Hive dependency as the only one that matters is

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun
Cheng, could you elaborate on your criteria, `Hive 2.3 code paths are proven to be stable`? For me, it's difficult to image that we can reach any stable situation when we don't use it at all by ourselves. > The Hive 1.2 code paths can only be removed once the Hive 2.3 code paths are proven to

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Hyukjin Kwon
> Should Hadoop 2 + Hive 2 be considered to work on JDK 11? This seems being investigated by Yuming's PR ( https://github.com/apache/spark/pull/26533) if I am not mistaken. Oh, yes, what I meant by (default) was the default profiles we will use in Spark. 2019년 11월 20일 (수) 오전 10:14, Sean Owen 님이

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Sean Owen
Should Hadoop 2 + Hive 2 be considered to work on JDK 11? I wasn't sure if 2.7 did, but honestly I've lost track. Anyway, it doesn't matter much as the JDK doesn't cause another build permutation. All are built targeting Java 8. I also don't know if we have to declare a binary release a default.

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Hyukjin Kwon
So, are we able to conclude our plans as below? 1. In Spark 3, we release as below: - Hadoop 3.2 + Hive 2.3 + JDK8 build that also works JDK 11 - Hadoop 2.7 + Hive 2.3 + JDK8 build that also works JDK 11 - Hadoop 2.7 + Hive 1.2.1 (fork) + JDK8 (default) 2. In Spark 3.1, we target: -

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
Thanks for taking care of this, Dongjoon! We can target SPARK-20202 to 3.1.0, but I don't think we should do it immediately after cutting the branch-3.0. The Hive 1.2 code paths can only be removed once the Hive 2.3 code paths are proven to be stable. If it turned out to be buggy in Spark 3.1, we

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Sean Owen
Same idea? support this combo in 3.0 and then remove Hadoop 2 support in 3.1 or something? or at least make them non-default, not necessarily publish special builds? On Tue, Nov 19, 2019 at 4:53 PM Dongjoon Hyun wrote: > For additional `hadoop-2.7 with Hive 2.3` pre-built distribution, how do

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun
Yes. It does. I meant SPARK-20202. Thanks. I understand that it can be considered like Scala version issue. So, that's the reason why I put this as a `policy` issue from the beginning. > First of all, I want to put this as a policy issue instead of a technical issue. In the policy perspective,

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
It's kinda like Scala version upgrade. Historically, we only remove the support of an older Scala version when the newer version is proven to be stable after one or more Spark minor versions. On Tue, Nov 19, 2019 at 2:07 PM Cheng Lian wrote: > Hmm, what exactly did you mean by "remove the usage

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
Hmm, what exactly did you mean by "remove the usage of forked `hive` in Apache Spark 3.0 completely officially"? I thought you wanted to remove the forked Hive 1.2 dependencies completely, no? As long as we still keep the Hive 1.2 in Spark 3.0, I'm fine with that. I personally don't have a

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun
BTW, `hive.version.short` is a directory name. We are using 2.3.6 only. For directory name, we use '1.2.1' and '2.3.5' because we just delayed the renaming the directories until 3.0.0 deadline to minimize the diff. We can replace it immediately if we want right now. On Tue, Nov 19, 2019 at

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun
Hi, Cheng. This is irrelevant to JDK11 and Hadoop 3. I'm talking about JDK8 world. If we consider them, it could be the followings. +--+-++ | | Hive 1.2.1 fork | Apache Hive 2.3.6 | +-+

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Cheng Lian
Dongjoon, I'm with Hyukjin. There should be at least one Spark 3.x minor release to stabilize Hive 2.3 code paths before retiring the Hive 1.2 fork. Even today, the Hive 2.3.6 version bundled in Spark 3.0 is still buggy in terms of JDK 11 support. (BTW, I just found that our root POM is referring

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Dongjoon Hyun
Thank you for feedback, Hyujkjin and Sean. I proposed `preview-2` for that purpose but I'm also +1 for do that at 3.1 if we can make a decision to eliminate the illegitimate Hive fork reference immediately after `branch-3.0` cut. Sean, I'm referencing Cheng Lian's email for the status of

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-19 Thread Sean Owen
Just to clarify, as even I have lost the details over time: hadoop-2.7 works with hive-2.3? it isn't tied to hadoop-3.2? Roughly how much risk is there in using the Hive 1.x fork over Hive 2.x, for end users using Hive via Spark? I don't have a strong opinion, other than sharing the view that we

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-18 Thread Hyukjin Kwon
I struggled hard to deal with this issue multiple times over a year and thankfully we finally decided to use the official version of Hive 2.3.x too (thank you, Yuming, Alan, and guys) I think this is already a huge progress that we started to use the official version of Hive. I think we should at

Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-18 Thread Dongjoon Hyun
Hi, All. First of all, I want to put this as a policy issue instead of a technical issue. Also, this is orthogonal from `hadoop` version discussion. Apache Spark community kept (not maintained) the forked Apache Hive 1.2.1 because there has been no other options before. As we see at SPARK-20202,