Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread Aniket Mokashi
Felix - please add me to this event. Ben - should we move this proposal to a doc and open it up for edits/comments. On Wed, Nov 20, 2019 at 5:37 PM Felix Cheung wrote: > Great! > > Due to number of constraints I won’t be sending link directly here but > please r me and I will add you. > > >

Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread Felix Cheung
Great! Due to number of constraints I won’t be sending link directly here but please r me and I will add you. From: Ben Sidhom Sent: Wednesday, November 20, 2019 9:10:01 AM To: John Zhuge Cc: bo yang ; Amogh Margoor ; Ryan Blue ; Ben Sidhom ; Spark Dev List

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Dongjoon Hyun
Thank you all. I'll try to make JIRA and PR for that. Bests, Dongjoon. On Wed, Nov 20, 2019 at 4:08 PM Cheng Lian wrote: > Sean, thanks for the corner cases you listed. They make a lot of sense. > Now I do incline to have Hive 2.3 as the default version. > > Dongjoon, apologize if I didn't

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Dongjoon Hyun
Thank you for much thoughtful clarification. I agree with your all options. Especially, for Hive Metastore connection, `Hive isolated client loader` is also important with Hive 2.3 because Hive 2.3 client cannot talk with Hive 2.1 and lower. `Hive Isolated client loader` is one of the good design

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Oh, actually, in order to decouple Hadoop 3.2 and Hive 2.3 upgrades, we will need a hive-2.3 profile anyway, no matter having the hive-1.2 profile or not. On Wed, Nov 20, 2019 at 3:33 PM Cheng Lian wrote: > Just to summarize my points: > >1. Let's still keep the Hive 1.2 dependency in Spark

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Cheng Lian
Sean, thanks for the corner cases you listed. They make a lot of sense. Now I do incline to have Hive 2.3 as the default version. Dongjoon, apologize if I didn't make it clear before. What made me concerned initially was only the following part: > can we remove the usage of forked `hive` in

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-20 Thread Cheng Lian
Hey Nicholas, Thanks for pointing this out. I just realized that I misread the spark-hadoop-cloud POM. Previously, in Spark 2.4, two profiles, "hadoop-2.7" and "hadoop-3.1", were referenced in the spark-hadoop-cloud POM (here

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Just to summarize my points: 1. Let's still keep the Hive 1.2 dependency in Spark 3.0, but it is optional. End-users may choose between Hive 1.2/2.3 via a new profile (either adding a hive-1.2 profile or adding a hive-2.3 profile works for me, depending on which Hive version we pick

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Dongjoon, I don't think we have any conflicts here. As stated in other threads multiple times, as long as Hive 2.3 and Hadoop 3.2 version upgrades can be decoupled, I have no preference over picking which Hive/Hadoop version as the default version. So the following two plans both work for me:

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Dongjoon Hyun
Nice. That's a progress. Let's narrow down to the path. We need to clarify what is the criteria we can agree. 1. What does `battle-tested for years` mean exactly? How and when can we start the `battle-tested` stage for Hive 2.3? 2. What is the new "Hive integration in Spark"? During

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Cheng Lian
Hey Dongjoon and Felix, I totally agree that Hive 2.3 is more stable than Hive 1.2. Otherwise, we wouldn't even consider integrating with Hive 2.3 in Spark 3.0. However, *"Hive" and "Hive integration in Spark" are two quite different things*, and I don't think anybody has ever mentioned "the

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Felix Cheung
Just to add - hive 1.2 fork is definitely not more stable. We know of a few critical bug fixes that we cherry picked into a fork of that fork to maintain ourselves. From: Dongjoon Hyun Sent: Wednesday, November 20, 2019 11:07:47 AM To: Sean Owen Cc: dev

Re: The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Dongjoon Hyun
Thanks. That will be a giant step forward, Sean! > I'd prefer making it the default in the POM for 3.0. Bests, Dongjoon. On Wed, Nov 20, 2019 at 11:02 AM Sean Owen wrote: > Yeah 'stable' is ambiguous. It's old and buggy, but at least it's the > same old and buggy that's been there a while.

The Myth: the forked Hive 1.2.1 is stabler than XXX

2019-11-20 Thread Dongjoon Hyun
Hi, All. I'm sending this email because it's important to discuss this topic narrowly and make a clear conclusion. `The forked Hive 1.2.1 is stable`? It sounds like a myth we created by ignoring the existing bugs. If you want to say the forked Hive 1.2.1 is stabler than XXX, please give us the

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Dongjoon Hyun
Yes. Right. That's the situation we are hitting and the result I expected. We need to change our default with Hive 2 in the POM. Dongjoon. On Wed, Nov 20, 2019 at 5:20 AM Sean Owen wrote: > Yes, good point. A user would get whatever the POM says without > profiles enabled so it matters. > >

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-20 Thread Mridul Muralidharan
Just for completeness sake, spark is not version neutral to hadoop; particularly in yarn mode, there is a minimum version requirement (though fairly generous I believe). I agree with Steve, it is a long standing pain that we are bundling a positively ancient version of hive. Having said that, we

Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread Ben Sidhom
That sounds great! On Wed, Nov 20, 2019 at 9:02 AM John Zhuge wrote: > That will be great. Please send us the invite. > > On Wed, Nov 20, 2019 at 8:56 AM bo yang wrote: > >> Cool, thanks Ryan, John, Amogh for the reply! Great to see you >> interested! Felix will have a Spark Scalability &

Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread John Zhuge
That will be great. Please send us the invite. On Wed, Nov 20, 2019 at 8:56 AM bo yang wrote: > Cool, thanks Ryan, John, Amogh for the reply! Great to see you interested! > Felix will have a Spark Scalability & Reliability Sync meeting on Dec 4 1pm > PST. We could discuss more details there. Do

Re: Enabling fully disaggregated shuffle on Spark

2019-11-20 Thread bo yang
Cool, thanks Ryan, John, Amogh for the reply! Great to see you interested! Felix will have a Spark Scalability & Reliability Sync meeting on Dec 4 1pm PST. We could discuss more details there. Do you want to join? On Tue, Nov 19, 2019 at 4:23 PM Amogh Margoor wrote: > We at Qubole are also

Re: Removing the usage of forked `hive` in Apache Spark 3.0 (SPARK-20202)

2019-11-20 Thread Sean Owen
Yes, good point. A user would get whatever the POM says without profiles enabled so it matters. Playing it out, an app _should_ compile with the Spark dependency marked 'provided'. In that case the app that is spark-submit-ted is agnostic to the Hive dependency as the only one that matters is