Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-07-21 Thread Steve Loughran
On Sun, 12 Jul 2020 at 01:45, gpongracz wrote: > As someone who mainly operates in AWS it would be very welcome to have the > option to use an updated version of hadoop using pyspark sourced from pypi. > > Acknowledging the issues of backwards compatability... > > The most vexing issue is the

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-07-11 Thread gpongracz
As someone who mainly operates in AWS it would be very welcome to have the option to use an updated version of hadoop using pyspark sourced from pypi. Acknowledging the issues of backwards compatability... The most vexing issue is the lack of ability to use s3a STS, ie

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-25 Thread Hyukjin Kwon
I dont have a strong opinion on changing default too but I also a little bit more prefer to have the option to switch Hadoop version first just to stay safer. To be clear, we're more now discussing about the timing about when to set Hadoop 3.0.0 by default, and which change has to be first,

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Andrew Melo
Hello, On Wed, Jun 24, 2020 at 2:13 PM Holden Karau wrote: > > So I thought our theory for the pypi packages was it was for local > developers, they really shouldn't care about the Hadoop version. If you're > running on a production cluster you ideally pip install from the same release >

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Holden Karau
So I thought our theory for the pypi packages was it was for local developers, they really shouldn't care about the Hadoop version. If you're running on a production cluster you ideally pip install from the same release artifacts as your production cluster to match. On Wed, Jun 24, 2020 at 12:11

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Wenchen Fan
Shall we start a new thread to discuss the bundled Hadoop version in PySpark? I don't have a strong opinion on changing the default, as users can still download the Hadoop 2.7 version. On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun wrote: > To Xiao. > Why Apache project releases should be

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Dongjoon Hyun
To Xiao. Why Apache project releases should be blocked by PyPi / CRAN? It's completely optional, isn't it? > let me repeat my opinion: the top priority is to provide two options for PyPi distribution IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first incident. Apache

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas
To rephrase my earlier email, PyPI users would care about the bundled Hadoop version if they have a workflow that, in effect, looks something like this: ``` pip install pyspark pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7 spark.read.parquet('s3a://...') ``` I agree that Hadoop 3 would

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Sean Owen
I'm also genuinely curious when PyPI users would care about the bundled Hadoop jars - do we even need two versions? that itself is extra complexity for end users. I do think Hadoop 3 is the better choice for the user who doesn't care, and better long term. OK but let's at least move ahead with

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Xiao Li
Hi, Dongjoon, Please do not misinterpret my point. I already clearly said "I do not know how to track the popularity of Hadoop 2 vs Hadoop 3." Also, let me repeat my opinion: the top priority is to provide two options for PyPi distribution and let the end users choose the ones they need. Hadoop

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Dongjoon Hyun
Thanks, Xiao, Sean, Nicholas. To Xiao, > it sounds like Hadoop 3.x is not as popular as Hadoop 2.7. If you say so, - Apache Hadoop 2.6.0 is the most popular one with 156 dependencies. - Apache Spark 2.2.0 is the most popular one with 264 dependencies. As we know, it doesn't make sense. Are we

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas
The team I'm on currently uses pip-installed PySpark for local development, and we regularly access S3 directly from our laptops/workstations. One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is being able to use a recent version of hadoop-aws that has mature support for s3a.

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Sean Owen
Will pyspark users care much about Hadoop version? they won't if running locally. They will if connecting to a Hadoop cluster. Then again in that context, they're probably using a distro anyway that harmonizes it. Hadoop 3's installed based can't be that large yet; it's been around far less time.

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Xiao Li
I think we just need to provide two options and let end users choose the ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark 3.1 release to me. I do not know how to track the popularity of Hadoop 2 vs

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Dongjoon Hyun
I fully understand your concern, but we cannot live with Hadoop 2.7.4 forever, Xiao. Like Hadoop 2.6, we should let it go. So, are you saying that CRAN/PyPy should have all combination of Apache Spark including Hive 1.2 distribution? What is your suggestion as a PMC on Hadoop 3.2 migration path?

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Sean Owen
So, we also release Spark binary distros with Hadoop 2.7, 3.2, and no Hadoop -- all of the options. Picking one profile or the other to release with pypi etc isn't more or less consistent with those releases, as all exist. Is this change only about the source code default, with no effect on

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Xiao Li
Then, it will be a little complex after this PR. It might make the community more confused. In PYPI and CRAN, we are using Hadoop 2.7 as the default profile; however, in the other distributions, we are using Hadoop 3.2 as the default? How to explain this to the community? I would not change the

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Dongjoon Hyun
Thanks. Uploading PySpark to PyPI is a simple manual step and our release script is able to build PySpark with Hadoop 2.7 still if we want. So, `No` for the following question. I updated my PR according to your comment. > If we change the default, will it impact them? If YES,... >From the

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Xiao Li
Our monthly pypi downloads of PySpark have reached 5.4 million. We should avoid forcing the current PySpark users to upgrade their Hadoop versions. If we change the default, will it impact them? If YES, I think we should not do it until it is ready and they have a workaround. So far, our pypi

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-23 Thread Dongjoon Hyun
Hi, All. I bump up this thread again with the title "Use Hadoop-3.2 as a default Hadoop profile in 3.1.0?" There exists some recent discussion on the following PR. Please let us know your thoughts. https://github.com/apache/spark/pull/28897 Bests, Dongjoon. On Fri, Nov 1, 2019 at 9:41 AM

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-22 Thread Steve Loughran
On Tue, Nov 19, 2019 at 10:40 PM Cheng Lian wrote: > Hey Steve, > > In terms of Maven artifact, I don't think the default Hadoop version > matters except for the spark-hadoop-cloud module, which is only meaningful > under the hadoop-3.2 profile. All the other spark-* artifacts published to >

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-20 Thread Cheng Lian
the org.spark-project hive 1.2 will need a solution. >>>> It is old and rather buggy; and It’s been *years* >>>> >>>> I think we should decouple hive change from everything else if people >>>> are concerned? >>>> >>>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-20 Thread Mridul Muralidharan
t; >>> I think we should decouple hive change from everything else if people are >>> concerned? >>> >>> ____________ >>> From: Steve Loughran >>> Sent: Sunday, November 17, 2019 9:22:09 AM >>> To: Cheng Lian >>> Cc

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Nicholas Chammas
> It is old and rather buggy; and It’s been *years* >>> >>> I think we should decouple hive change from everything else if people >>> are concerned? >>> >>> ---------- >>> *From:* Steve Loughran >>> *Sent:* Sunda

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Hyukjin Kwon
It is old and rather buggy; and It’s been *years* >>> >>> I think we should decouple hive change from everything else if people >>> are concerned? >>> >>> -------------- >>> *From:* Steve Loughran >>> *Sent:* S

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-19 Thread Cheng Lian
; >> -- >> *From:* Steve Loughran >> *Sent:* Sunday, November 17, 2019 9:22:09 AM >> *To:* Cheng Lian >> *Cc:* Sean Owen ; Wenchen Fan ; >> Dongjoon Hyun ; dev ; >> Yuming Wang >> *Subject:* Re: Use Hadoop-3.2 as a

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-18 Thread Dongjoon Hyun
- > *From:* Steve Loughran > *Sent:* Sunday, November 17, 2019 9:22:09 AM > *To:* Cheng Lian > *Cc:* Sean Owen ; Wenchen Fan ; > Dongjoon Hyun ; dev ; > Yuming Wang > *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0? > > Can I take this moment t

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-18 Thread Felix Cheung
:22:09 AM To: Cheng Lian Cc: Sean Owen ; Wenchen Fan ; Dongjoon Hyun ; dev ; Yuming Wang Subject: Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0? Can I take this moment to remind everyone that the version of hive which spark has historically bundled (the org.spark-project one

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-17 Thread Steve Loughran
Can I take this moment to remind everyone that the version of hive which spark has historically bundled (the org.spark-project one) is an orphan project put together to deal with Hive's shading issues and a source of unhappiness in the Hive project. What ever get shipped should do its best to

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Cheng Lian
Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I thought the original proposal was to replace Hive 1.2 with Hive 2.3, which seemed risky, and therefore we only introduced Hive 2.3 under the hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong here...

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Sean Owen
I'd prefer simply not making Hadoop 3 the default until 3.1+, rather than introduce yet another build combination. Does Hadoop 2 + Hive 2 work and is there demand for it? On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan wrote: > > Do we have a limitation on the number of pre-built distributions?

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Wenchen Fan
Do we have a limitation on the number of pre-built distributions? Seems this time we need 1. hadoop 2.7 + hive 1.2 2. hadoop 2.7 + hive 2.3 3. hadoop 3 + hive 2.3 AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so don't need to add JDK version to the combination. On Sat, Nov

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-16 Thread Dongjoon Hyun
Thank you for suggestion. Having `hive-2.3` profile sounds good to me because it's orthogonal to Hadoop 3. IIRC, originally, it was proposed in that way, but we put it under `hadoop-3.2` to avoid adding new profiles at that time. And, I'm wondering if you are considering additional pre-built

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-15 Thread Cheng Lian
Cc Yuming, Steve, and Dongjoon On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian wrote: > Similar to Xiao, my major concern about making Hadoop 3.2 the default > Hadoop version is quality control. The current hadoop-3.2 profile covers > too many major component upgrades, i.e.: > >- Hadoop 3.2 >

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-15 Thread Cheng Lian
Similar to Xiao, my major concern about making Hadoop 3.2 the default Hadoop version is quality control. The current hadoop-3.2 profile covers too many major component upgrades, i.e.: - Hadoop 3.2 - Hive 2.3 - JDK 11 We have already found and fixed some feature and performance

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-04 Thread Koert Kuipers
i get that cdh and hdp backport a lot and in that way left 2.7 behind. but they kept the public apis stable at the 2.7 level, because thats kind of the point. arent those the hadoop apis spark uses? On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran wrote: > > > On Mon, Nov 4, 2019 at 12:39 AM

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-04 Thread Steve Loughran
On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas wrote: > On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran > wrote: > >> It would be really good if the spark distributions shipped with later >> versions of the hadoop artifacts. >> > > I second this. If we need to keep a Hadoop 2.x profile around,

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-04 Thread Steve Loughran
I'd move spark's branch-2 line to 2.9.x as (a) spark's version of httpclient hits a bug in the AWS SDK used in hadoop-2.8 unless you revert that patch https://issues.apache.org/jira/browse/SPARK-22919 (b) there's only one future version of 2.8x planned, which is expected once myself or someone

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-03 Thread Nicholas Chammas
On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran wrote: > It would be really good if the spark distributions shipped with later > versions of the hadoop artifacts. > I second this. If we need to keep a Hadoop 2.x profile around, why not make it Hadoop 2.8 or something newer? Koert Kuipers wrote:

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-02 Thread Xiao Li
The changes for JDK 11 supports are not increasing the risk of Hadoop 3.2 profile. Hive 1.2.1 execution JARs are much more stable than Hive 2.3.6 execution JARs. The changes of thrift-servers are massive. We need more evidence to prove the quality and stability before we switching the default to

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-02 Thread Koert Kuipers
yes i am not against hadoop 3 becoming the default. i was just questioning the statement that we are close to dropping support for hadoop 2. we build our own spark releases that we deploy on the clusters of our clients. these clusters are hdp 2.x, cdh 5, emr, dataproc, etc. i am aware that

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-02 Thread Dongjoon Hyun
Hi, Koert. Could you be more specific to your Hadoop version requirement? Although we will have Hadoop 2.7 profile, Hadoop 2.6 and older support is officially already dropped in Apache Spark 3.0.0. We can not give you the answer for Hadoop 2.6 and older version clusters because we are not

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-02 Thread Koert Kuipers
i dont see how we can be close to the point where we dont need to support hadoop 2.x. this does not agree with the reality from my perspective, which is that all our clients are on hadoop 2.x. not a single one is on hadoop 3.x currently. this includes deployments of cloudera distros, hortonworks

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-01 Thread Dongjoon Hyun
Hi, Xiao. How JDK11-support can make `Hadoop-3.2 profile` risky? We build and publish with JDK8. > In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile only. Since we build and publish with JDK8 and the

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-01 Thread Jiaxin Shan
+1 for Hadoop 3.2. Seems lots of cloud integration efforts Steve made is only available in 3.2. We see lots of users asking for better S3A support in Spark. On Fri, Nov 1, 2019 at 9:46 AM Xiao Li wrote: > Hi, Steve, > > Thanks for your comments! My major quality concern is not against Hadoop >

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-01 Thread Xiao Li
Hi, Steve, Thanks for your comments! My major quality concern is not against Hadoop 3.2. In this release, Hive execution module upgrade [from 1.2 to 2.3], Hive thrift-server upgrade, and JDK11 supports are added to Hadoop 3.2 profile only. Compared with Hadoop 2.x profile, the Hadoop 3.2 profile

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-11-01 Thread Steve Loughran
What is the current default value? as the 2.x releases are becoming EOL; 2.7 is dead, there might be a 2.8.x; for now 2.9 is the branch-2 release getting attention. 2.10.0 shipped yesterday, but the ".0" means there will inevitably be surprises. One issue about using a older versions is that any

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-10-28 Thread Dongjoon Hyun
Thank you for the feedback, Sean and Xiao. Bests, Dongjoon. On Mon, Oct 28, 2019 at 12:52 PM Xiao Li wrote: > The stability and quality of Hadoop 3.2 profile are unknown. The changes > are massive, including Hive execution and a new version of Hive > thriftserver. > > To reduce the risk, I

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-10-28 Thread Xiao Li
The stability and quality of Hadoop 3.2 profile are unknown. The changes are massive, including Hive execution and a new version of Hive thriftserver. To reduce the risk, I would like to keep the current default version unchanged. When it becomes stable, we can change the default profile to

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-10-28 Thread Sean Owen
I'm OK with that, but don't have a strong opinion nor info about the implications. That said my guess is we're close to the point where we don't need to support Hadoop 2.x anyway, so, yeah. On Mon, Oct 28, 2019 at 2:33 PM Dongjoon Hyun wrote: > > Hi, All. > > There was a discussion on publishing

Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2019-10-28 Thread Dongjoon Hyun
Hi, All. There was a discussion on publishing artifacts built with Hadoop 3 . But, we are still publishing with Hadoop 2.7.3 and `3.0-preview` will be the same because we didn't change anything yet. Technically, we need to change two places for publishing. 1. Jenkins Snapshot Publishing