Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-24 Thread Prashant Sharma
+1 for 3.0.1 release.
I too can help out as release manager.

On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  wrote:

> I volunteer to be a release manager of 3.0.1, if nobody is working on this.
>
>
> -- 原始邮件 --
> *发件人:* "Gengliang Wang";
> *发送时间:* 2020年6月24日(星期三) 下午4:15
> *收件人:* "Hyukjin Kwon";
> *抄送:* "Dongjoon Hyun";"Jungtaek Lim"<
> kabhwan.opensou...@gmail.com>;"Jules Damji";"Holden
> Karau";"Reynold Xin";"Shivaram
> Venkataraman";"Yuanjian Li"<
> xyliyuanj...@gmail.com>;"Spark dev list";"Takeshi
> Yamamuro";
> *主题:* Re: [DISCUSS] Apache Spark 3.0.1 Release
>
> +1, the issues mentioned are really serious.
>
> On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon  wrote:
>
>> +1.
>>
>> Just as a note,
>> - SPARK-31918  is
>> fixed now, and there's no blocker. - When we build SparkR, we should use
>> the latest R version at least 4.0.0+.
>>
>> 2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이 작성:
>>
>>> +1
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 +1 on a 3.0.1 soon.

 Probably it would be nice if some Scala experts can take a look at
 https://issues.apache.org/jira/browse/SPARK-32051 and include the fix
 into 3.0.1 if possible.
 Looks like APIs designed to work with Scala 2.11 & Java bring
 ambiguity in Scala 2.12 & Java.

 On Wed, Jun 24, 2020 at 4:52 AM Jules Damji 
 wrote:

> +1 (non-binding)
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Jun 23, 2020, at 11:36 AM, Holden Karau 
> wrote:
>
> 
> +1 on a patch release soon
>
> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin 
> wrote:
>
>> +1 on doing a new patch release soon. I saw some of these issues when
>> preparing the 3.0 release, and some of them are very serious.
>>
>>
>> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release
>>> soon.
>>>
>>> Shivaram
>>>
>>> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro <
>>> linguin@gmail.com> wrote:
>>>
>>> Thanks for the heads-up, Yuanjian!
>>>
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>>
>>> wow, the updates are so quick. Anyway, +1 for the release.
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li 
>>> wrote:
>>>
>>> Hi dev-list,
>>>
>>> I’m writing this to raise the discussion about Spark 3.0.1
>>> feasibility since 4 blocker issues were found after Spark 3.0.0:
>>>
>>> [SPARK-31990] The state store compatibility broken will cause a
>>> correctness issue when Streaming query with `dropDuplicate` uses the
>>> checkpoint written by the old Spark version.
>>>
>>> [SPARK-32038] The regression bug in handling NaN values in
>>> COUNT(DISTINCT)
>>>
>>> [SPARK-31918][WIP] CRAN requires to make it working with the latest
>>> R 4.0. It makes the 3.0 release unavailable on CRAN, and only supports R
>>> [3.5, 4.0)
>>>
>>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time
>>> regression
>>>
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>> I think it would be great if we have Spark 3.0.1 to deliver the critical
>>> fixes.
>>>
>>> Any comments are appreciated.
>>>
>>> Best,
>>>
>>> Yuanjian
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>


回复: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-24 Thread 郑瑞峰
I volunteer to be a release manager of 3.0.1, if nobody is working on this.




--原始邮件--
发件人:"Gengliang Wang"https://issues.apache.org/jira/browse/SPARK-32051and 
includethe fix into 3.0.1 if possible.
Looks like APIs designed to work with Scala 2.11  Java bring 
ambiguityin Scala 2.12  Java.


On Wed, Jun 24, 2020 at 4:52 AM Jules Damji https://twitter.com/holdenkarau

Books (Learning Spark, High Performance Spark, 
etc.):https://amzn.to/2MaRAG9;
YouTube Live Streams:https://www.youtube.com/user/holdenkarau

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

2020-06-24 Thread Bart Samwel
The relevant earlier discussion is here:
https://github.com/apache/spark/pull/25678#issuecomment-531585556.

(FWIW, a recent PR tried adding this again:
https://github.com/apache/spark/pull/28858.)

On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello  wrote:

> Hello,
>
>
> Tahsin and I are trying to use the Apache Parquet file format with Spark
> SQL, but are running into errors when reading Parquet files that contain
> TimeType columns. We're wondering whether this is unsupported in Spark SQL
> due to an architectural limitation, or due to lack of resources?
>
>
> Context: When reading some Parquet files with Spark, we get an error
> message like the following:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 186.0 (TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could
> not read or convert schema for file:
> dbfs:/test/randomdata/sample001.parquet
> ...
> Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type:
> INT64 (TIME_MICROS);
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)
>
>
> This only seems to occur with Parquet files that have a column with the
> "TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the
> Parquet file. After digging into this a bit, we think that the error
> message is coming from "ParquetSchemaConverter.scala" here: link
> .
>
> 
>
>
> This seems to imply that the Spark SQL engine does not support reading
> Parquet files with TimeType columns.
>
> We are wondering if anyone on the mailing list could shed some more light
> on this: are there are architectural/datatype limitations in Spark that are
> resulting in this error, or is TimeType support for Parquet files something
> that hasn't been implemented yet due to lack of resources/interest?
>
>
> Thanks,
> Rylan
>


-- 
Bart Samwel
bart.sam...@databricks.com


[Spark SQL] Question about support for TimeType columns in Apache Parquet files

2020-06-24 Thread Rylan Dmello
Hello,


Tahsin and I are trying to use the Apache Parquet file format with Spark SQL, 
but are running into errors when reading Parquet files that contain TimeType 
columns. We're wondering whether this is unsupported in Spark SQL due to an 
architectural limitation, or due to lack of resources?


Context: When reading some Parquet files with Spark, we get an error message 
like the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage 186.0 
(TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could not read or 
convert schema for file: dbfs:/test/randomdata/sample001.parquet
...
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 
(TIME_MICROS);
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)


This only seems to occur with Parquet files that have a column with the 
"TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the Parquet 
file. After digging into this a bit, we think that the error message is coming 
from "ParquetSchemaConverter.scala" here: 
link.
 



This seems to imply that the Spark SQL engine does not support reading Parquet 
files with TimeType columns.

We are wondering if anyone on the mailing list could shed some more light on 
this: are there are architectural/datatype limitations in Spark that are 
resulting in this error, or is TimeType support for Parquet files something 
that hasn't been implemented yet due to lack of resources/interest?


Thanks,

Rylan


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Andrew Melo
Hello,

On Wed, Jun 24, 2020 at 2:13 PM Holden Karau  wrote:
>
> So I thought our theory for the pypi packages was it was for local 
> developers, they really shouldn't care about the Hadoop version. If you're 
> running on a production cluster you ideally pip install from the same release 
> artifacts as your production cluster to match.

That's certainly one use of pypi packages, but not the only one. In
our case, we provide clusters for our users, with SPARK_CONF pre
configured with (e.g.) the master connection URL. But the analyses
they're doing are their own and unique, so they work in their own
personal python virtual environments. There are no "release artifacts"
to publish, per-se, since each user is working independently and can
install whatever they'd like into their virtual environments.

Cheers
Andrew

>
> On Wed, Jun 24, 2020 at 12:11 PM Wenchen Fan  wrote:
>>
>> Shall we start a new thread to discuss the bundled Hadoop version in 
>> PySpark? I don't have a strong opinion on changing the default, as users can 
>> still download the Hadoop 2.7 version.
>>
>> On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun  
>> wrote:
>>>
>>> To Xiao.
>>> Why Apache project releases should be blocked by PyPi / CRAN? It's 
>>> completely optional, isn't it?
>>>
>>> > let me repeat my opinion:  the top priority is to provide two options 
>>> for PyPi distribution
>>>
>>> IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first 
>>> incident. Apache Spark already has a history of missing SparkR uploading. 
>>> We don't say Spark 3.0.0 fails due to CRAN uploading or other non-Apache 
>>> distribution channels. In short, non-Apache distribution channels cannot be 
>>> a `blocker` for Apache project releases. We only do our best for the 
>>> community.
>>>
>>> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is really 
>>> irrelevant to this PR. If someone wants to do that and the PR is ready, why 
>>> don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait for 
>>> December? Is there a reason why we need to wait?
>>>
>>> To Sean
>>> Thanks!
>>>
>>> To Nicholas.
>>> Do you think `pip install pyspark` is version-agnostic? In the Python 
>>> world, `pip install somepackage` fails frequently. In production, you 
>>> should use `pip install somepackage==specificversion`. I don't think the 
>>> production pipeline has non-versinoned Python package installation.
>>>
>>> The bottom line is that the PR doesn't change PyPi uploading, the AS-IS PR 
>>> keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there is 
>>> a blocker for that PR.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas 
>>>  wrote:

 To rephrase my earlier email, PyPI users would care about the bundled 
 Hadoop version if they have a workflow that, in effect, looks something 
 like this:

 ```
 pip install pyspark
 pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
 spark.read.parquet('s3a://...')
 ```

 I agree that Hadoop 3 would be a better default (again, the s3a support is 
 just much better). But to Xiao's point, if you are expecting Spark to work 
 with some package like hadoop-aws that assumes an older version of Hadoop 
 bundled with Spark, then changing the default may break your workflow.

 In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 to 
 hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that 
 would be more difficult to repair. 路‍♂️

 On Wed, Jun 24, 2020 at 1:44 PM Sean Owen  wrote:
>
> I'm also genuinely curious when PyPI users would care about the
> bundled Hadoop jars - do we even need two versions? that itself is
> extra complexity for end users.
> I do think Hadoop 3 is the better choice for the user who doesn't
> care, and better long term.
> OK but let's at least move ahead with changing defaults.
>
> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li  wrote:
> >
> > Hi, Dongjoon,
> >
> > Please do not misinterpret my point. I already clearly said "I do not 
> > know how to track the popularity of Hadoop 2 vs Hadoop 3."
> >
> > Also, let me repeat my opinion:  the top priority is to provide two 
> > options for PyPi distribution and let the end users choose the ones 
> > they need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make 
> > any breaking change, let us follow our protocol documented in 
> > https://spark.apache.org/versioning-policy.html.
> >
> > If you just want to change the Jenkins setup, I am OK about it. If you 
> > want to change the default distribution, we need more discussions in 
> > the community for getting an agreement.
> >
> >  Thanks,
> >
> > Xiao
> >
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): 

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Holden Karau
So I thought our theory for the pypi packages was it was for local
developers, they really shouldn't care about the Hadoop version. If you're
running on a production cluster you ideally pip install from the same
release artifacts as your production cluster to match.

On Wed, Jun 24, 2020 at 12:11 PM Wenchen Fan  wrote:

> Shall we start a new thread to discuss the bundled Hadoop version in
> PySpark? I don't have a strong opinion on changing the default, as users
> can still download the Hadoop 2.7 version.
>
> On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun 
> wrote:
>
>> To Xiao.
>> Why Apache project releases should be blocked by PyPi / CRAN? It's
>> completely optional, isn't it?
>>
>> > let me repeat my opinion:  the top priority is to provide two
>> options for PyPi distribution
>>
>> IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the
>> first incident. Apache Spark already has a history of missing SparkR
>> uploading. We don't say Spark 3.0.0 fails due to CRAN uploading or other
>> non-Apache distribution channels. In short, non-Apache distribution
>> channels cannot be a `blocker` for Apache project releases. We only do our
>> best for the community.
>>
>> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is
>> really irrelevant to this PR. If someone wants to do that and the PR is
>> ready, why don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait
>> for December? Is there a reason why we need to wait?
>>
>> To Sean
>> Thanks!
>>
>> To Nicholas.
>> Do you think `pip install pyspark` is version-agnostic? In the Python
>> world, `pip install somepackage` fails frequently. In production, you
>> should use `pip install somepackage==specificversion`. I don't think the
>> production pipeline has non-versinoned Python package installation.
>>
>> The bottom line is that the PR doesn't change PyPi uploading, the AS-IS
>> PR keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there
>> is a blocker for that PR.
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> To rephrase my earlier email, PyPI users would care about the bundled
>>> Hadoop version if they have a workflow that, in effect, looks something
>>> like this:
>>>
>>> ```
>>> pip install pyspark
>>> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
>>> spark.read.parquet('s3a://...')
>>> ```
>>>
>>> I agree that Hadoop 3 would be a better default (again, the s3a support
>>> is just much better). But to Xiao's point, if you are expecting Spark to
>>> work with some package like hadoop-aws that assumes an older version of
>>> Hadoop bundled with Spark, then changing the default may break your
>>> workflow.
>>>
>>> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7
>>> to hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
>>> would be more difficult to repair. 路‍♂️
>>>
>>> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen  wrote:
>>>
 I'm also genuinely curious when PyPI users would care about the
 bundled Hadoop jars - do we even need two versions? that itself is
 extra complexity for end users.
 I do think Hadoop 3 is the better choice for the user who doesn't
 care, and better long term.
 OK but let's at least move ahead with changing defaults.

 On Wed, Jun 24, 2020 at 12:38 PM Xiao Li  wrote:
 >
 > Hi, Dongjoon,
 >
 > Please do not misinterpret my point. I already clearly said "I do not
 know how to track the popularity of Hadoop 2 vs Hadoop 3."
 >
 > Also, let me repeat my opinion:  the top priority is to provide two
 options for PyPi distribution and let the end users choose the ones they
 need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
 breaking change, let us follow our protocol documented in
 https://spark.apache.org/versioning-policy.html.
 >
 > If you just want to change the Jenkins setup, I am OK about it. If
 you want to change the default distribution, we need more discussions in
 the community for getting an agreement.
 >
 >  Thanks,
 >
 > Xiao
 >

>>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Wenchen Fan
Shall we start a new thread to discuss the bundled Hadoop version in
PySpark? I don't have a strong opinion on changing the default, as users
can still download the Hadoop 2.7 version.

On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun 
wrote:

> To Xiao.
> Why Apache project releases should be blocked by PyPi / CRAN? It's
> completely optional, isn't it?
>
> > let me repeat my opinion:  the top priority is to provide two
> options for PyPi distribution
>
> IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first
> incident. Apache Spark already has a history of missing SparkR uploading.
> We don't say Spark 3.0.0 fails due to CRAN uploading or other non-Apache
> distribution channels. In short, non-Apache distribution channels cannot be
> a `blocker` for Apache project releases. We only do our best for the
> community.
>
> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is really
> irrelevant to this PR. If someone wants to do that and the PR is ready, why
> don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait for
> December? Is there a reason why we need to wait?
>
> To Sean
> Thanks!
>
> To Nicholas.
> Do you think `pip install pyspark` is version-agnostic? In the Python
> world, `pip install somepackage` fails frequently. In production, you
> should use `pip install somepackage==specificversion`. I don't think the
> production pipeline has non-versinoned Python package installation.
>
> The bottom line is that the PR doesn't change PyPi uploading, the AS-IS PR
> keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there is
> a blocker for that PR.
>
> Bests,
> Dongjoon.
>
> On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> To rephrase my earlier email, PyPI users would care about the bundled
>> Hadoop version if they have a workflow that, in effect, looks something
>> like this:
>>
>> ```
>> pip install pyspark
>> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
>> spark.read.parquet('s3a://...')
>> ```
>>
>> I agree that Hadoop 3 would be a better default (again, the s3a support
>> is just much better). But to Xiao's point, if you are expecting Spark to
>> work with some package like hadoop-aws that assumes an older version of
>> Hadoop bundled with Spark, then changing the default may break your
>> workflow.
>>
>> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7
>> to hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
>> would be more difficult to repair. 路‍♂️
>>
>> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen  wrote:
>>
>>> I'm also genuinely curious when PyPI users would care about the
>>> bundled Hadoop jars - do we even need two versions? that itself is
>>> extra complexity for end users.
>>> I do think Hadoop 3 is the better choice for the user who doesn't
>>> care, and better long term.
>>> OK but let's at least move ahead with changing defaults.
>>>
>>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li  wrote:
>>> >
>>> > Hi, Dongjoon,
>>> >
>>> > Please do not misinterpret my point. I already clearly said "I do not
>>> know how to track the popularity of Hadoop 2 vs Hadoop 3."
>>> >
>>> > Also, let me repeat my opinion:  the top priority is to provide two
>>> options for PyPi distribution and let the end users choose the ones they
>>> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
>>> breaking change, let us follow our protocol documented in
>>> https://spark.apache.org/versioning-policy.html.
>>> >
>>> > If you just want to change the Jenkins setup, I am OK about it. If you
>>> want to change the default distribution, we need more discussions in the
>>> community for getting an agreement.
>>> >
>>> >  Thanks,
>>> >
>>> > Xiao
>>> >
>>>
>>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Dongjoon Hyun
To Xiao.
Why Apache project releases should be blocked by PyPi / CRAN? It's
completely optional, isn't it?

> let me repeat my opinion:  the top priority is to provide two options
for PyPi distribution

IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first
incident. Apache Spark already has a history of missing SparkR uploading.
We don't say Spark 3.0.0 fails due to CRAN uploading or other non-Apache
distribution channels. In short, non-Apache distribution channels cannot be
a `blocker` for Apache project releases. We only do our best for the
community.

SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is really
irrelevant to this PR. If someone wants to do that and the PR is ready, why
don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait for
December? Is there a reason why we need to wait?

To Sean
Thanks!

To Nicholas.
Do you think `pip install pyspark` is version-agnostic? In the Python
world, `pip install somepackage` fails frequently. In production, you
should use `pip install somepackage==specificversion`. I don't think the
production pipeline has non-versinoned Python package installation.

The bottom line is that the PR doesn't change PyPi uploading, the AS-IS PR
keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there is
a blocker for that PR.

Bests,
Dongjoon.

On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> To rephrase my earlier email, PyPI users would care about the bundled
> Hadoop version if they have a workflow that, in effect, looks something
> like this:
>
> ```
> pip install pyspark
> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
> spark.read.parquet('s3a://...')
> ```
>
> I agree that Hadoop 3 would be a better default (again, the s3a support is
> just much better). But to Xiao's point, if you are expecting Spark to work
> with some package like hadoop-aws that assumes an older version of Hadoop
> bundled with Spark, then changing the default may break your workflow.
>
> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 to
> hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
> would be more difficult to repair. 路‍♂️
>
> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen  wrote:
>
>> I'm also genuinely curious when PyPI users would care about the
>> bundled Hadoop jars - do we even need two versions? that itself is
>> extra complexity for end users.
>> I do think Hadoop 3 is the better choice for the user who doesn't
>> care, and better long term.
>> OK but let's at least move ahead with changing defaults.
>>
>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li  wrote:
>> >
>> > Hi, Dongjoon,
>> >
>> > Please do not misinterpret my point. I already clearly said "I do not
>> know how to track the popularity of Hadoop 2 vs Hadoop 3."
>> >
>> > Also, let me repeat my opinion:  the top priority is to provide two
>> options for PyPi distribution and let the end users choose the ones they
>> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
>> breaking change, let us follow our protocol documented in
>> https://spark.apache.org/versioning-policy.html.
>> >
>> > If you just want to change the Jenkins setup, I am OK about it. If you
>> want to change the default distribution, we need more discussions in the
>> community for getting an agreement.
>> >
>> >  Thanks,
>> >
>> > Xiao
>> >
>>
>


Re: m2 cache issues in Jenkins?

2020-06-24 Thread shane knapp ☠
done:
-bash-4.1$ cd .m2
-bash-4.1$ ls
repository
-bash-4.1$ time rm -rf *

real17m4.607s
user0m0.950s
sys 0m18.816s
-bash-4.1$

On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠  wrote:

> ok, i've taken that worker offline and once the job running on it
> finishes, i'll wipe the cache.
>
> in the future, please file a JIRA and assign it to me so i don't have to
> track my work through emails to the dev@ list.  ;)
>
> thanks!
>
> shane
>
> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau 
> wrote:
>
>> The most recent one I noticed was
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
>>  which
>> was run on  amp-jenkins-worker-04.
>>
>> On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ 
>> wrote:
>>
>>> for those weird failures, it's super helpful to provide which workers
>>> are showing these issues.  :)
>>>
>>> i'd rather not wipe all of the m2 caches on all of the workers, as we'll
>>> then potentially get blacklisted again if we download too many packages
>>> from apache.org.
>>>
>>> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau 
>>> wrote:
>>>
 Hi Folks,

 I've been see some weird failures on Jenkins and it looks like it might
 be from the m2 cache. Would it be OK to clean it out? Or is it important?

 Cheers,

 Holden

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas
To rephrase my earlier email, PyPI users would care about the bundled
Hadoop version if they have a workflow that, in effect, looks something
like this:

```
pip install pyspark
pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
spark.read.parquet('s3a://...')
```

I agree that Hadoop 3 would be a better default (again, the s3a support is
just much better). But to Xiao's point, if you are expecting Spark to work
with some package like hadoop-aws that assumes an older version of Hadoop
bundled with Spark, then changing the default may break your workflow.

In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 to
hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
would be more difficult to repair. 路‍♂️

On Wed, Jun 24, 2020 at 1:44 PM Sean Owen  wrote:

> I'm also genuinely curious when PyPI users would care about the
> bundled Hadoop jars - do we even need two versions? that itself is
> extra complexity for end users.
> I do think Hadoop 3 is the better choice for the user who doesn't
> care, and better long term.
> OK but let's at least move ahead with changing defaults.
>
> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li  wrote:
> >
> > Hi, Dongjoon,
> >
> > Please do not misinterpret my point. I already clearly said "I do not
> know how to track the popularity of Hadoop 2 vs Hadoop 3."
> >
> > Also, let me repeat my opinion:  the top priority is to provide two
> options for PyPi distribution and let the end users choose the ones they
> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
> breaking change, let us follow our protocol documented in
> https://spark.apache.org/versioning-policy.html.
> >
> > If you just want to change the Jenkins setup, I am OK about it. If you
> want to change the default distribution, we need more discussions in the
> community for getting an agreement.
> >
> >  Thanks,
> >
> > Xiao
> >
>


Re: m2 cache issues in Jenkins?

2020-06-24 Thread Holden Karau
Will do :) Thanks for keeping the build system running smoothly :)

On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠  wrote:

> ok, i've taken that worker offline and once the job running on it
> finishes, i'll wipe the cache.
>
> in the future, please file a JIRA and assign it to me so i don't have to
> track my work through emails to the dev@ list.  ;)
>
> thanks!
>
> shane
>
> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau 
> wrote:
>
>> The most recent one I noticed was
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
>>  which
>> was run on  amp-jenkins-worker-04.
>>
>> On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ 
>> wrote:
>>
>>> for those weird failures, it's super helpful to provide which workers
>>> are showing these issues.  :)
>>>
>>> i'd rather not wipe all of the m2 caches on all of the workers, as we'll
>>> then potentially get blacklisted again if we download too many packages
>>> from apache.org.
>>>
>>> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau 
>>> wrote:
>>>
 Hi Folks,

 I've been see some weird failures on Jenkins and it looks like it might
 be from the m2 cache. Would it be OK to clean it out? Or is it important?

 Cheers,

 Holden

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: m2 cache issues in Jenkins?

2020-06-24 Thread shane knapp ☠
ok, i've taken that worker offline and once the job running on it finishes,
i'll wipe the cache.

in the future, please file a JIRA and assign it to me so i don't have to
track my work through emails to the dev@ list.  ;)

thanks!

shane

On Wed, Jun 24, 2020 at 10:48 AM Holden Karau  wrote:

> The most recent one I noticed was
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
>  which
> was run on  amp-jenkins-worker-04.
>
> On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ 
> wrote:
>
>> for those weird failures, it's super helpful to provide which workers are
>> showing these issues.  :)
>>
>> i'd rather not wipe all of the m2 caches on all of the workers, as we'll
>> then potentially get blacklisted again if we download too many packages
>> from apache.org.
>>
>> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau 
>> wrote:
>>
>>> Hi Folks,
>>>
>>> I've been see some weird failures on Jenkins and it looks like it might
>>> be from the m2 cache. Would it be OK to clean it out? Or is it important?
>>>
>>> Cheers,
>>>
>>> Holden
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: m2 cache issues in Jenkins?

2020-06-24 Thread Holden Karau
The most recent one I noticed was
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
which
was run on  amp-jenkins-worker-04.

On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠  wrote:

> for those weird failures, it's super helpful to provide which workers are
> showing these issues.  :)
>
> i'd rather not wipe all of the m2 caches on all of the workers, as we'll
> then potentially get blacklisted again if we download too many packages
> from apache.org.
>
> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau  wrote:
>
>> Hi Folks,
>>
>> I've been see some weird failures on Jenkins and it looks like it might
>> be from the m2 cache. Would it be OK to clean it out? Or is it important?
>>
>> Cheers,
>>
>> Holden
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: m2 cache issues in Jenkins?

2020-06-24 Thread shane knapp ☠
for those weird failures, it's super helpful to provide which workers are
showing these issues.  :)

i'd rather not wipe all of the m2 caches on all of the workers, as we'll
then potentially get blacklisted again if we download too many packages
from apache.org.

On Tue, Jun 23, 2020 at 5:58 PM Holden Karau  wrote:

> Hi Folks,
>
> I've been see some weird failures on Jenkins and it looks like it might be
> from the m2 cache. Would it be OK to clean it out? Or is it important?
>
> Cheers,
>
> Holden
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Sean Owen
I'm also genuinely curious when PyPI users would care about the
bundled Hadoop jars - do we even need two versions? that itself is
extra complexity for end users.
I do think Hadoop 3 is the better choice for the user who doesn't
care, and better long term.
OK but let's at least move ahead with changing defaults.

On Wed, Jun 24, 2020 at 12:38 PM Xiao Li  wrote:
>
> Hi, Dongjoon,
>
> Please do not misinterpret my point. I already clearly said "I do not know 
> how to track the popularity of Hadoop 2 vs Hadoop 3."
>
> Also, let me repeat my opinion:  the top priority is to provide two options 
> for PyPi distribution and let the end users choose the ones they need. Hadoop 
> 3.2 or Hadoop 2.7. In general, when we want to make any breaking change, let 
> us follow our protocol documented in 
> https://spark.apache.org/versioning-policy.html.
>
> If you just want to change the Jenkins setup, I am OK about it. If you want 
> to change the default distribution, we need more discussions in the community 
> for getting an agreement.
>
>  Thanks,
>
> Xiao
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Xiao Li
Hi, Dongjoon,

Please do not misinterpret my point. I already clearly said "I do not know
how to track the popularity of Hadoop 2 vs Hadoop 3."

Also, let me repeat my opinion:  the top priority is to provide two options
for PyPi distribution and let the end users choose the ones they need.
Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any breaking
change, let us follow our protocol documented in
https://spark.apache.org/versioning-policy.html.

If you just want to change the Jenkins setup, I am OK about it. If you want
to change the default distribution, we need more discussions in the
community for getting an agreement.

 Thanks,

Xiao


On Wed, Jun 24, 2020 at 10:07 AM Dongjoon Hyun 
wrote:

> Thanks, Xiao, Sean, Nicholas.
>
> To Xiao,
>
> >  it sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>
> If you say so,
> - Apache Hadoop 2.6.0 is the most popular one with 156 dependencies.
> - Apache Spark 2.2.0 is the most popular one with 264 dependencies.
>
> As we know, it doesn't make sense. Are we recommending Apache Spark 2.2.0
> over Apache Spark 3.0.0?
>
> There is a reason why Apache Spark dropped Hadoop 2.6 profile. Hadoop
> 2.7.4 has many limitations in the cloud environment. Apache Hadoop 3.2 will
> unleash Apache Spark 3.1 in the cloud environment.  (Nicholas also pointed
> it).
>
> For Sean's comment, yes. We can focus on that later in a different thread.
>
> > The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
> eventually, not now.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Jun 24, 2020 at 7:26 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> The team I'm on currently uses pip-installed PySpark for local
>> development, and we regularly access S3 directly from our
>> laptops/workstations.
>>
>> One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is
>> being able to use a recent version of hadoop-aws that has mature support
>> for s3a. With Hadoop 2.7 the support for s3a is buggy and incomplete, and
>> there are incompatibilities that prevent you from using Spark built against
>> Hadoop 2.7 with hadoop-aws version 2.8 or newer.
>>
>> On Wed, Jun 24, 2020 at 10:15 AM Sean Owen  wrote:
>>
>>> Will pyspark users care much about Hadoop version? they won't if running
>>> locally. They will if connecting to a Hadoop cluster. Then again in that
>>> context, they're probably using a distro anyway that harmonizes it.
>>> Hadoop 3's installed based can't be that large yet; it's been around far
>>> less time.
>>>
>>> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
>>> eventually, not now.
>>> But if the question now is build defaults, is it a big deal either way?
>>>
>>> On Tue, Jun 23, 2020 at 11:03 PM Xiao Li  wrote:
>>>
 I think we just need to provide two options and let end users choose
 the ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make
 Pyspark Hadoop 3.2+ Variant available in PyPI) is a high priority task for
 Spark 3.1 release to me.

 I do not know how to track the popularity of Hadoop 2 vs Hadoop 3.
 Based on this link
 https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
 sounds like Hadoop 3.x is not as popular as Hadoop 2.7.




-- 



Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Dongjoon Hyun
Thanks, Xiao, Sean, Nicholas.

To Xiao,

>  it sounds like Hadoop 3.x is not as popular as Hadoop 2.7.

If you say so,
- Apache Hadoop 2.6.0 is the most popular one with 156 dependencies.
- Apache Spark 2.2.0 is the most popular one with 264 dependencies.

As we know, it doesn't make sense. Are we recommending Apache Spark 2.2.0
over Apache Spark 3.0.0?

There is a reason why Apache Spark dropped Hadoop 2.6 profile. Hadoop 2.7.4
has many limitations in the cloud environment. Apache Hadoop 3.2 will
unleash Apache Spark 3.1 in the cloud environment.  (Nicholas also pointed
it).

For Sean's comment, yes. We can focus on that later in a different thread.

> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
eventually, not now.

Bests,
Dongjoon.


On Wed, Jun 24, 2020 at 7:26 AM Nicholas Chammas 
wrote:

> The team I'm on currently uses pip-installed PySpark for local
> development, and we regularly access S3 directly from our
> laptops/workstations.
>
> One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is
> being able to use a recent version of hadoop-aws that has mature support
> for s3a. With Hadoop 2.7 the support for s3a is buggy and incomplete, and
> there are incompatibilities that prevent you from using Spark built against
> Hadoop 2.7 with hadoop-aws version 2.8 or newer.
>
> On Wed, Jun 24, 2020 at 10:15 AM Sean Owen  wrote:
>
>> Will pyspark users care much about Hadoop version? they won't if running
>> locally. They will if connecting to a Hadoop cluster. Then again in that
>> context, they're probably using a distro anyway that harmonizes it.
>> Hadoop 3's installed based can't be that large yet; it's been around far
>> less time.
>>
>> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
>> eventually, not now.
>> But if the question now is build defaults, is it a big deal either way?
>>
>> On Tue, Jun 23, 2020 at 11:03 PM Xiao Li  wrote:
>>
>>> I think we just need to provide two options and let end users choose the
>>> ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
>>> Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
>>> 3.1 release to me.
>>>
>>> I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based
>>> on this link
>>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
>>> sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>>>
>>>
>>>


Re: Enabling push-based shuffle in Spark

2020-06-24 Thread mshen
Our paper summarizing this work of push-based shuffle was recently accepted
by VLDB 2020.
We have uploaded a preprint version of the paper to the  JIRA ticket
  , along with the
production results we have so far.



-
Min Shen
Staff Software Engineer
LinkedIn
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



High Availability for spark streaming application running in kubernetes

2020-06-24 Thread Shenson Joseph
Hello,

I have a spark streaming application running in kubernetes and we use spark
operator to submit spark jobs. Any suggestion on

1. How to handle high availability for spark streaming applications.
2. What would be the best approach to handle high availability of
checkpoint data if we don't use HDFS?

Thanks
Shenson


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas
The team I'm on currently uses pip-installed PySpark for local development,
and we regularly access S3 directly from our laptops/workstations.

One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is
being able to use a recent version of hadoop-aws that has mature support
for s3a. With Hadoop 2.7 the support for s3a is buggy and incomplete, and
there are incompatibilities that prevent you from using Spark built against
Hadoop 2.7 with hadoop-aws version 2.8 or newer.

On Wed, Jun 24, 2020 at 10:15 AM Sean Owen  wrote:

> Will pyspark users care much about Hadoop version? they won't if running
> locally. They will if connecting to a Hadoop cluster. Then again in that
> context, they're probably using a distro anyway that harmonizes it.
> Hadoop 3's installed based can't be that large yet; it's been around far
> less time.
>
> The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
> eventually, not now.
> But if the question now is build defaults, is it a big deal either way?
>
> On Tue, Jun 23, 2020 at 11:03 PM Xiao Li  wrote:
>
>> I think we just need to provide two options and let end users choose the
>> ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
>> Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
>> 3.1 release to me.
>>
>> I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based
>> on this link
>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
>> sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>>
>>
>>


Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Sean Owen
Will pyspark users care much about Hadoop version? they won't if running
locally. They will if connecting to a Hadoop cluster. Then again in that
context, they're probably using a distro anyway that harmonizes it.
Hadoop 3's installed based can't be that large yet; it's been around far
less time.

The bigger question indeed is dropping Hadoop 2.x / Hive 1.x etc
eventually, not now.
But if the question now is build defaults, is it a big deal either way?

On Tue, Jun 23, 2020 at 11:03 PM Xiao Li  wrote:

> I think we just need to provide two options and let end users choose the
> ones they need. Hadoop 3.2 or Hadoop 2.7. Thus, SPARK-32017 (Make Pyspark
> Hadoop 3.2+ Variant available in PyPI) is a high priority task for Spark
> 3.1 release to me.
>
> I do not know how to track the popularity of Hadoop 2 vs Hadoop 3. Based
> on this link
> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs , it
> sounds like Hadoop 3.x is not as popular as Hadoop 2.7.
>
>
>


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-24 Thread Gengliang Wang
+1, the issues mentioned are really serious.

On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon  wrote:

> +1.
>
> Just as a note,
> - SPARK-31918  is
> fixed now, and there's no blocker. - When we build SparkR, we should use
> the latest R version at least 4.0.0+.
>
> 2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이 작성:
>
>> +1
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> +1 on a 3.0.1 soon.
>>>
>>> Probably it would be nice if some Scala experts can take a look at
>>> https://issues.apache.org/jira/browse/SPARK-32051 and include the fix
>>> into 3.0.1 if possible.
>>> Looks like APIs designed to work with Scala 2.11 & Java bring
>>> ambiguity in Scala 2.12 & Java.
>>>
>>> On Wed, Jun 24, 2020 at 4:52 AM Jules Damji  wrote:
>>>
 +1 (non-binding)

 Sent from my iPhone
 Pardon the dumb thumb typos :)

 On Jun 23, 2020, at 11:36 AM, Holden Karau 
 wrote:

 
 +1 on a patch release soon

 On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin 
 wrote:

> +1 on doing a new patch release soon. I saw some of these issues when
> preparing the 3.0 release, and some of them are very serious.
>
>
> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release
>> soon.
>>
>> Shivaram
>>
>> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro <
>> linguin@gmail.com> wrote:
>>
>> Thanks for the heads-up, Yuanjian!
>>
>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>
>> wow, the updates are so quick. Anyway, +1 for the release.
>>
>> Bests,
>> Takeshi
>>
>> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li 
>> wrote:
>>
>> Hi dev-list,
>>
>> I’m writing this to raise the discussion about Spark 3.0.1
>> feasibility since 4 blocker issues were found after Spark 3.0.0:
>>
>> [SPARK-31990] The state store compatibility broken will cause a
>> correctness issue when Streaming query with `dropDuplicate` uses the
>> checkpoint written by the old Spark version.
>>
>> [SPARK-32038] The regression bug in handling NaN values in
>> COUNT(DISTINCT)
>>
>> [SPARK-31918][WIP] CRAN requires to make it working with the latest R
>> 4.0. It makes the 3.0 release unavailable on CRAN, and only supports R
>> [3.5, 4.0)
>>
>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time regression
>>
>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0. I
>> think it would be great if we have Spark 3.0.1 to deliver the critical
>> fixes.
>>
>> Any comments are appreciated.
>>
>> Best,
>>
>> Yuanjian
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau