Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Sean Owen
Will pyspark users care much about Hadoop version? they won't if running locally. They will if connecting to a Hadoop cluster. Then again in that context, they're probably using a distro anyway that harmonizes it. Hadoop 3's installed based can't be that large yet; it's been around far less time.

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-24 Thread Gengliang Wang
+1, the issues mentioned are really serious. On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon wrote: > +1. > > Just as a note, > - SPARK-31918 is > fixed now, and there's no blocker. - When we build SparkR, we should use > the latest R version at

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Sean Owen
I'm also genuinely curious when PyPI users would care about the bundled Hadoop jars - do we even need two versions? that itself is extra complexity for end users. I do think Hadoop 3 is the better choice for the user who doesn't care, and better long term. OK but let's at least move ahead with

Re: m2 cache issues in Jenkins?

2020-06-24 Thread shane knapp ☠
for those weird failures, it's super helpful to provide which workers are showing these issues. :) i'd rather not wipe all of the m2 caches on all of the workers, as we'll then potentially get blacklisted again if we download too many packages from apache.org. On Tue, Jun 23, 2020 at 5:58 PM

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Dongjoon Hyun
Thanks, Xiao, Sean, Nicholas. To Xiao, > it sounds like Hadoop 3.x is not as popular as Hadoop 2.7. If you say so, - Apache Hadoop 2.6.0 is the most popular one with 156 dependencies. - Apache Spark 2.2.0 is the most popular one with 264 dependencies. As we know, it doesn't make sense. Are we

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Xiao Li
Hi, Dongjoon, Please do not misinterpret my point. I already clearly said "I do not know how to track the popularity of Hadoop 2 vs Hadoop 3." Also, let me repeat my opinion: the top priority is to provide two options for PyPi distribution and let the end users choose the ones they need. Hadoop

Re: m2 cache issues in Jenkins?

2020-06-24 Thread shane knapp ☠
ok, i've taken that worker offline and once the job running on it finishes, i'll wipe the cache. in the future, please file a JIRA and assign it to me so i don't have to track my work through emails to the dev@ list. ;) thanks! shane On Wed, Jun 24, 2020 at 10:48 AM Holden Karau wrote: >

Re: m2 cache issues in Jenkins?

2020-06-24 Thread Holden Karau
The most recent one I noticed was https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console which was run on amp-jenkins-worker-04. On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ wrote: > for those weird failures, it's super helpful to provide which workers are >

High Availability for spark streaming application running in kubernetes

2020-06-24 Thread Shenson Joseph
Hello, I have a spark streaming application running in kubernetes and we use spark operator to submit spark jobs. Any suggestion on 1. How to handle high availability for spark streaming applications. 2. What would be the best approach to handle high availability of checkpoint data if we don't

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas
The team I'm on currently uses pip-installed PySpark for local development, and we regularly access S3 directly from our laptops/workstations. One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is being able to use a recent version of hadoop-aws that has mature support for s3a.

Re: Enabling push-based shuffle in Spark

2020-06-24 Thread mshen
Our paper summarizing this work of push-based shuffle was recently accepted by VLDB 2020. We have uploaded a preprint version of the paper to the JIRA ticket , along with the production results we have so far. - Min Shen Staff Software

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Dongjoon Hyun
To Xiao. Why Apache project releases should be blocked by PyPi / CRAN? It's completely optional, isn't it? > let me repeat my opinion: the top priority is to provide two options for PyPi distribution IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first incident. Apache

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Wenchen Fan
Shall we start a new thread to discuss the bundled Hadoop version in PySpark? I don't have a strong opinion on changing the default, as users can still download the Hadoop 2.7 version. On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun wrote: > To Xiao. > Why Apache project releases should be

[Spark SQL] Question about support for TimeType columns in Apache Parquet files

2020-06-24 Thread Rylan Dmello
Hello, Tahsin and I are trying to use the Apache Parquet file format with Spark SQL, but are running into errors when reading Parquet files that contain TimeType columns. We're wondering whether this is unsupported in Spark SQL due to an architectural limitation, or due to lack of resources?

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas
To rephrase my earlier email, PyPI users would care about the bundled Hadoop version if they have a workflow that, in effect, looks something like this: ``` pip install pyspark pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7 spark.read.parquet('s3a://...') ``` I agree that Hadoop 3 would

Re: m2 cache issues in Jenkins?

2020-06-24 Thread Holden Karau
Will do :) Thanks for keeping the build system running smoothly :) On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ wrote: > ok, i've taken that worker offline and once the job running on it > finishes, i'll wipe the cache. > > in the future, please file a JIRA and assign it to me so i don't have

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

2020-06-24 Thread Bart Samwel
The relevant earlier discussion is here: https://github.com/apache/spark/pull/25678#issuecomment-531585556. (FWIW, a recent PR tried adding this again: https://github.com/apache/spark/pull/28858.) On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello wrote: > Hello, > > > Tahsin and I are trying to

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Andrew Melo
Hello, On Wed, Jun 24, 2020 at 2:13 PM Holden Karau wrote: > > So I thought our theory for the pypi packages was it was for local > developers, they really shouldn't care about the Hadoop version. If you're > running on a production cluster you ideally pip install from the same release >

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Holden Karau
So I thought our theory for the pypi packages was it was for local developers, they really shouldn't care about the Hadoop version. If you're running on a production cluster you ideally pip install from the same release artifacts as your production cluster to match. On Wed, Jun 24, 2020 at 12:11

Re: m2 cache issues in Jenkins?

2020-06-24 Thread shane knapp ☠
done: -bash-4.1$ cd .m2 -bash-4.1$ ls repository -bash-4.1$ time rm -rf * real17m4.607s user0m0.950s sys 0m18.816s -bash-4.1$ On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ wrote: > ok, i've taken that worker offline and once the job running on it > finishes, i'll wipe the cache. >

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-24 Thread Prashant Sharma
+1 for 3.0.1 release. I too can help out as release manager. On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰 wrote: > I volunteer to be a release manager of 3.0.1, if nobody is working on this. > > > -- 原始邮件 -- > *发件人:* "Gengliang Wang"; > *发送时间:* 2020年6月24日(星期三) 下午4:15 >

回复: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-24 Thread 郑瑞峰
I volunteer to be a release manager of 3.0.1, if nobody is working on this. --原始邮件-- 发件人:"Gengliang Wang"https://issues.apache.org/jira/browse/SPARK-32051and includethe fix into 3.0.1 if possible. Looks like APIs designed to work with Scala 2.11 Java bring