date:20200624

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-24 Thread Gengliang Wang

+1, the issues mentioned are really serious. On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon wrote: > +1. > > Just as a note, > - SPARK-31918 is > fixed now, and there's no blocker. - When we build SparkR, we should use > the latest R version at l

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Sean Owen

Will pyspark users care much about Hadoop version? they won't if running locally. They will if connecting to a Hadoop cluster. Then again in that context, they're probably using a distro anyway that harmonizes it. Hadoop 3's installed based can't be that large yet; it's been around far less time.

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas

The team I'm on currently uses pip-installed PySpark for local development, and we regularly access S3 directly from our laptops/workstations. One of the benefits of having Spark built against Hadoop 3.2 vs. 2.7 is being able to use a recent version of hadoop-aws that has mature support for s3a. W

High Availability for spark streaming application running in kubernetes

2020-06-24 Thread Shenson Joseph

Hello, I have a spark streaming application running in kubernetes and we use spark operator to submit spark jobs. Any suggestion on 1. How to handle high availability for spark streaming applications. 2. What would be the best approach to handle high availability of checkpoint data if we don't us

Re: Enabling push-based shuffle in Spark

2020-06-24 Thread mshen

Our paper summarizing this work of push-based shuffle was recently accepted by VLDB 2020. We have uploaded a preprint version of the paper to the JIRA ticket , along with the production results we have so far. - Min Shen Staff Software En

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Dongjoon Hyun

Thanks, Xiao, Sean, Nicholas. To Xiao, > it sounds like Hadoop 3.x is not as popular as Hadoop 2.7. If you say so, - Apache Hadoop 2.6.0 is the most popular one with 156 dependencies. - Apache Spark 2.2.0 is the most popular one with 264 dependencies. As we know, it doesn't make sense. Are we

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Xiao Li

Hi, Dongjoon, Please do not misinterpret my point. I already clearly said "I do not know how to track the popularity of Hadoop 2 vs Hadoop 3." Also, let me repeat my opinion: the top priority is to provide two options for PyPi distribution and let the end users choose the ones they need. Hadoop

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Sean Owen

I'm also genuinely curious when PyPI users would care about the bundled Hadoop jars - do we even need two versions? that itself is extra complexity for end users. I do think Hadoop 3 is the better choice for the user who doesn't care, and better long term. OK but let's at least move ahead with chan

Re: m2 cache issues in Jenkins?

2020-06-24 Thread shane knapp ☠

for those weird failures, it's super helpful to provide which workers are showing these issues. :) i'd rather not wipe all of the m2 caches on all of the workers, as we'll then potentially get blacklisted again if we download too many packages from apache.org. On Tue, Jun 23, 2020 at 5:58 PM Hol

Re: m2 cache issues in Jenkins?

2020-06-24 Thread Holden Karau

The most recent one I noticed was https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console which was run on amp-jenkins-worker-04. On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ wrote: > for those weird failures, it's super helpful to provide which workers are > showing

Re: m2 cache issues in Jenkins?

2020-06-24 Thread shane knapp ☠

ok, i've taken that worker offline and once the job running on it finishes, i'll wipe the cache. in the future, please file a JIRA and assign it to me so i don't have to track my work through emails to the dev@ list. ;) thanks! shane On Wed, Jun 24, 2020 at 10:48 AM Holden Karau wrote: > The

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Nicholas Chammas

To rephrase my earlier email, PyPI users would care about the bundled Hadoop version if they have a workflow that, in effect, looks something like this: ``` pip install pyspark pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7 spark.read.parquet('s3a://...') ``` I agree that Hadoop 3 would be

Re: m2 cache issues in Jenkins?

2020-06-24 Thread Holden Karau

Will do :) Thanks for keeping the build system running smoothly :) On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ wrote: > ok, i've taken that worker offline and once the job running on it > finishes, i'll wipe the cache. > > in the future, please file a JIRA and assign it to me so i don't have

Re: m2 cache issues in Jenkins?

2020-06-24 Thread shane knapp ☠

done: -bash-4.1$ cd .m2 -bash-4.1$ ls repository -bash-4.1$ time rm -rf * real17m4.607s user0m0.950s sys 0m18.816s -bash-4.1$ On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ wrote: > ok, i've taken that worker offline and once the job running on it > finishes, i'll wipe the cache. >

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Dongjoon Hyun

To Xiao. Why Apache project releases should be blocked by PyPi / CRAN? It's completely optional, isn't it? > let me repeat my opinion: the top priority is to provide two options for PyPi distribution IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first incident. Apache

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Wenchen Fan

Shall we start a new thread to discuss the bundled Hadoop version in PySpark? I don't have a strong opinion on changing the default, as users can still download the Hadoop 2.7 version. On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun wrote: > To Xiao. > Why Apache project releases should be blocked

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Holden Karau

So I thought our theory for the pypi packages was it was for local developers, they really shouldn't care about the Hadoop version. If you're running on a production cluster you ideally pip install from the same release artifacts as your production cluster to match. On Wed, Jun 24, 2020 at 12:11 P

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Andrew Melo

Hello, On Wed, Jun 24, 2020 at 2:13 PM Holden Karau wrote: > > So I thought our theory for the pypi packages was it was for local > developers, they really shouldn't care about the Hadoop version. If you're > running on a production cluster you ideally pip install from the same release > artif

[Spark SQL] Question about support for TimeType columns in Apache Parquet files

2020-06-24 Thread Rylan Dmello

Hello, Tahsin and I are trying to use the Apache Parquet file format with Spark SQL, but are running into errors when reading Parquet files that contain TimeType columns. We're wondering whether this is unsupported in Spark SQL due to an architectural limitation, or due to lack of resources?

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

2020-06-24 Thread Bart Samwel

The relevant earlier discussion is here: https://github.com/apache/spark/pull/25678#issuecomment-531585556. (FWIW, a recent PR tried adding this again: https://github.com/apache/spark/pull/28858.) On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello wrote: > Hello, > > > Tahsin and I are trying to use

回复： [DISCUSS] Apache Spark 3.0.1 Release

2020-06-24 Thread 郑瑞峰

I volunteer to be a release manager of 3.0.1, if nobody is working on this. -- 原始邮件 -- 发件人: "Gengliang Wang"https://issues.apache.org/jira/browse/SPARK-32051 and include the fix into 3.0.1 if possible. Looks like APIs designed to work with Scala 2.11 & Java brin

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-24 Thread Prashant Sharma

+1 for 3.0.1 release. I too can help out as release manager. On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰 wrote: > I volunteer to be a release manager of 3.0.1, if nobody is working on this. > > > -- 原始邮件 -- > *发件人:* "Gengliang Wang"; > *发送时间:* 2020年6月24日(星期三) 下午4:15 > *收件

Re: [DISCUSS] Apache Spark 3.0.1 Release

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

High Availability for spark streaming application running in kubernetes

Re: Enabling push-based shuffle in Spark

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Re: m2 cache issues in Jenkins?

Re: m2 cache issues in Jenkins?

Re: m2 cache issues in Jenkins?

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Re: m2 cache issues in Jenkins?

Re: m2 cache issues in Jenkins?

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

[Spark SQL] Question about support for TimeType columns in Apache Parquet files

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

回复： [DISCUSS] Apache Spark 3.0.1 Release

Re: [DISCUSS] Apache Spark 3.0.1 Release

22 matches

Site Navigation

Mail list logo

Footer information