Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread shane knapp
ugh... R unit tests failed on both of these builds. https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94583/artifact/R/target/ https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94584/artifact/R/target/ On Fri, Aug 10, 2018 at 1:58 PM, Shivaram Venkataraman <

Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread shane knapp
/agreemsg On Fri, Aug 10, 2018 at 4:02 PM, Sean Owen wrote: > Seems OK to proceed with shutting off lintr, as it was masking those. > > On Fri, Aug 10, 2018 at 6:01 PM shane knapp wrote: > >> ugh... R unit tests failed on both of these builds. >> https://amplab.cs.berkeley.edu/jenkins//job/

Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread Sean Owen
Seems OK to proceed with shutting off lintr, as it was masking those. On Fri, Aug 10, 2018 at 6:01 PM shane knapp wrote: > ugh... R unit tests failed on both of these builds. > > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94583/artifact/R/target/ > >

Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread Shivaram Venkataraman
Sounds good to me as well. Thanks Shane. Shivaram On Fri, Aug 10, 2018 at 1:40 PM Reynold Xin wrote: > > SGTM > > On Fri, Aug 10, 2018 at 1:39 PM shane knapp wrote: >> >> https://issues.apache.org/jira/browse/SPARK-25089 >> >> basically since these branches are old, and there will be a greater

Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread Reynold Xin
SGTM On Fri, Aug 10, 2018 at 1:39 PM shane knapp wrote: > https://issues.apache.org/jira/browse/SPARK-25089 > > basically since these branches are old, and there will be a greater than > zero amount of work to get lint-r to pass (on the new ubuntu workers), sean > and i are proposing to remove

[R] discuss: removing lint-r checks for old branches

2018-08-10 Thread shane knapp
https://issues.apache.org/jira/browse/SPARK-25089 basically since these branches are old, and there will be a greater than zero amount of work to get lint-r to pass (on the new ubuntu workers), sean and i are proposing to remove the lint-r checks for the builds. this is super not important for

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
> > > I also think it's a good idea to test against newer Python versions. But I > don't know how difficult it is and whether or not it's feasible to resolve > that between branch cut and RC cut. > > unless someone pops in to this thread and tells me w/o a doubt that all spark branches will

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Li Jin
I agree with Byran. If it's acceptable to have another job to test with Python 3.5 and pyarrow 0.10.0, I am leaning towards upgrading arrow. Arrow 0.10.0 has tons of bug fixes and improves from 0.8.0, including important memory leak fixes such as https://issues.apache.org/jira/browse/ARROW-1973.

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
python 3.5/pyarrow 0.10.0 build: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.6-python-3.5-arrow-0.10.0-ubuntu-testing/ On Fri, Aug 10, 2018 at 10:44 AM, shane knapp wrote: > see:

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
see: https://github.com/apache/spark/pull/21939#issuecomment-412154343 yes, i can set up a build. have some Qs in the PR about building the spark package before running the python tests. On Fri, Aug 10, 2018 at 10:41 AM, Bryan Cutler wrote: > I agree that we should hold off on the Arrow

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Bryan Cutler
I agree that we should hold off on the Arrow upgrade if it requires major changes to our testing. I did have another thought that maybe we could just add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all current testing the same? I'm not sure how doable that is right now and

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan wrote: > It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave it > to Spark 3.0, so that we have more time to test. Any objections? > none here. -- Shane Knapp UC Berkeley EECS Research / RISELab Staff Technical Lead

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Wenchen Fan
It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave it to Spark 3.0, so that we have more time to test. Any objections? On Fri, Aug 10, 2018 at 11:53 PM shane knapp wrote: > quick update from my end: > > SPARK-24433 (SparkR/k8s) depends on SPARK-25087 (move builds to ubuntu)

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-10 Thread Marco Gaido
Hi Makatun, I think your problem has been solved in https://issues.apache.org/jira/browse/SPARK-16406 which is going to be in Spark 2.4. Please try on the current master, where you should see the problem disappeared. Thanks, Marco 2018-08-09 12:56 GMT+02:00 makatun : > Here are the images

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
quick update from my end: SPARK-24433 (SparkR/k8s) depends on SPARK-25087 (move builds to ubuntu) SPARK-23874 (arrow -> 0.10.0) now depends on SPARK-25079 (python 3.5 upgrade) both SPARK-25087 and SPARK-25079 are in progress and i'm very very hesitant to do these upgrades before the code

Re: [DISCUSS][SQL] Control the number of output files

2018-08-10 Thread Koert Kuipers
we have found that to make shuffles reliable without OOMs we need to have spark.sql.shuffle.partitions at a high number, bigger than 2000 at least. yet this leads to a large amount of part files, which puts big pressure on spark driver programs. i tried to mitigate this with dataframe.coalesce to