Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread shane knapp
ugh...  R unit tests failed on both of these builds.
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94583/artifact/R/target/
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94584/artifact/R/target/



On Fri, Aug 10, 2018 at 1:58 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Sounds good to me as well. Thanks Shane.
>
> Shivaram
> On Fri, Aug 10, 2018 at 1:40 PM Reynold Xin  wrote:
> >
> > SGTM
> >
> > On Fri, Aug 10, 2018 at 1:39 PM shane knapp  wrote:
> >>
> >> https://issues.apache.org/jira/browse/SPARK-25089
> >>
> >> basically since these branches are old, and there will be a greater
> than zero amount of work to get lint-r to pass (on the new ubuntu workers),
> sean and i are proposing to remove the lint-r checks for the builds.
> >>
> >> this is super not important for the 2.4 cut/code freeze, but i wanted
> to get this done before it gets pushed down my queue and before we revisit
> the ubuntu port.
> >>
> >> thanks in advance,
> >>
> >> shane
> >> --
> >> Shane Knapp
> >> UC Berkeley EECS Research / RISELab Staff Technical Lead
> >> https://rise.cs.berkeley.edu
>



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread shane knapp
/agreemsg

On Fri, Aug 10, 2018 at 4:02 PM, Sean Owen  wrote:

> Seems OK to proceed with shutting off lintr, as it was masking those.
>
> On Fri, Aug 10, 2018 at 6:01 PM shane knapp  wrote:
>
>> ugh...  R unit tests failed on both of these builds.
>> https://amplab.cs.berkeley.edu/jenkins//job/
>> SparkPullRequestBuilder/94583/artifact/R/target/
>> https://amplab.cs.berkeley.edu/jenkins//job/
>> SparkPullRequestBuilder/94584/artifact/R/target/
>>
>>
>>
>> On Fri, Aug 10, 2018 at 1:58 PM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> Sounds good to me as well. Thanks Shane.
>>>
>>> Shivaram
>>> On Fri, Aug 10, 2018 at 1:40 PM Reynold Xin  wrote:
>>> >
>>> > SGTM
>>> >
>>> > On Fri, Aug 10, 2018 at 1:39 PM shane knapp 
>>> wrote:
>>> >>
>>> >> https://issues.apache.org/jira/browse/SPARK-25089
>>> >>
>>> >> basically since these branches are old, and there will be a greater
>>> than zero amount of work to get lint-r to pass (on the new ubuntu workers),
>>> sean and i are proposing to remove the lint-r checks for the builds.
>>> >>
>>> >> this is super not important for the 2.4 cut/code freeze, but i wanted
>>> to get this done before it gets pushed down my queue and before we revisit
>>> the ubuntu port.
>>> >>
>>> >> thanks in advance,
>>> >>
>>> >> shane
>>> >> --
>>> >> Shane Knapp
>>> >> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> >> https://rise.cs.berkeley.edu
>>>
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread Sean Owen
Seems OK to proceed with shutting off lintr, as it was masking those.

On Fri, Aug 10, 2018 at 6:01 PM shane knapp  wrote:

> ugh...  R unit tests failed on both of these builds.
>
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94583/artifact/R/target/
>
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94584/artifact/R/target/
>
>
>
> On Fri, Aug 10, 2018 at 1:58 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Sounds good to me as well. Thanks Shane.
>>
>> Shivaram
>> On Fri, Aug 10, 2018 at 1:40 PM Reynold Xin  wrote:
>> >
>> > SGTM
>> >
>> > On Fri, Aug 10, 2018 at 1:39 PM shane knapp 
>> wrote:
>> >>
>> >> https://issues.apache.org/jira/browse/SPARK-25089
>> >>
>> >> basically since these branches are old, and there will be a greater
>> than zero amount of work to get lint-r to pass (on the new ubuntu workers),
>> sean and i are proposing to remove the lint-r checks for the builds.
>> >>
>> >> this is super not important for the 2.4 cut/code freeze, but i wanted
>> to get this done before it gets pushed down my queue and before we revisit
>> the ubuntu port.
>> >>
>> >> thanks in advance,
>> >>
>> >> shane
>> >> --
>> >> Shane Knapp
>> >> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> >> https://rise.cs.berkeley.edu
>>
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread Shivaram Venkataraman
Sounds good to me as well. Thanks Shane.

Shivaram
On Fri, Aug 10, 2018 at 1:40 PM Reynold Xin  wrote:
>
> SGTM
>
> On Fri, Aug 10, 2018 at 1:39 PM shane knapp  wrote:
>>
>> https://issues.apache.org/jira/browse/SPARK-25089
>>
>> basically since these branches are old, and there will be a greater than 
>> zero amount of work to get lint-r to pass (on the new ubuntu workers), sean 
>> and i are proposing to remove the lint-r checks for the builds.
>>
>> this is super not important for the 2.4 cut/code freeze, but i wanted to get 
>> this done before it gets pushed down my queue and before we revisit the 
>> ubuntu port.
>>
>> thanks in advance,
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [R] discuss: removing lint-r checks for old branches

2018-08-10 Thread Reynold Xin
SGTM

On Fri, Aug 10, 2018 at 1:39 PM shane knapp  wrote:

> https://issues.apache.org/jira/browse/SPARK-25089
>
> basically since these branches are old, and there will be a greater than
> zero amount of work to get lint-r to pass (on the new ubuntu workers), sean
> and i are proposing to remove the lint-r checks for the builds.
>
> this is super not important for the 2.4 cut/code freeze, but i wanted to
> get this done before it gets pushed down my queue and before we revisit the
> ubuntu port.
>
> thanks in advance,
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


[R] discuss: removing lint-r checks for old branches

2018-08-10 Thread shane knapp
https://issues.apache.org/jira/browse/SPARK-25089

basically since these branches are old, and there will be a greater than
zero amount of work to get lint-r to pass (on the new ubuntu workers), sean
and i are proposing to remove the lint-r checks for the builds.

this is super not important for the 2.4 cut/code freeze, but i wanted to
get this done before it gets pushed down my queue and before we revisit the
ubuntu port.

thanks in advance,

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
>
>
> I also think it's a good idea to test against newer Python versions. But I
> don't know how difficult it is and whether or not it's feasible to resolve
> that between branch cut and RC cut.
>

>
unless someone pops in to this thread and tells me w/o a doubt that all
spark branches will happily pass against 3.5, it will not happen until
after the 2.4 cut.  :)

however, from my (limited) testing, it does look like that's the case.
still not gonna pull the trigger on it until after the cut.

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Li Jin
I agree with Byran. If it's acceptable to have another job to test with
Python 3.5 and pyarrow 0.10.0, I am leaning towards upgrading arrow.

Arrow 0.10.0 has tons of bug fixes and improves from 0.8.0, including
important memory leak fixes such as
https://issues.apache.org/jira/browse/ARROW-1973. I think releasing with
0.10.0 will improve the overall experience of arrow related features quite
bit.

I also think it's a good idea to test against newer Python versions. But I
don't know how difficult it is and whether or not it's feasible to resolve
that between branch cut and RC cut.

On Fri, Aug 10, 2018 at 5:44 PM, shane knapp  wrote:

> see:  https://github.com/apache/spark/pull/21939#issuecomment-412154343
>
> yes, i can set up a build.  have some Qs in the PR about building the
> spark package before running the python tests.
>
> On Fri, Aug 10, 2018 at 10:41 AM, Bryan Cutler  wrote:
>
>> I agree that we should hold off on the Arrow upgrade if it requires major
>> changes to our testing. I did have another thought that maybe we could just
>> add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
>> current testing the same? I'm not sure how doable that is right now and
>> don't want to make a ton of extra work, so no objections from me to hold
>> off on things for now.
>>
>> On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:
>>
>>> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan 
>>> wrote:
>>>
 It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
 it to Spark 3.0, so that we have more time to test. Any objections?

>>>
>>> none here.
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
python 3.5/pyarrow 0.10.0 build:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.6-python-3.5-arrow-0.10.0-ubuntu-testing/

On Fri, Aug 10, 2018 at 10:44 AM, shane knapp  wrote:

> see:  https://github.com/apache/spark/pull/21939#issuecomment-412154343
>
> yes, i can set up a build.  have some Qs in the PR about building the
> spark package before running the python tests.
>
> On Fri, Aug 10, 2018 at 10:41 AM, Bryan Cutler  wrote:
>
>> I agree that we should hold off on the Arrow upgrade if it requires major
>> changes to our testing. I did have another thought that maybe we could just
>> add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
>> current testing the same? I'm not sure how doable that is right now and
>> don't want to make a ton of extra work, so no objections from me to hold
>> off on things for now.
>>
>> On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:
>>
>>> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan 
>>> wrote:
>>>
 It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
 it to Spark 3.0, so that we have more time to test. Any objections?

>>>
>>> none here.
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
see:  https://github.com/apache/spark/pull/21939#issuecomment-412154343

yes, i can set up a build.  have some Qs in the PR about building the spark
package before running the python tests.

On Fri, Aug 10, 2018 at 10:41 AM, Bryan Cutler  wrote:

> I agree that we should hold off on the Arrow upgrade if it requires major
> changes to our testing. I did have another thought that maybe we could just
> add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
> current testing the same? I'm not sure how doable that is right now and
> don't want to make a ton of extra work, so no objections from me to hold
> off on things for now.
>
> On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:
>
>> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan  wrote:
>>
>>> It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
>>> it to Spark 3.0, so that we have more time to test. Any objections?
>>>
>>
>> none here.
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Bryan Cutler
I agree that we should hold off on the Arrow upgrade if it requires major
changes to our testing. I did have another thought that maybe we could just
add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
current testing the same? I'm not sure how doable that is right now and
don't want to make a ton of extra work, so no objections from me to hold
off on things for now.

On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:

> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan  wrote:
>
>> It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
>> it to Spark 3.0, so that we have more time to test. Any objections?
>>
>
> none here.
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan  wrote:

> It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave it
> to Spark 3.0, so that we have more time to test. Any objections?
>

none here.

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Wenchen Fan
It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave it
to Spark 3.0, so that we have more time to test. Any objections?

On Fri, Aug 10, 2018 at 11:53 PM shane knapp  wrote:

> quick update from my end:
>
> SPARK-24433 (SparkR/k8s) depends on SPARK-25087 (move builds to ubuntu)
>
> SPARK-23874 (arrow -> 0.10.0) now depends on SPARK-25079 (python 3.5
> upgrade)
>
> both SPARK-25087 and SPARK-25079 are in progress and i'm very very
> hesitant to do these upgrades before the code freeze/branch cut.  i've done
> a TON of testing, but even as of yesterday afternoon, i'm still uncovering
> bugs and things that need fixing both on the infrastructure side and spark
> itself.
>
> h/t sean owen for helping out on SPARK-24950
>
> On Wed, Aug 8, 2018 at 10:51 AM, Mark Hamstra 
> wrote:
>
>> I'm inclined to agree. Just saying that it is not a regression doesn't
>> really cut it when it is a now known data correctness issue. We need
>> something a lot more than nothing before releasing 2.4.0. At a barest
>> minimum, that has to be much more complete and publicly highlighted
>> documentation of the issue so that users are less likely to stumble into
>> this unaware; but really we need to fix at least the most common cases of
>> this bug. Backports to maintenance branches are also probably in order.
>>
>> On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid 
>> wrote:
>>
>>> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:

 SPARK-23243 : 
 Shuffle+Repartition
 on an RDD could lead to incorrect answers
 It turns out to be a very complicated issue, there is no consensus
 about what is the right fix yet. Likely to miss it in Spark 2.4 because
 it's a long-standing issue, not a regression.

>>>
>>> This is a really serious data loss bug.  Yes its very complex, but we
>>> absolutely have to fix this, I really think it should be in 2.4.
>>> Has worked on it stopped?
>>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-10 Thread Marco Gaido
Hi Makatun,

I think your problem has been solved in
https://issues.apache.org/jira/browse/SPARK-16406 which is going to be in
Spark 2.4.
Please try on the current master, where you should see the problem
disappeared.

Thanks,
Marco

2018-08-09 12:56 GMT+02:00 makatun :

> Here are the images missing in the previous mail. My apologies.
>  timeline.png>
>  readFormat_visualVM_Sampler.jpg>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
quick update from my end:

SPARK-24433 (SparkR/k8s) depends on SPARK-25087 (move builds to ubuntu)

SPARK-23874 (arrow -> 0.10.0) now depends on SPARK-25079 (python 3.5
upgrade)

both SPARK-25087 and SPARK-25079 are in progress and i'm very very hesitant
to do these upgrades before the code freeze/branch cut.  i've done a TON of
testing, but even as of yesterday afternoon, i'm still uncovering bugs and
things that need fixing both on the infrastructure side and spark itself.

h/t sean owen for helping out on SPARK-24950

On Wed, Aug 8, 2018 at 10:51 AM, Mark Hamstra 
wrote:

> I'm inclined to agree. Just saying that it is not a regression doesn't
> really cut it when it is a now known data correctness issue. We need
> something a lot more than nothing before releasing 2.4.0. At a barest
> minimum, that has to be much more complete and publicly highlighted
> documentation of the issue so that users are less likely to stumble into
> this unaware; but really we need to fix at least the most common cases of
> this bug. Backports to maintenance branches are also probably in order.
>
> On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid 
> wrote:
>
>> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>>>
>>> SPARK-23243 : 
>>> Shuffle+Repartition
>>> on an RDD could lead to incorrect answers
>>> It turns out to be a very complicated issue, there is no consensus about
>>> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
>>> long-standing issue, not a regression.
>>>
>>
>> This is a really serious data loss bug.  Yes its very complex, but we
>> absolutely have to fix this, I really think it should be in 2.4.
>> Has worked on it stopped?
>>
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [DISCUSS][SQL] Control the number of output files

2018-08-10 Thread Koert Kuipers
we have found that to make shuffles reliable without OOMs we need to have
spark.sql.shuffle.partitions at a high number, bigger than 2000 at least.
yet this leads to a large amount of part files, which puts big pressure on
spark driver programs.

i tried to mitigate this with dataframe.coalesce to reduce the number of
files, but this is not acceptable. coalesce changes the tasks for the last
shuffle before it, bringing back the issues we tried to mitigate with a
high number for spark.sql.shuffle.partitions in the first place. doing a
dataframe.repartition before every write is also not an unacceptable
approach, it is too high a price to pay just to bring down the number of
files.

so i am very excited about any approach that efficiently merges files when
writing.



On Mon, Aug 6, 2018 at 5:28 PM, lukas nalezenec  wrote:

> Hi Koert,
> There is no such Jira yet. We need SPARK-23889 before. You can find some
> mentions in the design document inside 23889.
> Best regards
> Lukas
>
> 2018-08-06 18:34 GMT+02:00 Koert Kuipers :
>
>> i went through the jiras targeting 2.4.0 trying to find a feature where
>> spark would coalesce/repartition by size (so merge small files
>> automatically), but didn't find it.
>> can someone point me to it?
>> thank you.
>> best,
>> koert
>>
>> On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers  wrote:
>>
>>> lukas,
>>> what is the jira ticket for this? i would like to follow it's activity.
>>> thanks!
>>> koert
>>>
>>> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec 
>>> wrote:
>>>
 Hi,
 Yes, This feature is planned - Spark should be soon able to repartition
 output by size.
 Lukas


 Dne st 25. 7. 2018 23:26 uživatel Forest Fang 
 napsal:

> Has there been any discussion to simply support Hive's merge small
> files configuration? It simply adds one additional stage to inspect size 
> of
> each output file, recompute the desired parallelism to reach a target 
> size,
> and runs a map-only coalesce before committing the final files. Since 
> AFAIK
> SparkSQL already stages the final output commit, it seems feasible to
> respect this Hive config.
>
> https://community.hortonworks.com/questions/106987/hive-mult
> iple-small-files.html
>
>
> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
> wrote:
>
>> See some of the related discussion under https://github.com/apach
>> e/spark/pull/21589
>>
>> If feels to me like we need some kind of user code mechanism to
>> signal policy preferences to Spark. This could also include ways to 
>> signal
>> scheduling policy, which could include things like scheduling pool and/or
>> barrier scheduling. Some of those scheduling policies operate at 
>> inherently
>> different levels currently -- e.g. scheduling pools at the Job level
>> (really, the thread local level in the current implementation) and 
>> barrier
>> scheduling at the Stage level -- so it is not completely obvious how to
>> unify all of these policy options/preferences/mechanism, or whether it is
>> possible, but I think it is worth considering such things at a fairly 
>> high
>> level of abstraction and try to unify and simplify before making things
>> more complex with multiple policy mechanisms.
>>
>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
>> wrote:
>>
>>> Seems like a good idea in general. Do other systems have similar
>>> concepts? In general it'd be easier if we can follow existing 
>>> convention if
>>> there is any.
>>>
>>>
>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
>>> wrote:
>>>
 Hi all,

 Many Spark users in my company are asking for a way to control the
 number of output files in Spark SQL. There are use cases to either 
 reduce
 or increase the number. The users prefer not to use function
 *repartition*(n) or *coalesce*(n, shuffle) that require them to
 write and deploy Scala/Java/Python code.

 Could we introduce a query hint for this purpose (similar to
 Broadcast Join Hints)?

 /*+ *COALESCE*(n, shuffle) */

 In general, is query hint is the best way to bring DF functionality
 to SQL without extending SQL syntax? Any suggestion is highly 
 appreciated.

 This requirement is not the same as SPARK-6221 that asked for
 auto-merging output files.

 Thanks,
 John Zhuge

>>>
>>>
>>
>