Re: code freeze and branch cut for Apache Spark 2.4

2018-08-29 Thread Wenchen Fan
A few updates on this thread:

We still have a blocking issue, the repartition correctness bug:
https://github.com/apache/spark/pull/22112
It's close to merging.

There are a few PRs to fix Scala 2.12 issues. I think they will keep coming
up and we don't need to block Spark 2.4 on this.

All other features/issues mentioned in this thread are either finished or
retargeted to the next release, hopefully we can cut the branch this week.

Thanks to everyone for your contributions! Please reply to this email if
you think something should be done before Spark 2.4.

Thanks,
Wenchen

On Tue, Aug 14, 2018 at 12:57 AM Xingbo Jiang  wrote:

> I'm working on the fix of SPARK-23243
>  and should be able
> push another commit in 1~2 days. More detailed discussions can go to the PR.
> Thanks for pushing this issue forward! I really appreciate efforts by
> submit PRs or involve in the discussions actively!
>
> 2018-08-13 22:50 GMT+08:00 Tom Graves :
>
>> I agree with Imran, we need to fix SPARK-23243
>>  and any correctness
>> issues for that matter.
>>
>> Tom
>>
>> On Wednesday, August 8, 2018, 9:06:43 AM CDT, Imran Rashid
>>  wrote:
>>
>>
>> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>>
>> SPARK-23243 : 
>> Shuffle+Repartition
>> on an RDD could lead to incorrect answers
>> It turns out to be a very complicated issue, there is no consensus about
>> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
>> long-standing issue, not a regression.
>>
>>
>> This is a really serious data loss bug.  Yes its very complex, but we
>> absolutely have to fix this, I really think it should be in 2.4.
>> Has worked on it stopped?
>>
>
>


Re: [DISCUSS] move away from python doctests

2018-08-29 Thread Maciej Szymkiewicz
Hi Imran,

On Wed, 29 Aug 2018 at 22:26, Imran Rashid 
wrote:

> Hi Li,
>
> yes that makes perfect sense.  That more-or-less is the same as my view,
> though I framed it differently.  I guess in that case, I'm really asking:
>
> Can pyspark changes please be accompanied by more unit tests, and not
> assume we're getting coverage from doctests?
>

I don't think such assumptions are made, or at least I haven't seen any
evidence of that.

 However,  we often assume that particular components are already tested in
Scala API (SQL, ML), and intentionally don't repeat these tests.


>
> Imran
>
> On Wed, Aug 29, 2018 at 2:02 PM Li Jin  wrote:
>
>> Hi Imran,
>>
>> My understanding is that doctests and unittests are orthogonal - doctests
>> are used to make sure docstring examples are correct and are not meant to
>> replace unittests.
>> Functionalities are covered by unit tests to ensure correctness and
>> doctests are used to test the docstring, not the functionalities itself.
>>
>> There are issues with doctests, for example, we cannot test arrow related
>> functions in doctest because of pyarrow is optional dependency, but I think
>> that's a separate issue.
>>
>> Does this make sense?
>>
>> Li
>>
>> On Wed, Aug 29, 2018 at 6:35 PM Imran Rashid 
>> wrote:
>>
>>> Hi,
>>>
>>> I'd like to propose that we move away from such heavy reliance on
>>> doctests in python, and move towards more traditional unit tests.  The main
>>> reason is that its hard to share test code in doc tests.  For example, I
>>> was just looking at
>>>
>>> https://github.com/apache/spark/commit/82c18c240a6913a917df3b55cc5e22649561c4dd
>>>  and wondering if we had any tests for some of the pyspark changes.
>>> SparkSession.createDataFrame has doctests, but those are just run with one
>>> standard spark configuration, which does not enable arrow.  Its hard to
>>> easily reuse that test, just with another spark context with a different
>>> conf.  Similarly I've wondered about reusing test cases but with
>>> local-cluster instead of local mode.  I feel like they also discourage
>>> writing a test which tries to get more exhaustive coverage on corner cases.
>>>
>>> I'm not saying we should stop using doctests -- I see why they're nice.
>>> I just think they should really only be when you want that code snippet in
>>> the doc anyway, so you might as well test it.
>>>
>>> Admittedly, I'm not really a python-developer, so I could be totally
>>> wrong about the right way to author doctests -- pushback welcome!
>>>
>>> Thoughts?
>>>
>>> thanks,
>>> Imran
>>>
>>


Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-29 Thread Mridul Muralidharan
+1
I left a couple of comments in NiharS's PR, but this is very useful to
have in spark !

Regards,
Mridul
On Fri, Aug 3, 2018 at 10:00 AM Imran Rashid
 wrote:
>
> I'd like to propose adding a plugin api for Executors, primarily for 
> instrumentation and debugging 
> (https://issues.apache.org/jira/browse/SPARK-24918).  The changes are small, 
> but as its adding a new api, it might be spip-worthy.  I mentioned it as well 
> in a recent email I sent about memory monitoring
>
> The spip proposal is here (and attached to the jira as well): 
> https://docs.google.com/document/d/1a20gHGMyRbCM8aicvq4LhWfQmoA5cbHBQtyqIA2hgtc/edit?usp=sharing
>
> There are already some comments on the jira and pr, and I hope to get more 
> thoughts and opinions on it.
>
> thanks,
> Imran

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] move away from python doctests

2018-08-29 Thread Imran Rashid
Hi Li,

yes that makes perfect sense.  That more-or-less is the same as my view,
though I framed it differently.  I guess in that case, I'm really asking:

Can pyspark changes please be accompanied by more unit tests, and not
assume we're getting coverage from doctests?

Imran

On Wed, Aug 29, 2018 at 2:02 PM Li Jin  wrote:

> Hi Imran,
>
> My understanding is that doctests and unittests are orthogonal - doctests
> are used to make sure docstring examples are correct and are not meant to
> replace unittests.
> Functionalities are covered by unit tests to ensure correctness and
> doctests are used to test the docstring, not the functionalities itself.
>
> There are issues with doctests, for example, we cannot test arrow related
> functions in doctest because of pyarrow is optional dependency, but I think
> that's a separate issue.
>
> Does this make sense?
>
> Li
>
> On Wed, Aug 29, 2018 at 6:35 PM Imran Rashid 
> wrote:
>
>> Hi,
>>
>> I'd like to propose that we move away from such heavy reliance on
>> doctests in python, and move towards more traditional unit tests.  The main
>> reason is that its hard to share test code in doc tests.  For example, I
>> was just looking at
>>
>> https://github.com/apache/spark/commit/82c18c240a6913a917df3b55cc5e22649561c4dd
>>  and wondering if we had any tests for some of the pyspark changes.
>> SparkSession.createDataFrame has doctests, but those are just run with one
>> standard spark configuration, which does not enable arrow.  Its hard to
>> easily reuse that test, just with another spark context with a different
>> conf.  Similarly I've wondered about reusing test cases but with
>> local-cluster instead of local mode.  I feel like they also discourage
>> writing a test which tries to get more exhaustive coverage on corner cases.
>>
>> I'm not saying we should stop using doctests -- I see why they're nice.
>> I just think they should really only be when you want that code snippet in
>> the doc anyway, so you might as well test it.
>>
>> Admittedly, I'm not really a python-developer, so I could be totally
>> wrong about the right way to author doctests -- pushback welcome!
>>
>> Thoughts?
>>
>> thanks,
>> Imran
>>
>


Re: Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-29 Thread Tomasz Gawęda
Hi,

Tweet linked on the issue suggests some Spark error, but I didn't dig into it 
to find root cause. At least, it's quite confusing behaviour

Pozdrawiam/Best regards,
Tomek

29.08.2018 6:44 PM Nicholas Chammas  napisał(a):
Dunno if I made a silly mistake, but I wanted to bring some attention to this 
issue in case there was something serious going on here that might affect the 
upcoming release.

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25150

Nick




Re: [DISCUSS] move away from python doctests

2018-08-29 Thread Li Jin
Hi Imran,

My understanding is that doctests and unittests are orthogonal - doctests
are used to make sure docstring examples are correct and are not meant to
replace unittests.
Functionalities are covered by unit tests to ensure correctness and
doctests are used to test the docstring, not the functionalities itself.

There are issues with doctests, for example, we cannot test arrow related
functions in doctest because of pyarrow is optional dependency, but I think
that's a separate issue.

Does this make sense?

Li

On Wed, Aug 29, 2018 at 6:35 PM Imran Rashid 
wrote:

> Hi,
>
> I'd like to propose that we move away from such heavy reliance on doctests
> in python, and move towards more traditional unit tests.  The main reason
> is that its hard to share test code in doc tests.  For example, I was just
> looking at
>
> https://github.com/apache/spark/commit/82c18c240a6913a917df3b55cc5e22649561c4dd
>  and wondering if we had any tests for some of the pyspark changes.
> SparkSession.createDataFrame has doctests, but those are just run with one
> standard spark configuration, which does not enable arrow.  Its hard to
> easily reuse that test, just with another spark context with a different
> conf.  Similarly I've wondered about reusing test cases but with
> local-cluster instead of local mode.  I feel like they also discourage
> writing a test which tries to get more exhaustive coverage on corner cases.
>
> I'm not saying we should stop using doctests -- I see why they're nice.  I
> just think they should really only be when you want that code snippet in
> the doc anyway, so you might as well test it.
>
> Admittedly, I'm not really a python-developer, so I could be totally wrong
> about the right way to author doctests -- pushback welcome!
>
> Thoughts?
>
> thanks,
> Imran
>


[DISCUSS] move away from python doctests

2018-08-29 Thread Imran Rashid
Hi,

I'd like to propose that we move away from such heavy reliance on doctests
in python, and move towards more traditional unit tests.  The main reason
is that its hard to share test code in doc tests.  For example, I was just
looking at
https://github.com/apache/spark/commit/82c18c240a6913a917df3b55cc5e22649561c4dd
 and wondering if we had any tests for some of the pyspark changes.
SparkSession.createDataFrame has doctests, but those are just run with one
standard spark configuration, which does not enable arrow.  Its hard to
easily reuse that test, just with another spark context with a different
conf.  Similarly I've wondered about reusing test cases but with
local-cluster instead of local mode.  I feel like they also discourage
writing a test which tries to get more exhaustive coverage on corner cases.

I'm not saying we should stop using doctests -- I see why they're nice.  I
just think they should really only be when you want that code snippet in
the doc anyway, so you might as well test it.

Admittedly, I'm not really a python-developer, so I could be totally wrong
about the right way to author doctests -- pushback welcome!

Thoughts?

thanks,
Imran


Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-29 Thread Nicholas Chammas
Dunno if I made a silly mistake, but I wanted to bring some attention to
this issue in case there was something serious going on here that might
affect the upcoming release.

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25150

Nick


Re: [VOTE] SPARK 2.3.2 (RC5)

2018-08-29 Thread chenliang613
Hi

Any new progress ?  will start RC6 soon ?

Regards
Liang


Saisai Shao wrote
> There's still another one SPARK-25114.
> 
> I will wait for several days in case some other blocks jumped.
> 
> Thanks
> Saisai
> 
> 
> 
> Wenchen Fan 

> cloud0fan@

>  于2018年8月15日周三 上午10:19写道:
> 
>> SPARK-25051 is resolved, can we start a new RC?
>>
>> SPARK-16406 is an improvement, generally we should not backport.
>>
>> On Wed, Aug 15, 2018 at 5:16 AM Sean Owen 

> srowen@

>  wrote:
>>
>>> (We wouldn't consider lack of an improvement to block a maintenance
>>> release. It's reasonable to raise this elsewhere as a big nice to have
>>> on
>>> 2.3.x in general)
>>>
>>> On Tue, Aug 14, 2018, 4:13 PM antonkulaga 

> antonkulaga@

>  wrote:
>>>
 -1 as https://issues.apache.org/jira/browse/SPARK-16406 does not seem
 to be
 back-ported to 2.3.1 and it causes a lot of pain



 --
 Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

 -
 To unsubscribe e-mail: 

> dev-unsubscribe@.apache








--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org