Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread 吴晓菊
And it should be generic for HashJoin not only broadcast join, right?


Chrysan Wu
吴晓菊
Phone:+86 17717640807


2018-06-29 10:42 GMT+08:00 吴晓菊 :

> Sorry for the mistake. You are right output ordering of broadcast join can
> be the order of big table in some types of join. I will prepare a PR and
> let you review later. Thanks a lot!
>
>
> Chrysan Wu
> 吴晓菊
> Phone:+86 17717640807
>
>
> 2018-06-29 0:00 GMT+08:00 Wenchen Fan :
>
>> SortMergeJoin sorts its children by join key, but broadcast join does
>> not. I think the output ordering of broadcast join has nothing to do with
>> join key.
>>
>> On Thu, Jun 28, 2018 at 11:28 PM Marco Gaido 
>> wrote:
>>
>>> I think the outputOrdering would be the one of the big table (if any)
>>> and it wouldn't matter if this involves the join keys or not. Am I wrong?
>>>
>>> 2018-06-28 17:01 GMT+02:00 吴晓菊 :
>>>
 Thanks for the reply.
 By looking into the SortMergeJoinExec, I think we can follow what
 SortMergeJoin do, for some types of join, if the children is ordered on
 join keys, we can output the ordered join keys as output ordering.


 Chrysan Wu
 吴晓菊
 Phone:+86 17717640807


 2018-06-28 22:53 GMT+08:00 Wenchen Fan :

> SortMergeJoin only reports ordering of the join keys, not the output
> ordering of any child.
>
> It seems reasonable to me that broadcast join should respect the
> output ordering of the children. Feel free to submit a PR to fix it, 
> thanks!
>
> On Thu, Jun 28, 2018 at 10:07 PM 吴晓菊  wrote:
>
>> Why we cannot use the output order of big table?
>>
>>
>> Chrysan Wu
>> Phone:+86 17717640807
>>
>>
>> 2018-06-28 21:48 GMT+08:00 Marco Gaido :
>>
>>> The easy answer to this is that SortMergeJoin ensure an
>>> outputOrdering, while BroadcastHashJoin doesn't, ie. after running a
>>> BroadcastHashJoin you don't know which is going to be the order of the
>>> output since nothing enforces it.
>>>
>>> Hope this helps.
>>> Thanks.
>>> Marco
>>>
>>> 2018-06-28 15:46 GMT+02:00 吴晓菊 :
>>>

 We see SortMergeJoinExec is implemented with
 outputPartitioning while BroadcastHashJoinExec is
 only implemented with outputPartitioning. Why is the design?

 Chrysan Wu
 Phone:+86 17717640807


>>>
>>

>>>
>


Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread 吴晓菊
Sorry for the mistake. You are right output ordering of broadcast join can
be the order of big table in some types of join. I will prepare a PR and
let you review later. Thanks a lot!


Chrysan Wu
吴晓菊
Phone:+86 17717640807


2018-06-29 0:00 GMT+08:00 Wenchen Fan :

> SortMergeJoin sorts its children by join key, but broadcast join does not.
> I think the output ordering of broadcast join has nothing to do with join
> key.
>
> On Thu, Jun 28, 2018 at 11:28 PM Marco Gaido 
> wrote:
>
>> I think the outputOrdering would be the one of the big table (if any) and
>> it wouldn't matter if this involves the join keys or not. Am I wrong?
>>
>> 2018-06-28 17:01 GMT+02:00 吴晓菊 :
>>
>>> Thanks for the reply.
>>> By looking into the SortMergeJoinExec, I think we can follow what
>>> SortMergeJoin do, for some types of join, if the children is ordered on
>>> join keys, we can output the ordered join keys as output ordering.
>>>
>>>
>>> Chrysan Wu
>>> 吴晓菊
>>> Phone:+86 17717640807
>>>
>>>
>>> 2018-06-28 22:53 GMT+08:00 Wenchen Fan :
>>>
 SortMergeJoin only reports ordering of the join keys, not the output
 ordering of any child.

 It seems reasonable to me that broadcast join should respect the output
 ordering of the children. Feel free to submit a PR to fix it, thanks!

 On Thu, Jun 28, 2018 at 10:07 PM 吴晓菊  wrote:

> Why we cannot use the output order of big table?
>
>
> Chrysan Wu
> Phone:+86 17717640807
>
>
> 2018-06-28 21:48 GMT+08:00 Marco Gaido :
>
>> The easy answer to this is that SortMergeJoin ensure an
>> outputOrdering, while BroadcastHashJoin doesn't, ie. after running a
>> BroadcastHashJoin you don't know which is going to be the order of the
>> output since nothing enforces it.
>>
>> Hope this helps.
>> Thanks.
>> Marco
>>
>> 2018-06-28 15:46 GMT+02:00 吴晓菊 :
>>
>>>
>>> We see SortMergeJoinExec is implemented with 
>>> outputPartitioning
>>> while BroadcastHashJoinExec is only implemented with outputPartitioning.
>>> Why is the design?
>>>
>>> Chrysan Wu
>>> Phone:+86 17717640807
>>>
>>>
>>
>
>>>
>>


Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Marcelo Vanzin
Yep, that's right. There were a bunch of things that were removed from
those scripts that made it tricky to build 2.1 (like Scala 2.10
support). I think it's good to keep the scripts working for older
releases since that allows is to fix things / add features to them
without having to backport to older branches.

On Thu, Jun 28, 2018 at 11:30 AM, Felix Cheung
 wrote:
> If I recall we stop releasing Hadoop 2.3 or 2.4 in newer releases (2.2+?) -
> that might be why they are not the release script.
>
>
> 
> From: Marcelo Vanzin 
> Sent: Thursday, June 28, 2018 11:12:45 AM
> To: Sean Owen
> Cc: Marcelo Vanzin; dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> Alright, uploaded the missing packages.
>
> I'll send a PR to update the release scripts just in case...
>
> On Thu, Jun 28, 2018 at 10:08 AM, Sean Owen  wrote:
>> If it's easy enough to produce them, I agree you can just add them to the
>> RC
>> dir.
>>
>> On Thu, Jun 28, 2018 at 11:56 AM Marcelo Vanzin
>>  wrote:
>>>
>>> I just noticed this RC is missing builds for hadoop 2.3 and 2.4, which
>>> existed in the previous version:
>>> https://dist.apache.org/repos/dist/release/spark/spark-2.1.2/
>>>
>>> How important do we think are those? I think I can just build them and
>>> publish them to the RC directory without having to create a new RC.
>>>
>>> On Tue, Jun 26, 2018 at 1:25 PM, Marcelo Vanzin 
>>> wrote:
>>> > Please vote on releasing the following candidate as Apache Spark
>>> > version
>>> > 2.1.3.
>>> >
>>> > The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if
>>> > a
>>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 2.1.3
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>>> > https://github.com/apache/spark/tree/v2.1.3-rc2
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> > https://repository.apache.org/content/repositories/orgapachespark-1275/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>>> >
>>> > The list of bug fixes going into 2.1.3 can be found at the following
>>> > URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12341660
>>> >
>>> > Notes:
>>> >
>>> > - RC1 was not sent for a vote. I had trouble building it, and by the
>>> > time I got
>>> >   things fixed, there was a blocker bug filed. It was already tagged in
>>> > git
>>> >   at that time.
>>> >
>>> > - If testing the source package, I recommend using Java 8, even though
>>> > 2.1
>>> >   supports Java 7 (and the RC was built with JDK 7). This is because
>>> > Maven
>>> >   Central has updated some configuration that makes the default Java 7
>>> > SSL
>>> >   config not work.
>>> >
>>> > - There are Maven artifacts published for Scala 2.10, but binary
>>> > releases are only
>>> >   available for Scala 2.11. This matches the previous release (2.1.2),
>>> > but if there's
>>> >   a need / desire to have pre-built distributions for Scala 2.10, I can
>>> > probably
>>> >   amend the RC without having to create a new one.
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with a out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 2.1.3?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 2.1.3 can be found at:
>>> > https://s.apache.org/spark-2.1.3
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility should
>>> > be worked on immediately. Everything else please retarget to an
>>> > appropriate release.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > 

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Felix Cheung
If I recall we stop releasing Hadoop 2.3 or 2.4 in newer releases (2.2+?) - 
that might be why they are not the release script.



From: Marcelo Vanzin 
Sent: Thursday, June 28, 2018 11:12:45 AM
To: Sean Owen
Cc: Marcelo Vanzin; dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)

Alright, uploaded the missing packages.

I'll send a PR to update the release scripts just in case...

On Thu, Jun 28, 2018 at 10:08 AM, Sean Owen  wrote:
> If it's easy enough to produce them, I agree you can just add them to the RC
> dir.
>
> On Thu, Jun 28, 2018 at 11:56 AM Marcelo Vanzin
>  wrote:
>>
>> I just noticed this RC is missing builds for hadoop 2.3 and 2.4, which
>> existed in the previous version:
>> https://dist.apache.org/repos/dist/release/spark/spark-2.1.2/
>>
>> How important do we think are those? I think I can just build them and
>> publish them to the RC directory without having to create a new RC.
>>
>> On Tue, Jun 26, 2018 at 1:25 PM, Marcelo Vanzin 
>> wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 2.1.3.
>> >
>> > The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if
>> > a
>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.1.3
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>> > https://github.com/apache/spark/tree/v2.1.3-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1275/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>> >
>> > The list of bug fixes going into 2.1.3 can be found at the following
>> > URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12341660
>> >
>> > Notes:
>> >
>> > - RC1 was not sent for a vote. I had trouble building it, and by the
>> > time I got
>> >   things fixed, there was a blocker bug filed. It was already tagged in
>> > git
>> >   at that time.
>> >
>> > - If testing the source package, I recommend using Java 8, even though
>> > 2.1
>> >   supports Java 7 (and the RC was built with JDK 7). This is because
>> > Maven
>> >   Central has updated some configuration that makes the default Java 7
>> > SSL
>> >   config not work.
>> >
>> > - There are Maven artifacts published for Scala 2.10, but binary
>> > releases are only
>> >   available for Scala 2.11. This matches the previous release (2.1.2),
>> > but if there's
>> >   a need / desire to have pre-built distributions for Scala 2.10, I can
>> > probably
>> >   amend the RC without having to create a new one.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.1.3?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.1.3 can be found at:
>> > https://s.apache.org/spark-2.1.3
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> >
>> > --
>> > Marcelo
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>



--
Marcelo


Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Marcelo Vanzin
Alright, uploaded the missing packages.

I'll send a PR to update the release scripts just in case...

On Thu, Jun 28, 2018 at 10:08 AM, Sean Owen  wrote:
> If it's easy enough to produce them, I agree you can just add them to the RC
> dir.
>
> On Thu, Jun 28, 2018 at 11:56 AM Marcelo Vanzin
>  wrote:
>>
>> I just noticed this RC is missing builds for hadoop 2.3 and 2.4, which
>> existed in the previous version:
>> https://dist.apache.org/repos/dist/release/spark/spark-2.1.2/
>>
>> How important do we think are those? I think I can just build them and
>> publish them to the RC directory without having to create a new RC.
>>
>> On Tue, Jun 26, 2018 at 1:25 PM, Marcelo Vanzin 
>> wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 2.1.3.
>> >
>> > The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if
>> > a
>> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.1.3
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>> > https://github.com/apache/spark/tree/v2.1.3-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1275/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>> >
>> > The list of bug fixes going into 2.1.3 can be found at the following
>> > URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12341660
>> >
>> > Notes:
>> >
>> > - RC1 was not sent for a vote. I had trouble building it, and by the
>> > time I got
>> >   things fixed, there was a blocker bug filed. It was already tagged in
>> > git
>> >   at that time.
>> >
>> > - If testing the source package, I recommend using Java 8, even though
>> > 2.1
>> >   supports Java 7 (and the RC was built with JDK 7). This is because
>> > Maven
>> >   Central has updated some configuration that makes the default Java 7
>> > SSL
>> >   config not work.
>> >
>> > - There are Maven artifacts published for Scala 2.10, but binary
>> > releases are only
>> >   available for Scala 2.11. This matches the previous release (2.1.2),
>> > but if there's
>> >   a need / desire to have pre-built distributions for Scala 2.10, I can
>> > probably
>> >   amend the RC without having to create a new one.
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.1.3?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.1.3 can be found at:
>> > https://s.apache.org/spark-2.1.3
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> >
>> > --
>> > Marcelo
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Spark 2.2.2 (RC2)

2018-06-28 Thread Dongjoon Hyun
+1

Tested on CentOS 7.4 and Oracle JDK 1.8.0_171.

Bests,
Dongjoon.

On Thu, Jun 28, 2018 at 7:24 AM Takeshi Yamamuro 
wrote:

> +1
>
> I run tests on a EC2 m4.2xlarge instance;
> [ec2-user]$ java -version
> openjdk version "1.8.0_171"
> OpenJDK Runtime Environment (build 1.8.0_171-b10)
> OpenJDK 64-Bit Server VM (build 25.171-b10, mixed mode)
>
>
>
>
> On Thu, Jun 28, 2018 at 11:38 AM Wenchen Fan  wrote:
>
>> +1
>>
>> On Thu, Jun 28, 2018 at 10:19 AM zhenya Sun  wrote:
>>
>>> +1
>>>
>>> 在 2018年6月28日,上午10:15,Hyukjin Kwon  写道:
>>>
>>> +1
>>>
>>> 2018년 6월 28일 (목) 오전 8:42, Sean Owen 님이 작성:
>>>
 +1 from me too.

 On Wed, Jun 27, 2018 at 3:31 PM Tom Graves <
 tgraves...@yahoo.com.invalid> wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.2.2.
>
> The vote is open until Mon, July 2nd @ 9PM UTC (2PM PDT) and passes if
> a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.2.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.2-rc2 (commit
> fc28ba3db7185e84b6dbd02ad8ef8f1d06b9e3c6):
> https://github.com/apache/spark/tree/v2.2.2-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.2.2-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1276/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.2.2-rc2-docs/
>
> The list of bug fixes going into 2.2.2 can be found at the following
> URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342171
>
>
> Notes:
>
> - RC1 was not sent for a vote. I had trouble building it, and by the
> time I got
>   things fixed, there was a blocker bug filed. It was already tagged
> in git
>   at that time.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.2.2?
> ===
>
> The current list of open tickets targeted at 2.2.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.2.2
>
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Tom Graves
>

>>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Sean Owen
If it's easy enough to produce them, I agree you can just add them to the
RC dir.

On Thu, Jun 28, 2018 at 11:56 AM Marcelo Vanzin 
wrote:

> I just noticed this RC is missing builds for hadoop 2.3 and 2.4, which
> existed in the previous version:
> https://dist.apache.org/repos/dist/release/spark/spark-2.1.2/
>
> How important do we think are those? I think I can just build them and
> publish them to the RC directory without having to create a new RC.
>
> On Tue, Jun 26, 2018 at 1:25 PM, Marcelo Vanzin 
> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> 2.1.3.
> >
> > The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
> > majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.1.3
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
> > https://github.com/apache/spark/tree/v2.1.3-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1275/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
> >
> > The list of bug fixes going into 2.1.3 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12341660
> >
> > Notes:
> >
> > - RC1 was not sent for a vote. I had trouble building it, and by the
> time I got
> >   things fixed, there was a blocker bug filed. It was already tagged in
> git
> >   at that time.
> >
> > - If testing the source package, I recommend using Java 8, even though
> 2.1
> >   supports Java 7 (and the RC was built with JDK 7). This is because
> Maven
> >   Central has updated some configuration that makes the default Java 7
> SSL
> >   config not work.
> >
> > - There are Maven artifacts published for Scala 2.10, but binary
> > releases are only
> >   available for Scala 2.11. This matches the previous release (2.1.2),
> > but if there's
> >   a need / desire to have pre-built distributions for Scala 2.10, I can
> probably
> >   amend the RC without having to create a new one.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.1.3?
> > ===
> >
> > The current list of open tickets targeted at 2.1.3 can be found at:
> > https://s.apache.org/spark-2.1.3
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
> >
> > --
> > Marcelo
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Marcelo Vanzin
I just noticed this RC is missing builds for hadoop 2.3 and 2.4, which
existed in the previous version:
https://dist.apache.org/repos/dist/release/spark/spark-2.1.2/

How important do we think are those? I think I can just build them and
publish them to the RC directory without having to create a new RC.

On Tue, Jun 26, 2018 at 1:25 PM, Marcelo Vanzin  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 2.1.3.
>
> The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.1.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
> https://github.com/apache/spark/tree/v2.1.3-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1275/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>
> The list of bug fixes going into 2.1.3 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12341660
>
> Notes:
>
> - RC1 was not sent for a vote. I had trouble building it, and by the time I 
> got
>   things fixed, there was a blocker bug filed. It was already tagged in git
>   at that time.
>
> - If testing the source package, I recommend using Java 8, even though 2.1
>   supports Java 7 (and the RC was built with JDK 7). This is because Maven
>   Central has updated some configuration that makes the default Java 7 SSL
>   config not work.
>
> - There are Maven artifacts published for Scala 2.10, but binary
> releases are only
>   available for Scala 2.11. This matches the previous release (2.1.2),
> but if there's
>   a need / desire to have pre-built distributions for Scala 2.10, I can 
> probably
>   amend the RC without having to create a new one.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.1.3?
> ===
>
> The current list of open tickets targeted at 2.1.3 can be found at:
> https://s.apache.org/spark-2.1.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Time for 2.3.2?

2018-06-28 Thread Ryan Blue
+1

On Thu, Jun 28, 2018 at 9:34 AM Xiao Li  wrote:

> +1. Thanks, Saisai!
>
> The impact of SPARK-24495 is large. We should release Spark 2.3.2 ASAP.
>
> Thanks,
>
> Xiao
>
> 2018-06-27 23:28 GMT-07:00 Takeshi Yamamuro :
>
>> +1, I heard some Spark users have skipped v2.3.1 because of these bugs.
>>
>> On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
>> wrote:
>>
>>> +1
>>>
>>> Wenchen Fan 于2018年6月28日 周四下午2:06写道:
>>>
 Hi Saisai, that's great! please go ahead!

 On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
 wrote:

> +1, like mentioned by Marcelo, these issues seems quite severe.
>
> I can work on the release if short of hands :).
>
> Thanks
> Jerry
>
>
> Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:
>
>> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
>> for those out.
>>
>> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>>
>> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > Spark 2.3.1 was released just a while ago, but unfortunately we
>> discovered
>> > and fixed some critical issues afterward.
>> >
>> > SPARK-24495: SortMergeJoin may produce wrong result.
>> > This is a serious correctness bug, and is easy to hit: have
>> duplicated join
>> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`,
>> and the
>> > join is a sort merge join. This bug is only present in Spark 2.3.
>> >
>> > SPARK-24588: stream-stream join may produce wrong result
>> > This is a correctness bug in a new feature of Spark 2.3: the
>> stream-stream
>> > join. Users can hit this bug if one of the join side is partitioned
>> by a
>> > subset of the join keys.
>> >
>> > SPARK-24552: Task attempt numbers are reused when stages are retried
>> > This is a long-standing bug in the output committer that may
>> introduce data
>> > corruption.
>> >
>> > SPARK-24542: UDFXPath allow users to pass carefully crafted XML
>> to
>> > access arbitrary files
>> > This is a potential security issue if users build access control
>> module upon
>> > Spark.
>> >
>> > I think we need a Spark 2.3.2 to address these issues(especially the
>> > correctness bugs) ASAP. Any thoughts?
>> >
>> > Thanks,
>> > Wenchen
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: Time for 2.3.2?

2018-06-28 Thread Xiao Li
+1. Thanks, Saisai!

The impact of SPARK-24495 is large. We should release Spark 2.3.2 ASAP.

Thanks,

Xiao

2018-06-27 23:28 GMT-07:00 Takeshi Yamamuro :

> +1, I heard some Spark users have skipped v2.3.1 because of these bugs.
>
> On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
> wrote:
>
>> +1
>>
>> Wenchen Fan 于2018年6月28日 周四下午2:06写道:
>>
>>> Hi Saisai, that's great! please go ahead!
>>>
>>> On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
>>> wrote:
>>>
 +1, like mentioned by Marcelo, these issues seems quite severe.

 I can work on the release if short of hands :).

 Thanks
 Jerry


 Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:

> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
> for those out.
>
> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>
> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
> wrote:
> > Hi all,
> >
> > Spark 2.3.1 was released just a while ago, but unfortunately we
> discovered
> > and fixed some critical issues afterward.
> >
> > SPARK-24495: SortMergeJoin may produce wrong result.
> > This is a serious correctness bug, and is easy to hit: have
> duplicated join
> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`,
> and the
> > join is a sort merge join. This bug is only present in Spark 2.3.
> >
> > SPARK-24588: stream-stream join may produce wrong result
> > This is a correctness bug in a new feature of Spark 2.3: the
> stream-stream
> > join. Users can hit this bug if one of the join side is partitioned
> by a
> > subset of the join keys.
> >
> > SPARK-24552: Task attempt numbers are reused when stages are retried
> > This is a long-standing bug in the output committer that may
> introduce data
> > corruption.
> >
> > SPARK-24542: UDFXPath allow users to pass carefully crafted XML
> to
> > access arbitrary files
> > This is a potential security issue if users build access control
> module upon
> > Spark.
> >
> > I think we need a Spark 2.3.2 to address these issues(especially the
> > correctness bugs) ASAP. Any thoughts?
> >
> > Thanks,
> > Wenchen
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Marcelo Vanzin
BTW that would be a great fix in the docs now that we'll have a 2.3.2
being prepared.

On Thu, Jun 28, 2018 at 9:17 AM, Felix Cheung  wrote:
> Exactly...
>
> 
> From: Marcelo Vanzin 
> Sent: Thursday, June 28, 2018 9:16:08 AM
> To: Tom Graves
> Cc: Felix Cheung; dev
>
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> Yeah, we should be more careful with that in general. Like we state
> that "Spark runs on Java 8+"...
>
> On Thu, Jun 28, 2018 at 9:13 AM, Tom Graves  wrote:
>> Right we say we support R3.1+ but we never actually did, so agree its a
>> bug
>> but its not a regression since we never really supported them or tested
>> with
>> them and its not a logic or security bug that ends in corruptions or bad
>> behavior so in my opinion its not a blocker.   Again I'm fine with adding
>> it
>> though if others agree.   Maybe we should really change our documentation
>> to
>> state more clearly what versions we know it works with and have tested
>> with
>> since someone could read R3.1+ as it works with R4 (once released) which
>> very well might not be the case.
>>
>>
>> I'm +1 on the release.
>>
>> Tom
>>
>> On Thursday, June 28, 2018, 10:28:21 AM CDT, Felix Cheung
>>  wrote:
>>
>>
>> Not pushing back, but our support message has always been R 3.1+ so it a
>> bit
>> off to say we don’t support newer releases.
>>
>> https://spark.apache.org/docs/2.1.2/
>>
>> But looking back, this was found during 2.1.2 RC2 and didn’t fix (in time)
>> for 2.1.2?
>>
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555
>>
>> Since it isn’t a regression I’d say +1 from me.
>>
>>
>> 
>> From: Tom Graves 
>> Sent: Thursday, June 28, 2018 6:56:16 AM
>> To: Marcelo Vanzin; Felix Cheung
>> Cc: dev
>> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>>
>> If this is just supporting newer versions of R that 2.1 never supported
>> then
>> I would say its not a blocker. But if you feel its useful enough then I
>> would say its up to Marcelo if he wants to pull in and spin another rc.
>>
>> Tom
>>
>> On Wednesday, June 27, 2018, 8:57:25 PM CDT, Felix Cheung
>>  wrote:
>>
>>
>> Yes, this is broken with newer version of R.
>>
>> We check explicitly for warning for the R check which should fail the test
>> run.
>>
>> 
>> From: Marcelo Vanzin 
>> Sent: Wednesday, June 27, 2018 6:55 PM
>> To: Felix Cheung
>> Cc: Marcelo Vanzin; Tom Graves; dev
>> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>>
>> Not sure I understand that bug. Is it a compatibility issue with new
>> versions of R?
>>
>> It's at least marked as fixed in 2.2(.1).
>>
>> We do run jenkins on these branches, but that seems like just a
>> warning, which would not fail those builds...
>>
>> On Wed, Jun 27, 2018 at 6:12 PM, Felix Cheung 
>> wrote:
>>> (I don’t want to block the release(s) per se...)
>>>
>>> We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)
>>>
>>> This is fixed in 2.3 back in Nov 2017
>>>
>>>
>>> https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6
>>>
>>> Perhaps we don't get Jenkins run on these branches? It should have been
>>> detected.
>>>
>>> * checking for code/documentation mismatches ... WARNING
>>> Codoc mismatches from documentation object 'attach':
>>> attach
>>> Code: function(what, pos = 2L, name = deparse(substitute(what),
>>> backtick = FALSE), warn.conflicts = TRUE)
>>> Docs: function(what, pos = 2L, name = deparse(substitute(what)),
>>> warn.conflicts = TRUE)
>>> Mismatches in argument default values:
>>> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs:
>>> deparse(substitute(what))
>>>
>>> Codoc mismatches from documentation object 'glm':
>>> glm
>>> Code: function(formula, family = gaussian, data, weights, subset,
>>> na.action, start = NULL, etastart, mustart, offset,
>>> control = list(...), model = TRUE, method = "glm.fit",
>>> x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
>>> NULL, ...)
>>> Docs: function(formula, family = gaussian, data, weights, subset,
>>> na.action, start = NULL, etastart, mustart, offset,
>>> control = list(...), model = TRUE, method = "glm.fit",
>>> x = FALSE, y = TRUE, contrasts = NULL, ...)
>>> Argument names in code not in docs:
>>> singular.ok
>>> Mismatches in argument names:
>>> Position: 16 Code: singular.ok Docs: contrasts
>>> Position: 17 Code: contrasts Docs: ...
>>>
>>> 
>>> From: Sean Owen 
>>> Sent: Wednesday, June 27, 2018 5:02:37 AM
>>> To: Marcelo Vanzin
>>> Cc: dev
>>> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>>>
>>> +1 from me too for the usual reasons.
>>>
>>> On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin
>>> 
>>> wrote:

 Please vote on releasing the following candidate as Apache Spark version
 2.1.3.

 The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if
 a
 majority +1 PMC votes are cast, with a 

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Felix Cheung
Exactly...


From: Marcelo Vanzin 
Sent: Thursday, June 28, 2018 9:16:08 AM
To: Tom Graves
Cc: Felix Cheung; dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)

Yeah, we should be more careful with that in general. Like we state
that "Spark runs on Java 8+"...

On Thu, Jun 28, 2018 at 9:13 AM, Tom Graves  wrote:
> Right we say we support R3.1+ but we never actually did, so agree its a bug
> but its not a regression since we never really supported them or tested with
> them and its not a logic or security bug that ends in corruptions or bad
> behavior so in my opinion its not a blocker.   Again I'm fine with adding it
> though if others agree.   Maybe we should really change our documentation to
> state more clearly what versions we know it works with and have tested with
> since someone could read R3.1+ as it works with R4 (once released) which
> very well might not be the case.
>
>
> I'm +1 on the release.
>
> Tom
>
> On Thursday, June 28, 2018, 10:28:21 AM CDT, Felix Cheung
>  wrote:
>
>
> Not pushing back, but our support message has always been R 3.1+ so it a bit
> off to say we don’t support newer releases.
>
> https://spark.apache.org/docs/2.1.2/
>
> But looking back, this was found during 2.1.2 RC2 and didn’t fix (in time)
> for 2.1.2?
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555
>
> Since it isn’t a regression I’d say +1 from me.
>
>
> 
> From: Tom Graves 
> Sent: Thursday, June 28, 2018 6:56:16 AM
> To: Marcelo Vanzin; Felix Cheung
> Cc: dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> If this is just supporting newer versions of R that 2.1 never supported then
> I would say its not a blocker. But if you feel its useful enough then I
> would say its up to Marcelo if he wants to pull in and spin another rc.
>
> Tom
>
> On Wednesday, June 27, 2018, 8:57:25 PM CDT, Felix Cheung
>  wrote:
>
>
> Yes, this is broken with newer version of R.
>
> We check explicitly for warning for the R check which should fail the test
> run.
>
> 
> From: Marcelo Vanzin 
> Sent: Wednesday, June 27, 2018 6:55 PM
> To: Felix Cheung
> Cc: Marcelo Vanzin; Tom Graves; dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> Not sure I understand that bug. Is it a compatibility issue with new
> versions of R?
>
> It's at least marked as fixed in 2.2(.1).
>
> We do run jenkins on these branches, but that seems like just a
> warning, which would not fail those builds...
>
> On Wed, Jun 27, 2018 at 6:12 PM, Felix Cheung 
> wrote:
>> (I don’t want to block the release(s) per se...)
>>
>> We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)
>>
>> This is fixed in 2.3 back in Nov 2017
>>
>> https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6
>>
>> Perhaps we don't get Jenkins run on these branches? It should have been
>> detected.
>>
>> * checking for code/documentation mismatches ... WARNING
>> Codoc mismatches from documentation object 'attach':
>> attach
>> Code: function(what, pos = 2L, name = deparse(substitute(what),
>> backtick = FALSE), warn.conflicts = TRUE)
>> Docs: function(what, pos = 2L, name = deparse(substitute(what)),
>> warn.conflicts = TRUE)
>> Mismatches in argument default values:
>> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs:
>> deparse(substitute(what))
>>
>> Codoc mismatches from documentation object 'glm':
>> glm
>> Code: function(formula, family = gaussian, data, weights, subset,
>> na.action, start = NULL, etastart, mustart, offset,
>> control = list(...), model = TRUE, method = "glm.fit",
>> x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
>> NULL, ...)
>> Docs: function(formula, family = gaussian, data, weights, subset,
>> na.action, start = NULL, etastart, mustart, offset,
>> control = list(...), model = TRUE, method = "glm.fit",
>> x = FALSE, y = TRUE, contrasts = NULL, ...)
>> Argument names in code not in docs:
>> singular.ok
>> Mismatches in argument names:
>> Position: 16 Code: singular.ok Docs: contrasts
>> Position: 17 Code: contrasts Docs: ...
>>
>> 
>> From: Sean Owen 
>> Sent: Wednesday, June 27, 2018 5:02:37 AM
>> To: Marcelo Vanzin
>> Cc: dev
>> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>>
>> +1 from me too for the usual reasons.
>>
>> On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin
>> 
>> wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.1.3.
>>>
>>> The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.3
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>>> https://github.com/apache/spark/tree/v2.1.3-rc2

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Marcelo Vanzin
Yeah, we should be more careful with that in general. Like we state
that "Spark runs on Java 8+"...

On Thu, Jun 28, 2018 at 9:13 AM, Tom Graves  wrote:
> Right we say we support R3.1+ but we never actually did, so agree its a bug
> but its not a regression since we never really supported them or tested with
> them and its not a logic or security bug that ends in corruptions or bad
> behavior so in my opinion its not a blocker.   Again I'm fine with adding it
> though if others agree.   Maybe we should really change our documentation to
> state more clearly what versions we know it works with and have tested with
> since someone could read R3.1+ as it works with R4 (once released) which
> very well might not be the case.
>
>
> I'm +1 on the release.
>
> Tom
>
> On Thursday, June 28, 2018, 10:28:21 AM CDT, Felix Cheung
>  wrote:
>
>
> Not pushing back, but our support message has always been R 3.1+ so it a bit
> off to say we don’t support newer releases.
>
> https://spark.apache.org/docs/2.1.2/
>
> But looking back, this was found during 2.1.2 RC2 and didn’t fix (in time)
> for 2.1.2?
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555
>
> Since it isn’t a regression I’d say +1 from me.
>
>
> 
> From: Tom Graves 
> Sent: Thursday, June 28, 2018 6:56:16 AM
> To: Marcelo Vanzin; Felix Cheung
> Cc: dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> If this is just supporting newer versions of R that 2.1 never supported then
> I would say its not a blocker. But if you feel its useful enough then I
> would say its up to Marcelo if he wants to pull in and spin another rc.
>
> Tom
>
> On Wednesday, June 27, 2018, 8:57:25 PM CDT, Felix Cheung
>  wrote:
>
>
> Yes, this is broken with newer version of R.
>
> We check explicitly for warning for the R check which should fail the test
> run.
>
> 
> From: Marcelo Vanzin 
> Sent: Wednesday, June 27, 2018 6:55 PM
> To: Felix Cheung
> Cc: Marcelo Vanzin; Tom Graves; dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> Not sure I understand that bug. Is it a compatibility issue with new
> versions of R?
>
> It's at least marked as fixed in 2.2(.1).
>
> We do run jenkins on these branches, but that seems like just a
> warning, which would not fail those builds...
>
> On Wed, Jun 27, 2018 at 6:12 PM, Felix Cheung 
> wrote:
>> (I don’t want to block the release(s) per se...)
>>
>> We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)
>>
>> This is fixed in 2.3 back in Nov 2017
>>
>> https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6
>>
>> Perhaps we don't get Jenkins run on these branches? It should have been
>> detected.
>>
>> * checking for code/documentation mismatches ... WARNING
>> Codoc mismatches from documentation object 'attach':
>> attach
>> Code: function(what, pos = 2L, name = deparse(substitute(what),
>> backtick = FALSE), warn.conflicts = TRUE)
>> Docs: function(what, pos = 2L, name = deparse(substitute(what)),
>> warn.conflicts = TRUE)
>> Mismatches in argument default values:
>> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs:
>> deparse(substitute(what))
>>
>> Codoc mismatches from documentation object 'glm':
>> glm
>> Code: function(formula, family = gaussian, data, weights, subset,
>> na.action, start = NULL, etastart, mustart, offset,
>> control = list(...), model = TRUE, method = "glm.fit",
>> x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
>> NULL, ...)
>> Docs: function(formula, family = gaussian, data, weights, subset,
>> na.action, start = NULL, etastart, mustart, offset,
>> control = list(...), model = TRUE, method = "glm.fit",
>> x = FALSE, y = TRUE, contrasts = NULL, ...)
>> Argument names in code not in docs:
>> singular.ok
>> Mismatches in argument names:
>> Position: 16 Code: singular.ok Docs: contrasts
>> Position: 17 Code: contrasts Docs: ...
>>
>> 
>> From: Sean Owen 
>> Sent: Wednesday, June 27, 2018 5:02:37 AM
>> To: Marcelo Vanzin
>> Cc: dev
>> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>>
>> +1 from me too for the usual reasons.
>>
>> On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin
>> 
>> wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.1.3.
>>>
>>> The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.3
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>>> https://github.com/apache/spark/tree/v2.1.3-rc2
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>>>
>>> Signatures used for Spark RCs 

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Tom Graves
 Right we say we support R3.1+ but we never actually did, so agree its a bug 
but its not a regression since we never really supported them or tested with 
them and its not a logic or security bug that ends in corruptions or bad 
behavior so in my opinion its not a blocker.   Again I'm fine with adding it 
though if others agree.   Maybe we should really change our documentation to 
state more clearly what versions we know it works with and have tested with 
since someone could read R3.1+ as it works with R4 (once released) which very 
well might not be the case.   

I'm +1 on the release.
Tom
On Thursday, June 28, 2018, 10:28:21 AM CDT, Felix Cheung 
 wrote:  
 
 Not pushing back, but our support message has always been R 3.1+ so it a bit 
off to say we don’t support newer releases.
https://spark.apache.org/docs/2.1.2/
But looking back, this was found during 2.1.2 RC2 and didn’t fix (in time) for 
2.1.2?
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555
Since it isn’t a regression I’d say +1 from me.

From: Tom Graves 
Sent: Thursday, June 28, 2018 6:56:16 AM
To: Marcelo Vanzin; Felix Cheung
Cc: dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2) If this is just supporting newer versions 
of R that 2.1 never supported then I would say its not a blocker. But if you 
feel its useful enough then I would say its up to Marcelo if he wants to pull 
in and spin another rc.
Tom 
On Wednesday, June 27, 2018, 8:57:25 PM CDT, Felix Cheung 
 wrote:

Yes, this is broken with newer version of R.
We check explicitly for warning for the R check which should fail the test run.
From: Marcelo Vanzin 
Sent: Wednesday, June 27, 2018 6:55 PM
To: Felix Cheung
Cc: Marcelo Vanzin; Tom Graves; dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)  Not sure I understand that bug. Is it a 
compatibility issue with new
versions of R?

It's at least marked as fixed in 2.2(.1).

We do run jenkins on these branches, but that seems like just a
warning, which would not fail those builds...

On Wed, Jun 27, 2018 at 6:12 PM, Felix Cheung  wrote:
> (I don’t want to block the release(s) per se...)
>
> We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)
>
> This is fixed in 2.3 back in Nov 2017
> https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6
>
> Perhaps we don't get Jenkins run on these branches? It should have been
> detected.
>
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'attach':
> attach
> Code: function(what, pos = 2L, name = deparse(substitute(what),
> backtick = FALSE), warn.conflicts = TRUE)
> Docs: function(what, pos = 2L, name = deparse(substitute(what)),
> warn.conflicts = TRUE)
> Mismatches in argument default values:
> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs:
> deparse(substitute(what))
>
> Codoc mismatches from documentation object 'glm':
> glm
> Code: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
> NULL, ...)
> Docs: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, contrasts = NULL, ...)
> Argument names in code not in docs:
> singular.ok
> Mismatches in argument names:
> Position: 16 Code: singular.ok Docs: contrasts
> Position: 17 Code: contrasts Docs: ...
>
> 
> From: Sean Owen 
> Sent: Wednesday, June 27, 2018 5:02:37 AM
> To: Marcelo Vanzin
> Cc: dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> +1 from me too for the usual reasons.
>
> On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin 
> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.3.
>>
>> The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>> https://github.com/apache/spark/tree/v2.1.3-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1275/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>>
>> The list of bug fixes going into 2.1.3 

Re: Time for 2.3.2?

2018-06-28 Thread Felix Cheung
Yap will do


From: Marcelo Vanzin 
Sent: Thursday, June 28, 2018 9:04:41 AM
To: Felix Cheung
Cc: Spark dev list
Subject: Re: Time for 2.3.2?

Could you mark that bug as blocker and set the target version, in that case?

On Thu, Jun 28, 2018 at 8:46 AM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
+1

I’d like to fix SPARK-24535 first though


From: Stavros Kontopoulos 
mailto:stavros.kontopou...@lightbend.com>>
Sent: Thursday, June 28, 2018 3:50:34 AM
To: Marco Gaido
Cc: Takeshi Yamamuro; Xingbo Jiang; Wenchen Fan; Spark dev list; Saisai Shao; 
van...@cloudera.com.invalid
Subject: Re: Time for 2.3.2?

+1 makes sense.

On Thu, Jun 28, 2018 at 12:07 PM, Marco Gaido 
mailto:marcogaid...@gmail.com>> wrote:
+1 too, I'd consider also to include SPARK-24208 if we can solve it timely...

2018-06-28 8:28 GMT+02:00 Takeshi Yamamuro 
mailto:linguin@gmail.com>>:
+1, I heard some Spark users have skipped v2.3.1 because of these bugs.

On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
mailto:jiangxb1...@gmail.com>> wrote:
+1

Wenchen Fan mailto:cloud0...@gmail.com>>于2018年6月28日 
周四下午2:06写道:
Hi Saisai, that's great! please go ahead!

On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
mailto:sai.sai.s...@gmail.com>> wrote:
+1, like mentioned by Marcelo, these issues seems quite severe.

I can work on the release if short of hands :).

Thanks
Jerry


Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:
+1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
for those out.

(Those are what delayed 2.2.2 and 2.1.3 for those watching...)

On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
> Hi all,
>
> Spark 2.3.1 was released just a while ago, but unfortunately we discovered
> and fixed some critical issues afterward.
>
> SPARK-24495: SortMergeJoin may produce wrong result.
> This is a serious correctness bug, and is easy to hit: have duplicated join
> key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`, and the
> join is a sort merge join. This bug is only present in Spark 2.3.
>
> SPARK-24588: stream-stream join may produce wrong result
> This is a correctness bug in a new feature of Spark 2.3: the stream-stream
> join. Users can hit this bug if one of the join side is partitioned by a
> subset of the join keys.
>
> SPARK-24552: Task attempt numbers are reused when stages are retried
> This is a long-standing bug in the output committer that may introduce data
> corruption.
>
> SPARK-24542: UDFXPath allow users to pass carefully crafted XML to
> access arbitrary files
> This is a potential security issue if users build access control module upon
> Spark.
>
> I think we need a Spark 2.3.2 to address these issues(especially the
> correctness bugs) ASAP. Any thoughts?
>
> Thanks,
> Wenchen



--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



--
---
Takeshi Yamamuro




--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
p:  +30 6977967274

e: stavros.kontopou...@lightbend.com

[https://docs.google.com/a/lightbend.com/uc?id=0B5AMuG_Ml2ddbFJqVWJxeHV0bzg=download]



--
Marcelo


Re: Time for 2.3.2?

2018-06-28 Thread Marcelo Vanzin
Could you mark that bug as blocker and set the target version, in that case?

On Thu, Jun 28, 2018 at 8:46 AM, Felix Cheung 
wrote:

> +1
>
> I’d like to fix SPARK-24535 first though
>
> --
> *From:* Stavros Kontopoulos 
> *Sent:* Thursday, June 28, 2018 3:50:34 AM
> *To:* Marco Gaido
> *Cc:* Takeshi Yamamuro; Xingbo Jiang; Wenchen Fan; Spark dev list; Saisai
> Shao; van...@cloudera.com.invalid
> *Subject:* Re: Time for 2.3.2?
>
> +1 makes sense.
>
> On Thu, Jun 28, 2018 at 12:07 PM, Marco Gaido 
> wrote:
>
>> +1 too, I'd consider also to include SPARK-24208 if we can solve it
>> timely...
>>
>> 2018-06-28 8:28 GMT+02:00 Takeshi Yamamuro :
>>
>>> +1, I heard some Spark users have skipped v2.3.1 because of these bugs.
>>>
>>> On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
>>> wrote:
>>>
 +1

 Wenchen Fan 于2018年6月28日 周四下午2:06写道:

> Hi Saisai, that's great! please go ahead!
>
> On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
> wrote:
>
>> +1, like mentioned by Marcelo, these issues seems quite severe.
>>
>> I can work on the release if short of hands :).
>>
>> Thanks
>> Jerry
>>
>>
>> Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:
>>
>>> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
>>> for those out.
>>>
>>> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>>>
>>> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > Spark 2.3.1 was released just a while ago, but unfortunately we
>>> discovered
>>> > and fixed some critical issues afterward.
>>> >
>>> > SPARK-24495: SortMergeJoin may produce wrong result.
>>> > This is a serious correctness bug, and is easy to hit: have
>>> duplicated join
>>> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`,
>>> and the
>>> > join is a sort merge join. This bug is only present in Spark 2.3.
>>> >
>>> > SPARK-24588: stream-stream join may produce wrong result
>>> > This is a correctness bug in a new feature of Spark 2.3: the
>>> stream-stream
>>> > join. Users can hit this bug if one of the join side is
>>> partitioned by a
>>> > subset of the join keys.
>>> >
>>> > SPARK-24552: Task attempt numbers are reused when stages are
>>> retried
>>> > This is a long-standing bug in the output committer that may
>>> introduce data
>>> > corruption.
>>> >
>>> > SPARK-24542: UDFXPath allow users to pass carefully crafted
>>> XML to
>>> > access arbitrary files
>>> > This is a potential security issue if users build access control
>>> module upon
>>> > Spark.
>>> >
>>> > I think we need a Spark 2.3.2 to address these issues(especially
>>> the
>>> > correctness bugs) ASAP. Any thoughts?
>>> >
>>> > Thanks,
>>> > Wenchen
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> Stavros Kontopoulos
>
> *Senior Software Engineer *
> *Lightbend, Inc. *
>
> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
> *e: stavros.kontopou...@lightbend.com* 
>
>
>


-- 
Marcelo


Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread Wenchen Fan
SortMergeJoin sorts its children by join key, but broadcast join does not.
I think the output ordering of broadcast join has nothing to do with join
key.

On Thu, Jun 28, 2018 at 11:28 PM Marco Gaido  wrote:

> I think the outputOrdering would be the one of the big table (if any) and
> it wouldn't matter if this involves the join keys or not. Am I wrong?
>
> 2018-06-28 17:01 GMT+02:00 吴晓菊 :
>
>> Thanks for the reply.
>> By looking into the SortMergeJoinExec, I think we can follow what
>> SortMergeJoin do, for some types of join, if the children is ordered on
>> join keys, we can output the ordered join keys as output ordering.
>>
>>
>> Chrysan Wu
>> 吴晓菊
>> Phone:+86 17717640807
>>
>>
>> 2018-06-28 22:53 GMT+08:00 Wenchen Fan :
>>
>>> SortMergeJoin only reports ordering of the join keys, not the output
>>> ordering of any child.
>>>
>>> It seems reasonable to me that broadcast join should respect the output
>>> ordering of the children. Feel free to submit a PR to fix it, thanks!
>>>
>>> On Thu, Jun 28, 2018 at 10:07 PM 吴晓菊  wrote:
>>>
 Why we cannot use the output order of big table?


 Chrysan Wu
 Phone:+86 17717640807


 2018-06-28 21:48 GMT+08:00 Marco Gaido :

> The easy answer to this is that SortMergeJoin ensure an
> outputOrdering, while BroadcastHashJoin doesn't, ie. after running a
> BroadcastHashJoin you don't know which is going to be the order of the
> output since nothing enforces it.
>
> Hope this helps.
> Thanks.
> Marco
>
> 2018-06-28 15:46 GMT+02:00 吴晓菊 :
>
>>
>> We see SortMergeJoinExec is implemented with
>> outputPartitioning while BroadcastHashJoinExec is only
>> implemented with outputPartitioning. Why is the design?
>>
>> Chrysan Wu
>> Phone:+86 17717640807
>>
>>
>

>>
>


Re: Time for 2.3.2?

2018-06-28 Thread Felix Cheung
+1

I’d like to fix SPARK-24535 first though


From: Stavros Kontopoulos 
Sent: Thursday, June 28, 2018 3:50:34 AM
To: Marco Gaido
Cc: Takeshi Yamamuro; Xingbo Jiang; Wenchen Fan; Spark dev list; Saisai Shao; 
van...@cloudera.com.invalid
Subject: Re: Time for 2.3.2?

+1 makes sense.

On Thu, Jun 28, 2018 at 12:07 PM, Marco Gaido 
mailto:marcogaid...@gmail.com>> wrote:
+1 too, I'd consider also to include SPARK-24208 if we can solve it timely...

2018-06-28 8:28 GMT+02:00 Takeshi Yamamuro 
mailto:linguin@gmail.com>>:
+1, I heard some Spark users have skipped v2.3.1 because of these bugs.

On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
mailto:jiangxb1...@gmail.com>> wrote:
+1

Wenchen Fan mailto:cloud0...@gmail.com>>于2018年6月28日 
周四下午2:06写道:
Hi Saisai, that's great! please go ahead!

On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
mailto:sai.sai.s...@gmail.com>> wrote:
+1, like mentioned by Marcelo, these issues seems quite severe.

I can work on the release if short of hands :).

Thanks
Jerry


Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:
+1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
for those out.

(Those are what delayed 2.2.2 and 2.1.3 for those watching...)

On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
> Hi all,
>
> Spark 2.3.1 was released just a while ago, but unfortunately we discovered
> and fixed some critical issues afterward.
>
> SPARK-24495: SortMergeJoin may produce wrong result.
> This is a serious correctness bug, and is easy to hit: have duplicated join
> key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`, and the
> join is a sort merge join. This bug is only present in Spark 2.3.
>
> SPARK-24588: stream-stream join may produce wrong result
> This is a correctness bug in a new feature of Spark 2.3: the stream-stream
> join. Users can hit this bug if one of the join side is partitioned by a
> subset of the join keys.
>
> SPARK-24552: Task attempt numbers are reused when stages are retried
> This is a long-standing bug in the output committer that may introduce data
> corruption.
>
> SPARK-24542: UDFXPath allow users to pass carefully crafted XML to
> access arbitrary files
> This is a potential security issue if users build access control module upon
> Spark.
>
> I think we need a Spark 2.3.2 to address these issues(especially the
> correctness bugs) ASAP. Any thoughts?
>
> Thanks,
> Wenchen



--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



--
---
Takeshi Yamamuro




--
Stavros Kontopoulos
Senior Software Engineer
Lightbend, Inc.
p:  +30 6977967274

e: stavros.kontopou...@lightbend.com

[https://docs.google.com/a/lightbend.com/uc?id=0B5AMuG_Ml2ddbFJqVWJxeHV0bzg=download]


Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread Marco Gaido
I think the outputOrdering would be the one of the big table (if any) and
it wouldn't matter if this involves the join keys or not. Am I wrong?

2018-06-28 17:01 GMT+02:00 吴晓菊 :

> Thanks for the reply.
> By looking into the SortMergeJoinExec, I think we can follow what
> SortMergeJoin do, for some types of join, if the children is ordered on
> join keys, we can output the ordered join keys as output ordering.
>
>
> Chrysan Wu
> 吴晓菊
> Phone:+86 17717640807
>
>
> 2018-06-28 22:53 GMT+08:00 Wenchen Fan :
>
>> SortMergeJoin only reports ordering of the join keys, not the output
>> ordering of any child.
>>
>> It seems reasonable to me that broadcast join should respect the output
>> ordering of the children. Feel free to submit a PR to fix it, thanks!
>>
>> On Thu, Jun 28, 2018 at 10:07 PM 吴晓菊  wrote:
>>
>>> Why we cannot use the output order of big table?
>>>
>>>
>>> Chrysan Wu
>>> Phone:+86 17717640807
>>>
>>>
>>> 2018-06-28 21:48 GMT+08:00 Marco Gaido :
>>>
 The easy answer to this is that SortMergeJoin ensure an outputOrdering,
 while BroadcastHashJoin doesn't, ie. after running a BroadcastHashJoin you
 don't know which is going to be the order of the output since nothing
 enforces it.

 Hope this helps.
 Thanks.
 Marco

 2018-06-28 15:46 GMT+02:00 吴晓菊 :

>
> We see SortMergeJoinExec is implemented with
> outputPartitioning while BroadcastHashJoinExec is only
> implemented with outputPartitioning. Why is the design?
>
> Chrysan Wu
> Phone:+86 17717640807
>
>

>>>
>


Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Felix Cheung
Not pushing back, but our support message has always been R 3.1+ so it a bit 
off to say we don’t support newer releases.

https://spark.apache.org/docs/2.1.2/

But looking back, this was found during 2.1.2 RC2 and didn’t fix (in time) for 
2.1.2?

http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555

Since it isn’t a regression I’d say +1 from me.



From: Tom Graves 
Sent: Thursday, June 28, 2018 6:56:16 AM
To: Marcelo Vanzin; Felix Cheung
Cc: dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)

If this is just supporting newer versions of R that 2.1 never supported then I 
would say its not a blocker. But if you feel its useful enough then I would say 
its up to Marcelo if he wants to pull in and spin another rc.

Tom

On Wednesday, June 27, 2018, 8:57:25 PM CDT, Felix Cheung 
 wrote:


Yes, this is broken with newer version of R.

We check explicitly for warning for the R check which should fail the test run.


From: Marcelo Vanzin 
Sent: Wednesday, June 27, 2018 6:55 PM
To: Felix Cheung
Cc: Marcelo Vanzin; Tom Graves; dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)

Not sure I understand that bug. Is it a compatibility issue with new
versions of R?

It's at least marked as fixed in 2.2(.1).

We do run jenkins on these branches, but that seems like just a
warning, which would not fail those builds...

On Wed, Jun 27, 2018 at 6:12 PM, Felix Cheung  wrote:
> (I don’t want to block the release(s) per se...)
>
> We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)
>
> This is fixed in 2.3 back in Nov 2017
> https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6
>
> Perhaps we don't get Jenkins run on these branches? It should have been
> detected.
>
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'attach':
> attach
> Code: function(what, pos = 2L, name = deparse(substitute(what),
> backtick = FALSE), warn.conflicts = TRUE)
> Docs: function(what, pos = 2L, name = deparse(substitute(what)),
> warn.conflicts = TRUE)
> Mismatches in argument default values:
> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs:
> deparse(substitute(what))
>
> Codoc mismatches from documentation object 'glm':
> glm
> Code: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
> NULL, ...)
> Docs: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, contrasts = NULL, ...)
> Argument names in code not in docs:
> singular.ok
> Mismatches in argument names:
> Position: 16 Code: singular.ok Docs: contrasts
> Position: 17 Code: contrasts Docs: ...
>
> 
> From: Sean Owen 
> Sent: Wednesday, June 27, 2018 5:02:37 AM
> To: Marcelo Vanzin
> Cc: dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> +1 from me too for the usual reasons.
>
> On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin 
> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.3.
>>
>> The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>> https://github.com/apache/spark/tree/v2.1.3-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1275/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>>
>> The list of bug fixes going into 2.1.3 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12341660
>>
>> Notes:
>>
>> - RC1 was not sent for a vote. I had trouble building it, and by the time
>> I got
>> things fixed, there was a blocker bug filed. It was already tagged in
>> git
>> at that time.
>>
>> - If testing the source package, I recommend using Java 8, even though 2.1
>> supports Java 7 (and the RC was built with JDK 7). This is because Maven
>> Central has updated some configuration that makes the default Java 7 SSL
>> config not work.
>>
>> - There are Maven artifacts published for 

Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread 吴晓菊
Thanks for the reply.
By looking into the SortMergeJoinExec, I think we can follow what
SortMergeJoin do, for some types of join, if the children is ordered on
join keys, we can output the ordered join keys as output ordering.


Chrysan Wu
吴晓菊
Phone:+86 17717640807


2018-06-28 22:53 GMT+08:00 Wenchen Fan :

> SortMergeJoin only reports ordering of the join keys, not the output
> ordering of any child.
>
> It seems reasonable to me that broadcast join should respect the output
> ordering of the children. Feel free to submit a PR to fix it, thanks!
>
> On Thu, Jun 28, 2018 at 10:07 PM 吴晓菊  wrote:
>
>> Why we cannot use the output order of big table?
>>
>>
>> Chrysan Wu
>> Phone:+86 17717640807
>>
>>
>> 2018-06-28 21:48 GMT+08:00 Marco Gaido :
>>
>>> The easy answer to this is that SortMergeJoin ensure an outputOrdering,
>>> while BroadcastHashJoin doesn't, ie. after running a BroadcastHashJoin you
>>> don't know which is going to be the order of the output since nothing
>>> enforces it.
>>>
>>> Hope this helps.
>>> Thanks.
>>> Marco
>>>
>>> 2018-06-28 15:46 GMT+02:00 吴晓菊 :
>>>

 We see SortMergeJoinExec is implemented with 
 outputPartitioning
 while BroadcastHashJoinExec is only implemented with outputPartitioning.
 Why is the design?

 Chrysan Wu
 Phone:+86 17717640807


>>>
>>


Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread Wenchen Fan
SortMergeJoin only reports ordering of the join keys, not the output
ordering of any child.

It seems reasonable to me that broadcast join should respect the output
ordering of the children. Feel free to submit a PR to fix it, thanks!

On Thu, Jun 28, 2018 at 10:07 PM 吴晓菊  wrote:

> Why we cannot use the output order of big table?
>
>
> Chrysan Wu
> Phone:+86 17717640807
>
>
> 2018-06-28 21:48 GMT+08:00 Marco Gaido :
>
>> The easy answer to this is that SortMergeJoin ensure an outputOrdering,
>> while BroadcastHashJoin doesn't, ie. after running a BroadcastHashJoin you
>> don't know which is going to be the order of the output since nothing
>> enforces it.
>>
>> Hope this helps.
>> Thanks.
>> Marco
>>
>> 2018-06-28 15:46 GMT+02:00 吴晓菊 :
>>
>>>
>>> We see SortMergeJoinExec is implemented with
>>> outputPartitioning while BroadcastHashJoinExec is only
>>> implemented with outputPartitioning. Why is the design?
>>>
>>> Chrysan Wu
>>> Phone:+86 17717640807
>>>
>>>
>>
>


Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread 吴晓菊
Why we cannot use the output order of big table?


Chrysan Wu
Phone:+86 17717640807


2018-06-28 21:48 GMT+08:00 Marco Gaido :

> The easy answer to this is that SortMergeJoin ensure an outputOrdering,
> while BroadcastHashJoin doesn't, ie. after running a BroadcastHashJoin you
> don't know which is going to be the order of the output since nothing
> enforces it.
>
> Hope this helps.
> Thanks.
> Marco
>
> 2018-06-28 15:46 GMT+02:00 吴晓菊 :
>
>>
>> We see SortMergeJoinExec is implemented with
>> outputPartitioning while BroadcastHashJoinExec is only
>> implemented with outputPartitioning. Why is the design?
>>
>> Chrysan Wu
>> Phone:+86 17717640807
>>
>>
>


Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Tom Graves
 If this is just supporting newer versions of R that 2.1 never supported then I 
would say its not a blocker. But if you feel its useful enough then I would say 
its up to Marcelo if he wants to pull in and spin another rc.
Tom 
On Wednesday, June 27, 2018, 8:57:25 PM CDT, Felix Cheung 
 wrote:  
 
 Yes, this is broken with newer version of R.
We check explicitly for warning for the R check which should fail the test run.
From: Marcelo Vanzin 
Sent: Wednesday, June 27, 2018 6:55 PM
To: Felix Cheung
Cc: Marcelo Vanzin; Tom Graves; dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2) Not sure I understand that bug. Is it a 
compatibility issue with new
versions of R?

It's at least marked as fixed in 2.2(.1).

We do run jenkins on these branches, but that seems like just a
warning, which would not fail those builds...

On Wed, Jun 27, 2018 at 6:12 PM, Felix Cheung  wrote:
> (I don’t want to block the release(s) per se...)
>
> We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)
>
> This is fixed in 2.3 back in Nov 2017
> https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6
>
> Perhaps we don't get Jenkins run on these branches? It should have been
> detected.
>
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'attach':
> attach
> Code: function(what, pos = 2L, name = deparse(substitute(what),
> backtick = FALSE), warn.conflicts = TRUE)
> Docs: function(what, pos = 2L, name = deparse(substitute(what)),
> warn.conflicts = TRUE)
> Mismatches in argument default values:
> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs:
> deparse(substitute(what))
>
> Codoc mismatches from documentation object 'glm':
> glm
> Code: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
> NULL, ...)
> Docs: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, contrasts = NULL, ...)
> Argument names in code not in docs:
> singular.ok
> Mismatches in argument names:
> Position: 16 Code: singular.ok Docs: contrasts
> Position: 17 Code: contrasts Docs: ...
>
> 
> From: Sean Owen 
> Sent: Wednesday, June 27, 2018 5:02:37 AM
> To: Marcelo Vanzin
> Cc: dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> +1 from me too for the usual reasons.
>
> On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin 
> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.3.
>>
>> The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>> https://github.com/apache/spark/tree/v2.1.3-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1275/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>>
>> The list of bug fixes going into 2.1.3 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12341660
>>
>> Notes:
>>
>> - RC1 was not sent for a vote. I had trouble building it, and by the time
>> I got
>> things fixed, there was a blocker bug filed. It was already tagged in
>> git
>> at that time.
>>
>> - If testing the source package, I recommend using Java 8, even though 2.1
>> supports Java 7 (and the RC was built with JDK 7). This is because Maven
>> Central has updated some configuration that makes the default Java 7 SSL
>> config not work.
>>
>> - There are Maven artifacts published for Scala 2.10, but binary
>> releases are only
>> available for Scala 2.11. This matches the previous release (2.1.2),
>> but if there's
>> a need / desire to have pre-built distributions for Scala 2.10, I can
>> probably
>> amend the RC without having to create a new one.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you 

Re: why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread Marco Gaido
The easy answer to this is that SortMergeJoin ensure an outputOrdering,
while BroadcastHashJoin doesn't, ie. after running a BroadcastHashJoin you
don't know which is going to be the order of the output since nothing
enforces it.

Hope this helps.
Thanks.
Marco

2018-06-28 15:46 GMT+02:00 吴晓菊 :

>
> We see SortMergeJoinExec is implemented with outputPartitioning
> while BroadcastHashJoinExec is only implemented with outputPartitioning.
> Why is the design?
>
> Chrysan Wu
> Phone:+86 17717640807
>
>


why BroadcastHashJoinExec is not implemented with outputOrdering?

2018-06-28 Thread 吴晓菊
We see SortMergeJoinExec is implemented with
outputPartitioning while BroadcastHashJoinExec is only
implemented with outputPartitioning. Why is the design?

Chrysan Wu
Phone:+86 17717640807


Re: Support SqlStreaming in spark

2018-06-28 Thread JackyLee
Spark JIRA:
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-24630

Benefits:

Firstly, users, who are unfamiliar with streaming, can easily use SQL to run
StructStreaming especially when migrating offline tasks to real time
processing tasks.
Secondly, support SQL API in StructStreaming can also combine
StructStreaming with hive. Users can store the source/sink metadata in a
table and use hive metastore to manage it. The users, who want to read this
data, can easily create a stream by accessing the table, which can greatly
reduce the development cost and maintenance costs of StructStreaming.
Finally, easy to achieve unified management and authority control of source
and sink, and more controllable in the management of some private data,
especially in some financial or security area.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Time for 2.3.2?

2018-06-28 Thread Stavros Kontopoulos
+1 makes sense.

On Thu, Jun 28, 2018 at 12:07 PM, Marco Gaido 
wrote:

> +1 too, I'd consider also to include SPARK-24208 if we can solve it
> timely...
>
> 2018-06-28 8:28 GMT+02:00 Takeshi Yamamuro :
>
>> +1, I heard some Spark users have skipped v2.3.1 because of these bugs.
>>
>> On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
>> wrote:
>>
>>> +1
>>>
>>> Wenchen Fan 于2018年6月28日 周四下午2:06写道:
>>>
 Hi Saisai, that's great! please go ahead!

 On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
 wrote:

> +1, like mentioned by Marcelo, these issues seems quite severe.
>
> I can work on the release if short of hands :).
>
> Thanks
> Jerry
>
>
> Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:
>
>> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
>> for those out.
>>
>> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>>
>> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > Spark 2.3.1 was released just a while ago, but unfortunately we
>> discovered
>> > and fixed some critical issues afterward.
>> >
>> > SPARK-24495: SortMergeJoin may produce wrong result.
>> > This is a serious correctness bug, and is easy to hit: have
>> duplicated join
>> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`,
>> and the
>> > join is a sort merge join. This bug is only present in Spark 2.3.
>> >
>> > SPARK-24588: stream-stream join may produce wrong result
>> > This is a correctness bug in a new feature of Spark 2.3: the
>> stream-stream
>> > join. Users can hit this bug if one of the join side is partitioned
>> by a
>> > subset of the join keys.
>> >
>> > SPARK-24552: Task attempt numbers are reused when stages are retried
>> > This is a long-standing bug in the output committer that may
>> introduce data
>> > corruption.
>> >
>> > SPARK-24542: UDFXPath allow users to pass carefully crafted XML
>> to
>> > access arbitrary files
>> > This is a potential security issue if users build access control
>> module upon
>> > Spark.
>> >
>> > I think we need a Spark 2.3.2 to address these issues(especially the
>> > correctness bugs) ASAP. Any thoughts?
>> >
>> > Thanks,
>> > Wenchen
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: Time for 2.3.2?

2018-06-28 Thread Marco Gaido
+1 too, I'd consider also to include SPARK-24208 if we can solve it
timely...

2018-06-28 8:28 GMT+02:00 Takeshi Yamamuro :

> +1, I heard some Spark users have skipped v2.3.1 because of these bugs.
>
> On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
> wrote:
>
>> +1
>>
>> Wenchen Fan 于2018年6月28日 周四下午2:06写道:
>>
>>> Hi Saisai, that's great! please go ahead!
>>>
>>> On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
>>> wrote:
>>>
 +1, like mentioned by Marcelo, these issues seems quite severe.

 I can work on the release if short of hands :).

 Thanks
 Jerry


 Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:

> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
> for those out.
>
> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>
> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
> wrote:
> > Hi all,
> >
> > Spark 2.3.1 was released just a while ago, but unfortunately we
> discovered
> > and fixed some critical issues afterward.
> >
> > SPARK-24495: SortMergeJoin may produce wrong result.
> > This is a serious correctness bug, and is easy to hit: have
> duplicated join
> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`,
> and the
> > join is a sort merge join. This bug is only present in Spark 2.3.
> >
> > SPARK-24588: stream-stream join may produce wrong result
> > This is a correctness bug in a new feature of Spark 2.3: the
> stream-stream
> > join. Users can hit this bug if one of the join side is partitioned
> by a
> > subset of the join keys.
> >
> > SPARK-24552: Task attempt numbers are reused when stages are retried
> > This is a long-standing bug in the output committer that may
> introduce data
> > corruption.
> >
> > SPARK-24542: UDFXPath allow users to pass carefully crafted XML
> to
> > access arbitrary files
> > This is a potential security issue if users build access control
> module upon
> > Spark.
> >
> > I think we need a Spark 2.3.2 to address these issues(especially the
> > correctness bugs) ASAP. Any thoughts?
> >
> > Thanks,
> > Wenchen
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Time for 2.3.2?

2018-06-28 Thread Takeshi Yamamuro
+1, I heard some Spark users have skipped v2.3.1 because of these bugs.

On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang  wrote:

> +1
>
> Wenchen Fan 于2018年6月28日 周四下午2:06写道:
>
>> Hi Saisai, that's great! please go ahead!
>>
>> On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
>> wrote:
>>
>>> +1, like mentioned by Marcelo, these issues seems quite severe.
>>>
>>> I can work on the release if short of hands :).
>>>
>>> Thanks
>>> Jerry
>>>
>>>
>>> Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:
>>>
 +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
 for those out.

 (Those are what delayed 2.2.2 and 2.1.3 for those watching...)

 On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
 wrote:
 > Hi all,
 >
 > Spark 2.3.1 was released just a while ago, but unfortunately we
 discovered
 > and fixed some critical issues afterward.
 >
 > SPARK-24495: SortMergeJoin may produce wrong result.
 > This is a serious correctness bug, and is easy to hit: have
 duplicated join
 > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`,
 and the
 > join is a sort merge join. This bug is only present in Spark 2.3.
 >
 > SPARK-24588: stream-stream join may produce wrong result
 > This is a correctness bug in a new feature of Spark 2.3: the
 stream-stream
 > join. Users can hit this bug if one of the join side is partitioned
 by a
 > subset of the join keys.
 >
 > SPARK-24552: Task attempt numbers are reused when stages are retried
 > This is a long-standing bug in the output committer that may
 introduce data
 > corruption.
 >
 > SPARK-24542: UDFXPath allow users to pass carefully crafted XML to
 > access arbitrary files
 > This is a potential security issue if users build access control
 module upon
 > Spark.
 >
 > I think we need a Spark 2.3.2 to address these issues(especially the
 > correctness bugs) ASAP. Any thoughts?
 >
 > Thanks,
 > Wenchen



 --
 Marcelo

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-- 
---
Takeshi Yamamuro


Re: Time for 2.3.2?

2018-06-28 Thread Xingbo Jiang
+1

Wenchen Fan 于2018年6月28日 周四下午2:06写道:

> Hi Saisai, that's great! please go ahead!
>
> On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
> wrote:
>
>> +1, like mentioned by Marcelo, these issues seems quite severe.
>>
>> I can work on the release if short of hands :).
>>
>> Thanks
>> Jerry
>>
>>
>> Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:
>>
>>> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
>>> for those out.
>>>
>>> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>>>
>>> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > Spark 2.3.1 was released just a while ago, but unfortunately we
>>> discovered
>>> > and fixed some critical issues afterward.
>>> >
>>> > SPARK-24495: SortMergeJoin may produce wrong result.
>>> > This is a serious correctness bug, and is easy to hit: have duplicated
>>> join
>>> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`, and
>>> the
>>> > join is a sort merge join. This bug is only present in Spark 2.3.
>>> >
>>> > SPARK-24588: stream-stream join may produce wrong result
>>> > This is a correctness bug in a new feature of Spark 2.3: the
>>> stream-stream
>>> > join. Users can hit this bug if one of the join side is partitioned by
>>> a
>>> > subset of the join keys.
>>> >
>>> > SPARK-24552: Task attempt numbers are reused when stages are retried
>>> > This is a long-standing bug in the output committer that may introduce
>>> data
>>> > corruption.
>>> >
>>> > SPARK-24542: UDFXPath allow users to pass carefully crafted XML to
>>> > access arbitrary files
>>> > This is a potential security issue if users build access control
>>> module upon
>>> > Spark.
>>> >
>>> > I think we need a Spark 2.3.2 to address these issues(especially the
>>> > correctness bugs) ASAP. Any thoughts?
>>> >
>>> > Thanks,
>>> > Wenchen
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: Time for 2.3.2?

2018-06-28 Thread Wenchen Fan
Hi Saisai, that's great! please go ahead!

On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao  wrote:

> +1, like mentioned by Marcelo, these issues seems quite severe.
>
> I can work on the release if short of hands :).
>
> Thanks
> Jerry
>
>
> Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:
>
>> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
>> for those out.
>>
>> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>>
>> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan  wrote:
>> > Hi all,
>> >
>> > Spark 2.3.1 was released just a while ago, but unfortunately we
>> discovered
>> > and fixed some critical issues afterward.
>> >
>> > SPARK-24495: SortMergeJoin may produce wrong result.
>> > This is a serious correctness bug, and is easy to hit: have duplicated
>> join
>> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`, and
>> the
>> > join is a sort merge join. This bug is only present in Spark 2.3.
>> >
>> > SPARK-24588: stream-stream join may produce wrong result
>> > This is a correctness bug in a new feature of Spark 2.3: the
>> stream-stream
>> > join. Users can hit this bug if one of the join side is partitioned by a
>> > subset of the join keys.
>> >
>> > SPARK-24552: Task attempt numbers are reused when stages are retried
>> > This is a long-standing bug in the output committer that may introduce
>> data
>> > corruption.
>> >
>> > SPARK-24542: UDFXPath allow users to pass carefully crafted XML to
>> > access arbitrary files
>> > This is a potential security issue if users build access control module
>> upon
>> > Spark.
>> >
>> > I think we need a Spark 2.3.2 to address these issues(especially the
>> > correctness bugs) ASAP. Any thoughts?
>> >
>> > Thanks,
>> > Wenchen
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>