[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2017-05-26 Thread jaceklaskowski
Github user jaceklaskowski commented on the issue: https://github.com/apache/spark/pull/15575 I'm late with this, but just leaving it for future code reviewers... I think the change took the most extreme path where even such simple `outputPartitioning` as the one in

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-23 Thread hvanhovell
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/15575 LGTM - merging to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-23 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 yeah, LGTM, it doesn't change current outputPartitioning of operators. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-23 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 LGTM too. Unforunately my internet sucks (on a plane) and I can't merge this right now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15575 LGTM, since the scope of this PR is just refactoring. Let me first post the existing code for `outputPartitioning` in `ExpandExec`: ```Scala // The GroupExpressions can output

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 In practice, setting the `outputPartitioning` of a physical plan like `ExpandExec` to `child.outputPartitioning` doesn't cause any real problem, even this physical plan doesn't keep the same row

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 @tejasapatil yeah, that is correct. however, I am wondering if we can say this `ExpandExec` have the same distribution of rows as its child...because it even doesn't have the `col`... --- If your

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread tejasapatil
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @viirya : As per my understanding, if the child operator emits `col`, after applying `ExpandExec`, the output is `col'`. The original child partitioning being over `col`, `ExpandExec` does not

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 @rxin yeah, I am curious why `ExpandExec` and `GenerateExec` have different `outputPartitioning`... --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 @tejasapatil I see there is 1:1 mapping among output partition of child operator and output partition of `ExpandExec`. For example we have an Expand applying on a data set like col: [1, 2,

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread tejasapatil
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @viirya >> However, if its child has certain partition such as HashPartition, after ExpandExec it becomes a UnknownPartitioning The notion of `Partitioning` in Spark is the

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 The current thing LGTM. cc @yhuai do you have any other feedback? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread tejasapatil
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @viirya >> In the table in the description, CoalesceExec output UnknownPartitioning Yes. Since partitions == 1 is a corner case, I did not put that in the table. If you look

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 @viirya of course if you say coalesce(1) it is a single partition -- any operator that changes partition to 1 partition is single partition. For Expand isn't it just the same as Generate?

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 @rxin In the table in the description, `CoalesceExec` output `UnknownPartitioning`, actually it can be `SinglePartition` if what you do is `coalesce(1)`. `ExpandExec` doesn't actually move

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 @tejasapatil you were linking to a fb internal branch :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 CoalesceExec is not SinglePartition. Coalesce can lead to an arbitrary number of partitions. ExpandExec doesn't really change the partitioning either. --- If your project is set up for it,

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 But `ExpandExec`'s `GroupExpressions` could generate data with arbitrary partitioning, so even it doesn't move rows across input partitions, doesn't it still can change child' output partition?

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 `CoalesceExec` can also have `SinglePartition` as `outputPartitioning` too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread tejasapatil
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @rxin : Yes. There was a comment over there [0] but based on code it does not look like its moving rows across input partitions (cc @chenghao-intel who added that). [0] :

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 ExpandExec should follow child's output partitioning and ordering, shouldn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67320/ Test PASSed. ---

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67320/consoleFull)** for PR 15575 at commit

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67319/ Test PASSed. ---

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67319 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67319/consoleFull)** for PR 15575 at commit

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67318/ Test PASSed. ---

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67318 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67318/consoleFull)** for PR 15575 at commit

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread tejasapatil
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @yhuai : please see the table in this PRs description. I have added a `comment` (last column) for each entry to point out those cases. --- If your project is set up for it, you can reply to

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread yhuai
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15575 > I felt that there are numerous places where child's output ordering could be used but the operators don't set it Can you list them at here? --- If your project is set up for it, you can

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67320/consoleFull)** for PR 15575 at commit

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread tejasapatil
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 Agree with the planner behavior described in the last few comments (relevant code :

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67319 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67319/consoleFull)** for PR 15575 at commit

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 Yeah. So I don't see any exception that an unary node, if it is not `ShuffleExchange`, can have an `outputPartitioning` other than `child.outputPartitioning`. --- If your project is set up for it,

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread yhuai
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15575 Our planner decides if to add an `ShuffleExchange` by consider `outputPartitioning` and `requiredDistribution` together. If the `outputPartitioning` of the child does not satisfy the

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67311 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67311/consoleFull)** for PR 15575 at commit

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 The query planner will inject exchange based on the required distribution for the child of `UnaryNodeExec`. Looks like all unary nodes have `child.outputPartitioning` as its output partition. I

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67311/ Test PASSed. ---

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 @yhuai let me be more clear. It is actually fairly confusing to say "this only changes if the operator doesn't shuffle data". The thing is that we are relying on each operator's output

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67318 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67318/consoleFull)** for PR 15575 at commit

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread yhuai
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15575 `outputPartitioning` of a node will only be changed if this node shuffles data. Right now, only `ShuffleExchange` shuffles data. --- If your project is set up for it, you can reply to this email

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67311 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67311/consoleFull)** for PR 15575 at commit

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67309 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67309/consoleFull)** for PR 15575 at commit

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67309/ Test FAILed. ---

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67309 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67309/consoleFull)** for PR 15575 at commit

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread tejasapatil
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 Jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and