[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2017-05-26 Thread jaceklaskowski
Github user jaceklaskowski commented on the issue:

https://github.com/apache/spark/pull/15575
  
I'm late with this, but just leaving it for future code reviewers...

I think the change took the most extreme path where even such simple 
`outputPartitioning` as the one in `WindowExec` (see below) is currently part 
of the extension of `UnaryExecNode` while I think it should stay with 
`UnaryExecNode` and be overridden when the output partitioning is different. 
It's a kind of code duplication.

```
  override def outputPartitioning: Partitioning = child.outputPartitioning
```

The above serves nothing but says _"I simply don't care what happens 
upstream"_.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-23 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/15575
  
LGTM - merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-23 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
yeah, LGTM, it doesn't change current outputPartitioning of operators.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15575
  
LGTM too. Unforunately my internet sucks (on a plane) and I can't merge 
this right now.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15575
  
LGTM, since the scope of this PR is just refactoring. 

Let me first post the existing code for `outputPartitioning` in 
`ExpandExec`:
```Scala
  // The GroupExpressions can output data with arbitrary partitioning, so 
set it
  // as UNKNOWN partitioning
  override def outputPartitioning: Partitioning = UnknownPartitioning(0)
```

It makes sense to set it either `UnknownPartitioning` or 
`child.outputPartitioning`. However, the above code is setting to a wrong 
number of partitions. We need to correct it no matter whether we are using the 
number or not. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
In practice, setting the `outputPartitioning` of a physical plan like 
`ExpandExec` to `child.outputPartitioning` doesn't cause any real problem, even 
this physical plan doesn't keep the same row distribution of its child. That is 
because if the physical plan changes output, it will have different output 
attributes, e.g., `col` to `col'` as @tejasapatil pointed out.

If its parent plan requires a distribution, says `HashPartition`, this 
distribution will bound to the physical plan's output `col'`, instead of its 
child plan's `col`. So even the physical plan uses `child.outputPartitioning`, 
`EnsureRequirements ` will step in and inject an extra shuffle exchange of 
`HashPartition(col')` to satisfy the requirement.

It works like that as per my understanding. However it doesn't mean the 
physical plan's output partitioning is exactly as same as its child's, i.e., 
`HashPartition(col)`, because it doesn't have the output `col`. This part might 
be confusing to some people, so I think it might be better to explain it more. 
That is what I understood about this, if I am wrong please kindly point out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
@tejasapatil yeah, that is correct. however, I am wondering if we can say 
this `ExpandExec` have the same distribution of rows as its child...because it 
even doesn't have the `col`...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/15575
  
@viirya : As per my understanding, if the child operator emits `col`, after 
applying `ExpandExec`, the output is `col'`. The original child partitioning 
being over `col`, `ExpandExec` does not seem to alter that. 

The table above was to summarise the state of things before this PR. I did 
not change any semantics in this PR as its pure refactoring.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
@rxin yeah, I am curious why `ExpandExec` and `GenerateExec` have different 
`outputPartitioning`...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
@tejasapatil I see there is 1:1 mapping among output partition of child 
operator and output partition of `ExpandExec`. 

For example we have an Expand applying on a data set like col: [1, 2, 3]. 
If the projections are col, col + 1, col + 2.

Assume the partition of the data set is HashPartition(col). We have three 
partitions:
p1: [1]
p2: [2]
p3: [3]

After the Expand, the data set becomes:

p1: [1, 2, 3]
p2: [2, 3, 4]
p3: [3, 4, 5]

Is it still valid for HashPartition(col)? Looks like it doesn't. I think It 
is why there is a comment on ExpandExec in the code position you links to.

BTW, in your table `ExpandExec`'s `outputPartitioning` is 
`UnknownPartitioning`, right? If it doesn't change child's partition, why we 
don't set it to child's outputPartitioning?

  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/15575
  
@viirya

>> However, if its child has certain partition such as HashPartition, after 
ExpandExec it becomes a UnknownPartitioning

The notion of `Partitioning` in Spark is the distribution of rows across 
tasks. Even if the child's output has `HashPartitioning`, there is a 1:1 
mapping among output partition of child operator and output partition of 
`ExpandExec`. So, applying `ExpandExec` does not alter the partitioning of 
child outputs.






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15575
  
The current thing LGTM. 

cc @yhuai do you have any other feedback?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/15575
  
@viirya 

>> In the table in the description, CoalesceExec output UnknownPartitioning

Yes. Since partitions == 1 is a corner case, I did not put that in the 
table. If you look at the code, its doing the right thing: 

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L490


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15575
  
@viirya of course if you say coalesce(1) it is a single partition -- any 
operator that changes partition to 1 partition is single partition.

For Expand isn't it just the same as Generate?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
@rxin In the table in the description, `CoalesceExec` output 
`UnknownPartitioning`, actually it can be `SinglePartition` if what you do is 
`coalesce(1)`.

`ExpandExec` doesn't actually move rows across partitions as @tejasapatil 
pointed out. However, if its child has certain partition such as 
`HashPartition`, after `ExpandExec` it becomes a `UnknownPartitioning`. I am 
not sure if it does change the partitioning or not. From the view of output 
partition of the physical plan, it is changed indeed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15575
  
@tejasapatil you were linking to a fb internal branch :)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15575
  
CoalesceExec is not SinglePartition. Coalesce can lead to an arbitrary 
number of partitions.

ExpandExec doesn't really change the partitioning either.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
But `ExpandExec`'s `GroupExpressions` could generate data with arbitrary 
partitioning, so even it doesn't move rows across input partitions, doesn't it 
still can change child' output partition?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
`CoalesceExec` can also have `SinglePartition` as `outputPartitioning` too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/15575
  
@rxin : Yes. There was a comment over there [0] but based on code it does 
not look like its moving rows across input partitions (cc @chenghao-intel who 
added that).

[0] : 
https://github.com/facebook/FB-Spark/blob/d26c42982c18da8fb1b21c9eb75aef9364d1b992/sql/core/src/main/scala/org/apache/spark/sql/execution/Expand.scala#L45


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15575
  
ExpandExec should follow child's output partitioning and ordering, 
shouldn't it?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15575
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15575
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67320/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15575
  
**[Test build #67320 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67320/consoleFull)**
 for PR 15575 at commit 
[`cfcd2a7`](https://github.com/apache/spark/commit/cfcd2a79231ed16c2a3e82551e87a6de888c2af6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15575
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15575
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67319/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15575
  
**[Test build #67319 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67319/consoleFull)**
 for PR 15575 at commit 
[`2eee2ee`](https://github.com/apache/spark/commit/2eee2eea81d665706a76a5614d26b4401a07f127).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15575
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15575
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67318/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15575
  
**[Test build #67318 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67318/consoleFull)**
 for PR 15575 at commit 
[`c5b6626`](https://github.com/apache/spark/commit/c5b6626c144a97054f74adc3158ceea6de3cea58).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/15575
  
@yhuai : please see the table in this PRs description. I have added a 
`comment` (last column) for each entry to point out those cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/15575
  
> I felt that there are numerous places where child's output ordering could 
be used but the operators don't set it

Can you list them at here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15575
  
**[Test build #67320 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67320/consoleFull)**
 for PR 15575 at commit 
[`cfcd2a7`](https://github.com/apache/spark/commit/cfcd2a79231ed16c2a3e82551e87a6de888c2af6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/15575
  
Agree with the planner behavior described in the last few comments 
(relevant code : 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L165)

I have updated the description of the PR with a table which shows the 
output partitioning and ordering for each implementation of `UnaryExecNode`. It 
will help with the review in case we are missing something / setting something 
wrong.

`BroadcastExchangeExec `, `CoalesceExec `, `CollectLimitExec `, 
`ExpandExec`, `TakeOrderedAndProjectExec` and `ShuffleExchange` do not use 
child's output partitioning.

I felt that there are numerous places where child's output ordering could 
be used but the operators don't set it (thus using default `Nil`). I won't made 
those modifications in this PR as I want to only do the refac in this diff.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15575
  
**[Test build #67319 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67319/consoleFull)**
 for PR 15575 at commit 
[`2eee2ee`](https://github.com/apache/spark/commit/2eee2eea81d665706a76a5614d26b4401a07f127).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
Yeah. So I don't see any exception that an unary node, if it is not 
`ShuffleExchange`, can have an `outputPartitioning` other than 
`child.outputPartitioning`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/15575
  
Our planner decides if to add an `ShuffleExchange` by consider 
`outputPartitioning` and `requiredDistribution` together. If the 
`outputPartitioning` of the child does not satisfy the `requiredDistribution` 
of the parent, we will add a `ShuffleExchange` operator.

When we say `child`, it is literally the child of a node. It does not 
matter if the child is a `ShuffleExchange` or not. For a node, when its 
`outputPartitioning` is `child. outputPartitioning`, this node does not shuffle 
data. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15575
  
**[Test build #67311 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67311/consoleFull)**
 for PR 15575 at commit 
[`f9a4420`](https://github.com/apache/spark/commit/f9a442080d73dd42682a20f0a1ee6e5844abb938).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
The query planner will inject exchange based on the required distribution 
for the child of `UnaryNodeExec`. Looks like all unary nodes have 
`child.outputPartitioning` as its output partition. I can't see an exception 
based on the fact that only `ShuffleExchange` will change `outputPartitioning`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15575
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67311/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15575
  
@yhuai let me be more clear. It is actually fairly confusing to say "this 
only changes if the operator doesn't shuffle data". 

The thing is that we are relying on each operator's output partitioning to 
determine injecting exchanges. So is it safe to depend on "child", which is 
ambiguous in this context? Child could mean the current child before injecting 
exchange, or it could mean the injected exchange operator. Can the query 
planner incorrectly not injecting exchange because a plan node is deemed 
already having the correct output partitioning? Or is this always guaranteed to 
not happen due to certain ordering we enforce in planning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15575
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15575
  
**[Test build #67318 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67318/consoleFull)**
 for PR 15575 at commit 
[`c5b6626`](https://github.com/apache/spark/commit/c5b6626c144a97054f74adc3158ceea6de3cea58).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread yhuai
Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/15575
  
`outputPartitioning` of a node will only be changed if this node shuffles 
data. Right now, only `ShuffleExchange` shuffles data. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15575
  
**[Test build #67311 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67311/consoleFull)**
 for PR 15575 at commit 
[`f9a4420`](https://github.com/apache/spark/commit/f9a442080d73dd42682a20f0a1ee6e5844abb938).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15575
  
**[Test build #67309 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67309/consoleFull)**
 for PR 15575 at commit 
[`85fb132`](https://github.com/apache/spark/commit/85fb132196a700072e5a8e70ebf17ae24d7958e4).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15575
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67309/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15575
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15575
  
**[Test build #67309 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67309/consoleFull)**
 for PR 15575 at commit 
[`85fb132`](https://github.com/apache/spark/commit/85fb132196a700072e5a8e70ebf17ae24d7958e4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-20 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/15575
  
Jenkins test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org