Github user jaceklaskowski commented on the issue:
https://github.com/apache/spark/pull/15575
I'm late with this, but just leaving it for future code reviewers...
I think the change took the most extreme path where even such simple
`outputPartitioning` as the one in
Github user hvanhovell commented on the issue:
https://github.com/apache/spark/pull/15575
LGTM - merging to master. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/15575
yeah, LGTM, it doesn't change current outputPartitioning of operators.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/15575
LGTM too. Unforunately my internet sucks (on a plane) and I can't merge
this right now.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/15575
LGTM, since the scope of this PR is just refactoring.
Let me first post the existing code for `outputPartitioning` in
`ExpandExec`:
```Scala
// The GroupExpressions can output
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/15575
In practice, setting the `outputPartitioning` of a physical plan like
`ExpandExec` to `child.outputPartitioning` doesn't cause any real problem, even
this physical plan doesn't keep the same row
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/15575
@tejasapatil yeah, that is correct. however, I am wondering if we can say
this `ExpandExec` have the same distribution of rows as its child...because it
even doesn't have the `col`...
---
If your
Github user tejasapatil commented on the issue:
https://github.com/apache/spark/pull/15575
@viirya : As per my understanding, if the child operator emits `col`, after
applying `ExpandExec`, the output is `col'`. The original child partitioning
being over `col`, `ExpandExec` does not
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/15575
@rxin yeah, I am curious why `ExpandExec` and `GenerateExec` have different
`outputPartitioning`...
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/15575
@tejasapatil I see there is 1:1 mapping among output partition of child
operator and output partition of `ExpandExec`.
For example we have an Expand applying on a data set like col: [1, 2,
Github user tejasapatil commented on the issue:
https://github.com/apache/spark/pull/15575
@viirya
>> However, if its child has certain partition such as HashPartition, after
ExpandExec it becomes a UnknownPartitioning
The notion of `Partitioning` in Spark is the
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/15575
The current thing LGTM.
cc @yhuai do you have any other feedback?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user tejasapatil commented on the issue:
https://github.com/apache/spark/pull/15575
@viirya
>> In the table in the description, CoalesceExec output UnknownPartitioning
Yes. Since partitions == 1 is a corner case, I did not put that in the
table. If you look
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/15575
@viirya of course if you say coalesce(1) it is a single partition -- any
operator that changes partition to 1 partition is single partition.
For Expand isn't it just the same as Generate?
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/15575
@rxin In the table in the description, `CoalesceExec` output
`UnknownPartitioning`, actually it can be `SinglePartition` if what you do is
`coalesce(1)`.
`ExpandExec` doesn't actually move
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/15575
@tejasapatil you were linking to a fb internal branch :)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/15575
CoalesceExec is not SinglePartition. Coalesce can lead to an arbitrary
number of partitions.
ExpandExec doesn't really change the partitioning either.
---
If your project is set up for it,
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/15575
But `ExpandExec`'s `GroupExpressions` could generate data with arbitrary
partitioning, so even it doesn't move rows across input partitions, doesn't it
still can change child' output partition?
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/15575
`CoalesceExec` can also have `SinglePartition` as `outputPartitioning` too.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user tejasapatil commented on the issue:
https://github.com/apache/spark/pull/15575
@rxin : Yes. There was a comment over there [0] but based on code it does
not look like its moving rows across input partitions (cc @chenghao-intel who
added that).
[0] :
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/15575
ExpandExec should follow child's output partitioning and ordering,
shouldn't it?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15575
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15575
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67320/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15575
**[Test build #67320 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67320/consoleFull)**
for PR 15575 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15575
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15575
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67319/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15575
**[Test build #67319 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67319/consoleFull)**
for PR 15575 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15575
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15575
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67318/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15575
**[Test build #67318 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67318/consoleFull)**
for PR 15575 at commit
Github user tejasapatil commented on the issue:
https://github.com/apache/spark/pull/15575
@yhuai : please see the table in this PRs description. I have added a
`comment` (last column) for each entry to point out those cases.
---
If your project is set up for it, you can reply to
Github user yhuai commented on the issue:
https://github.com/apache/spark/pull/15575
> I felt that there are numerous places where child's output ordering could
be used but the operators don't set it
Can you list them at here?
---
If your project is set up for it, you can
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15575
**[Test build #67320 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67320/consoleFull)**
for PR 15575 at commit
Github user tejasapatil commented on the issue:
https://github.com/apache/spark/pull/15575
Agree with the planner behavior described in the last few comments
(relevant code :
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15575
**[Test build #67319 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67319/consoleFull)**
for PR 15575 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/15575
Yeah. So I don't see any exception that an unary node, if it is not
`ShuffleExchange`, can have an `outputPartitioning` other than
`child.outputPartitioning`.
---
If your project is set up for it,
Github user yhuai commented on the issue:
https://github.com/apache/spark/pull/15575
Our planner decides if to add an `ShuffleExchange` by consider
`outputPartitioning` and `requiredDistribution` together. If the
`outputPartitioning` of the child does not satisfy the
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15575
**[Test build #67311 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67311/consoleFull)**
for PR 15575 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/15575
The query planner will inject exchange based on the required distribution
for the child of `UnaryNodeExec`. Looks like all unary nodes have
`child.outputPartitioning` as its output partition. I
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15575
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67311/
Test PASSed.
---
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/15575
@yhuai let me be more clear. It is actually fairly confusing to say "this
only changes if the operator doesn't shuffle data".
The thing is that we are relying on each operator's output
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15575
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15575
**[Test build #67318 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67318/consoleFull)**
for PR 15575 at commit
Github user yhuai commented on the issue:
https://github.com/apache/spark/pull/15575
`outputPartitioning` of a node will only be changed if this node shuffles
data. Right now, only `ShuffleExchange` shuffles data.
---
If your project is set up for it, you can reply to this email
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15575
**[Test build #67311 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67311/consoleFull)**
for PR 15575 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15575
**[Test build #67309 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67309/consoleFull)**
for PR 15575 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15575
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67309/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15575
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15575
**[Test build #67309 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67309/consoleFull)**
for PR 15575 at commit
Github user tejasapatil commented on the issue:
https://github.com/apache/spark/pull/15575
Jenkins test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
50 matches
Mail list logo