[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-10-03 Thread maropu
Github user maropu closed the pull request at:

https://github.com/apache/spark/pull/18861


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-14 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r132881614
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -596,14 +596,17 @@ object CollapseProject extends Rule[LogicalPlan] {
 object CollapseRepartition extends Rule[LogicalPlan] {
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r132881491
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -596,14 +596,17 @@ object CollapseProject extends Rule[LogicalPlan] {
 object CollapseRepartition extends Rule[LogicalPlan] {
--- End diff --

Please also add new test cases to `CollapseRepartitionSuite` for the 
changes in this rule.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131809065
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -596,14 +596,15 @@ object CollapseProject extends Rule[LogicalPlan] {
 object CollapseRepartition extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
 // Case 1: When a Repartition has a child of Repartition or 
RepartitionByExpression,
-// 1) When the top node does not enable the shuffle (i.e., coalesce 
API), but the child
-//   enables the shuffle. Returns the child node if the last 
numPartitions is bigger;
-//   otherwise, keep unchanged.
+// 1) When the top node does not enable the shuffle (i.e., coalesce 
with no user-specified
+//   strategy), but the child enables the shuffle. Returns the child 
node if the last
+//   numPartitions is bigger; otherwise, keep unchanged.
 // 2) In the other cases, returns the top node with the child's child
-case r @ Repartition(_, _, child: RepartitionOperation) => (r.shuffle, 
child.shuffle) match {
-  case (false, true) => if (r.numPartitions >= child.numPartitions) 
child else r
-  case _ => r.copy(child = child.child)
-}
+case r @ Repartition(_, _, child: RepartitionOperation, None) =>
--- End diff --

sorry, my bad. Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131808426
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -746,8 +746,20 @@ abstract class RepartitionOperation extends UnaryNode {
  * [[RepartitionByExpression]] as this method is called directly by 
DataFrame's, because the user
  * asked for `coalesce` or `repartition`. [[RepartitionByExpression]] is 
used when the consumer
  * of the output requires some specific ordering or distribution of the 
data.
+ *
+ * If `shuffle` = false (`coalesce` cases), this logical plan can have an 
user-specified strategy
+ * to coalesce input partitions.
+ *
+ * @param numPartitions How many partitions to use in the output RDD
+ * @param shuffle Whether to shuffle when repartitioning
+ * @param child the LogicalPlan
+ * @param coalescer Optional coalescer that an user specifies
  */
-case class Repartition(numPartitions: Int, shuffle: Boolean, child: 
LogicalPlan)
+case class Repartition(
+numPartitions: Int,
+shuffle: Boolean,
+child: LogicalPlan,
+coalescer: Option[PartitionCoalescer] = None)
   extends RepartitionOperation {
   require(numPartitions > 0, s"Number of partitions ($numPartitions) must 
be positive.")
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131768761
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -596,14 +596,15 @@ object CollapseProject extends Rule[LogicalPlan] {
 object CollapseRepartition extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
 // Case 1: When a Repartition has a child of Repartition or 
RepartitionByExpression,
-// 1) When the top node does not enable the shuffle (i.e., coalesce 
API), but the child
-//   enables the shuffle. Returns the child node if the last 
numPartitions is bigger;
-//   otherwise, keep unchanged.
+// 1) When the top node does not enable the shuffle (i.e., coalesce 
with no user-specified
+//   strategy), but the child enables the shuffle. Returns the child 
node if the last
+//   numPartitions is bigger; otherwise, keep unchanged.
 // 2) In the other cases, returns the top node with the child's child
-case r @ Repartition(_, _, child: RepartitionOperation) => (r.shuffle, 
child.shuffle) match {
-  case (false, true) => if (r.numPartitions >= child.numPartitions) 
child else r
-  case _ => r.copy(child = child.child)
-}
+case r @ Repartition(_, _, child: RepartitionOperation, None) =>
--- End diff --

?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131766797
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2662,6 +2662,30 @@ class Dataset[T] private[sql](
   }
 
   /**
+   * Returns a new Dataset that an user-defined `PartitionCoalescer` 
reduces into fewer partitions.
+   * `userDefinedCoalescer` is the same with a coalescer used in the `RDD` 
coalesce function.
+   *
+   * If a larger number of partitions is requested, it will stay at the 
current
+   * number of partitions. Similar to coalesce defined on an `RDD`, this 
operation results in
+   * a narrow dependency, e.g. if you go from 1000 partitions to 100 
partitions, there will not
+   * be a shuffle, instead each of the 100 new partitions will claim 10 of 
the current partitions.
+   *
+   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 
1,
+   * this may result in your computation taking place on fewer nodes than
+   * you like (e.g. one node in the case of numPartitions = 1). To avoid 
this,
+   * you can call repartition. This will add a shuffle step, but means the
+   * current upstream partitions will be executed in parallel (per whatever
+   * the current partitioning is).
+   *
+   * @group typedrel
+   * @since 2.3.0
+   */
+  def coalesce(numPartitions: Int, userDefinedCoalescer: 
Option[PartitionCoalescer])
+: Dataset[T] = withTypedPlan {
--- End diff --

```Scala
def coalesce(
numPartitions: Int,
userDefinedCoalescer: Option[PartitionCoalescer]): Dataset[T] = 
withTypedPlan {
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131766593
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -746,8 +746,20 @@ abstract class RepartitionOperation extends UnaryNode {
  * [[RepartitionByExpression]] as this method is called directly by 
DataFrame's, because the user
  * asked for `coalesce` or `repartition`. [[RepartitionByExpression]] is 
used when the consumer
  * of the output requires some specific ordering or distribution of the 
data.
+ *
+ * If `shuffle` = false (`coalesce` cases), this logical plan can have an 
user-specified strategy
+ * to coalesce input partitions.
+ *
+ * @param numPartitions How many partitions to use in the output RDD
+ * @param shuffle Whether to shuffle when repartitioning
+ * @param child the LogicalPlan
+ * @param coalescer Optional coalescer that an user specifies
  */
-case class Repartition(numPartitions: Int, shuffle: Boolean, child: 
LogicalPlan)
+case class Repartition(
+numPartitions: Int,
+shuffle: Boolean,
+child: LogicalPlan,
+coalescer: Option[PartitionCoalescer] = None)
   extends RepartitionOperation {
   require(numPartitions > 0, s"Number of partitions ($numPartitions) must 
be positive.")
--- End diff --

Add a new require here? 
```
require(!shuffle || coalescer.isEmpty, "Custom coalescer is not allowed for 
repartition(shuffle=true)")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131571620
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -753,6 +753,16 @@ case class Repartition(numPartitions: Int, shuffle: 
Boolean, child: LogicalPlan)
 }
 
 /**
+ * Returns a new RDD that has at most `numPartitions` partitions. This 
behavior can be modified by
+ * supplying a `PartitionCoalescer` to control the behavior of the 
partitioning.
+ */
+case class PartitionCoalesce(numPartitions: Int, coalescer: 
PartitionCoalescer, child: LogicalPlan)
+  extends UnaryNode {
--- End diff --

yea, I think so. I'll try and plz give me days to do so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131571547
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -1185,23 +1194,21 @@ class SizeBasedCoalescer(val maxSize: Int) extends 
PartitionCoalescer with Seria
   totalSum += splitSize
 }
 
-while (index < partitions.size) {
+while (index < partitions.length) {
   val partition = partitions(index)
-  val fileSplit =
-
partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit]
-  val splitSize = fileSplit.getLength
+  val splitSize = getPartitionSize(partition)
   if (currentSum + splitSize < maxSize) {
 addPartition(partition, splitSize)
 index += 1
-if (index == partitions.size) {
-  updateGroups
+if (index == partitions.length) {
+  updateGroups()
 }
   } else {
-if (currentGroup.partitions.size == 0) {
+if (currentGroup.partitions.isEmpty) {
   addPartition(partition, splitSize)
   index += 1
 } else {
-  updateGroups
+  updateGroups()
--- End diff --

ok, I'll drop these from this pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131571449
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -1185,23 +1194,21 @@ class SizeBasedCoalescer(val maxSize: Int) extends 
PartitionCoalescer with Seria
   totalSum += splitSize
 }
 
-while (index < partitions.size) {
+while (index < partitions.length) {
   val partition = partitions(index)
-  val fileSplit =
-
partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit]
-  val splitSize = fileSplit.getLength
+  val splitSize = getPartitionSize(partition)
   if (currentSum + splitSize < maxSize) {
 addPartition(partition, splitSize)
 index += 1
-if (index == partitions.size) {
-  updateGroups
+if (index == partitions.length) {
+  updateGroups()
 }
   } else {
-if (currentGroup.partitions.size == 0) {
+if (currentGroup.partitions.isEmpty) {
   addPartition(partition, splitSize)
   index += 1
 } else {
-  updateGroups
+  updateGroups()
--- End diff --

I am fine about this, but it might confuse the others. Maybe just remove 
them in this PR? You can submit a separate PR later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131571248
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -1185,23 +1194,21 @@ class SizeBasedCoalescer(val maxSize: Int) extends 
PartitionCoalescer with Seria
   totalSum += splitSize
 }
 
-while (index < partitions.size) {
+while (index < partitions.length) {
   val partition = partitions(index)
-  val fileSplit =
-
partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit]
-  val splitSize = fileSplit.getLength
+  val splitSize = getPartitionSize(partition)
   if (currentSum + splitSize < maxSize) {
 addPartition(partition, splitSize)
 index += 1
-if (index == partitions.size) {
-  updateGroups
+if (index == partitions.length) {
+  updateGroups()
 }
   } else {
-if (currentGroup.partitions.size == 0) {
+if (currentGroup.partitions.isEmpty) {
   addPartition(partition, splitSize)
   index += 1
 } else {
-  updateGroups
+  updateGroups()
--- End diff --

Yea, I just left the changes of the original author (probably refactoring 
stuffs?) ..., better remove this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131570851
  
--- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
@@ -1185,23 +1194,21 @@ class SizeBasedCoalescer(val maxSize: Int) extends 
PartitionCoalescer with Seria
   totalSum += splitSize
 }
 
-while (index < partitions.size) {
+while (index < partitions.length) {
   val partition = partitions(index)
-  val fileSplit =
-
partition.asInstanceOf[HadoopPartition].inputSplit.value.asInstanceOf[FileSplit]
-  val splitSize = fileSplit.getLength
+  val splitSize = getPartitionSize(partition)
   if (currentSum + splitSize < maxSize) {
 addPartition(partition, splitSize)
 index += 1
-if (index == partitions.size) {
-  updateGroups
+if (index == partitions.length) {
+  updateGroups()
 }
   } else {
-if (currentGroup.partitions.size == 0) {
+if (currentGroup.partitions.isEmpty) {
   addPartition(partition, splitSize)
   index += 1
 } else {
-  updateGroups
+  updateGroups()
--- End diff --

All the above changes are not related to this PR, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131570879
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -753,6 +753,16 @@ case class Repartition(numPartitions: Int, shuffle: 
Boolean, child: LogicalPlan)
 }
 
 /**
+ * Returns a new RDD that has at most `numPartitions` partitions. This 
behavior can be modified by
+ * supplying a `PartitionCoalescer` to control the behavior of the 
partitioning.
+ */
+case class PartitionCoalesce(numPartitions: Int, coalescer: 
PartitionCoalescer, child: LogicalPlan)
+  extends UnaryNode {
--- End diff --

Adding new logical nodes also needs the updates in multiple different 
components. (e.g., Optimizer). 

Is that possible to reuse the existing node `Repartition`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131570565
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ---
@@ -571,7 +570,8 @@ case class UnionExec(children: Seq[SparkPlan]) extends 
SparkPlan {
  * current upstream partitions will be executed in parallel (per whatever
  * the current partitioning is).
  */
-case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends 
UnaryExecNode {
+case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: 
Option[PartitionCoalescer])
--- End diff --

ok!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18861#discussion_r131570472
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ---
@@ -571,7 +570,8 @@ case class UnionExec(children: Seq[SparkPlan]) extends 
SparkPlan {
  * current upstream partitions will be executed in parallel (per whatever
  * the current partitioning is).
  */
-case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends 
UnaryExecNode {
+case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: 
Option[PartitionCoalescer])
--- End diff --

Could you add the parm description of `coalescer`? also update function 
descriptions? Thanks~!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18861: [SPARK-19426][SQL] Custom coalescer for Dataset

2017-08-06 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/18861

[SPARK-19426][SQL] Custom coalescer for Dataset

## What changes were proposed in this pull request?
This pr added a new API for `coalesce` in `Dataset`; users can specify the 
custom coalescer which reduces an input Dataset into fewer partitions. This 
coalescer implementation is the same with the one in `RDD#coalesce` added in 
#11865 (SPARK-14042).

This is the rework of #16766.

## How was this patch tested?
Added tests in `DatasetSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark pr16766

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18861.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18861


commit 9abcd75474f188d7b9f4ed4caab6bb8e98c1da4d
Author: Marius van Niekerk 
Date:   2016-11-07T22:06:38Z

Implement custom coalesce

commit bb7f9b0ab06ec0be8666e14846e19fe0c28e5143
Author: Takeshi Yamamuro 
Date:   2017-08-06T14:44:51Z

Add more fixes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org