[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-13 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22961
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22961
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98768/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22961
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22961
  
**[Test build #98768 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98768/testReport)**
 for PR 22961 at commit 
[`6dd50b0`](https://github.com/apache/spark/commit/6dd50b02f607c6f1b34b00e85a2c0e11bc8518ff).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-13 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/22961
  
LGTM too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-13 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22961
  
cool thanks! LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-13 Thread mu5358271
Github user mu5358271 commented on the issue:

https://github.com/apache/spark/pull/22961
  
Did some performance evaluation on a 1G test dataset on a small cluster 
with the following script:

```
import java.util.UUID

import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.udf

import scala.util.{Random, Try}

@transient val sc = SparkContext.getOrCreate()
@transient val spark = SparkSession.builder().getOrCreate()

import spark.implicits._

val totalSize = 28 // 1G total

val longSize = 6 // 256 Byte records

val wideSize = 13 // 32KB records

sc.parallelize(0 until (1 << (totalSize - longSize)), 200).
  map(_ => Array.fill(1 << longSize)(Random.nextInt)).
  toDS.
  write.mode("overwrite").parquet("long")

sc.parallelize(0 until (1 << (totalSize - wideSize)), 200).
  map(_ => Array.fill(1 << wideSize)(Random.nextInt)).
  toDS.
  write.mode("overwrite").parquet("wide")

val expensiveOrdering = udf((vs: Seq[Int]) => vs.foldLeft(0L)(_ + _))

for {
  format <- Seq("wide", "long")
  expensive <- Seq(true, false)
  trial <- 0 until 10
} yield {
  val time =
Try({
  val start = System.currentTimeMillis()
  spark.read.parquet(format).orderBy(if (expensive) 
expensiveOrdering('value) else 'value 
(0)).write.parquet(s"$format-${UUID.randomUUID}")
  System.currentTimeMillis() - start
}).toOption
  (format, expensive, trial, time)
}
```

scenarios:
- after: with this change and using default 1g spark.driver.maxResultSize 
- before : without this change and using default 1g 
spark.driver.maxResultSize
- before + : without this change and increase spark.driver.maxResultSize 
from default 1g to 4g.

no value means evaluation failed.

ordering | format | scenario | avg (ms) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 
| 9
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
cheap | long | after | 23435.2 | 16259 | 31116 | 24585 | 21104 | 26732 | 
15863 | 23716 | 25672 | 28313 | 20992
cheap | long | before | 24087.9 | 26391 | 24483 | 28731 | 24995 | 18151 | 
27224 | 25278 | 16526 | 24290 | 24810
cheap | long | before+ | 21538.5 | 22336 | 31748 | 17915 | 21733 | 16393 | 
20415 | 23558 | 21403 | 22264 | 17620
cheap | wide | after | 25028.7 | 26401 | 21526 | 27118 | 22763 | 41360 | 
14608 | 22935 | 28918 | 21304 | 23354
cheap | wide | before |   |   |   |   |   |   |   |   |   |   |  
cheap | wide | before+ | 33324.1 | 42077 | 32455 | 38926 | 31055 | 30729 | 
30532 | 30121 | 30127 | 30357 | 36862
expensive | long | after | 24989.2 | 22967 | 22490 | 22365 | 27159 | 23944 
| 25401 | 22834 | 26073 | 28212 | 28447
expensive | long | before | 33553.1 | 30019 | 33404 | 32004 | 33547 | 35282 
| 34149 | 33365 | 30934 | 36945 | 35882
expensive | long | before+ | 32839.4 | 32572 | 35354 | 32635 | 33385 | 
32063 | 33350 | 35472 | 31771 | 31261 | 30531
expensive | wide | after | 26740.2 | 39559 | 30116 | 22777 | 24766 | 21391 
| 22470 | 31302 | 18392 | 35768 | 20861
expensive | wide | before |   |   |   |   |   |   |   |   |   |   
|  
expensive | wide | before+ | 254233.4 | 356997 | 309464 | 281589 | 226232 | 
223588 | 224295 | 238064 | 226036 | 230633 | 225436


- the suggested change has roughly the same performance as before when the 
dataset has small rows and the ordering evaluation is cheap.
- it reduces runtime when the ordering evaluation is expensive by 
distributing ordering evaluation across the cluster.
- it reduces driver memory usage and helps job complete successfully when 
rows are large by reducing the size of data collected to the driver


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22961
  
**[Test build #98768 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98768/testReport)**
 for PR 22961 at commit 
[`6dd50b0`](https://github.com/apache/spark/commit/6dd50b02f607c6f1b34b00e85a2c0e11bc8518ff).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22961
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22961
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98722/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22961
  
**[Test build #98722 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98722/testReport)**
 for PR 22961 at commit 
[`54b60ab`](https://github.com/apache/spark/commit/54b60abfd11628cd12a8bf39e082d795b29427cf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-12 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22961
  
do you have some benchmark numbers?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22961
  
**[Test build #98722 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98722/testReport)**
 for PR 22961 at commit 
[`54b60ab`](https://github.com/apache/spark/commit/54b60abfd11628cd12a8bf39e082d795b29427cf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-12 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22961
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-08 Thread mu5358271
Github user mu5358271 commented on the issue:

https://github.com/apache/spark/pull/22961
  
cc @cloud-fan @gatorsmile @hvanhovell 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22961
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22961
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22961
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org