[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22961 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22961 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98768/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22961 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22961 **[Test build #98768 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98768/testReport)** for PR 22961 at commit [`6dd50b0`](https://github.com/apache/spark/commit/6dd50b02f607c6f1b34b00e85a2c0e11bc8518ff). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22961 LGTM too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22961 cool thanks! LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user mu5358271 commented on the issue: https://github.com/apache/spark/pull/22961 Did some performance evaluation on a 1G test dataset on a small cluster with the following script: ``` import java.util.UUID import org.apache.spark.SparkContext import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.udf import scala.util.{Random, Try} @transient val sc = SparkContext.getOrCreate() @transient val spark = SparkSession.builder().getOrCreate() import spark.implicits._ val totalSize = 28 // 1G total val longSize = 6 // 256 Byte records val wideSize = 13 // 32KB records sc.parallelize(0 until (1 << (totalSize - longSize)), 200). map(_ => Array.fill(1 << longSize)(Random.nextInt)). toDS. write.mode("overwrite").parquet("long") sc.parallelize(0 until (1 << (totalSize - wideSize)), 200). map(_ => Array.fill(1 << wideSize)(Random.nextInt)). toDS. write.mode("overwrite").parquet("wide") val expensiveOrdering = udf((vs: Seq[Int]) => vs.foldLeft(0L)(_ + _)) for { format <- Seq("wide", "long") expensive <- Seq(true, false) trial <- 0 until 10 } yield { val time = Try({ val start = System.currentTimeMillis() spark.read.parquet(format).orderBy(if (expensive) expensiveOrdering('value) else 'value (0)).write.parquet(s"$format-${UUID.randomUUID}") System.currentTimeMillis() - start }).toOption (format, expensive, trial, time) } ``` scenarios: - after: with this change and using default 1g spark.driver.maxResultSize - before : without this change and using default 1g spark.driver.maxResultSize - before + : without this change and increase spark.driver.maxResultSize from default 1g to 4g. no value means evaluation failed. ordering | format | scenario | avg (ms) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- cheap | long | after | 23435.2 | 16259 | 31116 | 24585 | 21104 | 26732 | 15863 | 23716 | 25672 | 28313 | 20992 cheap | long | before | 24087.9 | 26391 | 24483 | 28731 | 24995 | 18151 | 27224 | 25278 | 16526 | 24290 | 24810 cheap | long | before+ | 21538.5 | 22336 | 31748 | 17915 | 21733 | 16393 | 20415 | 23558 | 21403 | 22264 | 17620 cheap | wide | after | 25028.7 | 26401 | 21526 | 27118 | 22763 | 41360 | 14608 | 22935 | 28918 | 21304 | 23354 cheap | wide | before | Â | Â | Â | Â | Â | Â | Â | Â | Â | Â | Â cheap | wide | before+ | 33324.1 | 42077 | 32455 | 38926 | 31055 | 30729 | 30532 | 30121 | 30127 | 30357 | 36862 expensive | long | after | 24989.2 | 22967 | 22490 | 22365 | 27159 | 23944 | 25401 | 22834 | 26073 | 28212 | 28447 expensive | long | before | 33553.1 | 30019 | 33404 | 32004 | 33547 | 35282 | 34149 | 33365 | 30934 | 36945 | 35882 expensive | long | before+ | 32839.4 | 32572 | 35354 | 32635 | 33385 | 32063 | 33350 | 35472 | 31771 | 31261 | 30531 expensive | wide | after | 26740.2 | 39559 | 30116 | 22777 | 24766 | 21391 | 22470 | 31302 | 18392 | 35768 | 20861 expensive | wide | before | Â | Â | Â | Â | Â | Â | Â | Â | Â | Â | Â expensive | wide | before+ | 254233.4 | 356997 | 309464 | 281589 | 226232 | 223588 | 224295 | 238064 | 226036 | 230633 | 225436 - the suggested change has roughly the same performance as before when the dataset has small rows and the ordering evaluation is cheap. - it reduces runtime when the ordering evaluation is expensive by distributing ordering evaluation across the cluster. - it reduces driver memory usage and helps job complete successfully when rows are large by reducing the size of data collected to the driver --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22961 **[Test build #98768 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98768/testReport)** for PR 22961 at commit [`6dd50b0`](https://github.com/apache/spark/commit/6dd50b02f607c6f1b34b00e85a2c0e11bc8518ff). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22961 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22961 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98722/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22961 **[Test build #98722 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98722/testReport)** for PR 22961 at commit [`54b60ab`](https://github.com/apache/spark/commit/54b60abfd11628cd12a8bf39e082d795b29427cf). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22961 do you have some benchmark numbers? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22961 **[Test build #98722 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98722/testReport)** for PR 22961 at commit [`54b60ab`](https://github.com/apache/spark/commit/54b60abfd11628cd12a8bf39e082d795b29427cf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22961 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user mu5358271 commented on the issue: https://github.com/apache/spark/pull/22961 cc @cloud-fan @gatorsmile @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22961 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22961 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22961 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org