[GitHub] spark issue #19978: [SPARK-22784][CORE][WIP] Configure reading buffer size i...
Github user MikhailErofeev commented on the issue: https://github.com/apache/spark/pull/19978 @srowen, yes, the processing is no longer IO-bound after backporting SPARK-20923 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19978: [SPARK-22784][CORE][WIP] Configure reading buffer...
Github user MikhailErofeev closed the pull request at: https://github.com/apache/spark/pull/19978 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19978: [SPARK-22784][CORE] Configure reading buffer size in Spa...
Github user MikhailErofeev commented on the issue: https://github.com/apache/spark/pull/19978 @squito Your guess was right, and I can remove these blocks by https://issues.apache.org/jira/browse/SPARK-20923. I will test the performance after this patch and refine or close the ticket. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19978: [SPARK-22784][CORE] Configure reading buffer size in Spa...
Github user MikhailErofeev commented on the issue: https://github.com/apache/spark/pull/19978 Thanks for the constuctive feedback. Here is my benchmark for a step of 1MB. During this run the speedup was 23%, I think there was some interference on my workstation. ``` 2048 213.481 1048576 212.256 2097152 206.07 3145728 199.057 4194304 192.292 5242880 187.91 6291456 182.95 7340032 179.837 8388608 176.994 9437184 174.602 10485760 173.313 < 10MB, 19% 11534336 172.1 12582912 176.085 13631488 171.782 14680064 172.209 15728640 172.048 16777216 168.313 16777216 168.313 17825792 169.521 18874368 167.466 19922944 167.023 20971520 167.7 < 10M. 21% saving 22020096 166.754 23068672 166.141 24117248 165.809 25165824 166.053 26214400 165.281 27262976 165.381 28311552 164.564 29360128 164.894 30408704 164.599 < 30M. x20 of avg line size, 23% saving 31457280 164.019 32505856 164.289 33554432 164.517 34603008 163.96 35651584 163.936 36700160 163.381 37748736 164.156 38797312 164.061 39845888 163.636 40894464 163.73 41943040 162.462 42991616 163.006 44040192 162.586 45088768 162.363 ``` @squito our main users run iterative algorithms with a lot of partitions. They have a lot of data and prefer to have smaller partitions (it's another improvement axis). So, long lines come from SparkListenerTaskEnd and this blocks info: ``` "Block ID": "rdd_5_30129", "Status": { "Storage Level": { "Use Disk": true, "Use Memory": false, "Deserialized": false, "Replication": 1 }, "Memory Size": 0, "Disk Size": 57174029 } }, ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19978: [SPARK-22784][CORE] Configure reading buffer size in Spa...
Github user MikhailErofeev commented on the issue: https://github.com/apache/spark/pull/19978 I don't mind to just set it to a higher value. Moreover, the current default value (2048) is small in any case. For my log files, 30M buffer was the best value (a bigger one did not bring a lot of speedup), although for other files the optimal value could be bigger. What do you think? Is it ok to keep the value as 30M? With 50 cores it could eat 1.5G. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19978: [SPARK-22784][CORE] Configure reading buffer size...
GitHub user MikhailErofeev opened a pull request: https://github.com/apache/spark/pull/19978 [SPARK-22784][CORE] Configure reading buffer size in Spark History Server ## What changes were proposed in this pull request? Added debug logging of spent time and line size for each job. Parametrized `ReplayListenerBus` with a new buffer size parameter `spark.history.fs.buffer.size`. Added documentation to the parameter. ## How was this patch tested? Existing tests for correctness, manual tests (reading a file in a loop with different buffer sizes) for performance measurements. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MikhailErofeev/spark feature/shs-buffer-upstream Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19978.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19978 commit db1dc533a17564d531e84ad4f41ae9152d8619a5 Author: m.erofeev <m.erof...@criteo.com> Date: 2017-12-14T10:46:59Z [SPARK-22784][CORE] Configure reading buffer size in Spark History Server Increasing buffer size in ReplayListenerBus speedups reading in case of long event strings --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org