Re: Shuffle write increases in spark 1.2
I think Xuefeng Wu's suggestion is likely correct. This different is more likely explained by the compression library changing versions than sort vs hash shuffle (which should not affect output size significantly). Others have reported that switching to lz4 fixed their issue. We should document this if this is the case. I wonder if we're asking Snappy to be super-low-overhead and as a result the new version does a better job of it (less overhead, less compression). On Sat, Feb 14, 2015 at 9:32 AM, Peng Cheng pc...@uow.edu.au wrote: I double check the 1.2 feature list and found out that the new sort-based shuffle manager has nothing to do with HashPartitioner :- Sorry for the misinformation. In another hand. This may explain increase in shuffle spill as a side effect of the new shuffle manager, let me revert spark.shuffle.manager to hash and see if it make things better (or worse, as the benchmark in https://issues.apache.org/jira/browse/SPARK-3280 indicates) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894p21657.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Shuffle write increases in spark 1.2
I have seen same behavior! I would love to hear an update on this... Thanks, Ami On Thu, Feb 5, 2015 at 8:26 AM, Anubhav Srivastav anubhav.srivas...@gmail.com wrote: Hi Kevin, We seem to be facing the same problem as well. Were you able to find anything after that? The ticket does not seem to have progressed anywhere. Regards, Anubhav On 5 January 2015 at 10:37, 정재부 itsjb.j...@samsung.com wrote: Sure, here is a ticket. https://issues.apache.org/jira/browse/SPARK-5081 --- *Original Message* --- *Sender* : Josh Rosenrosenvi...@gmail.com *Date* : 2015-01-05 06:14 (GMT+09:00) *Title* : Re: Shuffle write increases in spark 1.2 If you have a small reproduction for this issue, can you open a ticket at https://issues.apache.org/jira/browse/SPARK ? On December 29, 2014 at 7:10:02 PM, Kevin Jung (itsjb.j...@samsung.com) wrote: Hi all, The size of shuffle write showing in spark web UI is mush different when I execute same spark job on same input data(100GB) in both spark 1.1 and spark 1.2. At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 writes larger file than spark 1.1. Can anyone tell me why this happened? Thanks Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Shuffle write increases in spark 1.2
I double check the 1.2 feature list and found out that the new sort-based shuffle manager has nothing to do with HashPartitioner :- Sorry for the misinformation. In another hand. This may explain increase in shuffle spill as a side effect of the new shuffle manager, let me revert spark.shuffle.manager to hash and see if it make things better (or worse, as the benchmark in https://issues.apache.org/jira/browse/SPARK-3280 indicates) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894p21657.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Shuffle write increases in spark 1.2
Same problem here, shuffle write increased from 10G to over 64G, since I'm running on amazon EC2 this always cause temporary folder to consume all the disk space. Still looking for a solution. BTW, the 64G shuffle write is encountered on shuffling a pairRDD with HashPartitioner, so its not related to Spark 1.2.0's new features Yours Peng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894p21656.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Shuffle write increases in spark 1.2
Hello, as the original message never got accepted to the mailinglist, I quote it here completely: Kevin Jung wrote Hi all, The size of shuffle write showing in spark web UI is much different when I execute same spark job on same input data(100GB) in both spark 1.1 and spark 1.2. At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 writes larger file than spark 1.1. Can anyone tell me why this happens? Thanks Kevin I'm experiencing the same thing with my job and that's what I tested: * Spark 1.2.0 with Sort-based Shuffle * Spark 1.2.0 with Hash-based Shuffle * Spark 1.2.1 with Sort-based Shuffle All three combinations show the same behaviour, which contrasts from Spark 1.1.0. In Spark 1.1.0, my job runs for about an hour, in Spark 1.2.x it runs for almost four hours. Configuration is identical otherwise - I only added org.apache.spark.scheduler.CompressedMapStatus to the Kryo registrator for Spark 1.2.0 to cope with https://issues.apache.org/jira/browse/SPARK-5102. As a consequence (I think, but causality might be different) I see lots and lots of disk spills. I cannot provide a small test case, but maybe the log entries for a single worker thread can help someone investigate on this. (See below.) I will also open up an issue, if nobody stops me by providing an answer ;) Any help will be greatly appreciated, because otherwise I'm stuck with Spark 1.1.0, as quadrupling runtime is not an option. Sincerely, Chris 2015-02-09T14:06:06.328+01:00 INFOorg.apache.spark.executor.Executor Running task 9.0 in stage 18.0 (TID 300)Executor task launch worker-18 2015-02-09T14:06:06.351+01:00 INFOorg.apache.spark.CacheManager Partition rdd_35_9 not found, computing itExecutor task launch worker-18 2015-02-09T14:06:06.351+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 10 non-empty blocks out of 10 blocks Executor task launch worker-18 2015-02-09T14:06:06.351+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote fetches in 0 ms Executor task launch worker-18 2015-02-09T14:06:07.396+01:00 INFOorg.apache.spark.storage.MemoryStore ensureFreeSpace(2582904) called with curMem=300174944, maxMe... Executor task launch worker-18 2015-02-09T14:06:07.397+01:00 INFOorg.apache.spark.storage.MemoryStore Block rdd_35_9 stored as bytes in memory (estimated size 2.5... Executor task launch worker-18 2015-02-09T14:06:07.398+01:00 INFO org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_35_9 Executor task launch worker-18 2015-02-09T14:06:07.399+01:00 INFOorg.apache.spark.CacheManager Partition rdd_38_9 not found, computing itExecutor task launch worker-18 2015-02-09T14:06:07.399+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 10 non-empty blocks out of 10 blocks Executor task launch worker-18 2015-02-09T14:06:07.400+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote fetches in 0 ms Executor task launch worker-18 2015-02-09T14:06:07.567+01:00 INFOorg.apache.spark.storage.MemoryStore ensureFreeSpace(944848) called with curMem=302757848, maxMem... Executor task launch worker-18 2015-02-09T14:06:07.568+01:00 INFOorg.apache.spark.storage.MemoryStore Block rdd_38_9 stored as values in memory (estimated size 92... Executor task launch worker-18 2015-02-09T14:06:07.569+01:00 INFO org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_38_9 Executor task launch worker-18 2015-02-09T14:06:07.573+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 34 non-empty blocks out of 50 blocks Executor task launch worker-18 2015-02-09T14:06:07.573+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote fetches in 1 ms Executor task launch worker-18 2015-02-09T14:06:38.931+01:00 INFOorg.apache.spark.CacheManager Partition rdd_41_9 not found, computing itExecutor task launch worker-18 2015-02-09T14:06:38.931+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 3 non-empty blocks out of 10 blocks Executor task launch worker-18 2015-02-09T14:06:38.931+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote fetches in 0 ms Executor task launch worker-18 2015-02-09T14:06:38.945+01:00 INFOorg.apache.spark.storage.MemoryStore ensureFreeSpace(0) called with curMem=307529127, maxMem=9261... Executor task launch worker-18 2015-02-09T14:06:38.945+01:00 INFOorg.apache.spark.storage.MemoryStore Block rdd_41_9 stored as bytes in memory (estimated size 0.0... Executor task launch worker-18 2015-02-09T14:06:38.946+01:00 INFO org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_41_9 Executor task launch worker-18
Re: Shuffle write increases in spark 1.2
Hello, as the original message from Kevin Jung never got accepted to the mailinglist, I quote it here completely: Kevin Jung wrote Hi all, The size of shuffle write showing in spark web UI is much different when I execute same spark job on same input data(100GB) in both spark 1.1 and spark 1.2. At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 writes larger file than spark 1.1. Can anyone tell me why this happens? Thanks Kevin I'm experiencing the same thing with my job and that's what I tested: * Spark 1.2.0 with Sort-based Shuffle * Spark 1.2.0 with Hash-based Shuffle * Spark 1.2.1 with Sort-based Shuffle All three combinations show the same behaviour, which contrasts from Spark 1.1.0. In Spark 1.1.0, my job runs for about an hour, in Spark 1.2.x it runs for almost four hours. Configuration is identical otherwise - I only added org.apache.spark.scheduler.CompressedMapStatus to the Kryo registrator for Spark 1.2.0 to cope with https://issues.apache.org/jira/browse/SPARK-5102. As a consequence (I think, but causality might be different) I see lots and lots of disk spills. I cannot provide a small test case, but maybe the log entries for a single worker thread can help someone investigate on this. (See below.) I also opened an issue on this, see https://issues.apache.org/jira/browse/SPARK-5715 Any help will be greatly appreciated, because otherwise I'm stuck with Spark 1.1.0, as quadrupling runtime is not an option. Sincerely, Chris 2015-02-09T14:06:06.328+01:00 INFOorg.apache.spark.executor.Executor Running task 9.0 in stage 18.0 (TID 300)Executor task launch worker-18 2015-02-09T14:06:06.351+01:00 INFOorg.apache.spark.CacheManager Partition rdd_35_9 not found, computing itExecutor task launch worker-18 2015-02-09T14:06:06.351+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 10 non-empty blocks out of 10 blocks Executor task launch worker-18 2015-02-09T14:06:06.351+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote fetches in 0 ms Executor task launch worker-18 2015-02-09T14:06:07.396+01:00 INFOorg.apache.spark.storage.MemoryStore ensureFreeSpace(2582904) called with curMem=300174944, maxMe... Executor task launch worker-18 2015-02-09T14:06:07.397+01:00 INFOorg.apache.spark.storage.MemoryStore Block rdd_35_9 stored as bytes in memory (estimated size 2.5... Executor task launch worker-18 2015-02-09T14:06:07.398+01:00 INFO org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_35_9 Executor task launch worker-18 2015-02-09T14:06:07.399+01:00 INFOorg.apache.spark.CacheManager Partition rdd_38_9 not found, computing itExecutor task launch worker-18 2015-02-09T14:06:07.399+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 10 non-empty blocks out of 10 blocks Executor task launch worker-18 2015-02-09T14:06:07.400+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote fetches in 0 ms Executor task launch worker-18 2015-02-09T14:06:07.567+01:00 INFOorg.apache.spark.storage.MemoryStore ensureFreeSpace(944848) called with curMem=302757848, maxMem... Executor task launch worker-18 2015-02-09T14:06:07.568+01:00 INFOorg.apache.spark.storage.MemoryStore Block rdd_38_9 stored as values in memory (estimated size 92... Executor task launch worker-18 2015-02-09T14:06:07.569+01:00 INFO org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_38_9 Executor task launch worker-18 2015-02-09T14:06:07.573+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 34 non-empty blocks out of 50 blocks Executor task launch worker-18 2015-02-09T14:06:07.573+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote fetches in 1 ms Executor task launch worker-18 2015-02-09T14:06:38.931+01:00 INFOorg.apache.spark.CacheManager Partition rdd_41_9 not found, computing itExecutor task launch worker-18 2015-02-09T14:06:38.931+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 3 non-empty blocks out of 10 blocks Executor task launch worker-18 2015-02-09T14:06:38.931+01:00 INFO org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote fetches in 0 ms Executor task launch worker-18 2015-02-09T14:06:38.945+01:00 INFOorg.apache.spark.storage.MemoryStore ensureFreeSpace(0) called with curMem=307529127, maxMem=9261... Executor task launch worker-18 2015-02-09T14:06:38.945+01:00 INFOorg.apache.spark.storage.MemoryStore Block rdd_41_9 stored as bytes in memory (estimated size 0.0... Executor task launch worker-18 2015-02-09T14:06:38.946+01:00 INFO org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_41_9
Re: Shuffle write increases in spark 1.2
Hi Kevin, We seem to be facing the same problem as well. Were you able to find anything after that? The ticket does not seem to have progressed anywhere. Regards, Anubhav On 5 January 2015 at 10:37, 정재부 itsjb.j...@samsung.com wrote: Sure, here is a ticket. https://issues.apache.org/jira/browse/SPARK-5081 --- *Original Message* --- *Sender* : Josh Rosenrosenvi...@gmail.com *Date* : 2015-01-05 06:14 (GMT+09:00) *Title* : Re: Shuffle write increases in spark 1.2 If you have a small reproduction for this issue, can you open a ticket at https://issues.apache.org/jira/browse/SPARK ? On December 29, 2014 at 7:10:02 PM, Kevin Jung (itsjb.j...@samsung.com) wrote: Hi all, The size of shuffle write showing in spark web UI is mush different when I execute same spark job on same input data(100GB) in both spark 1.1 and spark 1.2. At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 writes larger file than spark 1.1. Can anyone tell me why this happened? Thanks Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Shuffle write increases in spark 1.2
Sure, here is a ticket. https://issues.apache.org/jira/browse/SPARK-5081 --- Original Message --- Sender : Josh Rosenrosenvi...@gmail.com Date : 2015-01-05 06:14 (GMT+09:00) Title : Re: Shuffle write increases in spark 1.2 If you have a small reproduction for this issue, can you open a ticket athttps://issues.apache.org/jira/browse/SPARK? On December 29, 2014 at 7:10:02 PM, Kevin Jung (itsjb.j...@samsung.com) wrote: Hi all, The size of shuffle write showing in spark web UI is mush different when I execute same spark job on same input data(100GB) in both spark 1.1 and spark 1.2. At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 writes larger file than spark 1.1. Can anyone tell me why this happened? Thanks Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org