Re: Shuffle write increases in spark 1.2

2015-02-15 Thread Aaron Davidson
I think Xuefeng Wu's suggestion is likely correct. This different is more
likely explained by the compression library changing versions than sort vs
hash shuffle (which should not affect output size significantly). Others
have reported that switching to lz4 fixed their issue.

We should document this if this is the case. I wonder if we're asking
Snappy to be super-low-overhead and as a result the new version does a
better job of it (less overhead, less compression).

On Sat, Feb 14, 2015 at 9:32 AM, Peng Cheng pc...@uow.edu.au wrote:

 I double check the 1.2 feature list and found out that the new sort-based
 shuffle manager has nothing to do with HashPartitioner :- Sorry for the
 misinformation.

 In another hand. This may explain increase in shuffle spill as a side
 effect
 of the new shuffle manager, let me revert spark.shuffle.manager to hash and
 see if it make things better (or worse, as the benchmark in
 https://issues.apache.org/jira/browse/SPARK-3280 indicates)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894p21657.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Shuffle write increases in spark 1.2

2015-02-15 Thread Ami Khandeshi
I have seen same behavior!  I would love to hear an update on this...

Thanks,

Ami

On Thu, Feb 5, 2015 at 8:26 AM, Anubhav Srivastav 
anubhav.srivas...@gmail.com wrote:

 Hi Kevin,
 We seem to be facing the same problem as well. Were you able to find
 anything after that? The ticket does not seem to have progressed anywhere.

 Regards,
 Anubhav

 On 5 January 2015 at 10:37, 정재부 itsjb.j...@samsung.com wrote:

  Sure, here is a ticket. https://issues.apache.org/jira/browse/SPARK-5081



 --- *Original Message* ---

 *Sender* : Josh Rosenrosenvi...@gmail.com

 *Date* : 2015-01-05 06:14 (GMT+09:00)

 *Title* : Re: Shuffle write increases in spark 1.2


 If you have a small reproduction for this issue, can you open a ticket at
 https://issues.apache.org/jira/browse/SPARK ?



 On December 29, 2014 at 7:10:02 PM, Kevin Jung (itsjb.j...@samsung.com)
 wrote:

  Hi all,
 The size of shuffle write showing in spark web UI is mush different when
 I
 execute same spark job on same input data(100GB) in both spark 1.1 and
 spark
 1.2.
 At the same sortBy stage, the size of shuffle write is 39.7GB in spark
 1.1
 but 91.0GB in spark 1.2.
 I set spark.shuffle.manager option to hash because it's default value is
 changed but spark 1.2 writes larger file than spark 1.1.
 Can anyone tell me why this happened?

 Thanks
 Kevin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

 - To
 unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org





Re: Shuffle write increases in spark 1.2

2015-02-14 Thread Peng Cheng
I double check the 1.2 feature list and found out that the new sort-based
shuffle manager has nothing to do with HashPartitioner :- Sorry for the
misinformation.

In another hand. This may explain increase in shuffle spill as a side effect
of the new shuffle manager, let me revert spark.shuffle.manager to hash and
see if it make things better (or worse, as the benchmark in
https://issues.apache.org/jira/browse/SPARK-3280 indicates)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894p21657.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Shuffle write increases in spark 1.2

2015-02-14 Thread Peng Cheng
Same problem here, shuffle write increased from 10G to over 64G, since I'm
running on amazon EC2 this always cause temporary folder to consume all the
disk space. Still looking for a solution.

BTW, the 64G shuffle write is encountered on shuffling a pairRDD with
HashPartitioner, so its not related to Spark 1.2.0's new features

Yours Peng



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894p21656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Shuffle write increases in spark 1.2

2015-02-10 Thread chris
Hello,

as the original message never got accepted to the mailinglist, I quote it
here completely:


Kevin Jung wrote
 Hi all,
 The size of shuffle write showing in spark web UI is much different when I
 execute same spark job on same input data(100GB) in both spark 1.1 and
 spark 1.2.
 At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1
 but 91.0GB in spark 1.2.
 I set spark.shuffle.manager option to hash because it's default value is
 changed but spark 1.2 writes larger file than spark 1.1.
 Can anyone tell me why this happens?
 
 Thanks
 Kevin

I'm experiencing the same thing with my job and that's what I tested:

* Spark 1.2.0 with Sort-based Shuffle
* Spark 1.2.0 with Hash-based Shuffle
* Spark 1.2.1 with Sort-based Shuffle

All three combinations show the same behaviour, which contrasts from Spark
1.1.0.

In Spark 1.1.0, my job runs for about an hour, in Spark 1.2.x it runs for
almost four hours. Configuration is identical otherwise - I only added
org.apache.spark.scheduler.CompressedMapStatus to the Kryo registrator for
Spark 1.2.0 to cope with https://issues.apache.org/jira/browse/SPARK-5102.


As a consequence (I think, but causality might be different) I see lots and
lots of disk spills.

I cannot provide a small test case, but maybe the log entries for a single
worker thread can help someone investigate on this. (See below.)


I will also open up an issue, if nobody stops me by providing an answer ;)

Any help will be greatly appreciated, because otherwise I'm stuck with Spark
1.1.0, as quadrupling runtime is not an option.

Sincerely,

Chris



2015-02-09T14:06:06.328+01:00   INFOorg.apache.spark.executor.Executor
Running task 9.0 in stage 18.0 (TID 300)Executor task launch worker-18
2015-02-09T14:06:06.351+01:00   INFOorg.apache.spark.CacheManager   
Partition
rdd_35_9 not found, computing itExecutor task launch worker-18
2015-02-09T14:06:06.351+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 10 non-empty
blocks out of 10 blocks Executor task launch worker-18
2015-02-09T14:06:06.351+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote
fetches in 0 ms Executor task launch worker-18
2015-02-09T14:06:07.396+01:00   INFOorg.apache.spark.storage.MemoryStore
ensureFreeSpace(2582904) called with curMem=300174944, maxMe... Executor
task launch worker-18
2015-02-09T14:06:07.397+01:00   INFOorg.apache.spark.storage.MemoryStore
Block rdd_35_9 stored as bytes in memory (estimated size 2.5... Executor
task launch worker-18
2015-02-09T14:06:07.398+01:00   INFO
org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_35_9
Executor task launch worker-18
2015-02-09T14:06:07.399+01:00   INFOorg.apache.spark.CacheManager   
Partition
rdd_38_9 not found, computing itExecutor task launch worker-18
2015-02-09T14:06:07.399+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 10 non-empty
blocks out of 10 blocks Executor task launch worker-18
2015-02-09T14:06:07.400+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote
fetches in 0 ms Executor task launch worker-18
2015-02-09T14:06:07.567+01:00   INFOorg.apache.spark.storage.MemoryStore
ensureFreeSpace(944848) called with curMem=302757848, maxMem... Executor
task launch worker-18
2015-02-09T14:06:07.568+01:00   INFOorg.apache.spark.storage.MemoryStore
Block rdd_38_9 stored as values in memory (estimated size 92... Executor
task launch worker-18
2015-02-09T14:06:07.569+01:00   INFO
org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_38_9
Executor task launch worker-18
2015-02-09T14:06:07.573+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 34 non-empty
blocks out of 50 blocks Executor task launch worker-18
2015-02-09T14:06:07.573+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote
fetches in 1 ms Executor task launch worker-18
2015-02-09T14:06:38.931+01:00   INFOorg.apache.spark.CacheManager   
Partition
rdd_41_9 not found, computing itExecutor task launch worker-18
2015-02-09T14:06:38.931+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 3 non-empty
blocks out of 10 blocks Executor task launch worker-18
2015-02-09T14:06:38.931+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote
fetches in 0 ms Executor task launch worker-18
2015-02-09T14:06:38.945+01:00   INFOorg.apache.spark.storage.MemoryStore
ensureFreeSpace(0) called with curMem=307529127, maxMem=9261... Executor
task launch worker-18
2015-02-09T14:06:38.945+01:00   INFOorg.apache.spark.storage.MemoryStore
Block rdd_41_9 stored as bytes in memory (estimated size 0.0... Executor
task launch worker-18
2015-02-09T14:06:38.946+01:00   INFO
org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_41_9
Executor task launch worker-18

Re: Shuffle write increases in spark 1.2

2015-02-10 Thread chris
Hello,

as the original message from Kevin Jung never got accepted to the
mailinglist, I quote it here completely:


Kevin Jung wrote
 Hi all,
 The size of shuffle write showing in spark web UI is much different when I
 execute same spark job on same input data(100GB) in both spark 1.1 and
 spark 1.2.
 At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1
 but 91.0GB in spark 1.2.
 I set spark.shuffle.manager option to hash because it's default value is
 changed but spark 1.2 writes larger file than spark 1.1.
 Can anyone tell me why this happens?
 
 Thanks
 Kevin

I'm experiencing the same thing with my job and that's what I tested:

* Spark 1.2.0 with Sort-based Shuffle
* Spark 1.2.0 with Hash-based Shuffle
* Spark 1.2.1 with Sort-based Shuffle

All three combinations show the same behaviour, which contrasts from Spark
1.1.0.

In Spark 1.1.0, my job runs for about an hour, in Spark 1.2.x it runs for
almost four hours. Configuration is identical otherwise - I only added
org.apache.spark.scheduler.CompressedMapStatus to the Kryo registrator for
Spark 1.2.0 to cope with https://issues.apache.org/jira/browse/SPARK-5102.


As a consequence (I think, but causality might be different) I see lots and
lots of disk spills.

I cannot provide a small test case, but maybe the log entries for a single
worker thread can help someone investigate on this. (See below.)


I also opened an issue on this, see
https://issues.apache.org/jira/browse/SPARK-5715

Any help will be greatly appreciated, because otherwise I'm stuck with Spark
1.1.0, as quadrupling runtime is not an option.

Sincerely,

Chris



2015-02-09T14:06:06.328+01:00   INFOorg.apache.spark.executor.Executor
Running task 9.0 in stage 18.0 (TID 300)Executor task launch worker-18
2015-02-09T14:06:06.351+01:00   INFOorg.apache.spark.CacheManager   
Partition
rdd_35_9 not found, computing itExecutor task launch worker-18
2015-02-09T14:06:06.351+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 10 non-empty
blocks out of 10 blocks Executor task launch worker-18
2015-02-09T14:06:06.351+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote
fetches in 0 ms Executor task launch worker-18
2015-02-09T14:06:07.396+01:00   INFOorg.apache.spark.storage.MemoryStore
ensureFreeSpace(2582904) called with curMem=300174944, maxMe... Executor
task launch worker-18
2015-02-09T14:06:07.397+01:00   INFOorg.apache.spark.storage.MemoryStore
Block rdd_35_9 stored as bytes in memory (estimated size 2.5... Executor
task launch worker-18
2015-02-09T14:06:07.398+01:00   INFO
org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_35_9
Executor task launch worker-18
2015-02-09T14:06:07.399+01:00   INFOorg.apache.spark.CacheManager   
Partition
rdd_38_9 not found, computing itExecutor task launch worker-18
2015-02-09T14:06:07.399+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 10 non-empty
blocks out of 10 blocks Executor task launch worker-18
2015-02-09T14:06:07.400+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote
fetches in 0 ms Executor task launch worker-18
2015-02-09T14:06:07.567+01:00   INFOorg.apache.spark.storage.MemoryStore
ensureFreeSpace(944848) called with curMem=302757848, maxMem... Executor
task launch worker-18
2015-02-09T14:06:07.568+01:00   INFOorg.apache.spark.storage.MemoryStore
Block rdd_38_9 stored as values in memory (estimated size 92... Executor
task launch worker-18
2015-02-09T14:06:07.569+01:00   INFO
org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_38_9
Executor task launch worker-18
2015-02-09T14:06:07.573+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 34 non-empty
blocks out of 50 blocks Executor task launch worker-18
2015-02-09T14:06:07.573+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote
fetches in 1 ms Executor task launch worker-18
2015-02-09T14:06:38.931+01:00   INFOorg.apache.spark.CacheManager   
Partition
rdd_41_9 not found, computing itExecutor task launch worker-18
2015-02-09T14:06:38.931+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorGetting 3 non-empty
blocks out of 10 blocks Executor task launch worker-18
2015-02-09T14:06:38.931+01:00   INFO
org.apache.spark.storage.ShuffleBlockFetcherIteratorStarted 0 remote
fetches in 0 ms Executor task launch worker-18
2015-02-09T14:06:38.945+01:00   INFOorg.apache.spark.storage.MemoryStore
ensureFreeSpace(0) called with curMem=307529127, maxMem=9261... Executor
task launch worker-18
2015-02-09T14:06:38.945+01:00   INFOorg.apache.spark.storage.MemoryStore
Block rdd_41_9 stored as bytes in memory (estimated size 0.0... Executor
task launch worker-18
2015-02-09T14:06:38.946+01:00   INFO
org.apache.spark.storage.BlockManagerMaster Updated info of block rdd_41_9

Re: Shuffle write increases in spark 1.2

2015-02-05 Thread Anubhav Srivastav
Hi Kevin,
We seem to be facing the same problem as well. Were you able to find
anything after that? The ticket does not seem to have progressed anywhere.

Regards,
Anubhav

On 5 January 2015 at 10:37, 정재부 itsjb.j...@samsung.com wrote:

  Sure, here is a ticket. https://issues.apache.org/jira/browse/SPARK-5081



 --- *Original Message* ---

 *Sender* : Josh Rosenrosenvi...@gmail.com

 *Date* : 2015-01-05 06:14 (GMT+09:00)

 *Title* : Re: Shuffle write increases in spark 1.2


 If you have a small reproduction for this issue, can you open a ticket at
 https://issues.apache.org/jira/browse/SPARK ?



 On December 29, 2014 at 7:10:02 PM, Kevin Jung (itsjb.j...@samsung.com)
 wrote:

  Hi all,
 The size of shuffle write showing in spark web UI is mush different when I
 execute same spark job on same input data(100GB) in both spark 1.1 and
 spark
 1.2.
 At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1
 but 91.0GB in spark 1.2.
 I set spark.shuffle.manager option to hash because it's default value is
 changed but spark 1.2 writes larger file than spark 1.1.
 Can anyone tell me why this happened?

 Thanks
 Kevin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

 - To
 unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org


Re: Shuffle write increases in spark 1.2

2015-01-04 Thread 정재부


Sure, here is a ticket. https://issues.apache.org/jira/browse/SPARK-5081

--- Original Message ---
Sender : Josh Rosenrosenvi...@gmail.com
Date : 2015-01-05 06:14 (GMT+09:00)
Title : Re: Shuffle write increases in spark 1.2



If you have a small reproduction for this issue, can you open a ticket athttps://issues.apache.org/jira/browse/SPARK? 


On December 29, 2014 at 7:10:02 PM, Kevin Jung (itsjb.j...@samsung.com) wrote:



Hi all, The size of shuffle write showing in spark web UI is mush different when I execute same spark job on same input data(100GB) in both spark 1.1 and spark 1.2. At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 writes larger file than spark 1.1. Can anyone tell me why this happened? Thanks Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org 
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org