Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread XianXing Zhang
Do we have any update on this thread? Has anyone met and solved similar
problems before?

Any pointers will be greatly appreciated!

Best,
XianXing

On Mon, Jun 15, 2015 at 11:48 PM, Jia Yu jia...@asu.edu wrote:

 Hi Peng,

 I got exactly same error! My shuffle data is also very large. Have you
 figured out a method to solve that?

 Thanks,
 Jia

 On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng pc...@uow.edu.au wrote:

 I'm deploying a Spark data processing job on an EC2 cluster, the job is
 small
 for the cluster (16 cores with 120G RAM in total), the largest RDD has
 only
 76k+ rows. But heavily skewed in the middle (thus requires repartitioning)
 and each row has around 100k of data after serialization. The job always
 got
 stuck in repartitioning. Namely, the job will constantly get following
 errors and retries:

 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle

 org.apache.spark.shuffle.FetchFailedException: Error in opening
 FileSegmentManagedBuffer

 org.apache.spark.shuffle.FetchFailedException:
 java.io.FileNotFoundException: /tmp/spark-...
 I've tried to identify the problem but it seems like both memory and disk
 consumption of the machine throwing these errors are below 50%. I've also
 tried different configurations, including:

 let driver/executor memory use 60% of total memory.
 let netty to priortize JVM shuffling buffer.
 increase shuffling streaming buffer to 128m.
 use KryoSerializer and max out all buffers
 increase shuffling memoryFraction to 0.4
 But none of them works. The small job always trigger the same series of
 errors and max out retries (upt to 1000 times). How to troubleshoot this
 thing in such situation?

 Thanks a lot if you have any clue.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread Eugen Cepoi
Are you using yarn?
If yes increase the yarn memory overhead option. Yarn is probably killing
your executors.
Le 26 juin 2015 20:43, XianXing Zhang xianxing.zh...@gmail.com a écrit :

 Do we have any update on this thread? Has anyone met and solved similar
 problems before?

 Any pointers will be greatly appreciated!

 Best,
 XianXing

 On Mon, Jun 15, 2015 at 11:48 PM, Jia Yu jia...@asu.edu wrote:

 Hi Peng,

 I got exactly same error! My shuffle data is also very large. Have you
 figured out a method to solve that?

 Thanks,
 Jia

 On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng pc...@uow.edu.au wrote:

 I'm deploying a Spark data processing job on an EC2 cluster, the job is
 small
 for the cluster (16 cores with 120G RAM in total), the largest RDD has
 only
 76k+ rows. But heavily skewed in the middle (thus requires
 repartitioning)
 and each row has around 100k of data after serialization. The job always
 got
 stuck in repartitioning. Namely, the job will constantly get following
 errors and retries:

 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle

 org.apache.spark.shuffle.FetchFailedException: Error in opening
 FileSegmentManagedBuffer

 org.apache.spark.shuffle.FetchFailedException:
 java.io.FileNotFoundException: /tmp/spark-...
 I've tried to identify the problem but it seems like both memory and disk
 consumption of the machine throwing these errors are below 50%. I've also
 tried different configurations, including:

 let driver/executor memory use 60% of total memory.
 let netty to priortize JVM shuffling buffer.
 increase shuffling streaming buffer to 128m.
 use KryoSerializer and max out all buffers
 increase shuffling memoryFraction to 0.4
 But none of them works. The small job always trigger the same series of
 errors and max out retries (upt to 1000 times). How to troubleshoot this
 thing in such situation?

 Thanks a lot if you have any clue.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread XianXing Zhang
Yes we deployed Spark on top of Yarn.

What you suggested is very helpful, I increased the Yarn memory overhead
option and it helped in most cases. (Sometime it still has some failures
when the amount of data to be shuffled is large, but I guess if I continue
increasing the Yarn memory overhead option, the problem should be solved,
although at the expense of consuming more memory).

Thank you!

On Fri, Jun 26, 2015 at 1:34 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:

 Are you using yarn?
 If yes increase the yarn memory overhead option. Yarn is probably killing
 your executors.
 Le 26 juin 2015 20:43, XianXing Zhang xianxing.zh...@gmail.com a
 écrit :

 Do we have any update on this thread? Has anyone met and solved similar
 problems before?

 Any pointers will be greatly appreciated!

 Best,
 XianXing

 On Mon, Jun 15, 2015 at 11:48 PM, Jia Yu jia...@asu.edu wrote:

 Hi Peng,

 I got exactly same error! My shuffle data is also very large. Have you
 figured out a method to solve that?

 Thanks,
 Jia

 On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng pc...@uow.edu.au wrote:

 I'm deploying a Spark data processing job on an EC2 cluster, the job is
 small
 for the cluster (16 cores with 120G RAM in total), the largest RDD has
 only
 76k+ rows. But heavily skewed in the middle (thus requires
 repartitioning)
 and each row has around 100k of data after serialization. The job
 always got
 stuck in repartitioning. Namely, the job will constantly get following
 errors and retries:

 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle

 org.apache.spark.shuffle.FetchFailedException: Error in opening
 FileSegmentManagedBuffer

 org.apache.spark.shuffle.FetchFailedException:
 java.io.FileNotFoundException: /tmp/spark-...
 I've tried to identify the problem but it seems like both memory and
 disk
 consumption of the machine throwing these errors are below 50%. I've
 also
 tried different configurations, including:

 let driver/executor memory use 60% of total memory.
 let netty to priortize JVM shuffling buffer.
 increase shuffling streaming buffer to 128m.
 use KryoSerializer and max out all buffers
 increase shuffling memoryFraction to 0.4
 But none of them works. The small job always trigger the same series of
 errors and max out retries (upt to 1000 times). How to troubleshoot this
 thing in such situation?

 Thanks a lot if you have any clue.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-16 Thread Jia Yu
Hi Peng,

I got exactly same error! My shuffle data is also very large. Have you
figured out a method to solve that?

Thanks,
Jia

On Fri, Apr 24, 2015 at 7:59 AM, Peng Cheng pc...@uow.edu.au wrote:

 I'm deploying a Spark data processing job on an EC2 cluster, the job is
 small
 for the cluster (16 cores with 120G RAM in total), the largest RDD has only
 76k+ rows. But heavily skewed in the middle (thus requires repartitioning)
 and each row has around 100k of data after serialization. The job always
 got
 stuck in repartitioning. Namely, the job will constantly get following
 errors and retries:

 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle

 org.apache.spark.shuffle.FetchFailedException: Error in opening
 FileSegmentManagedBuffer

 org.apache.spark.shuffle.FetchFailedException:
 java.io.FileNotFoundException: /tmp/spark-...
 I've tried to identify the problem but it seems like both memory and disk
 consumption of the machine throwing these errors are below 50%. I've also
 tried different configurations, including:

 let driver/executor memory use 60% of total memory.
 let netty to priortize JVM shuffling buffer.
 increase shuffling streaming buffer to 128m.
 use KryoSerializer and max out all buffers
 increase shuffling memoryFraction to 0.4
 But none of them works. The small job always trigger the same series of
 errors and max out retries (upt to 1000 times). How to troubleshoot this
 thing in such situation?

 Thanks a lot if you have any clue.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-04-24 Thread Peng Cheng
I'm deploying a Spark data processing job on an EC2 cluster, the job is small
for the cluster (16 cores with 120G RAM in total), the largest RDD has only
76k+ rows. But heavily skewed in the middle (thus requires repartitioning)
and each row has around 100k of data after serialization. The job always got
stuck in repartitioning. Namely, the job will constantly get following
errors and retries:

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle

org.apache.spark.shuffle.FetchFailedException: Error in opening
FileSegmentManagedBuffer

org.apache.spark.shuffle.FetchFailedException:
java.io.FileNotFoundException: /tmp/spark-...
I've tried to identify the problem but it seems like both memory and disk
consumption of the machine throwing these errors are below 50%. I've also
tried different configurations, including:

let driver/executor memory use 60% of total memory.
let netty to priortize JVM shuffling buffer.
increase shuffling streaming buffer to 128m.
use KryoSerializer and max out all buffers
increase shuffling memoryFraction to 0.4
But none of them works. The small job always trigger the same series of
errors and max out retries (upt to 1000 times). How to troubleshoot this
thing in such situation?

Thanks a lot if you have any clue.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-are-the-likely-causes-of-org-apache-spark-shuffle-MetadataFetchFailedException-Missing-an-outpu-tp22646.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org