Re: Spark stalling during shuffle (maybe a memory issue)

2016-11-13 Thread bogdanbaraila
The issue was fixed for me by allocating just one core per executor. If I
have executors with more then 1 core the issue appears again. I didn't yet
understood why is this happening but for the ones having similar issue they
can try this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p28067.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark stalling during shuffle (maybe a memory issue)

2016-09-14 Thread bogdanbaraila
Hello Jonathan

Did you found any working solution for your issue? If yes could you please
share it?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p27716.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Aaron Davidson
So the current stalling is simply sitting there with no log output? Have
you jstack'd an Executor to see where it may be hanging? Are you observing
memory or disk pressure ("df" and "df -i")?


On Tue, May 20, 2014 at 2:03 PM, jonathan.keebler wrote:

> Thanks for the suggestion, Andrew.  We have also implemented our solution
> using reduceByKey, but observe the same behavior.  For example if we do the
> following:
>
> map1
> groupByKey
> map2
> saveAsTextFile
>
> Then the stalling will occur during the map1+groupByKey execution.
>
> If we do
>
> map1
> reduceByKey
> map2
> saveAsTextFile
>
> Then the reduceByKey finishes successfully, but the stalling will occur
> during the map2+saveAsTextFile execution.
>
>
> On Tue, May 20, 2014 at 4:22 PM, Andrew Ash [via Apache Spark User List] 
> <[hidden
> email] <http://user/SendEmail.jtp?type=node&node=6137&i=0>> wrote:
>
>> If the distribution of the keys in your groupByKey is skewed (some keys
>> appear way more often than others) you should consider modifying your job
>> to use reduceByKey instead wherever possible.
>> On May 20, 2014 12:53 PM, "Jon Keebler" <[hidden 
>> email]<http://user/SendEmail.jtp?type=node&node=6134&i=0>>
>> wrote:
>>
>>> So we upped the spark.akka.frameSize value to 128 MB and still observed
>>> the same behavior.  It's happening not necessarily when data is being sent
>>> back to the driver, but when there is an inter-cluster shuffle, for example
>>> during a groupByKey.
>>>
>>> Is it possible we should focus on tuning these parameters:
>>> spark.storage.memoryFraction & spark.shuffle.memoryFraction ??
>>>
>>>
>>> On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson <[hidden 
>>> email]<http://user/SendEmail.jtp?type=node&node=6134&i=1>
>>> > wrote:
>>>
>>>> This is very likely because the serialized map output locations buffer
>>>> exceeds the akka frame size. Please try setting "spark.akka.frameSize"
>>>> (default 10 MB) to some higher number, like 64 or 128.
>>>>
>>>> In the newest version of Spark, this would throw a better error, for
>>>> what it's worth.
>>>>
>>>>
>>>>
>>>> On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler <[hidden 
>>>> email]<http://user/SendEmail.jtp?type=node&node=6134&i=2>
>>>> > wrote:
>>>>
>>>>> Has anyone observed Spark worker threads stalling during a shuffle
>>>>> phase with
>>>>> the following message (one per worker host) being echoed to the
>>>>> terminal on
>>>>> the driver thread?
>>>>>
>>>>> INFO spark.MapOutputTrackerActor: Asked to send map output locations
>>>>> for
>>>>> shuffle 0 to [worker host]...
>>>>>
>>>>>
>>>>> At this point Spark-related activity on the hadoop cluster completely
>>>>> halts
>>>>> .. there's no network activity, disk IO or CPU activity, and individual
>>>>> tasks are not completing and the job just sits in this state.  At this
>>>>> point
>>>>> we just kill the job & a re-start of the Spark server service is
>>>>> required.
>>>>>
>>>>> Using identical jobs we were able to by-pass this halt point by
>>>>> increasing
>>>>> available heap memory to the workers, but it's odd we don't get an
>>>>> out-of-memory error or any error at all.  Upping the memory available
>>>>> isn't
>>>>> a very satisfying answer to what may be going on :)
>>>>>
>>>>> We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.
>>>>>
>>>>> Thanks for any help or ideas you may have!
>>>>>
>>>>> Cheers,
>>>>> Jonathan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>
>>>>
>>>
>>
>> --
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6134.html
>>  To unsubscribe from Spark stalling during shuffle (maybe a memory
>> issue), click here.
>> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
> --
> View this message in context: Re: Spark stalling during shuffle (maybe a
> memory 
> issue)<http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6137.html>
>  Sent from the Apache Spark User List mailing list 
> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>


Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread jonathan.keebler
Thanks for the suggestion, Andrew.  We have also implemented our solution
using reduceByKey, but observe the same behavior.  For example if we do the
following:

map1
groupByKey
map2
saveAsTextFile

Then the stalling will occur during the map1+groupByKey execution.

If we do

map1
reduceByKey
map2
saveAsTextFile

Then the reduceByKey finishes successfully, but the stalling will occur
during the map2+saveAsTextFile execution.


On Tue, May 20, 2014 at 4:22 PM, Andrew Ash [via Apache Spark User List] <
ml-node+s1001560n6134...@n3.nabble.com> wrote:

> If the distribution of the keys in your groupByKey is skewed (some keys
> appear way more often than others) you should consider modifying your job
> to use reduceByKey instead wherever possible.
> On May 20, 2014 12:53 PM, "Jon Keebler" <[hidden 
> email]<http://user/SendEmail.jtp?type=node&node=6134&i=0>>
> wrote:
>
>> So we upped the spark.akka.frameSize value to 128 MB and still observed
>> the same behavior.  It's happening not necessarily when data is being sent
>> back to the driver, but when there is an inter-cluster shuffle, for example
>> during a groupByKey.
>>
>> Is it possible we should focus on tuning these parameters:
>> spark.storage.memoryFraction & spark.shuffle.memoryFraction ??
>>
>>
>> On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson <[hidden 
>> email]<http://user/SendEmail.jtp?type=node&node=6134&i=1>
>> > wrote:
>>
>>> This is very likely because the serialized map output locations buffer
>>> exceeds the akka frame size. Please try setting "spark.akka.frameSize"
>>> (default 10 MB) to some higher number, like 64 or 128.
>>>
>>> In the newest version of Spark, this would throw a better error, for
>>> what it's worth.
>>>
>>>
>>>
>>> On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler <[hidden 
>>> email]<http://user/SendEmail.jtp?type=node&node=6134&i=2>
>>> > wrote:
>>>
>>>> Has anyone observed Spark worker threads stalling during a shuffle
>>>> phase with
>>>> the following message (one per worker host) being echoed to the
>>>> terminal on
>>>> the driver thread?
>>>>
>>>> INFO spark.MapOutputTrackerActor: Asked to send map output locations for
>>>> shuffle 0 to [worker host]...
>>>>
>>>>
>>>> At this point Spark-related activity on the hadoop cluster completely
>>>> halts
>>>> .. there's no network activity, disk IO or CPU activity, and individual
>>>> tasks are not completing and the job just sits in this state.  At this
>>>> point
>>>> we just kill the job & a re-start of the Spark server service is
>>>> required.
>>>>
>>>> Using identical jobs we were able to by-pass this halt point by
>>>> increasing
>>>> available heap memory to the workers, but it's odd we don't get an
>>>> out-of-memory error or any error at all.  Upping the memory available
>>>> isn't
>>>> a very satisfying answer to what may be going on :)
>>>>
>>>> We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.
>>>>
>>>> Thanks for any help or ideas you may have!
>>>>
>>>> Cheers,
>>>> Jonathan
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>
>>>
>>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6134.html
>  To unsubscribe from Spark stalling during shuffle (maybe a memory issue), 
> click
> here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=6067&code=amtlZWJsZXI0MkBnbWFpbC5jb218NjA2N3wtMjA5NzAzMzE5NQ==>
> .
> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Andrew Ash
If the distribution of the keys in your groupByKey is skewed (some keys
appear way more often than others) you should consider modifying your job
to use reduceByKey instead wherever possible.
On May 20, 2014 12:53 PM, "Jon Keebler"  wrote:

> So we upped the spark.akka.frameSize value to 128 MB and still observed
> the same behavior.  It's happening not necessarily when data is being sent
> back to the driver, but when there is an inter-cluster shuffle, for example
> during a groupByKey.
>
> Is it possible we should focus on tuning these parameters: 
> spark.storage.memoryFraction
> & spark.shuffle.memoryFraction ??
>
>
> On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson wrote:
>
>> This is very likely because the serialized map output locations buffer
>> exceeds the akka frame size. Please try setting "spark.akka.frameSize"
>> (default 10 MB) to some higher number, like 64 or 128.
>>
>> In the newest version of Spark, this would throw a better error, for what
>> it's worth.
>>
>>
>>
>> On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler 
>> wrote:
>>
>>> Has anyone observed Spark worker threads stalling during a shuffle phase
>>> with
>>> the following message (one per worker host) being echoed to the terminal
>>> on
>>> the driver thread?
>>>
>>> INFO spark.MapOutputTrackerActor: Asked to send map output locations for
>>> shuffle 0 to [worker host]...
>>>
>>>
>>> At this point Spark-related activity on the hadoop cluster completely
>>> halts
>>> .. there's no network activity, disk IO or CPU activity, and individual
>>> tasks are not completing and the job just sits in this state.  At this
>>> point
>>> we just kill the job & a re-start of the Spark server service is
>>> required.
>>>
>>> Using identical jobs we were able to by-pass this halt point by
>>> increasing
>>> available heap memory to the workers, but it's odd we don't get an
>>> out-of-memory error or any error at all.  Upping the memory available
>>> isn't
>>> a very satisfying answer to what may be going on :)
>>>
>>> We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.
>>>
>>> Thanks for any help or ideas you may have!
>>>
>>> Cheers,
>>> Jonathan
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>


Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Jon Keebler
So we upped the spark.akka.frameSize value to 128 MB and still observed the
same behavior.  It's happening not necessarily when data is being sent back
to the driver, but when there is an inter-cluster shuffle, for example
during a groupByKey.

Is it possible we should focus on tuning these parameters:
spark.storage.memoryFraction
& spark.shuffle.memoryFraction ??


On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson  wrote:

> This is very likely because the serialized map output locations buffer
> exceeds the akka frame size. Please try setting "spark.akka.frameSize"
> (default 10 MB) to some higher number, like 64 or 128.
>
> In the newest version of Spark, this would throw a better error, for what
> it's worth.
>
>
>
> On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler wrote:
>
>> Has anyone observed Spark worker threads stalling during a shuffle phase
>> with
>> the following message (one per worker host) being echoed to the terminal
>> on
>> the driver thread?
>>
>> INFO spark.MapOutputTrackerActor: Asked to send map output locations for
>> shuffle 0 to [worker host]...
>>
>>
>> At this point Spark-related activity on the hadoop cluster completely
>> halts
>> .. there's no network activity, disk IO or CPU activity, and individual
>> tasks are not completing and the job just sits in this state.  At this
>> point
>> we just kill the job & a re-start of the Spark server service is required.
>>
>> Using identical jobs we were able to by-pass this halt point by increasing
>> available heap memory to the workers, but it's odd we don't get an
>> out-of-memory error or any error at all.  Upping the memory available
>> isn't
>> a very satisfying answer to what may be going on :)
>>
>> We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.
>>
>> Thanks for any help or ideas you may have!
>>
>> Cheers,
>> Jonathan
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-19 Thread jonathan.keebler
Thanks for the response, Aaron!  We'll give it a try tomorrow.


On Tue, May 20, 2014 at 12:13 AM, Aaron Davidson [via Apache Spark User
List]  wrote:

> This is very likely because the serialized map output locations buffer
> exceeds the akka frame size. Please try setting "spark.akka.frameSize"
> (default 10 MB) to some higher number, like 64 or 128.
>
> In the newest version of Spark, this would throw a better error, for what
> it's worth.
>
>
>
> On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler <[hidden 
> email]<http://user/SendEmail.jtp?type=node&node=6073&i=0>
> > wrote:
>
>> Has anyone observed Spark worker threads stalling during a shuffle phase
>> with
>> the following message (one per worker host) being echoed to the terminal
>> on
>> the driver thread?
>>
>> INFO spark.MapOutputTrackerActor: Asked to send map output locations for
>> shuffle 0 to [worker host]...
>>
>>
>> At this point Spark-related activity on the hadoop cluster completely
>> halts
>> .. there's no network activity, disk IO or CPU activity, and individual
>> tasks are not completing and the job just sits in this state.  At this
>> point
>> we just kill the job & a re-start of the Spark server service is required.
>>
>> Using identical jobs we were able to by-pass this halt point by increasing
>> available heap memory to the workers, but it's odd we don't get an
>> out-of-memory error or any error at all.  Upping the memory available
>> isn't
>> a very satisfying answer to what may be going on :)
>>
>> We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.
>>
>> Thanks for any help or ideas you may have!
>>
>> Cheers,
>> Jonathan
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6073.html
>  To unsubscribe from Spark stalling during shuffle (maybe a memory issue), 
> click
> here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=6067&code=amtlZWJsZXI0MkBnbWFpbC5jb218NjA2N3wtMjA5NzAzMzE5NQ==>
> .
> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6074.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-19 Thread Aaron Davidson
This is very likely because the serialized map output locations buffer
exceeds the akka frame size. Please try setting "spark.akka.frameSize"
(default 10 MB) to some higher number, like 64 or 128.

In the newest version of Spark, this would throw a better error, for what
it's worth.



On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler wrote:

> Has anyone observed Spark worker threads stalling during a shuffle phase
> with
> the following message (one per worker host) being echoed to the terminal on
> the driver thread?
>
> INFO spark.MapOutputTrackerActor: Asked to send map output locations for
> shuffle 0 to [worker host]...
>
>
> At this point Spark-related activity on the hadoop cluster completely halts
> .. there's no network activity, disk IO or CPU activity, and individual
> tasks are not completing and the job just sits in this state.  At this
> point
> we just kill the job & a re-start of the Spark server service is required.
>
> Using identical jobs we were able to by-pass this halt point by increasing
> available heap memory to the workers, but it's odd we don't get an
> out-of-memory error or any error at all.  Upping the memory available isn't
> a very satisfying answer to what may be going on :)
>
> We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.
>
> Thanks for any help or ideas you may have!
>
> Cheers,
> Jonathan
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Spark stalling during shuffle (maybe a memory issue)

2014-05-19 Thread jonathan.keebler
Has anyone observed Spark worker threads stalling during a shuffle phase with
the following message (one per worker host) being echoed to the terminal on
the driver thread?

INFO spark.MapOutputTrackerActor: Asked to send map output locations for
shuffle 0 to [worker host]...


At this point Spark-related activity on the hadoop cluster completely halts
.. there's no network activity, disk IO or CPU activity, and individual
tasks are not completing and the job just sits in this state.  At this point
we just kill the job & a re-start of the Spark server service is required.

Using identical jobs we were able to by-pass this halt point by increasing
available heap memory to the workers, but it's odd we don't get an
out-of-memory error or any error at all.  Upping the memory available isn't
a very satisfying answer to what may be going on :)

We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.

Thanks for any help or ideas you may have!

Cheers,
Jonathan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.