Re: Heartbeat exceeds

Andrew Or Sat, 05 Apr 2014 10:11:26 -0700

Setting spark.worker.timeout should not help you. What this value means is
that the master checks every 60 seconds whether the workers are still
alive, as the documentation describes. But this value also determines how
often the workers send HEARTBEAT messages to notify the master of their
liveness; in particular, under this configuration, the workers send such
messages every 60 / 4 = 15 seconds. Increasing this value means it takes
longer (i.e. 600 seconds) in your case to detect that something went wrong
in the first place.


spark.storage.blockManagerSlaveTimeoutMs is similar. It controls how
frequently the HEARTBEAT messages are sent and how frequently they are
expected to arrive. Under default parameters, the driver checks every 45
seconds whether the block managers (living on the executors) are still
alive, and each block manager sends a HEARTBEAT to the driver every 15
seconds.

If anything, increasing spark.akka.timeout is closest to what you want. It
gives more leeway for the communication the driver and the executors, such
that if the executors take longer than usual to respond, the currently
running task does not just give up after 100 seconds (the default).

However, it seems that the root cause of the problem is in your
application's use of memory. Are you caching a lot of RDD's? You can find
out more details about what went wrong exactly by going through the worker
logs on <master_url>:8080. The timeout exception that you ran into is
usually a side-effect of a deeper, underlying exception.


On Sat, Apr 5, 2014 at 9:33 AM, Debasish Das <debasish.da...@gmail.com>wrote:

> This does not seem to help:
>
> export SPARK_JAVA_OPTS="-Dspark.local.dir=/app/spark/tmp
> -Dspark.worker.timeout=600 -Dspark.akka.timeout=200
> -Dspark.storage.blockManagerSlaveTimeoutMs=300000"
>
> Getting the message leads to GC failure followed by master declaring the
> worker as dead !
>
> This is related to GC...Persisting the factors to disk at each iteration
> will resolve this issue with runtime loss of course...
>
> I also have another issue...I run with executor memory as 24g but I see
> 18.4 GB in executor ui...is that expected ?
>
>
> On Sat, Apr 5, 2014 at 8:16 AM, Debasish Das <debasish.da...@gmail.com>wrote:
>
>> From the documentation this is what I understood:
>>
>> 1. spark.worker.timeout: Number of seconds after which the standalone
>> deploy master considers a worker lost if it receives no heartbeats.
>> default: 60
>>
>> I increased it to be 600
>>
>> It was pointed before that if there is GC overload and the worker takes
>> time to respond, master thinks worker JVM died.
>>
>> I have seen this issue as well several times.
>>
>> 2. spark.akka.timeout: Communication timeout between Spark nodes, in
>> seconds.
>> default: 100
>>
>> I increased it to 200 as it was pointed before but I don't understand
>> when the communication timeout is triggered. Some explanation on this
>> setting will be very helpful.
>>
>> 3. spark.storage.blockManagerSlaveTimeoutMs: I could not find
>> documentation but as Patrick said the 45000 number coming from this.
>>
>> How is this related to spark.worker.timeout?
>>
>> I bumped it up to 300s but JVM can go to GC only if there is pressure on
>> JVM right....May be I need to do a yourkit run to understand the memory
>> usage in more detail. Any suggestions on how to setup yourkit for memory
>> analysis ?
>>
>> I set it using the following options in spark_env.sh:
>>
>> export SPARK_JAVA_OPTS="-Dspark.local.dir=/app/spark/tmp
>> -Dspark.storage.blockManagerSlaveTimeoutMs=300000
>> -Dspark.worker.timeout=600 -Dspark.akka.timeout=200"
>>
>>
>>
>> This is the correct way to specify
>> spark.storage.blockManagerSlaveTimeoutMs ?
>>
>>
>> On Sat, Apr 5, 2014 at 4:00 AM, azurecoder <rich...@elastacloud.com>wrote:
>>
>>> Interested in a resolution to this. I'm building a large triangular
>>> matrix so
>>> doing similar to ALS - lots of work on the worker nodes and keep timing
>>> out.
>>>
>>> Tried a few updates to akka frame sizes, timeouts and blockmanager but
>>> unable to complete. Will try the blockmanagerslaves property now and let
>>> you
>>> know the effect. That property doesn't appear to be documented on the
>>> site
>>> though.
>>>
>>> Cheers!
>>>
>>> Richard
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Heartbeat-exceeds-tp3798p3809.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>

Re: Heartbeat exceeds

Reply via email to