If you look in the Spark UI, do you see any garbage collection happening?
My best guess is that some of the executors are going into GC and they are
timing out. You can manually increase the timeout by setting the Spark conf:

spark.storage.blockManagerSlaveTimeoutMs

to a higher value. In your case it's setting this to 45000 or 45 seconds.




On Fri, Apr 4, 2014 at 5:52 PM, Debasish Das <debasish.da...@gmail.com>wrote:

> Hi,
>
> In my ALS runs I am noticing messages that complain about heart beats:
>
> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(17, machine1, 53419, 0) with no recent heart beats: 48476ms
> exceeds 45000ms
> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(12, machine2, 60714, 0) with no recent heart beats: 45328ms
> exceeds 45000ms
> 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(19, machine3, 39496, 0) with no recent heart beats: 53259ms
> exceeds 45000ms
>
> Is this some issue with the underlying jvm over which akka is run ? Can I
> increase the heartbeat somehow to get these messages resolved ?
>
> Any more insight about the possible cause for the heartbeat will be
> helpful...
>
> It tried to re-run the job but it ultimately failed...
>
> Also I am noticing negative numbers in the stage duration:
>
>
>
>
> Any insights into the problem will be very helpful...
>
> Thanks.
> Deb
>

Reply via email to