Perhaps you could put explicit GC logs in the childopts so that you see if
you have "GC grinding" in the jvm running the worker that gets
disconnected. I suggested it since you mentioned that the machine is under
heavy load.

Another thing that sometimes caused something like that was when the
machine came under heavy load from outside processes, since we were testing
on a shared machine. Is it your case?

Regards,
JG

On Sun, Jun 28, 2015 at 11:46 AM, Nick R. Katsipoulakis <
[email protected]> wrote:

> Javier thank you for your response.
>
> So, do you suggest that I change to "workers.childopts" to more memory
> than I have now? Currently I have it set to 4 GBs and some of the executors
> do not use all of it (I monitor the JVM memory usage on each executor from
> the Bolt code). But, I guess I can try it and see if it works.
>
> Thank you again.
>
> Regards,
> Nick
>
> 2015-06-28 11:32 GMT-04:00 Javier Gonzalez <[email protected]>:
>
>> It could be that heavy usage of an executor's machine prevents the
>> executor from communicating with nimbus, hence it appears "dead" to nimbus,
>> even though it's still working. I think we saw something like this some
>> time during our PoC development, and it was fixed by allocating more memory
>> to our workers - not enough memory was causing the workers to incur in
>> heavy GC cycles.
>>
>> Regards,
>> Javier
>>
>> On Fri, Jun 26, 2015 at 3:53 PM, Nick R. Katsipoulakis <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> I have been running a sample topology and I can see on the nimbus.log
>>> messages like the following:
>>>
>>> 2015-06-26T19:46:35.556+0000 b.s.d.nimbus [INFO] Executor
>>> tpch-q5-top-1-1435347835:[5 5] not alive
>>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>>> tpch-q5-top-1-1435347835:[13 13] not alive
>>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>>> tpch-q5-top-1-1435347835:[21 21] not alive
>>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>>> tpch-q5-top-1-1435347835:[29 29] not alive
>>>
>>> So, my question is when does the nimbus come to the above decision? By
>>> the way, none of the above machines has crashed on there is an exception in
>>> the code. The only problem is that the resource utilization in those
>>> machines reaches high levels. Is the former a case where nimbus declares an
>>> executor as "not alive"?
>>>
>>> Thanks,
>>> Nick
>>>
>>
>>
>>
>> --
>> Javier González Nicolini
>>
>
>
>
> --
> Nikolaos Romanos Katsipoulakis,
> University of Pittsburgh, PhD candidate
>



-- 
Javier González Nicolini

Reply via email to