It could be that heavy usage of an executor's machine prevents the executor from communicating with nimbus, hence it appears "dead" to nimbus, even though it's still working. I think we saw something like this some time during our PoC development, and it was fixed by allocating more memory to our workers - not enough memory was causing the workers to incur in heavy GC cycles.
Regards, Javier On Fri, Jun 26, 2015 at 3:53 PM, Nick R. Katsipoulakis < [email protected]> wrote: > Hello, > > I have been running a sample topology and I can see on the nimbus.log > messages like the following: > > 2015-06-26T19:46:35.556+0000 b.s.d.nimbus [INFO] Executor > tpch-q5-top-1-1435347835:[5 5] not alive > 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor > tpch-q5-top-1-1435347835:[13 13] not alive > 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor > tpch-q5-top-1-1435347835:[21 21] not alive > 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor > tpch-q5-top-1-1435347835:[29 29] not alive > > So, my question is when does the nimbus come to the above decision? By the > way, none of the above machines has crashed on there is an exception in the > code. The only problem is that the resource utilization in those machines > reaches high levels. Is the former a case where nimbus declares an executor > as "not alive"? > > Thanks, > Nick > -- Javier González Nicolini
