Hi again,

I finally found a clue in this issue. It looks like Marathon is the one
behind the job killing spree. I still don't know *why* but it looks like
the task consolidation of Marathon finds a discrepancy with Mesos and
decides to kill the instance.

INFO|2015-01-08
10:05:35,491|pool-1-thread-1173|MarathonScheduler.scala:299|Requesting task
reconciliation with the Mesos master
 INFO|2015-01-08
10:05:35,493|Thread-188479|MarathonScheduler.scala:138|Received status
update for task
core-compute-jobs-actualvalues-st.be0e36cc-9714-11e4-9e7c-3e6ce77341aa:
TASK_RUNNING (Reconciliation: Latest task state)
INFO|2015-01-08
10:05:35,494|pool-1-thread-1171|MarathonScheduler.scala:338|Need to scale
core-compute-jobs-actualvalues-st from 0 up to 1 instances

#### Following mesos, at this point, there's already an instance of this
job running, so it's actually scaling from 1 to 2 and not from 0 to 1 as it
says in the logs ####

INFO|2015-01-08 10:05:35,878|Thread-188483|TaskBuilder.scala:38|No matching
offer for core-compute-jobs-actualvalues-st (need 1.0 CPUs, 1000.0 mem, 0.0
disk, 1 ports)
(... offers ...)
...
#### Killing ####
 INFO|2015-01-08
10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:353|Scaling
core-compute-jobs-actualvalues-st from 2 down to 1 instances
 INFO|2015-01-08
10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:357|Killing tasks:
Set(core-compute-jobs-actualvalues-st.e458118d-971d-11e4-9e7c-3e6ce77341aa)

Any ideas why this happens and how to fix it?

-kr, Gerard.


On Tue, Dec 2, 2014 at 1:15 AM, Gerard Maas <[email protected]> wrote:

> Thanks!.  I'll try that and report back once I've some interesting
> evidence.
>
> -kr, Gerard.
>
> On Tue, Dec 2, 2014 at 12:54 AM, Tim Chen <[email protected]> wrote:
>
>> Hi Gerard,
>>
>> I see. What will be helpful to help diagnoise your problem is that if you
>> can enable verbose logging (GLOG_v=1) before running the slave, and share
>> the slave logs when it happens.
>>
>> Tim
>>
>> On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <[email protected]>
>> wrote:
>>
>>> Hi Tim,
>>>
>>> It's quite hard to reproduce. It just "happens"... some time worst than
>>> others, mostly when the system is under load. We notice b/c the framework
>>> starts 'jumping' from one slave to other, but so far we have no clue why
>>> this is happening.
>>>
>>> What I'm currently looking for is some potential conditions that could
>>> cause Mesos to kill the executor (not the task) to validate whether any of
>>> those conditions apply to our case and try to narrow down the problem to
>>> some reproducible subset.
>>>
>>> -kr, Gerard.
>>>
>>>
>>> On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <[email protected]> wrote:
>>>
>>>> There are different reasons, but most commonly is when the framework
>>>> ask to kill the task.
>>>>
>>>> Can you provide some easy repro steps/artifacts? I've been working on
>>>> Spark on Mesos these days and can help try this out.
>>>>
>>>> Tim
>>>>
>>>> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Sorry if this has been discussed before. I'm new to the list.
>>>>>
>>>>> We are currently running our Spark + Spark Streaming jobs on Mesos,
>>>>> submitting our jobs through Marathon.
>>>>>
>>>>> We see with some regularity that the Spark Streaming driver gets
>>>>> killed by Mesos and then restarted on some other node by Marathon.
>>>>>
>>>>> I've no clue why Mesos is killing the driver and looking at both the
>>>>> Mesos and Spark logs didn't make me any wiser.
>>>>>
>>>>> On the Spark Streaming driver logs, I find this entry of Mesos
>>>>> "signing off" my driver:
>>>>>
>>>>> Shutting down
>>>>>> Sending SIGTERM to process tree at pid 17845
>>>>>> Killing the following process trees:
>>>>>> [
>>>>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>>>>>>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>>>>>>    \--- 17847 java -cp core-compute-job.jar
>>>>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
>>>>>> ]
>>>>>> Command terminated with signal Terminated (pid: 17845)
>>>>>
>>>>>
>>>>> What would be the reasons for Mesos to kill an executor?
>>>>> Have anybody seen something similar? Any hints on where to start
>>>>> digging?
>>>>>
>>>>> -kr, Gerard.
>>>>> .
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to