Re: Mesos killing Spark Driver

Gerard Maas Thu, 08 Jan 2015 07:07:13 -0800

Hi,

This issue is on prod, running Marathon 0.6 - we are currently testing
0.7.5 on Dev, but I've no results of this behavior yet.
I saw your post by searching on the Marathon group but didn't consider that
it would apply to my case as I don't see the NPE.
The warning on version mismatch between Mesos 0.20 and Marathon 0.6 is
indeed important.


Thanks,  Gerard.

On Thu, Jan 8, 2015 at 3:50 PM, Shijun Kong <[email protected]>
wrote:

>  Hi Gerard,
>
>  What version of Marathon are you running? I ran into similar behavior
> some time back. My problem seems to be compatibility issue between Marathon
> and Meosos: https://github.com/mesosphere/marathon/issues/595
>
>
>
>  Regards,
> Shijun
>
>  On Jan 8, 2015, at 9:28 AM, Gerard Maas <[email protected]> wrote:
>
>  Hi again,
>
>  I finally found a clue in this issue. It looks like Marathon is the one
> behind the job killing spree. I still don't know *why* but it looks like
> the task consolidation of Marathon finds a discrepancy with Mesos and
> decides to kill the instance.
>
>  INFO|2015-01-08
> 10:05:35,491|pool-1-thread-1173|MarathonScheduler.scala:299|Requesting task
> reconciliation with the Mesos master
>  INFO|2015-01-08
> 10:05:35,493|Thread-188479|MarathonScheduler.scala:138|Received status
> update for task
> core-compute-jobs-actualvalues-st.be0e36cc-9714-11e4-9e7c-3e6ce77341aa:
> TASK_RUNNING (Reconciliation: Latest task state)
> INFO|2015-01-08
> 10:05:35,494|pool-1-thread-1171|MarathonScheduler.scala:338|Need to scale
> core-compute-jobs-actualvalues-st from 0 up to 1 instances
>
>  #### Following mesos, at this point, there's already an instance of this
> job running, so it's actually scaling from 1 to 2 and not from 0 to 1 as it
> says in the logs ####
>
>  INFO|2015-01-08 10:05:35,878|Thread-188483|TaskBuilder.scala:38|No
> matching offer for core-compute-jobs-actualvalues-st (need 1.0 CPUs, 1000.0
> mem, 0.0 disk, 1 ports)
>  (... offers ...)
> ...
> #### Killing ####
>   INFO|2015-01-08
> 10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:353|Scaling
> core-compute-jobs-actualvalues-st from 2 down to 1 instances
>  INFO|2015-01-08
> 10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:357|Killing tasks:
> Set(core-compute-jobs-actualvalues-st.e458118d-971d-11e4-9e7c-3e6ce77341aa)
>
>  Any ideas why this happens and how to fix it?
>
>  -kr, Gerard.
>
>
> On Tue, Dec 2, 2014 at 1:15 AM, Gerard Maas <[email protected]> wrote:
>
>> Thanks!.  I'll try that and report back once I've some interesting
>> evidence.
>>
>>  -kr, Gerard.
>>
>> On Tue, Dec 2, 2014 at 12:54 AM, Tim Chen <[email protected]> wrote:
>>
>>> Hi Gerard,
>>>
>>>  I see. What will be helpful to help diagnoise your problem is that if
>>> you can enable verbose logging (GLOG_v=1) before running the slave, and
>>> share the slave logs when it happens.
>>>
>>>  Tim
>>>
>>> On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <[email protected]>
>>> wrote:
>>>
>>>> Hi Tim,
>>>>
>>>>  It's quite hard to reproduce. It just "happens"... some time worst
>>>> than others, mostly when the system is under load. We notice b/c the
>>>> framework starts 'jumping' from one slave to other, but so far we have no
>>>> clue why this is happening.
>>>>
>>>>  What I'm currently looking for is some potential conditions that
>>>> could cause Mesos to kill the executor (not the task) to validate whether
>>>> any of those conditions apply to our case and try to narrow down the
>>>> problem to some reproducible subset.
>>>>
>>>>  -kr, Gerard.
>>>>
>>>>
>>>> On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <[email protected]> wrote:
>>>>
>>>>> There are different reasons, but most commonly is when the framework
>>>>> ask to kill the task.
>>>>>
>>>>>  Can you provide some easy repro steps/artifacts? I've been working
>>>>> on Spark on Mesos these days and can help try this out.
>>>>>
>>>>>  Tim
>>>>>
>>>>> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>  Sorry if this has been discussed before. I'm new to the list.
>>>>>>
>>>>>>  We are currently running our Spark + Spark Streaming jobs on Mesos,
>>>>>> submitting our jobs through Marathon.
>>>>>>
>>>>>>  We see with some regularity that the Spark Streaming driver gets
>>>>>> killed by Mesos and then restarted on some other node by Marathon.
>>>>>>
>>>>>>  I've no clue why Mesos is killing the driver and looking at both
>>>>>> the Mesos and Spark logs didn't make me any wiser.
>>>>>>
>>>>>>  On the Spark Streaming driver logs, I find this entry of Mesos
>>>>>> "signing off" my driver:
>>>>>>
>>>>>>  Shutting down
>>>>>>> Sending SIGTERM to process tree at pid 17845
>>>>>>> Killing the following process trees:
>>>>>>> [
>>>>>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>>>>>>>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>>>>>>>    \--- 17847 java -cp core-compute-job.jar
>>>>>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
>>>>>>> ]
>>>>>>> Command terminated with signal Terminated (pid: 17845)
>>>>>>
>>>>>>
>>>>>>  What would be the reasons for Mesos to kill an executor?
>>>>>> Have anybody seen something similar? Any hints on where to start
>>>>>> digging?
>>>>>>
>>>>>>  -kr, Gerard.
>>>>>> .
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: Mesos killing Spark Driver

Reply via email to