Hi, This issue is on prod, running Marathon 0.6 - we are currently testing 0.7.5 on Dev, but I've no results of this behavior yet. I saw your post by searching on the Marathon group but didn't consider that it would apply to my case as I don't see the NPE. The warning on version mismatch between Mesos 0.20 and Marathon 0.6 is indeed important.
Thanks, Gerard. On Thu, Jan 8, 2015 at 3:50 PM, Shijun Kong <[email protected]> wrote: > Hi Gerard, > > What version of Marathon are you running? I ran into similar behavior > some time back. My problem seems to be compatibility issue between Marathon > and Meosos: https://github.com/mesosphere/marathon/issues/595 > > > > Regards, > Shijun > > On Jan 8, 2015, at 9:28 AM, Gerard Maas <[email protected]> wrote: > > Hi again, > > I finally found a clue in this issue. It looks like Marathon is the one > behind the job killing spree. I still don't know *why* but it looks like > the task consolidation of Marathon finds a discrepancy with Mesos and > decides to kill the instance. > > INFO|2015-01-08 > 10:05:35,491|pool-1-thread-1173|MarathonScheduler.scala:299|Requesting task > reconciliation with the Mesos master > INFO|2015-01-08 > 10:05:35,493|Thread-188479|MarathonScheduler.scala:138|Received status > update for task > core-compute-jobs-actualvalues-st.be0e36cc-9714-11e4-9e7c-3e6ce77341aa: > TASK_RUNNING (Reconciliation: Latest task state) > INFO|2015-01-08 > 10:05:35,494|pool-1-thread-1171|MarathonScheduler.scala:338|Need to scale > core-compute-jobs-actualvalues-st from 0 up to 1 instances > > #### Following mesos, at this point, there's already an instance of this > job running, so it's actually scaling from 1 to 2 and not from 0 to 1 as it > says in the logs #### > > INFO|2015-01-08 10:05:35,878|Thread-188483|TaskBuilder.scala:38|No > matching offer for core-compute-jobs-actualvalues-st (need 1.0 CPUs, 1000.0 > mem, 0.0 disk, 1 ports) > (... offers ...) > ... > #### Killing #### > INFO|2015-01-08 > 10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:353|Scaling > core-compute-jobs-actualvalues-st from 2 down to 1 instances > INFO|2015-01-08 > 10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:357|Killing tasks: > Set(core-compute-jobs-actualvalues-st.e458118d-971d-11e4-9e7c-3e6ce77341aa) > > Any ideas why this happens and how to fix it? > > -kr, Gerard. > > > On Tue, Dec 2, 2014 at 1:15 AM, Gerard Maas <[email protected]> wrote: > >> Thanks!. I'll try that and report back once I've some interesting >> evidence. >> >> -kr, Gerard. >> >> On Tue, Dec 2, 2014 at 12:54 AM, Tim Chen <[email protected]> wrote: >> >>> Hi Gerard, >>> >>> I see. What will be helpful to help diagnoise your problem is that if >>> you can enable verbose logging (GLOG_v=1) before running the slave, and >>> share the slave logs when it happens. >>> >>> Tim >>> >>> On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <[email protected]> >>> wrote: >>> >>>> Hi Tim, >>>> >>>> It's quite hard to reproduce. It just "happens"... some time worst >>>> than others, mostly when the system is under load. We notice b/c the >>>> framework starts 'jumping' from one slave to other, but so far we have no >>>> clue why this is happening. >>>> >>>> What I'm currently looking for is some potential conditions that >>>> could cause Mesos to kill the executor (not the task) to validate whether >>>> any of those conditions apply to our case and try to narrow down the >>>> problem to some reproducible subset. >>>> >>>> -kr, Gerard. >>>> >>>> >>>> On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <[email protected]> wrote: >>>> >>>>> There are different reasons, but most commonly is when the framework >>>>> ask to kill the task. >>>>> >>>>> Can you provide some easy repro steps/artifacts? I've been working >>>>> on Spark on Mesos these days and can help try this out. >>>>> >>>>> Tim >>>>> >>>>> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Sorry if this has been discussed before. I'm new to the list. >>>>>> >>>>>> We are currently running our Spark + Spark Streaming jobs on Mesos, >>>>>> submitting our jobs through Marathon. >>>>>> >>>>>> We see with some regularity that the Spark Streaming driver gets >>>>>> killed by Mesos and then restarted on some other node by Marathon. >>>>>> >>>>>> I've no clue why Mesos is killing the driver and looking at both >>>>>> the Mesos and Spark logs didn't make me any wiser. >>>>>> >>>>>> On the Spark Streaming driver logs, I find this entry of Mesos >>>>>> "signing off" my driver: >>>>>> >>>>>> Shutting down >>>>>>> Sending SIGTERM to process tree at pid 17845 >>>>>>> Killing the following process trees: >>>>>>> [ >>>>>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf >>>>>>> \-+- 17846 sh ./run-mesos.sh application-ts.conf >>>>>>> \--- 17847 java -cp core-compute-job.jar >>>>>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326 >>>>>>> ] >>>>>>> Command terminated with signal Terminated (pid: 17845) >>>>>> >>>>>> >>>>>> What would be the reasons for Mesos to kill an executor? >>>>>> Have anybody seen something similar? Any hints on where to start >>>>>> digging? >>>>>> >>>>>> -kr, Gerard. >>>>>> . >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > >

