Hi again, I finally found a clue in this issue. It looks like Marathon is the one behind the job killing spree. I still don't know *why* but it looks like the task consolidation of Marathon finds a discrepancy with Mesos and decides to kill the instance.
INFO|2015-01-08 10:05:35,491|pool-1-thread-1173|MarathonScheduler.scala:299|Requesting task reconciliation with the Mesos master INFO|2015-01-08 10:05:35,493|Thread-188479|MarathonScheduler.scala:138|Received status update for task core-compute-jobs-actualvalues-st.be0e36cc-9714-11e4-9e7c-3e6ce77341aa: TASK_RUNNING (Reconciliation: Latest task state) INFO|2015-01-08 10:05:35,494|pool-1-thread-1171|MarathonScheduler.scala:338|Need to scale core-compute-jobs-actualvalues-st from 0 up to 1 instances #### Following mesos, at this point, there's already an instance of this job running, so it's actually scaling from 1 to 2 and not from 0 to 1 as it says in the logs #### INFO|2015-01-08 10:05:35,878|Thread-188483|TaskBuilder.scala:38|No matching offer for core-compute-jobs-actualvalues-st (need 1.0 CPUs, 1000.0 mem, 0.0 disk, 1 ports) (... offers ...) ... #### Killing #### INFO|2015-01-08 10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:353|Scaling core-compute-jobs-actualvalues-st from 2 down to 1 instances INFO|2015-01-08 10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:357|Killing tasks: Set(core-compute-jobs-actualvalues-st.e458118d-971d-11e4-9e7c-3e6ce77341aa) Any ideas why this happens and how to fix it? -kr, Gerard. On Tue, Dec 2, 2014 at 1:15 AM, Gerard Maas <[email protected]> wrote: > Thanks!. I'll try that and report back once I've some interesting > evidence. > > -kr, Gerard. > > On Tue, Dec 2, 2014 at 12:54 AM, Tim Chen <[email protected]> wrote: > >> Hi Gerard, >> >> I see. What will be helpful to help diagnoise your problem is that if you >> can enable verbose logging (GLOG_v=1) before running the slave, and share >> the slave logs when it happens. >> >> Tim >> >> On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <[email protected]> >> wrote: >> >>> Hi Tim, >>> >>> It's quite hard to reproduce. It just "happens"... some time worst than >>> others, mostly when the system is under load. We notice b/c the framework >>> starts 'jumping' from one slave to other, but so far we have no clue why >>> this is happening. >>> >>> What I'm currently looking for is some potential conditions that could >>> cause Mesos to kill the executor (not the task) to validate whether any of >>> those conditions apply to our case and try to narrow down the problem to >>> some reproducible subset. >>> >>> -kr, Gerard. >>> >>> >>> On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <[email protected]> wrote: >>> >>>> There are different reasons, but most commonly is when the framework >>>> ask to kill the task. >>>> >>>> Can you provide some easy repro steps/artifacts? I've been working on >>>> Spark on Mesos these days and can help try this out. >>>> >>>> Tim >>>> >>>> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Sorry if this has been discussed before. I'm new to the list. >>>>> >>>>> We are currently running our Spark + Spark Streaming jobs on Mesos, >>>>> submitting our jobs through Marathon. >>>>> >>>>> We see with some regularity that the Spark Streaming driver gets >>>>> killed by Mesos and then restarted on some other node by Marathon. >>>>> >>>>> I've no clue why Mesos is killing the driver and looking at both the >>>>> Mesos and Spark logs didn't make me any wiser. >>>>> >>>>> On the Spark Streaming driver logs, I find this entry of Mesos >>>>> "signing off" my driver: >>>>> >>>>> Shutting down >>>>>> Sending SIGTERM to process tree at pid 17845 >>>>>> Killing the following process trees: >>>>>> [ >>>>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf >>>>>> \-+- 17846 sh ./run-mesos.sh application-ts.conf >>>>>> \--- 17847 java -cp core-compute-job.jar >>>>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326 >>>>>> ] >>>>>> Command terminated with signal Terminated (pid: 17845) >>>>> >>>>> >>>>> What would be the reasons for Mesos to kill an executor? >>>>> Have anybody seen something similar? Any hints on where to start >>>>> digging? >>>>> >>>>> -kr, Gerard. >>>>> . >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >

