Here's the Chronos container log:
{"log":"I0812 08:15:42.256145 112 sched.cpp:981] Scheduler::statusUpdate took
2.572805ms\n","stream":"stderr","time":"2016-08-12T00:15:42.256201718Z"}
{"log":"Exception in thread \"Thread-1433105\"
java.lang.IllegalArgumentException: no such vertex in
graph\n","stream":"stderr","time":"2016-08-12T00:15:43.902249654Z"}
{"log":"\u0009at
org.jgrapht.graph.AbstractGraph.assertVertexExist(AbstractGraph.java:158)\n","stream":"stderr","time":"2016-08-12T00:15:43.902297744Z"}
{"log":"\u0009at
org.jgrapht.graph.AbstractBaseGraph$DirectedSpecifics.getEdgeContainer(AbstractBaseGraph.java:927)\n","stream":"stderr","time":"2016-08-12T00:15:43.902310237Z"}
{"log":"\u0009at
org.jgrapht.graph.AbstractBaseGraph$DirectedSpecifics.edgesOf(AbstractBaseGraph.java:851)\n","stream":"stderr","time":"2016-08-12T00:15:43.902324329Z"}
{"log":"\u0009at
org.jgrapht.graph.AbstractBaseGraph.edgesOf(AbstractBaseGraph.java:395)\n","stream":"stderr","time":"2016-08-12T00:15:43.902333866Z"}
{"log":"\u0009at
org.apache.mesos.chronos.scheduler.graph.JobGraph.getChildren(JobGraph.scala:175)\n","stream":"stderr","time":"2016-08-12T00:15:43.902343167Z"}
{"log":"\u0009at
org.apache.mesos.chronos.scheduler.graph.JobGraph.getExecutableChildren(JobGraph.scala:148)\n","stream":"stderr","time":"2016-08-12T00:15:43.902357005Z"}
{"log":"\u0009at
org.apache.mesos.chronos.scheduler.jobs.JobScheduler.processDependencies(JobScheduler.scala:347)\n","stream":"stderr","time":"2016-08-12T00:15:43.902369764Z"}
{"log":"\u0009at
org.apache.mesos.chronos.scheduler.jobs.JobScheduler.handleFinishedTask(JobScheduler.scala:272)\n","stream":"stderr","time":"2016-08-12T00:15:43.902381643Z"}
{"log":"\u0009at
org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework.statusUpdate(MesosJobFramework.scala:226)\n","stream":"stderr","time":"2016-08-12T00:15:43.902393721Z"}
{"log":"\u0009at sun.reflect.GeneratedMethodAccessor110.invoke(Unknown
Source)\n","stream":"stderr","time":"2016-08-12T00:15:43.90240862Z"}
{"log":"\u0009at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n","stream":"stderr","time":"2016-08-12T00:15:43.902417552Z"}
{"log":"\u0009at
java.lang.reflect.Method.invoke(Method.java:606)\n","stream":"stderr","time":"2016-08-12T00:15:43.902427895Z"}
{"log":"\u0009at
com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:37)\n","stream":"stderr","time":"2016-08-12T00:15:43.902447417Z"}
{"log":"\u0009at com.sun.proxy.$Proxy31.statusUpdate(Unknown
Source)\n","stream":"stderr","time":"2016-08-12T00:15:43.902458275Z"}
{"log":"I0812 08:15:43.902712 96 sched.cpp:1937] Asked to abort the
driver\n","stream":"stderr","time":"2016-08-12T00:15:43.902765363Z"}
(C++ function at
org_apache_mesos_MesosSchedulerDriver.cpp)JNIScheduler::statusUpdate invokes
(Scala function at MesosJobFramework.scala) statusUpdate, which query/replace a
job.
At the same time, another thread deleted that job. So statusUpdate throws an
excepthion, catched by JNIScheduler::statusUpdate, then invoked driver->abort().
In summary, this is a race-condition bug of Chronos.
Regards,
Zhichang Yu
________________________________
发件人: tommy xiao <[email protected]>
发送时间: 2016年8月14日 7:43
收件人: user
主题: Re: 答复: Deactivationg framework unexpectly
hi Yu,
please enable debug mode to see more details logs with GLOG_v=3
2016-08-12 14:27 GMT+08:00 志昌 余
<[email protected]<mailto:[email protected]>>:
Hi Anindya,
The problem occurred again. The following is the log of the scheduler
driver log at Chronos side:
I0812 08:15:43.902712 96 sched.cpp:1937] Asked to abort the driver
I0812 08:15:43.902763 96 sched.cpp:981] Scheduler::statusUpdate took
1.436378441secs
I0812 08:15:43.902788 96 sched.cpp:988] Not sending status update
acknowledgment message b\
ecause the driver is not running!
I0812 08:15:43.902866 96 sched.cpp:919] Ignoring task status update message
because the dr\
iver is not running!
However from the earlier log I don't see the clue of why scheduler driver
be aborted.
Thankds,
Zhichang Yu
________________________________
发件人: 志昌 余 <[email protected]<mailto:[email protected]>>
发送时间: 2016年8月9日 18:03:31
收件人: [email protected]<mailto:[email protected]>
主题: 答复: Deactivationg framework unexpectly
Hi Anindys,
Thanks for the info. I'll enable scheduler driver log to see what happen.
Regards,
Zhichang Yu
________________________________
发件人: [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>> 代表 Anindya Sinha
<[email protected]<mailto:[email protected]>>
发送时间: 2016年8月8日 23:50:10
收件人: [email protected]<mailto:[email protected]>
主题: Re: Deactivationg framework unexpectly
Looks like your framework (chronos) is sending a DeactivateFrameworkMessage
message to the master. The scheduler driver would also send a
DeativateFramework message if it is aborted
(https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L1224).
Also, master can deactivate your framework if your framework disconnects or
fails over. Please check logs in master or see if your framework received a
FrameworkErrorMessage.
Thanks
Anindya
On Aug 8, 2016, at 3:35 AM, 志昌 余
<[email protected]<mailto:[email protected]>> wrote:
Hi,
I recently faced a wired problem. I'm running mesos + chronos. Chronos
often (once every several days) stops scheduling tasks due to mesos deactived
the framework.
As following is the log of mesos master leader:
# grep -iP "activat|disconnected" /var/log/mesos/mesos-master.INFO
I0806 13:40:33.143658 30 master.cpp:2551] Deactivating framework
90a6a7dc-7256-4e55-bd7e-573233c5df74-0000 (chronos-2.5.0-SNAPSHOT) at
[email protected]<mailto:[email protected]>:34544
I0806 13:40:33.143908 23 hierarchical.cpp:375] Deactivated framework
90a6a7dc-7256-4e55-bd7e-573233c5df74-0000
The fix is to manually reboot the chronos leader.
My env:
There are 3 physical machines, on each are running containerized mesos master
and chronos. When the issue occurred, the mesos leader and chronos leader were
both running on the same machine.
Software Version:
mesos-master:0.28.0-2.0.16.ubuntu1404
chronos:2.5.0-ce4469d.ubuntu1404-mesos-0.28.0-2.0.16.ubuntu1404
Can anyone give insight for this problem?
Thanks,
Zhichang Yu
--
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com<http://gmail.com>