On Tue, Nov 29, 2016 at 09:31:08PM +0800, haosdent wrote: > Do your jobs scheduled by marathon or your framework?
We started 3 frameworks(marathon, storm, chronos) before upgrading. Here is the relative logs from the leading master -----------------------------8<---------------------------- ... I1129 14:11:44.009774 6862 master.cpp:7460] Adding task ct:TEST_JOB0_1480396486890:4 with resources cpus(*):4.9; mem(*):64; disk(*):256 on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual) I1129 14:11:44.009842 6862 master.cpp:7460] Adding task mesos-master-dev051-cqdx.qiyi.virtual-31000 with resources cpus(*):1; mem(*):1024; ports(*):[31000-31000] on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual) I1129 14:11:44.009891 6862 master.cpp:7460] Adding task mesos-master-dev051-cqdx.qiyi.virtual-31000 with resources cpus(*):1; mem(*):1024; ports(*):[31000-31000] on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual) I1129 14:11:44.009953 6862 master.cpp:7460] Adding task test-all.35819d0c-b5df-11e6-971e-02429c7d09a1 with resources cpus(*):1; mem(*):128; ports(*):[31417-31418] on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual) I1129 14:11:44.010197 6860 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 455974ns W1129 14:11:44.010202 6862 master.cpp:6569] Possibly orphaned task test-all.35819d0c-b5df-11e6-971e-02429c7d09a1 of framework 39b8a1b0-5ab0-478b-8175-479fb8737942-0002 running on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual) I1129 14:11:44.010213 6860 replica.cpp:712] Persisted action at 102 W1129 14:11:44.010249 6862 master.cpp:6569] Possibly orphaned task mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework 8e87ed68-434d-4267-b83d-c6a509266a03-0000 running on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual) W1129 14:11:44.010406 6862 master.cpp:6569] Possibly orphaned task mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework 39b8a1b0-5ab0-478b-8175-479fb8737942-0004 running on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual) W1129 14:11:44.010429 6862 master.cpp:6569] Possibly orphaned task ct:TEST_JOB0_1480396486890:4 of framework 39b8a1b0-5ab0-478b-8175-479fb8737942-0003 running on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual) W1129 14:11:44.010447 6862 master.cpp:6596] Possibly orphaned completed task mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework 39b8a1b0-5ab0-478b-8175-479fb8737942-0000 that ran on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual) I1129 14:11:44.010645 6860 hierarchical.cpp:476] Added agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual) with cpus(*):8; mem(*):14604; disk(*):297130; ports(*):[31000-32000] (allocated: cpus(*):8; mem(*):2280; ports(*):[31000-31000, 31417-31418]; disk(*):256) I1129 14:11:44.010646 6862 master.cpp:4885] Re-registered agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual) with cpus(*):8; mem(*):14604; disk(*):297130; ports(*):[31000-32000] I1129 14:11:44.010764 6862 master.cpp:4953] Sending updated checkpointed resources to agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual) I1129 14:11:44.011076 6860 replica.cpp:691] Replica received learned notice for position 102 from @0.0.0.0:0 I1129 14:11:44.011338 6861 master.cpp:5015] Received update of agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual) with total oversubscribed resources I1129 14:11:44.011404 6861 hierarchical.cpp:540] Agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual) updated with oversubscribed resources (total: cpus(*):8; mem(*):14604; disk(*):297130; ports(*):[31000-32000], allocated: cpus(*):8; mem(*):2280; ports(*):[31000-31000, 31417-31418]; disk(*):256) I1129 14:11:44.011510 6860 leveldb.cpp:341] Persisting action (20 bytes) to leveldb took 414611ns I1129 14:11:44.011543 6860 leveldb.cpp:399] Deleting ~2 keys from leveldb took 12550ns I1129 14:11:44.011561 6860 replica.cpp:712] Persisted action at 102 I1129 14:11:44.011574 6860 replica.cpp:697] Replica learned TRUNCATE action at position 102 I1129 14:11:44.011751 6859 master.cpp:5150] Status update TASK_FAILED (UUID: 0d66df11-1e25-45f1-99cc-20e79ab28c91) for task mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework 8e87ed68-434d-4267-b83d-c6a509266a03-0000 from agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual) W1129 14:11:44.011795 6859 master.cpp:5171] Received status update TASK_FAILED (UUID: 0d66df11-1e25-45f1-99cc-20e79ab28c91) for task mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework 8e87ed68-434d-4267-b83d-c6a509266a03-0000 from agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual) for an unknown framework I1129 14:11:44.011845 6859 master.cpp:6854] Updating the state of task mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework 8e87ed68-434d-4267-b83d-c6a509266a03-0000 (latest state: TASK_FAILED, status update state: TASK_FAILED) I1129 14:11:44.089604 6860 master.cpp:2429] Received SUBSCRIBE call for framework 'storm096_mesos0282' at scheduler-550b9c6e-4fc9-4786-b512-8b82d397bd3a@10.23.85.233:32036 I1129 14:11:44.089687 6860 master.cpp:2505] Subscribing framework storm096_mesos0282 with checkpointing enabled and capabilities [ ] I1129 14:11:44.090003 6861 hierarchical.cpp:269] Added framework 39b8a1b0-5ab0-478b-8175-479fb8737942-0005 I1129 14:11:44.090212 6861 master.cpp:5738] Sending 1 offers to framework 39b8a1b0-5ab0-478b-8175-479fb8737942-0005 (storm096_mesos0282) at scheduler-550b9c6e-4fc9-4786-b512-8b82d397bd3a@10.23.85.233:32036 I1129 14:11:44.476806 6860 master.cpp:2429] Received SUBSCRIBE call for framework 'marathon' at scheduler-f3bc64fc-53f2-4490-b500-69aad8ed7afe@10.23.85.234:6041 I1129 14:11:44.476883 6860 master.cpp:2505] Subscribing framework marathon with checkpointing enabled and capabilities [ ] I1129 14:11:44.477252 6860 hierarchical.cpp:269] Added framework 39b8a1b0-5ab0-478b-8175-479fb8737942-0002 ... ``` Apparently, it adding tasks before framework registered, so the tasks added previously became **orphan**. I'm wondering if we can write frameworks info into replicated log? So we can load frameworks first before add any existed tasks? > > On Tue, Nov 29, 2016 at 7:20 PM, Chengwei Yang <chengwei.yang...@gmail.com> > wrote: > > Hi there, > > We're upgrading mesos from 0.28.2 to 1.0.2 and we found an interesting > problem. > > We followed the official upgrade guide so first upgrade 2 following > mesos-master, and then the leading master. > > Once the leading master upgraded, the leader switched to another 1.0.2 > mesos-master. > > Now, stop here. > > we found that the leading master does below from its log. > > ``` > ... > Adding task ... > Adding task ... > ... > SUBSRIBE framework > SUBSRIBE framework > ... > ``` > > So the problem is when it adding existed tasks, it can not found > corresponding > framework, so the task becomes **Orphan**. > > Is this a known preempt issue or am I missing anything? > > -- > Thanks, > Chengwei > > > > > -- > Best Regards, > Haosdent Huang > SECURITY NOTE: file ~/.netrc must not be accessible by others -- Thanks, Chengwei
signature.asc
Description: Digital signature