As check with chengwei privately. The orphan tasks should not exist after framework subscribed success, because could not find them in the `orphan_tasks` field in the master/state endpoint.
On Tue, Nov 29, 2016 at 9:53 PM, Chengwei Yang <[email protected]> wrote: > On Tue, Nov 29, 2016 at 09:31:08PM +0800, haosdent wrote: > > Do your jobs scheduled by marathon or your framework? > > We started 3 frameworks(marathon, storm, chronos) before upgrading. > > Here is the relative logs from the leading master > > -----------------------------8<---------------------------- > ... > I1129 14:11:44.009774 6862 master.cpp:7460] Adding task > ct:TEST_JOB0_1480396486890:4 with resources cpus(*):4.9; mem(*):64; > disk(*):256 on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 > (mesos-master-dev051-cqdx.qiyi.virtual) > I1129 14:11:44.009842 6862 master.cpp:7460] Adding task > mesos-master-dev051-cqdx.qiyi.virtual-31000 with resources cpus(*):1; > mem(*):1024; ports(*):[31000-31000] on agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 > (mesos-master-dev051-cqdx.qiyi.virtual) > I1129 14:11:44.009891 6862 master.cpp:7460] Adding task > mesos-master-dev051-cqdx.qiyi.virtual-31000 with resources cpus(*):1; > mem(*):1024; ports(*):[31000-31000] on agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 > (mesos-master-dev051-cqdx.qiyi.virtual) > I1129 14:11:44.009953 6862 master.cpp:7460] Adding task > test-all.35819d0c-b5df-11e6-971e-02429c7d09a1 with resources cpus(*):1; > mem(*):128; ports(*):[31417-31418] on agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 > (mesos-master-dev051-cqdx.qiyi.virtual) > I1129 14:11:44.010197 6860 leveldb.cpp:341] Persisting action (18 bytes) > to leveldb took 455974ns > W1129 14:11:44.010202 6862 master.cpp:6569] Possibly orphaned task > test-all.35819d0c-b5df-11e6-971e-02429c7d09a1 of framework > 39b8a1b0-5ab0-478b-8175-479fb8737942-0002 running on agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 > (mesos-master-dev051-cqdx.qiyi.virtual) > I1129 14:11:44.010213 6860 replica.cpp:712] Persisted action at 102 > W1129 14:11:44.010249 6862 master.cpp:6569] Possibly orphaned task > mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework > 8e87ed68-434d-4267-b83d-c6a509266a03-0000 running on agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 > (mesos-master-dev051-cqdx.qiyi.virtual) > W1129 14:11:44.010406 6862 master.cpp:6569] Possibly orphaned task > mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework > 39b8a1b0-5ab0-478b-8175-479fb8737942-0004 running on agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 > (mesos-master-dev051-cqdx.qiyi.virtual) > W1129 14:11:44.010429 6862 master.cpp:6569] Possibly orphaned task > ct:TEST_JOB0_1480396486890:4 of framework > 39b8a1b0-5ab0-478b-8175-479fb8737942-0003 > running on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@ > 10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual) > W1129 14:11:44.010447 6862 master.cpp:6596] Possibly orphaned completed > task mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework > 39b8a1b0-5ab0-478b-8175-479fb8737942-0000 that ran on agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 > (mesos-master-dev051-cqdx.qiyi.virtual) > I1129 14:11:44.010645 6860 hierarchical.cpp:476] Added agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 > (mesos-master-dev051-cqdx.qiyi.virtual) > with cpus(*):8; mem(*):14604; disk(*):297130; ports(*):[31000-32000] > (allocated: cpus(*):8; mem(*):2280; ports(*):[31000-31000, 31417-31418]; > disk(*):256) > I1129 14:11:44.010646 6862 master.cpp:4885] Re-registered agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 > (mesos-master-dev051-cqdx.qiyi.virtual) with cpus(*):8; mem(*):14604; > disk(*):297130; ports(*):[31000-32000] > I1129 14:11:44.010764 6862 master.cpp:4953] Sending updated checkpointed > resources to agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@ > 10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual) > I1129 14:11:44.011076 6860 replica.cpp:691] Replica received learned > notice for position 102 from @0.0.0.0:0 > I1129 14:11:44.011338 6861 master.cpp:5015] Received update of agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 > (mesos-master-dev051-cqdx.qiyi.virtual) with total oversubscribed > resources > I1129 14:11:44.011404 6861 hierarchical.cpp:540] Agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 > (mesos-master-dev051-cqdx.qiyi.virtual) > updated with oversubscribed resources (total: cpus(*):8; mem(*):14604; > disk(*):297130; ports(*):[31000-32000], allocated: cpus(*):8; mem(*):2280; > ports(*):[31000-31000, 31417-31418]; disk(*):256) > I1129 14:11:44.011510 6860 leveldb.cpp:341] Persisting action (20 bytes) > to leveldb took 414611ns > I1129 14:11:44.011543 6860 leveldb.cpp:399] Deleting ~2 keys from leveldb > took 12550ns > I1129 14:11:44.011561 6860 replica.cpp:712] Persisted action at 102 > I1129 14:11:44.011574 6860 replica.cpp:697] Replica learned TRUNCATE > action at position 102 > I1129 14:11:44.011751 6859 master.cpp:5150] Status update TASK_FAILED > (UUID: 0d66df11-1e25-45f1-99cc-20e79ab28c91) for task > mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework > 8e87ed68-434d-4267-b83d-c6a509266a03-0000 from agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 > (mesos-master-dev051-cqdx.qiyi.virtual) > W1129 14:11:44.011795 6859 master.cpp:5171] Received status update > TASK_FAILED (UUID: 0d66df11-1e25-45f1-99cc-20e79ab28c91) for task > mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework > 8e87ed68-434d-4267-b83d-c6a509266a03-0000 from agent > 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 > (mesos-master-dev051-cqdx.qiyi.virtual) for an unknown framework > I1129 14:11:44.011845 6859 master.cpp:6854] Updating the state of task > mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework > 8e87ed68-434d-4267-b83d-c6a509266a03-0000 (latest state: TASK_FAILED, > status update state: TASK_FAILED) > I1129 14:11:44.089604 6860 master.cpp:2429] Received SUBSCRIBE call for > framework 'storm096_mesos0282' at scheduler-550b9c6e-4fc9-4786- > [email protected]:32036 > I1129 14:11:44.089687 6860 master.cpp:2505] Subscribing framework > storm096_mesos0282 with checkpointing enabled and capabilities [ ] > I1129 14:11:44.090003 6861 hierarchical.cpp:269] Added framework > 39b8a1b0-5ab0-478b-8175-479fb8737942-0005 > I1129 14:11:44.090212 6861 master.cpp:5738] Sending 1 offers to framework > 39b8a1b0-5ab0-478b-8175-479fb8737942-0005 (storm096_mesos0282) at > [email protected]:32036 > I1129 14:11:44.476806 6860 master.cpp:2429] Received SUBSCRIBE call for > framework 'marathon' at scheduler-f3bc64fc-53f2-4490- > [email protected]:6041 > I1129 14:11:44.476883 6860 master.cpp:2505] Subscribing framework > marathon with checkpointing enabled and capabilities [ ] > I1129 14:11:44.477252 6860 hierarchical.cpp:269] Added framework > 39b8a1b0-5ab0-478b-8175-479fb8737942-0002 > ... > ``` > > Apparently, it adding tasks before framework registered, so the tasks > added previously became **orphan**. > > I'm wondering if we can write frameworks info into replicated log? > > So we can load frameworks first before add any existed tasks? > > > > > On Tue, Nov 29, 2016 at 7:20 PM, Chengwei Yang < > [email protected]> > > wrote: > > > > Hi there, > > > > We're upgrading mesos from 0.28.2 to 1.0.2 and we found an > interesting > > problem. > > > > We followed the official upgrade guide so first upgrade 2 following > > mesos-master, and then the leading master. > > > > Once the leading master upgraded, the leader switched to another > 1.0.2 > > mesos-master. > > > > Now, stop here. > > > > we found that the leading master does below from its log. > > > > ``` > > ... > > Adding task ... > > Adding task ... > > ... > > SUBSRIBE framework > > SUBSRIBE framework > > ... > > ``` > > > > So the problem is when it adding existed tasks, it can not found > > corresponding > > framework, so the task becomes **Orphan**. > > > > Is this a known preempt issue or am I missing anything? > > > > -- > > Thanks, > > Chengwei > > > > > > > > > > -- > > Best Regards, > > Haosdent Huang > > SECURITY NOTE: file ~/.netrc must not be accessible by others > > -- > Thanks, > Chengwei > -- Best Regards, Haosdent Huang

