As check with chengwei privately. The orphan tasks should not exist after
framework subscribed success, because could not find them in the
`orphan_tasks` field in the master/state endpoint.

On Tue, Nov 29, 2016 at 9:53 PM, Chengwei Yang <[email protected]>
wrote:

> On Tue, Nov 29, 2016 at 09:31:08PM +0800, haosdent wrote:
> > Do your jobs scheduled by marathon or your framework?
>
> We started 3 frameworks(marathon, storm, chronos) before upgrading.
>
> Here is the relative logs from the leading master
>
> -----------------------------8<----------------------------
> ...
> I1129 14:11:44.009774  6862 master.cpp:7460] Adding task
> ct:TEST_JOB0_1480396486890:4 with resources cpus(*):4.9; mem(*):64;
> disk(*):256 on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0
> (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.009842  6862 master.cpp:7460] Adding task
> mesos-master-dev051-cqdx.qiyi.virtual-31000 with resources cpus(*):1;
> mem(*):1024; ports(*):[31000-31000] on agent 
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0
> (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.009891  6862 master.cpp:7460] Adding task
> mesos-master-dev051-cqdx.qiyi.virtual-31000 with resources cpus(*):1;
> mem(*):1024; ports(*):[31000-31000] on agent 
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0
> (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.009953  6862 master.cpp:7460] Adding task
> test-all.35819d0c-b5df-11e6-971e-02429c7d09a1 with resources cpus(*):1;
> mem(*):128; ports(*):[31417-31418] on agent 
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0
> (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.010197  6860 leveldb.cpp:341] Persisting action (18 bytes)
> to leveldb took 455974ns
> W1129 14:11:44.010202  6862 master.cpp:6569] Possibly orphaned task
> test-all.35819d0c-b5df-11e6-971e-02429c7d09a1 of framework
> 39b8a1b0-5ab0-478b-8175-479fb8737942-0002 running on agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.010213  6860 replica.cpp:712] Persisted action at 102
> W1129 14:11:44.010249  6862 master.cpp:6569] Possibly orphaned task
> mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
> 8e87ed68-434d-4267-b83d-c6a509266a03-0000 running on agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual)
> W1129 14:11:44.010406  6862 master.cpp:6569] Possibly orphaned task
> mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
> 39b8a1b0-5ab0-478b-8175-479fb8737942-0004 running on agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual)
> W1129 14:11:44.010429  6862 master.cpp:6569] Possibly orphaned task
> ct:TEST_JOB0_1480396486890:4 of framework 
> 39b8a1b0-5ab0-478b-8175-479fb8737942-0003
> running on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@
> 10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual)
> W1129 14:11:44.010447  6862 master.cpp:6596] Possibly orphaned completed
> task mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
> 39b8a1b0-5ab0-478b-8175-479fb8737942-0000 that ran on agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.010645  6860 hierarchical.cpp:476] Added agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 
> (mesos-master-dev051-cqdx.qiyi.virtual)
> with cpus(*):8; mem(*):14604; disk(*):297130; ports(*):[31000-32000]
> (allocated: cpus(*):8; mem(*):2280; ports(*):[31000-31000, 31417-31418];
> disk(*):256)
> I1129 14:11:44.010646  6862 master.cpp:4885] Re-registered agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual) with cpus(*):8; mem(*):14604;
> disk(*):297130; ports(*):[31000-32000]
> I1129 14:11:44.010764  6862 master.cpp:4953] Sending updated checkpointed
> resources  to agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@
> 10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual)
> I1129 14:11:44.011076  6860 replica.cpp:691] Replica received learned
> notice for position 102 from @0.0.0.0:0
> I1129 14:11:44.011338  6861 master.cpp:5015] Received update of agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual) with total oversubscribed
> resources
> I1129 14:11:44.011404  6861 hierarchical.cpp:540] Agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 
> (mesos-master-dev051-cqdx.qiyi.virtual)
> updated with oversubscribed resources  (total: cpus(*):8; mem(*):14604;
> disk(*):297130; ports(*):[31000-32000], allocated: cpus(*):8; mem(*):2280;
> ports(*):[31000-31000, 31417-31418]; disk(*):256)
> I1129 14:11:44.011510  6860 leveldb.cpp:341] Persisting action (20 bytes)
> to leveldb took 414611ns
> I1129 14:11:44.011543  6860 leveldb.cpp:399] Deleting ~2 keys from leveldb
> took 12550ns
> I1129 14:11:44.011561  6860 replica.cpp:712] Persisted action at 102
> I1129 14:11:44.011574  6860 replica.cpp:697] Replica learned TRUNCATE
> action at position 102
> I1129 14:11:44.011751  6859 master.cpp:5150] Status update TASK_FAILED
> (UUID: 0d66df11-1e25-45f1-99cc-20e79ab28c91) for task
> mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
> 8e87ed68-434d-4267-b83d-c6a509266a03-0000 from agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual)
> W1129 14:11:44.011795  6859 master.cpp:5171] Received status update
> TASK_FAILED (UUID: 0d66df11-1e25-45f1-99cc-20e79ab28c91) for task
> mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
> 8e87ed68-434d-4267-b83d-c6a509266a03-0000 from agent
> 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
> (mesos-master-dev051-cqdx.qiyi.virtual) for an unknown framework
> I1129 14:11:44.011845  6859 master.cpp:6854] Updating the state of task
> mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
> 8e87ed68-434d-4267-b83d-c6a509266a03-0000 (latest state: TASK_FAILED,
> status update state: TASK_FAILED)
> I1129 14:11:44.089604  6860 master.cpp:2429] Received SUBSCRIBE call for
> framework 'storm096_mesos0282' at scheduler-550b9c6e-4fc9-4786-
> [email protected]:32036
> I1129 14:11:44.089687  6860 master.cpp:2505] Subscribing framework
> storm096_mesos0282 with checkpointing enabled and capabilities [  ]
> I1129 14:11:44.090003  6861 hierarchical.cpp:269] Added framework
> 39b8a1b0-5ab0-478b-8175-479fb8737942-0005
> I1129 14:11:44.090212  6861 master.cpp:5738] Sending 1 offers to framework
> 39b8a1b0-5ab0-478b-8175-479fb8737942-0005 (storm096_mesos0282) at
> [email protected]:32036
> I1129 14:11:44.476806  6860 master.cpp:2429] Received SUBSCRIBE call for
> framework 'marathon' at scheduler-f3bc64fc-53f2-4490-
> [email protected]:6041
> I1129 14:11:44.476883  6860 master.cpp:2505] Subscribing framework
> marathon with checkpointing enabled and capabilities [  ]
> I1129 14:11:44.477252  6860 hierarchical.cpp:269] Added framework
> 39b8a1b0-5ab0-478b-8175-479fb8737942-0002
> ...
> ```
>
> Apparently, it adding tasks before framework registered, so the tasks
> added previously became **orphan**.
>
> I'm wondering if we can write frameworks info into replicated log?
>
> So we can load frameworks first before add any existed tasks?
>
> >
> > On Tue, Nov 29, 2016 at 7:20 PM, Chengwei Yang <
> [email protected]>
> > wrote:
> >
> >     Hi there,
> >
> >     We're upgrading mesos from 0.28.2 to 1.0.2 and we found an
> interesting
> >     problem.
> >
> >     We followed the official upgrade guide so first upgrade 2 following
> >     mesos-master, and then the leading master.
> >
> >     Once the leading master upgraded, the leader switched to another
> 1.0.2
> >     mesos-master.
> >
> >     Now, stop here.
> >
> >     we found that the leading master does below from its log.
> >
> >     ```
> >     ...
> >     Adding task ...
> >     Adding task ...
> >     ...
> >     SUBSRIBE framework
> >     SUBSRIBE framework
> >     ...
> >     ```
> >
> >     So the problem is when it adding existed tasks, it can not found
> >     corresponding
> >     framework, so the task becomes **Orphan**.
> >
> >     Is this a known preempt issue or am I missing anything?
> >
> >     --
> >     Thanks,
> >     Chengwei
> >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
> > SECURITY NOTE: file ~/.netrc must not be accessible by others
>
> --
> Thanks,
> Chengwei
>



-- 
Best Regards,
Haosdent Huang

Reply via email to