On Tue, Nov 29, 2016 at 11:31:58PM +0800, haosdent wrote:
> As check with chengwei privately. The orphan tasks should not exist after
> framework subscribed success, because could not find them in the 
> `orphan_tasks`
> field in the master/state endpoint.

Yeah, thanks @haosdent!

> 
> On Tue, Nov 29, 2016 at 9:53 PM, Chengwei Yang <[email protected]>
> wrote:
> 
>     On Tue, Nov 29, 2016 at 09:31:08PM +0800, haosdent wrote:
>     > Do your jobs scheduled by marathon or your framework?
> 
>     We started 3 frameworks(marathon, storm, chronos) before upgrading.
> 
>     Here is the relative logs from the leading master
> 
>     -----------------------------8<----------------------------
>     ...
>     I1129 14:11:44.009774  6862 master.cpp:7460] Adding task
>     ct:TEST_JOB0_1480396486890:4 with resources cpus(*):4.9; mem(*):64; disk
>     (*):256 on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0
>     (mesos-master-dev051-cqdx.qiyi.virtual)
>     I1129 14:11:44.009842  6862 master.cpp:7460] Adding task
>     mesos-master-dev051-cqdx.qiyi.virtual-31000 with resources cpus(*):1; mem
>     (*):1024; ports(*):[31000-31000] on agent 26cad8b0-b963-44b6-bc97-
>     4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual)
>     I1129 14:11:44.009891  6862 master.cpp:7460] Adding task
>     mesos-master-dev051-cqdx.qiyi.virtual-31000 with resources cpus(*):1; mem
>     (*):1024; ports(*):[31000-31000] on agent 26cad8b0-b963-44b6-bc97-
>     4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual)
>     I1129 14:11:44.009953  6862 master.cpp:7460] Adding task
>     test-all.35819d0c-b5df-11e6-971e-02429c7d09a1 with resources cpus(*):1; 
> mem
>     (*):128; ports(*):[31417-31418] on agent 26cad8b0-b963-44b6-bc97-
>     4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual)
>     I1129 14:11:44.010197  6860 leveldb.cpp:341] Persisting action (18 bytes)
>     to leveldb took 455974ns
>     W1129 14:11:44.010202  6862 master.cpp:6569] Possibly orphaned task
>     test-all.35819d0c-b5df-11e6-971e-02429c7d09a1 of framework
>     39b8a1b0-5ab0-478b-8175-479fb8737942-0002 running on agent
>     26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
>     (mesos-master-dev051-cqdx.qiyi.virtual)
>     I1129 14:11:44.010213  6860 replica.cpp:712] Persisted action at 102
>     W1129 14:11:44.010249  6862 master.cpp:6569] Possibly orphaned task
>     mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
>     8e87ed68-434d-4267-b83d-c6a509266a03-0000 running on agent
>     26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
>     (mesos-master-dev051-cqdx.qiyi.virtual)
>     W1129 14:11:44.010406  6862 master.cpp:6569] Possibly orphaned task
>     mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
>     39b8a1b0-5ab0-478b-8175-479fb8737942-0004 running on agent
>     26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
>     (mesos-master-dev051-cqdx.qiyi.virtual)
>     W1129 14:11:44.010429  6862 master.cpp:6569] Possibly orphaned task
>     ct:TEST_JOB0_1480396486890:4 of framework 39b8a1b0-5ab0-478b-8175-
>     479fb8737942-0003 running on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0
>     at slave(1)@10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual)
>     W1129 14:11:44.010447  6862 master.cpp:6596] Possibly orphaned completed
>     task mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
>     39b8a1b0-5ab0-478b-8175-479fb8737942-0000 that ran on agent
>     26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
>     (mesos-master-dev051-cqdx.qiyi.virtual)
>     I1129 14:11:44.010645  6860 hierarchical.cpp:476] Added agent
>     26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.
>     qiyi.virtual) with cpus(*):8; mem(*):14604; disk(*):297130; ports(*):
>     [31000-32000] (allocated: cpus(*):8; mem(*):2280; ports(*):[31000-31000,
>     31417-31418]; disk(*):256)
>     I1129 14:11:44.010646  6862 master.cpp:4885] Re-registered agent
>     26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
>     (mesos-master-dev051-cqdx.qiyi.virtual) with cpus(*):8; mem(*):14604; disk
>     (*):297130; ports(*):[31000-32000]
>     I1129 14:11:44.010764  6862 master.cpp:4953] Sending updated checkpointed
>     resources  to agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@
>     10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual)
>     I1129 14:11:44.011076  6860 replica.cpp:691] Replica received learned
>     notice for position 102 from @0.0.0.0:0
>     I1129 14:11:44.011338  6861 master.cpp:5015] Received update of agent
>     26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
>     (mesos-master-dev051-cqdx.qiyi.virtual) with total oversubscribed 
> resources
>     I1129 14:11:44.011404  6861 hierarchical.cpp:540] Agent
>     26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.
>     qiyi.virtual) updated with oversubscribed resources  (total: cpus(*):8; 
> mem
>     (*):14604; disk(*):297130; ports(*):[31000-32000], allocated: cpus(*):8;
>     mem(*):2280; ports(*):[31000-31000, 31417-31418]; disk(*):256)
>     I1129 14:11:44.011510  6860 leveldb.cpp:341] Persisting action (20 bytes)
>     to leveldb took 414611ns
>     I1129 14:11:44.011543  6860 leveldb.cpp:399] Deleting ~2 keys from leveldb
>     took 12550ns
>     I1129 14:11:44.011561  6860 replica.cpp:712] Persisted action at 102
>     I1129 14:11:44.011574  6860 replica.cpp:697] Replica learned TRUNCATE
>     action at position 102
>     I1129 14:11:44.011751  6859 master.cpp:5150] Status update TASK_FAILED
>     (UUID: 0d66df11-1e25-45f1-99cc-20e79ab28c91) for task
>     mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
>     8e87ed68-434d-4267-b83d-c6a509266a03-0000 from agent
>     26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
>     (mesos-master-dev051-cqdx.qiyi.virtual)
>     W1129 14:11:44.011795  6859 master.cpp:5171] Received status update
>     TASK_FAILED (UUID: 0d66df11-1e25-45f1-99cc-20e79ab28c91) for task
>     mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
>     8e87ed68-434d-4267-b83d-c6a509266a03-0000 from agent
>     26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051
>     (mesos-master-dev051-cqdx.qiyi.virtual) for an unknown framework
>     I1129 14:11:44.011845  6859 master.cpp:6854] Updating the state of task
>     mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework
>     8e87ed68-434d-4267-b83d-c6a509266a03-0000 (latest state: TASK_FAILED,
>     status update state: TASK_FAILED)
>     I1129 14:11:44.089604  6860 master.cpp:2429] Received SUBSCRIBE call for
>     framework 'storm096_mesos0282' at scheduler-550b9c6e-4fc9-4786-
>     [email protected]:32036
>     I1129 14:11:44.089687  6860 master.cpp:2505] Subscribing framework
>     storm096_mesos0282 with checkpointing enabled and capabilities [  ]
>     I1129 14:11:44.090003  6861 hierarchical.cpp:269] Added framework
>     39b8a1b0-5ab0-478b-8175-479fb8737942-0005
>     I1129 14:11:44.090212  6861 master.cpp:5738] Sending 1 offers to framework
>     39b8a1b0-5ab0-478b-8175-479fb8737942-0005 (storm096_mesos0282) at
>     [email protected]:32036
>     I1129 14:11:44.476806  6860 master.cpp:2429] Received SUBSCRIBE call for
>     framework 'marathon' at scheduler-f3bc64fc-53f2-4490-
>     [email protected]:6041
>     I1129 14:11:44.476883  6860 master.cpp:2505] Subscribing framework 
> marathon
>     with checkpointing enabled and capabilities [  ]
>     I1129 14:11:44.477252  6860 hierarchical.cpp:269] Added framework
>     39b8a1b0-5ab0-478b-8175-479fb8737942-0002
>     ...
>     ```
> 
>     Apparently, it adding tasks before framework registered, so the tasks 
> added
>     previously became **orphan**.
> 
>     I'm wondering if we can write frameworks info into replicated log?
> 
>     So we can load frameworks first before add any existed tasks?
> 
>     >
>     > On Tue, Nov 29, 2016 at 7:20 PM, Chengwei Yang <
>     [email protected]>
>     > wrote:
>     >
>     >     Hi there,
>     >
>     >     We're upgrading mesos from 0.28.2 to 1.0.2 and we found an
>     interesting
>     >     problem.
>     >
>     >     We followed the official upgrade guide so first upgrade 2 following
>     >     mesos-master, and then the leading master.
>     >
>     >     Once the leading master upgraded, the leader switched to another
>     1.0.2
>     >     mesos-master.
>     >
>     >     Now, stop here.
>     >
>     >     we found that the leading master does below from its log.
>     >
>     >     ```
>     >     ...
>     >     Adding task ...
>     >     Adding task ...
>     >     ...
>     >     SUBSRIBE framework
>     >     SUBSRIBE framework
>     >     ...
>     >     ```
>     >
>     >     So the problem is when it adding existed tasks, it can not found
>     >     corresponding
>     >     framework, so the task becomes **Orphan**.
>     >
>     >     Is this a known preempt issue or am I missing anything?
>     >
>     >     --
>     >     Thanks,
>     >     Chengwei
>     >
>     >
>     >
>     >
>     > --
>     > Best Regards,
>     > Haosdent Huang
>     > SECURITY NOTE: file ~/.netrc must not be accessible by others
>    
>     --
>     Thanks,
>     Chengwei
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang
> SECURITY NOTE: file ~/.netrc must not be accessible by others

-- 
Thanks,
Chengwei

Attachment: signature.asc
Description: Digital signature

Reply via email to