On Tue, Nov 29, 2016 at 09:31:08PM +0800, haosdent wrote:
> Do your jobs scheduled by marathon or your framework?

We started 3 frameworks(marathon, storm, chronos) before upgrading.

Here is the relative logs from the leading master

-----------------------------8<----------------------------
...
I1129 14:11:44.009774  6862 master.cpp:7460] Adding task 
ct:TEST_JOB0_1480396486890:4 with resources cpus(*):4.9; mem(*):64; disk(*):256 
on agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 
(mesos-master-dev051-cqdx.qiyi.virtual)
I1129 14:11:44.009842  6862 master.cpp:7460] Adding task 
mesos-master-dev051-cqdx.qiyi.virtual-31000 with resources cpus(*):1; 
mem(*):1024; ports(*):[31000-31000] on agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual)
I1129 14:11:44.009891  6862 master.cpp:7460] Adding task 
mesos-master-dev051-cqdx.qiyi.virtual-31000 with resources cpus(*):1; 
mem(*):1024; ports(*):[31000-31000] on agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual)
I1129 14:11:44.009953  6862 master.cpp:7460] Adding task 
test-all.35819d0c-b5df-11e6-971e-02429c7d09a1 with resources cpus(*):1; 
mem(*):128; ports(*):[31417-31418] on agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual)
I1129 14:11:44.010197  6860 leveldb.cpp:341] Persisting action (18 bytes) to 
leveldb took 455974ns
W1129 14:11:44.010202  6862 master.cpp:6569] Possibly orphaned task 
test-all.35819d0c-b5df-11e6-971e-02429c7d09a1 of framework 
39b8a1b0-5ab0-478b-8175-479fb8737942-0002 running on agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 
(mesos-master-dev051-cqdx.qiyi.virtual)
I1129 14:11:44.010213  6860 replica.cpp:712] Persisted action at 102
W1129 14:11:44.010249  6862 master.cpp:6569] Possibly orphaned task 
mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework 
8e87ed68-434d-4267-b83d-c6a509266a03-0000 running on agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 
(mesos-master-dev051-cqdx.qiyi.virtual)
W1129 14:11:44.010406  6862 master.cpp:6569] Possibly orphaned task 
mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework 
39b8a1b0-5ab0-478b-8175-479fb8737942-0004 running on agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 
(mesos-master-dev051-cqdx.qiyi.virtual)
W1129 14:11:44.010429  6862 master.cpp:6569] Possibly orphaned task 
ct:TEST_JOB0_1480396486890:4 of framework 
39b8a1b0-5ab0-478b-8175-479fb8737942-0003 running on agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 
(mesos-master-dev051-cqdx.qiyi.virtual)
W1129 14:11:44.010447  6862 master.cpp:6596] Possibly orphaned completed task 
mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework 
39b8a1b0-5ab0-478b-8175-479fb8737942-0000 that ran on agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 
(mesos-master-dev051-cqdx.qiyi.virtual)
I1129 14:11:44.010645  6860 hierarchical.cpp:476] Added agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual) 
with cpus(*):8; mem(*):14604; disk(*):297130; ports(*):[31000-32000] 
(allocated: cpus(*):8; mem(*):2280; ports(*):[31000-31000, 31417-31418]; 
disk(*):256)
I1129 14:11:44.010646  6862 master.cpp:4885] Re-registered agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 
(mesos-master-dev051-cqdx.qiyi.virtual) with cpus(*):8; mem(*):14604; 
disk(*):297130; ports(*):[31000-32000]
I1129 14:11:44.010764  6862 master.cpp:4953] Sending updated checkpointed 
resources  to agent 26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at 
slave(1)@10.23.85.18:5051 (mesos-master-dev051-cqdx.qiyi.virtual)
I1129 14:11:44.011076  6860 replica.cpp:691] Replica received learned notice 
for position 102 from @0.0.0.0:0
I1129 14:11:44.011338  6861 master.cpp:5015] Received update of agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 
(mesos-master-dev051-cqdx.qiyi.virtual) with total oversubscribed resources
I1129 14:11:44.011404  6861 hierarchical.cpp:540] Agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 (mesos-master-dev051-cqdx.qiyi.virtual) 
updated with oversubscribed resources  (total: cpus(*):8; mem(*):14604; 
disk(*):297130; ports(*):[31000-32000], allocated: cpus(*):8; mem(*):2280; 
ports(*):[31000-31000, 31417-31418]; disk(*):256)
I1129 14:11:44.011510  6860 leveldb.cpp:341] Persisting action (20 bytes) to 
leveldb took 414611ns
I1129 14:11:44.011543  6860 leveldb.cpp:399] Deleting ~2 keys from leveldb took 
12550ns
I1129 14:11:44.011561  6860 replica.cpp:712] Persisted action at 102
I1129 14:11:44.011574  6860 replica.cpp:697] Replica learned TRUNCATE action at 
position 102
I1129 14:11:44.011751  6859 master.cpp:5150] Status update TASK_FAILED (UUID: 
0d66df11-1e25-45f1-99cc-20e79ab28c91) for task 
mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework 
8e87ed68-434d-4267-b83d-c6a509266a03-0000 from agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 
(mesos-master-dev051-cqdx.qiyi.virtual)
W1129 14:11:44.011795  6859 master.cpp:5171] Received status update TASK_FAILED 
(UUID: 0d66df11-1e25-45f1-99cc-20e79ab28c91) for task 
mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework 
8e87ed68-434d-4267-b83d-c6a509266a03-0000 from agent 
26cad8b0-b963-44b6-bc97-4216f13d17eb-S0 at slave(1)@10.23.85.18:5051 
(mesos-master-dev051-cqdx.qiyi.virtual) for an unknown framework
I1129 14:11:44.011845  6859 master.cpp:6854] Updating the state of task 
mesos-master-dev051-cqdx.qiyi.virtual-31000 of framework 
8e87ed68-434d-4267-b83d-c6a509266a03-0000 (latest state: TASK_FAILED, status 
update state: TASK_FAILED)
I1129 14:11:44.089604  6860 master.cpp:2429] Received SUBSCRIBE call for 
framework 'storm096_mesos0282' at 
scheduler-550b9c6e-4fc9-4786-b512-8b82d397bd3a@10.23.85.233:32036
I1129 14:11:44.089687  6860 master.cpp:2505] Subscribing framework 
storm096_mesos0282 with checkpointing enabled and capabilities [  ]
I1129 14:11:44.090003  6861 hierarchical.cpp:269] Added framework 
39b8a1b0-5ab0-478b-8175-479fb8737942-0005
I1129 14:11:44.090212  6861 master.cpp:5738] Sending 1 offers to framework 
39b8a1b0-5ab0-478b-8175-479fb8737942-0005 (storm096_mesos0282) at 
scheduler-550b9c6e-4fc9-4786-b512-8b82d397bd3a@10.23.85.233:32036
I1129 14:11:44.476806  6860 master.cpp:2429] Received SUBSCRIBE call for 
framework 'marathon' at 
scheduler-f3bc64fc-53f2-4490-b500-69aad8ed7afe@10.23.85.234:6041
I1129 14:11:44.476883  6860 master.cpp:2505] Subscribing framework marathon 
with checkpointing enabled and capabilities [  ]
I1129 14:11:44.477252  6860 hierarchical.cpp:269] Added framework 
39b8a1b0-5ab0-478b-8175-479fb8737942-0002
...
```

Apparently, it adding tasks before framework registered, so the tasks added 
previously became **orphan**.

I'm wondering if we can write frameworks info into replicated log?

So we can load frameworks first before add any existed tasks?

> 
> On Tue, Nov 29, 2016 at 7:20 PM, Chengwei Yang <chengwei.yang...@gmail.com>
> wrote:
> 
>     Hi there,
> 
>     We're upgrading mesos from 0.28.2 to 1.0.2 and we found an interesting
>     problem.
> 
>     We followed the official upgrade guide so first upgrade 2 following
>     mesos-master, and then the leading master.
> 
>     Once the leading master upgraded, the leader switched to another 1.0.2
>     mesos-master.
> 
>     Now, stop here.
> 
>     we found that the leading master does below from its log.
> 
>     ```
>     ...
>     Adding task ...
>     Adding task ...
>     ...
>     SUBSRIBE framework
>     SUBSRIBE framework
>     ...
>     ```
> 
>     So the problem is when it adding existed tasks, it can not found
>     corresponding
>     framework, so the task becomes **Orphan**.
> 
>     Is this a known preempt issue or am I missing anything?
>    
>     --
>     Thanks,
>     Chengwei
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang
> SECURITY NOTE: file ~/.netrc must not be accessible by others

-- 
Thanks,
Chengwei

Attachment: signature.asc
Description: Digital signature

Reply via email to