Benjamin Mahler created MESOS-1758:
--------------------------------------

             Summary: Freezer failure leads to lost task during container 
destruction.
                 Key: MESOS-1758
                 URL: https://issues.apache.org/jira/browse/MESOS-1758
             Project: Mesos
          Issue Type: Bug
          Components: containerization
            Reporter: Benjamin Mahler


In the past we've seen numerous issues around the freezer. Lately, on the 
2.6.44 kernel, we've seen issues where we're unable to freeze the cgroup:

(1) An oom occurs.
(2) No indication of oom in the kernel logs.
(3) The slave is unable to freeze the cgroup.
(4) The task is marked as lost.

{noformat}
I0903 16:46:24.956040 25469 mem.cpp:575] Memory limit exceeded: Requested: 
15488MB Maximum Used: 15488MB

MEMORY STATISTICS:
cache 7958691840
rss 8281653248
mapped_file 9474048
pgpgin 4487861
pgpgout 522933
pgfault 2533780
pgmajfault 11
inactive_anon 0
active_anon 8281653248
inactive_file 7631708160
active_file 326852608
unevictable 0
hierarchical_memory_limit 16240345088
total_cache 7958691840
total_rss 8281653248
total_mapped_file 9474048
total_pgpgin 4487861
total_pgpgout 522933
total_pgfault 2533780
total_pgmajfault 11
total_inactive_anon 0
total_active_anon 8281653248
total_inactive_file 7631728640
total_active_file 326852608
total_unevictable 0
I0903 16:46:24.956848 25469 containerizer.cpp:1041] Container 
bbb9732a-d600-4c1b-b326-846338c608c3 has reached its limit for resource 
mem(*):1.62403e+10 and will be terminated
I0903 16:46:24.957427 25469 containerizer.cpp:909] Destroying container 
'bbb9732a-d600-4c1b-b326-846338c608c3'
I0903 16:46:24.958664 25481 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:34.959529 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:34.962070 25482 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
1.710848ms
I0903 16:46:34.962658 25479 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:44.963349 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:44.965631 25472 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
1.588224ms
I0903 16:46:44.966356 25472 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:54.967254 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:56.008447 25475 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
2.15296ms
I0903 16:46:56.009071 25466 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:06.010329 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:06.012538 25467 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
1.643008ms
I0903 16:47:06.013216 25467 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:12.516348 25480 slave.cpp:3030] Current usage 9.57%. Max allowed 
age: 5.630238827780799days
I0903 16:47:16.015192 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:16.017043 25486 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
1.511168ms
I0903 16:47:16.017555 25480 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:19.862746 25483 http.cpp:245] HTTP request for 
'/slave(1)/stats.json'
E0903 16:47:24.960055 25472 slave.cpp:2557] Termination of executor 'E' of 
framework '201104070004-0000002563-0000' failed: Failed to destroy container: 
discarded future
I0903 16:47:24.962054 25472 slave.cpp:2087] Handling status update TASK_LOST 
(UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 
201104070004-0000002563-0000 from @0.0.0.0:0
I0903 16:47:24.963470 25469 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
to 128MB for container bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:24.963541 25471 cpushare.cpp:338] Updated 'cpu.shares' to 256 (cpus 
0.25) for container bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:24.964756 25471 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 
100ms and 'cpu.cfs_quota_us' to 25ms (cpus 0.25) for container 
bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:43.406610 25476 status_update_manager.cpp:320] Received status 
update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of 
framework 201104070004-0000002563-0000
I0903 16:47:43.406991 25476 status_update_manager.hpp:342] Checkpointing UPDATE 
for status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for 
task T of framework 201104070004-0000002563-0000
I0903 16:47:43.410475 25476 status_update_manager.cpp:373] Forwarding status 
update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of 
framework 201104070004-0000002563-0000 to master@<scrubbed_ip>:5050
I0903 16:47:43.439923 25480 status_update_manager.cpp:398] Received status 
update acknowledgement (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T 
of framework 201104070004-0000002563-0000
I0903 16:47:43.440115 25480 status_update_manager.hpp:342] Checkpointing ACK 
for status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for 
task T of framework 201104070004-0000002563-0000
I0903 16:47:43.443595 25480 slave.cpp:2709] Cleaning up executor 'E' of 
framework 201104070004-0000002563-0000
{noformat}

We should consider avoiding the freezer entirely in favor of a kill(2) loop. We 
don't have to wait for pid namespaces to remove the freezer dependency.

At the very least, when the freezer fails, we should proceed with a kill(2) 
loop to ensure that we destroy the cgroup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to