[jira] [Commented] (MESOS-1758) Freezer failure leads to lost task during container destruction.

2014-09-07 Thread Joe Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124977#comment-14124977
 ] 

Joe Smith commented on MESOS-1758:
--

Can we make sure this gets into 0.21.0? This is continuing to hit us with LOST 
tasks, so just want to make sure it gets included.

Thanks!

 Freezer failure leads to lost task during container destruction.
 

 Key: MESOS-1758
 URL: https://issues.apache.org/jira/browse/MESOS-1758
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Benjamin Mahler

 In the past we've seen numerous issues around the freezer. Lately, on the 
 2.6.44 kernel, we've seen issues where we're unable to freeze the cgroup:
 (1) An oom occurs.
 (2) No indication of oom in the kernel logs.
 (3) The slave is unable to freeze the cgroup.
 (4) The task is marked as lost.
 {noformat}
 I0903 16:46:24.956040 25469 mem.cpp:575] Memory limit exceeded: Requested: 
 15488MB Maximum Used: 15488MB
 MEMORY STATISTICS:
 cache 7958691840
 rss 8281653248
 mapped_file 9474048
 pgpgin 4487861
 pgpgout 522933
 pgfault 2533780
 pgmajfault 11
 inactive_anon 0
 active_anon 8281653248
 inactive_file 7631708160
 active_file 326852608
 unevictable 0
 hierarchical_memory_limit 16240345088
 total_cache 7958691840
 total_rss 8281653248
 total_mapped_file 9474048
 total_pgpgin 4487861
 total_pgpgout 522933
 total_pgfault 2533780
 total_pgmajfault 11
 total_inactive_anon 0
 total_active_anon 8281653248
 total_inactive_file 7631728640
 total_active_file 326852608
 total_unevictable 0
 I0903 16:46:24.956848 25469 containerizer.cpp:1041] Container 
 bbb9732a-d600-4c1b-b326-846338c608c3 has reached its limit for resource 
 mem(*):1.62403e+10 and will be terminated
 I0903 16:46:24.957427 25469 containerizer.cpp:909] Destroying container 
 'bbb9732a-d600-4c1b-b326-846338c608c3'
 I0903 16:46:24.958664 25481 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:34.959529 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:34.962070 25482 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 1.710848ms
 I0903 16:46:34.962658 25479 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:44.963349 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:44.965631 25472 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 1.588224ms
 I0903 16:46:44.966356 25472 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:54.967254 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:56.008447 25475 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 2.15296ms
 I0903 16:46:56.009071 25466 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:06.010329 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:06.012538 25467 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 1.643008ms
 I0903 16:47:06.013216 25467 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:12.516348 25480 slave.cpp:3030] Current usage 9.57%. Max allowed 
 age: 5.630238827780799days
 I0903 16:47:16.015192 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:16.017043 25486 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 1.511168ms
 I0903 16:47:16.017555 25480 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:19.862746 25483 http.cpp:245] HTTP request for 
 '/slave(1)/stats.json'
 E0903 16:47:24.960055 25472 slave.cpp:2557] Termination of executor 'E' of 
 framework '201104070004-002563-' failed: Failed to destroy container: 
 discarded future
 I0903 16:47:24.962054 25472 slave.cpp:2087] Handling status update TASK_LOST 
 (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 
 201104070004-002563- from @0.0.0.0:0
 I0903 16:47:24.963470 25469 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
 to 128MB for container bbb9732a-d600-4c1b-b326-846338c608c3
 

[jira] [Commented] (MESOS-1765) Use PID namespace to avoid freezing cgroup

2014-09-07 Thread Joe Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124979#comment-14124979
 ] 

Joe Smith commented on MESOS-1765:
--

[~wangcong] can you share a link to the kernel bug? (Or a pointer to more 
discussion?) Sounds like we should also keep tabs on fixing that as well.

 Use PID namespace to avoid freezing cgroup
 --

 Key: MESOS-1765
 URL: https://issues.apache.org/jira/browse/MESOS-1765
 Project: Mesos
  Issue Type: Story
  Components: containerization
Reporter: Cong Wang

 There is some known kernel issue when we freeze the whole cgroup upon OOM. 
 Mesos probably can just use PID namespace so that we will only need to kill 
 the init of the pid namespace, instead of freezing all the processes and 
 killing them one by one. But I am not quite sure if this would break the 
 existing code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1758) Freezer failure leads to lost task during container destruction.

2014-09-07 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124983#comment-14124983
 ] 

Jie Yu commented on MESOS-1758:
---

Instead of investigating more time on fixing cgroups freezer, I am in favor of 
implementing PID namespace as that will be our ultimate solution.

 Freezer failure leads to lost task during container destruction.
 

 Key: MESOS-1758
 URL: https://issues.apache.org/jira/browse/MESOS-1758
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Benjamin Mahler

 In the past we've seen numerous issues around the freezer. Lately, on the 
 2.6.44 kernel, we've seen issues where we're unable to freeze the cgroup:
 (1) An oom occurs.
 (2) No indication of oom in the kernel logs.
 (3) The slave is unable to freeze the cgroup.
 (4) The task is marked as lost.
 {noformat}
 I0903 16:46:24.956040 25469 mem.cpp:575] Memory limit exceeded: Requested: 
 15488MB Maximum Used: 15488MB
 MEMORY STATISTICS:
 cache 7958691840
 rss 8281653248
 mapped_file 9474048
 pgpgin 4487861
 pgpgout 522933
 pgfault 2533780
 pgmajfault 11
 inactive_anon 0
 active_anon 8281653248
 inactive_file 7631708160
 active_file 326852608
 unevictable 0
 hierarchical_memory_limit 16240345088
 total_cache 7958691840
 total_rss 8281653248
 total_mapped_file 9474048
 total_pgpgin 4487861
 total_pgpgout 522933
 total_pgfault 2533780
 total_pgmajfault 11
 total_inactive_anon 0
 total_active_anon 8281653248
 total_inactive_file 7631728640
 total_active_file 326852608
 total_unevictable 0
 I0903 16:46:24.956848 25469 containerizer.cpp:1041] Container 
 bbb9732a-d600-4c1b-b326-846338c608c3 has reached its limit for resource 
 mem(*):1.62403e+10 and will be terminated
 I0903 16:46:24.957427 25469 containerizer.cpp:909] Destroying container 
 'bbb9732a-d600-4c1b-b326-846338c608c3'
 I0903 16:46:24.958664 25481 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:34.959529 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:34.962070 25482 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 1.710848ms
 I0903 16:46:34.962658 25479 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:44.963349 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:44.965631 25472 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 1.588224ms
 I0903 16:46:44.966356 25472 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:54.967254 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:56.008447 25475 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 2.15296ms
 I0903 16:46:56.009071 25466 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:06.010329 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:06.012538 25467 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 1.643008ms
 I0903 16:47:06.013216 25467 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:12.516348 25480 slave.cpp:3030] Current usage 9.57%. Max allowed 
 age: 5.630238827780799days
 I0903 16:47:16.015192 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:16.017043 25486 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 1.511168ms
 I0903 16:47:16.017555 25480 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:19.862746 25483 http.cpp:245] HTTP request for 
 '/slave(1)/stats.json'
 E0903 16:47:24.960055 25472 slave.cpp:2557] Termination of executor 'E' of 
 framework '201104070004-002563-' failed: Failed to destroy container: 
 discarded future
 I0903 16:47:24.962054 25472 slave.cpp:2087] Handling status update TASK_LOST 
 (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 
 201104070004-002563- from @0.0.0.0:0
 I0903 16:47:24.963470 25469 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
 to 128MB for container bbb9732a-d600-4c1b-b326-846338c608c3
 

[jira] [Commented] (MESOS-1773) Make shutdown grace period configurable per task

2014-09-07 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125017#comment-14125017
 ] 

Alexander Rukletsov commented on MESOS-1773:


The shutdown timeout can be passed in {{TaskInfo}} protobuf. However, updating 
the timeout only in the framework Executor (e.g. {{CommandExecutor}}) doesn't 
make much sense if timeouts on higher levels ({{Executor}}, {{containerizer}}) 
haven't been updated. Suggestion: slave should remember the biggest timeout 
among all current tasks. This timeout should be updated when the tasks stages 
or enters terminal state.

Thoughts?

 Make shutdown grace period configurable per task
 

 Key: MESOS-1773
 URL: https://issues.apache.org/jira/browse/MESOS-1773
 Project: Mesos
  Issue Type: Improvement
  Components: general, slave
Reporter: Alexander Rukletsov
  Labels: patch

 With [Issue 1571|https://issues.apache.org/jira/browse/MESOS-1571] fixed, 
 shutdown grace periods on all levels are dependent on the slave flag. The 
 next step is to make it configurable per task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)