[jira] [Commented] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY

2020-05-07 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101572#comment-17101572
 ] 

Andrei Budnik commented on MESOS-10107:
---

{code:java}
commit 0cb1591b709e3c9f32093d943b8e2ddcdcf7999f
Author: Charles-Francois Natali 
Date:   Sat May 2 01:41:09 2020 +0100

Keep retrying to remove cgroup on EBUSY.

This is a follow-up to MESOS-10107, which introduced retries when
calling `rmdir` on a seemingly empty cgroup fails with `EBUSY`
because of various kernel bugs.
At the time, the fix introduced a bounded number of retries, using an
exponential backoff summing up to slightly over 1s. This was done
because it was similar to what Docker does, and worked during testing.
However, after 1 month without seeing this error in our cluster at work,
we finally experienced one case where the 1s timeout wasn't enough.
It could be because the machine was busy at the time, or some other
random factor.
So instead of only trying for 1s, I think it might make sense to just
keep retrying, until the top-level container destruction timeout - set
at 1 minute - kicks in.
This actually makes more sense, and avoids having a magical timeout in
the cgroup code.
We just need to ensure that when the destroyer is finalized, it discards
the future in charge of doing the periodic remove.

This closes #362
{code}

> containeriser: failed to remove cgroup - EBUSY
> --
>
> Key: MESOS-10107
> URL: https://issues.apache.org/jira/browse/MESOS-10107
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Charles N
>Assignee: Charles Natali
>Priority: Major
>  Labels: cgroups, containerization
> Fix For: 1.10.0
>
> Attachments: mesos-remove-cgroup-race.diff, 
> reproduce-cgroup-rmdir-race.py
>
>
> We've been seeing some random errors on our cluster, where the container 
> cgroup isn't properly destroyed after the OOM killer kicked in when memory 
> limit has been exceeded - see analysis and patch below:
> Agent log:
> {noformat}
> I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 
> 10272MB Maximum Used: 10518532KB
> MEMORY STATISTICS: 
> cache 0
> rss 10502754304
> rss_huge 4001366016
> shmem 0
> mapped_file 270336
> dirty 0
> writeback 0
> swap 0
> pgpgin 1684617
> pgpgout 95480
> pgfault 1670328
> pgmajfault 957
> inactive_anon 0
> active_anon 10501189632
> inactive_file 4096
> active_file 0
> unevictable 0
> hierarchical_memory_limit 10770972672
> hierarchical_memsw_limit 10770972672
> total_cache 0
> total_rss 10502754304
> total_rss_huge 4001366016
> total_shmem 0
> total_mapped_file 270336
> total_dirty 0
> total_writeback 0
> total_swap 0
> total_pgpgin 1684617
> total_pgpgout 95480
> total_pgfault 1670328
> total_pgmajfault 957
> total_inactive_anon 0
> total_active_anon 10501070848
> total_inactive_file 4096
> total_active_file 0
> total_unevictable 0
> I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource 
> [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be 
> terminated
> I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
> I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state 
> of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING 
> after 4.285078272secs
> I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy 
> container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
> I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 102.27072ms
> I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 242944ns
> I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for 
> executor(1)@127.0.1.1:46357
> I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
> E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 
> 'task-0

[jira] [Commented] (MESOS-10116) Attempt to reactivate disconnected agent crashes the master

2020-05-07 Thread Andrei Sekretenko (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101542#comment-17101542
 ] 

Andrei Sekretenko commented on MESOS-10116:
---

master:
{noformat}
commit a32513a1fc6a149b30f04721f866e3cbb6003661
Author: Andrei Sekretenko 
Date:   Tue Apr 14 18:55:59 2020 +0200

Added test for reactivation of a disconnected drained agent.

Review: https://reviews.apache.org/r/72364
{noformat}

1.9.x:
{noformat}
commit b3b6dbb27a93a9ace4e4d2d1e83b16ea92f1a8e1
Author: Andrei Sekretenko 
Date:   Tue Apr 14 18:55:59 2020 +0200

Added test for reactivation of a disconnected drained agent.

Review: https://reviews.apache.org/r/72364
{noformat}

> Attempt to reactivate disconnected agent crashes the master
> ---
>
> Key: MESOS-10116
> URL: https://issues.apache.org/jira/browse/MESOS-10116
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Critical
>
> Observed the following scenario on a production cluster:
>  - operator performs agent draining
>  - draining completes, operator disconnects the agent
>  - operator reactivates agent via REACTIVATE_AGENT call
>  - *master issues an offer for a reactivated disconnected agent*
>  - a framework issues ACCEPT call with this offer
>  - master crashes with the following stack trace:
> {noformat}
> F0311 09:06:18.852365 11289 validation.cpp:2123] Check failed: 
> slave->connected Offer 4067082c-ec7a-4efc-ac2d-c6e7cbc77356-O13981526 
> outlived disconnected agent 968ea9b2-374d-45cb-b5b3-c4ffb45a4a78-S0 at 
> slave(1)@10.50.7.59:5051 (10.50.7.59)
> *** Check failure stack trace: ***
> @ 0x7feac6a1dc6d google::LogMessage::Fail()
> @ 0x7feac6a1fec8 google::LogMessage::SendToLog()
> @ 0x7feac6a1d803 google::LogMessage::Flush()
> @ 0x7feac6a20809 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7feac57cdea0 mesos::internal::master::validation::offer::validateSlave()
> @ 0x7feac57d09c1 std::_Function_handler<>::_M_invoke()
> @ 0x7feac57d0fd1 std::function<>::operator()()
> @ 0x7feac57cea3c mesos::internal::master::validation::offer::validate()
> @ 0x7feac56d5565 mesos::internal::master::Master::accept()
> @ 0x7feac56468f0 mesos::internal::master::Master::Http::scheduler()
> @ 0x7feac5689797 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal6master6Master10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_
> @ 0x7feac697038c 
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZZNS1_11ProcessBase8_consumeERKNSB_12HttpEndpointERKSsRKNS1_5OwnedINS3_7RequestNKUlRK6OptionINS3_14authentication20AuthenticationResultEEE0_clESR_EUlbE0_IbclEv
> @ 0x7feac53f30e7 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_IST_SI_St12_PlaceholderILi1EEclEOS3_
> @ 0x7feac6966561 process::ProcessBase::consume()
> @ 0x7feac697db5b process::ProcessManager::resume()
> @ 0x7feac69837f6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7feac262f070 (unknown)
> @ 0x7feac1e4de65 start_thread
> @ 0x7feac1b7688d __clone
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)