[jira] [Commented] (MESOS-8098) Benchmark Master failover performance
[ https://issues.apache.org/jira/browse/MESOS-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241338#comment-16241338 ] Benjamin Mahler commented on MESOS-8098: Looking through the bottom layer, I see the majority of the width is taken up by protobuf serialization, de-serialization, copying and destruction. So that should be a good area to focus on. Also, the profiling tools on macOS are actually really nice I've found if you are ok with slowing down the program significantly to get a more complete profile. > Benchmark Master failover performance > - > > Key: MESOS-8098 > URL: https://issues.apache.org/jira/browse/MESOS-8098 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Yan Xu >Assignee: Yan Xu > Attachments: withoutperfpatches.perf.svg, withperfpatches.perf.svg > > > Master failover performance often sheds light on the master's performance in > general as it's often the time the master experiences the highest load. Ways > we can benchmark the failover include the time it takes for all agents to > reregister, all frameworks to resubscribe or fully reconcile. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1
[ https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241278#comment-16241278 ] Julien Pepy commented on MESOS-7007: Hi, I was looking into rebasing [~chhsia0]'s patch (https://reviews.apache.org/r/58980/) on v1.4.0, but as [~naelyn] notcied the codebase has diverged a lot since May, mostly due to MESOS-7449. So here is a new slightly different patch: https://reviews.apache.org/r/63598/ It fills ContainerInfo from the Executor, when present, so that it become the default if no ContainerInfo is present in TaskInfo (whether using a container image or a command). It seemed logical, since agents can be configured with default ContainerInfo to pass to the Executor. It has been tested successfully on v1.4.0. Is it possible to integrate it to this ticket? Thanks! > filesystem/shared and --default_container_info broken since 1.1 > --- > > Key: MESOS-7007 > URL: https://issues.apache.org/jira/browse/MESOS-7007 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.1.0, 1.2.0 >Reporter: Pierre Cheynier >Assignee: Chun-Hung Hsiao > Labels: storage > > I face this issue, that prevent me to upgrade to 1.1.0 (and the change was > consequently introduced in this version): > I'm using default_container_info to mount a /tmp volume in the container's > mount namespace from its current sandbox, meaning that each container have a > dedicated /tmp, thanks to the {{filesystem/shared}} isolator. > I noticed through our automation pipeline that integration tests were failing > and found that this is because /tmp (the one from the host!) contents is > trashed each time a container is created. > Here is my setup: > * > {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}} > * > {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}} > I discovered this issue in the early days of 1.1 (end of Nov, spoke with > someone on Slack), but had unfortunately no time to dig into the symptoms a > bit more. > I found nothing interesting even using GLOGv=3. > Maybe it's a bad usage of isolators that trigger this issue ? If it's the > case, then at least a documentation update should be done. > Let me know if more information is needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking
[ https://issues.apache.org/jira/browse/MESOS-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-8169: --- Shepherd: James Peach (was: James Peach) > master validation incorrectly rejects slaves, buggy executorID checking > --- > > Key: MESOS-8169 > URL: https://issues.apache.org/jira/browse/MESOS-8169 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: James DeFelice >Assignee: James DeFelice > Labels: mesosphere > > proposed fix: https://github.com/apache/mesos/pull/248 > I observed this in my environment, where I had two frameworks that used the > same ExecutorID and then triggered a master failover. The master refuses to > reregister the slave because it's not considering the owning-framework of the > ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) > that there's an erroneous duplicate executor ID: > {code} > W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of > agent at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: > Executor has a duplicate ExecutorID 'default' > {code} > (yes, "default" is probably a terrible name for an ExecutorID - that's a > separate discussion!) > /cc [~neilc] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking
[ https://issues.apache.org/jira/browse/MESOS-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8169: -- Shepherd: James Peach Assignee: James DeFelice > master validation incorrectly rejects slaves, buggy executorID checking > --- > > Key: MESOS-8169 > URL: https://issues.apache.org/jira/browse/MESOS-8169 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: James DeFelice >Assignee: James DeFelice > Labels: mesosphere > > proposed fix: https://github.com/apache/mesos/pull/248 > I observed this in my environment, where I had two frameworks that used the > same ExecutorID and then triggered a master failover. The master refuses to > reregister the slave because it's not considering the owning-framework of the > ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) > that there's an erroneous duplicate executor ID: > {code} > W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of > agent at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: > Executor has a duplicate ExecutorID 'default' > {code} > (yes, "default" is probably a terrible name for an ExecutorID - that's a > separate discussion!) > /cc [~neilc] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8174) clang-format incorrectly indents aggregate initializations
Benjamin Bannier created MESOS-8174: --- Summary: clang-format incorrectly indents aggregate initializations Key: MESOS-8174 URL: https://issues.apache.org/jira/browse/MESOS-8174 Project: Mesos Issue Type: Bug Reporter: Benjamin Bannier Aggregate initializations are incorrectly indented. I would expect the following indention, {code} Foo bar{ 123, 456, 789}; {code} Instead this is indented as {code} Foo bar{123, 456, 789}; {code} Forcing a line break after the opening curly incorrectly indents the arguments with two instead of four spaces, {code} Foo bar{ 123, 456, 789}; {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8146) Mesos agent fails containers on restart if containers were started with memory-swap less than memory + 64mb
[ https://issues.apache.org/jira/browse/MESOS-8146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240756#comment-16240756 ] Joseph Wu commented on MESOS-8146: -- One important thing to note is that specifying arbitrary parameters to the DockerContainerizer is not guaranteed to work: https://github.com/apache/mesos/blob/1.4.x/include/mesos/mesos.proto#L2850-L2854 The error here probably comes from a conflict between the underlying resource isolation. Under the covers, Mesos can resize the container's cpu/memory. The extra parameters you specify breaks the assumption Mesos is making (about how Docker works). > Mesos agent fails containers on restart if containers were started with > memory-swap less than memory + 64mb > --- > > Key: MESOS-8146 > URL: https://issues.apache.org/jira/browse/MESOS-8146 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.4.0 > Environment: Mesos 1.4.0 > Redhat 7.4 > Marathon 1.4.8 > docker 1.12.6 > docker api 1.24 >Reporter: Guchakov Nikita > > I'd have some strange behaviour with Mesos when trying to disable swap on > docker containers. Our mesos version in use is 1.4.0 > When marathon deploys containers with > ``` > "parameters": [ > { > "key": "memory", > "value": "1024m" > }, > { > "key": "memory-swap", > "value": "1024m" > } > ] > ``` > then it deploys successfully. BUT when mesos-slave restarts and tries to > deregister executor it fails: > ```E1027 11:11:47.367416 12626 slave.cpp:4287] Failed to update resources for > container 6e3e07af-db09-4dc0-88f8-4e5599529cbe of executor > 'templates-api.d72549fd-baed-11e7-9742-96b37b4eca54' of framework > 20171020-202151-141892780-5050-1-0001, destroying container: Failed to set > 'memory.limit_in_bytes': Invalid argument > ``` > Things goes more weird when I tried different memory-swap configurations: > Containers doesn't destroyed on slave's restart only when memory-swap >= > memory + 64mb. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7124) Replace monadic type get() functions with operator*
[ https://issues.apache.org/jira/browse/MESOS-7124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-7124: Description: In MESOS-2757 we introduced {{T* operator->}} for {{Option}}, {{Future}} and {{Try}}. This provided a convenient short-hand for existing member functions {{T& get}} providing identical functionality. To finalize the work of MESOS-2757 we should replace the existing {{T& get()}} member functions with functions {{T& operator*}}. This is desirable as having both {{operator->}} and {{get}} in the code base at the same time lures developers into using the old-style {{get}} instead of {{operator->}} where it is not needed, e.g., {code} m.get().fun(); {code} instead of {code} m->fun(); {code} We still require the functionality of {{get}} to directly access the contained value, but the current API unnecessarily conflates two (at least from a usage perspective) unrelated aspects; in these instances, we should use an {{operator*}} instead, {code} void f(const T&); Try m = ..; f(*m); // instead of: f(m.get()); {code} Using {{operator*}} in these instances makes it much less likely that users would use it in instances when they wanted to call functions of the wrapped value, i.e., {code} m->fun(); {code} appears more natural than {code} (*m).fun(); {code} Note that this proposed change is in line with the interface of {{std::optional}}. Also, {{std::shared_ptr}}'s {{get}} is a useful function and implements an unrelated interface: it surfaces the wrapped pointer as opposed to its {{operator*}} which dereferences the wrapped pointer. Similarly, our current {{get}} also produce values, and are unrelated to {{std::shared_ptr}}'s {{get}}. was: In MESOS-2757 we introduced {{T* operator->}} for {{Option}}, {{Future}} and {{Try}}. This provided a convenient short-hand for existing member functions {{T* get}} providing identical functionality. To finalize the work of MESOS-2757 we should replace the existing {{T* get()}} member functions with functions {{T* operator*}}. This is desirable as having both {{operator->}} and {{get}} in the code base at the same time lures developers into using the old-style {{get}} instead of {{operator->}} where it is not needed, e.g., {code} m.get().fun(); {code} instead of {code} m->fun(); {code} We still require the functionality of {{get}} to directly access the contained value, but the current API unnecessarily conflates two (at least from a usage perspective) unrelated aspects; in these instances, we should use an {{operator*}} instead, {code} void f(const T&); Try m = ..; f(*m); // instead of: f(m.get()); {code} Using {{operator*}} in these instances makes it much less likely that users would use it in instances when they wanted to call functions of the wrapped value, i.e., {code} m->fun(); {code} appears more natural than {code} (*m).fun(); {code} Note that this proposed change is in line with the interface of {{std::optional}}. Also, {{std::shared_ptr}}'s {{get}} is a useful function and implements an unrelated interface: it surfaces the wrapped pointer as opposed to its {{operator*}} which dereferences the wrapped pointer. Similarly, our current {{get}} also produce values, and are unrelated to {{std::shared_ptr}}'s {{get}}. > Replace monadic type get() functions with operator* > --- > > Key: MESOS-7124 > URL: https://issues.apache.org/jira/browse/MESOS-7124 > Project: Mesos > Issue Type: Improvement > Components: libprocess, stout >Reporter: Benjamin Bannier > Labels: tech-debt > > In MESOS-2757 we introduced {{T* operator->}} for {{Option}}, {{Future}} and > {{Try}}. This provided a convenient short-hand for existing member functions > {{T& get}} providing identical functionality. > To finalize the work of MESOS-2757 we should replace the existing {{T& > get()}} member functions with functions {{T& operator*}}. > This is desirable as having both {{operator->}} and {{get}} in the code base > at the same time lures developers into using the old-style {{get}} instead of > {{operator->}} where it is not needed, e.g., > {code} > m.get().fun(); > {code} > instead of > {code} > m->fun(); > {code} > We still require the functionality of {{get}} to directly access the > contained value, but the current API unnecessarily conflates two (at least > from a usage perspective) unrelated aspects; in these instances, we should > use an {{operator*}} instead, > {code} > void f(const T&); > > Try m = ..; > f(*m); // instead of: f(m.get()); > {code} > Using {{operator*}} in these instances makes it much less likely that users > would use it in instances when they wanted to call functions of the wrapped > value, i.e., > {code} > m->fun(); > {code} > appears more natural than >
[jira] [Assigned] (MESOS-8173) Improve fetcher exit status message
[ https://issues.apache.org/jira/browse/MESOS-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-8173: -- Assignee: James Peach > Improve fetcher exit status message > --- > > Key: MESOS-8173 > URL: https://issues.apache.org/jira/browse/MESOS-8173 > Project: Mesos > Issue Type: Bug > Components: fetcher >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > When the fetcher fails, we emit a message: > {code} > return Failure("Failed to fetch all URIs for container '" + >stringify(containerId) + >"' with exit status: " + >stringify(status.get())); > {code} > But `status` is the return value from > [wait(2)|http://man7.org/linux/man-pages/man2/waitpid.2.html] so we should be > using {{WSTRINGIFY}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8173) Improve fetcher exit status message
James Peach created MESOS-8173: -- Summary: Improve fetcher exit status message Key: MESOS-8173 URL: https://issues.apache.org/jira/browse/MESOS-8173 Project: Mesos Issue Type: Bug Components: fetcher Reporter: James Peach Priority: Minor When the fetcher fails, we emit a message: {code} return Failure("Failed to fetch all URIs for container '" + stringify(containerId) + "' with exit status: " + stringify(status.get())); {code} But `status` is the return value from [wait(2)|http://man7.org/linux/man-pages/man2/waitpid.2.html] so we should be using {{WSTRINGIFY}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8172) Agent --authenticate_http_executors commandline flag unrecognized in 1.4.0
Dan Leary created MESOS-8172: Summary: Agent --authenticate_http_executors commandline flag unrecognized in 1.4.0 Key: MESOS-8172 URL: https://issues.apache.org/jira/browse/MESOS-8172 Project: Mesos Issue Type: Bug Components: executor, security Affects Versions: 1.4.0 Environment: Ubuntu 16.04.3 with meso 1.4.0 compiled from source tarball. Reporter: Dan Leary Apparently the mesos-agent authenticate_http_executors commandline arg was introduced in 1.3.0 by MESOS-6365. But running "mesos-agent --authenticate_http_executors ..." in 1.4.0 yields {noformat} Failed to load unknown flag 'authenticate_http_executors' {noformat} ...followed by a usage report that does not include "--authenticate_http_executors". Presumably this means executor authentication is no longer configurable. It is still documented at https://mesos.apache.org/documentation/latest/authentication/#agent -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7924) Add a javascript linter to the webui.
[ https://issues.apache.org/jira/browse/MESOS-7924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armand Grillet updated MESOS-7924: -- Sprint: Mesosphere Sprint 63, Mesosphere Sprint 64, Mesosphere Sprint 67 (was: Mesosphere Sprint 63, Mesosphere Sprint 64) > Add a javascript linter to the webui. > - > > Key: MESOS-7924 > URL: https://issues.apache.org/jira/browse/MESOS-7924 > Project: Mesos > Issue Type: Improvement > Components: webui >Reporter: Benjamin Mahler >Assignee: Armand Grillet > Labels: tech-debt > Fix For: 1.5.0 > > > As far as I can tell, javascript linters (e.g. ESLint) help catch some > functional errors as well, for example, we've made some "strict" mistakes a > few times that ESLint can catch: MESOS-6624, MESOS-7912. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process
[ https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240294#comment-16240294 ] Andor Molnar commented on MESOS-4065: - [~tillt] Hi, We've started to review your code changes. If the patch is still required on Zookeeper side, please come over to GitHub an elaborate on the use case a little bit. Thanks. > slave FD for ZK tcp connection leaked to executor process > - > > Key: MESOS-4065 > URL: https://issues.apache.org/jira/browse/MESOS-4065 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.24.1, 0.25.0, 1.2.2 >Reporter: James DeFelice > Labels: mesosphere, security > > {code} > core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd > root 1432 99.3 0.0 202420 12928 ?Rsl 21:32 13:51 > ./etcd-mesos-executor -log_dir=./ > root 1450 0.4 0.1 38332 28752 ?Sl 21:32 0:03 ./etcd > --data-dir=etcd_data --name=etcd-1449178273 > --listen-peer-urls=http://10.0.0.45:1025 > --initial-advertise-peer-urls=http://10.0.0.45:1025 > --listen-client-urls=http://10.0.0.45:1026 > --advertise-client-urls=http://10.0.0.45:1026 > --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025 > --initial-cluster-state=existing > core 1651 0.0 0.0 6740 928 pts/0S+ 21:46 0:00 grep > --colour=auto -e etcd > core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181 > etcd-meso 1432 root 10u IPv4 21973 0t0TCP > ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 > (ESTABLISHED) > core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave > root 1124 0.2 0.1 900496 25736 ?Ssl 21:11 0:04 > /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave > core 1658 0.0 0.0 6740 832 pts/0S+ 21:46 0:00 grep > --colour=auto -e slave > core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181 > mesos-sla 1124 root 10u IPv4 21973 0t0TCP > ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 > (ESTABLISHED) > {code} > I only tested against mesos 0.24.1 and 0.25.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240156#comment-16240156 ] Rob Johnson commented on MESOS-7966: sorry I missed this - I'll take a look to see if we still have the master logs from that time. > check for maintenance on agent causes fatal error > - > > Key: MESOS-7966 > URL: https://issues.apache.org/jira/browse/MESOS-7966 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0 >Reporter: Rob Johnson >Assignee: Armand Grillet >Priority: Blocker > Labels: reliability > > We interact with the maintenance API frequently to orchestrate gracefully > draining agents of tasks without impacting service availability. > Occasionally we seem to trigger a fatal error in Mesos when interacting with > the api. This happens relatively frequently, and impacts us when downstream > frameworks (marathon) react badly to leader elections. > Here is the log line that we see when the master dies: > {code} > F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: > slaves[slaveId].maintenance.isSome() > {code} > It's quite possibly we're using the maintenance API in the wrong way. We're > happy to provide any other logs you need - please let me know what would be > useful for debugging. > Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error
[ https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240148#comment-16240148 ] Alexander Rukletsov commented on MESOS-7966: [~robjohnson] do you still have master logs? > check for maintenance on agent causes fatal error > - > > Key: MESOS-7966 > URL: https://issues.apache.org/jira/browse/MESOS-7966 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0 >Reporter: Rob Johnson >Assignee: Armand Grillet >Priority: Blocker > Labels: reliability > > We interact with the maintenance API frequently to orchestrate gracefully > draining agents of tasks without impacting service availability. > Occasionally we seem to trigger a fatal error in Mesos when interacting with > the api. This happens relatively frequently, and impacts us when downstream > frameworks (marathon) react badly to leader elections. > Here is the log line that we see when the master dies: > {code} > F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: > slaves[slaveId].maintenance.isSome() > {code} > It's quite possibly we're using the maintenance API in the wrong way. We're > happy to provide any other logs you need - please let me know what would be > useful for debugging. > Thanks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7991) fatal, check failed !framework->recovered()
[ https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215386#comment-16215386 ] Alexander Rukletsov edited comment on MESOS-7991 at 11/6/17 10:44 AM: -- This could happen if we have master failover, agent re-registers and then again re-registers (https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/slave/slave.cpp#L1629). The statement in https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8070 thus does not seem correct and the change https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8073 from the review request https://reviews.apache.org/r/53897/ that happened to follow this comment should be removed. The strange thing is that the tasks are known to the master but not to the agent according to the logs (master.cpp:7568), the fact that the agent kept its id but not its tasks seems unlikely. [~drribosome] Could you give more context around the agent, the registration attempt and also the master logs since the failover and the agent logs around that timeframe? We should write a test reproducing the issue -(having a master + agent, launching a task, restarting master, block framework re-registration, let agent re-registers twice by spoofing the second re-registration)- and then remove the line 8073. was (Author: armandgrillet): This could happen if we have master failover, agent re-registers and then again re-registers (https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/slave/slave.cpp#L1629). The statement in https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8070 thus does not seem correct and the change https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8073 from the review request https://reviews.apache.org/r/53897/ that happened to follow this comment should be removed. The strange thing is that the tasks are known to the master but not to the agent according to the logs (master.cpp:7568), the fact that the agent kept its id but not its tasks seems unlikely. Could you give more context around the agent, the registration attempt and also the master logs since the failover and the agent logs around that timeframe? We should write a test reproducing the issue -(having a master + agent, launching a task, restarting master, block framework re-registration, let agent re-registers twice by spoofing the second re-registration)- and then remove the line 8073. > fatal, check failed !framework->recovered() > --- > > Key: MESOS-7991 > URL: https://issues.apache.org/jira/browse/MESOS-7991 > Project: Mesos > Issue Type: Bug >Reporter: Jack Crawford >Assignee: Armand Grillet >Priority: Blocker > Labels: reliability > > mesos master crashed on what appears to be framework recovery > mesos master version: 1.3.1 > mesos agent version: 1.3.1 > {code} > W0920 14:58:54.756364 25452 master.cpp:7568] Task > 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756369 25452 master.cpp:7568] Task > 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756376 25452 master.cpp:7568] Task > 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756381 25452 master.cpp:7568] Task > e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756386 25452 master.cpp:7568] Task > f838a03c-5cd4-47eb-8606-69b004d89808 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756392 25452 master.cpp:7568] Task > 685ca5da-fa24-494d-a806-06e03bbf00bd of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871
[jira] [Comment Edited] (MESOS-7991) fatal, check failed !framework->recovered()
[ https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215386#comment-16215386 ] Alexander Rukletsov edited comment on MESOS-7991 at 11/6/17 10:43 AM: -- This could happen if we have master failover, agent re-registers and then again re-registers (https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/slave/slave.cpp#L1629). The statement in https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8070 thus does not seem correct and the change https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8073 from the review request https://reviews.apache.org/r/53897/ that happened to follow this comment should be removed. The strange thing is that the tasks are known to the master but not to the agent according to the logs (master.cpp:7568), the fact that the agent kept its id but not its tasks seems unlikely. Could you give more context around the agent, the registration attempt and also the master logs since the failover and the agent logs around that timeframe? We should write a test reproducing the issue -(having a master + agent, launching a task, restarting master, block framework re-registration, let agent re-registers twice by spoofing the second re-registration)- and then remove the line 8073. was (Author: armandgrillet): This could happen if we have master failover, agent re-registers and then again re-registers (https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/slave/slave.cpp#L1629). The statement in https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8070 thus does not seem correct and the change https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8073 from the review request https://reviews.apache.org/r/53897/ that happened to follow this comment should be removed. The strange thing is that the tasks are known to the master but not to the agent according to the logs (master.cpp:7568), the fact that the agent kept its id but not its tasks seem unlikely. Could you give more context around the agent, the registration attempt and also the master logs since the failover and the agent logs around that timeframe? We should write a test reproducing the issue -(having a master + agent, launching a task, restarting master, block framework re-registration, let agent re-registers twice by spoofing the second re-registration)- and then remove the line 8073. > fatal, check failed !framework->recovered() > --- > > Key: MESOS-7991 > URL: https://issues.apache.org/jira/browse/MESOS-7991 > Project: Mesos > Issue Type: Bug >Reporter: Jack Crawford >Assignee: Armand Grillet >Priority: Blocker > Labels: reliability > > mesos master crashed on what appears to be framework recovery > mesos master version: 1.3.1 > mesos agent version: 1.3.1 > {code} > W0920 14:58:54.756364 25452 master.cpp:7568] Task > 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756369 25452 master.cpp:7568] Task > 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756376 25452 master.cpp:7568] Task > 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756381 25452 master.cpp:7568] Task > e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756386 25452 master.cpp:7568] Task > f838a03c-5cd4-47eb-8606-69b004d89808 of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent > a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1) > @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with > the agent > W0920 14:58:54.756392 25452 master.cpp:7568] Task > 685ca5da-fa24-494d-a806-06e03bbf00bd of framework > 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the