[jira] [Commented] (MESOS-8098) Benchmark Master failover performance

2017-11-06 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241338#comment-16241338
 ] 

Benjamin Mahler commented on MESOS-8098:


Looking through the bottom layer, I see the majority of the width is taken up 
by protobuf serialization, de-serialization, copying and destruction. So that 
should be a good area to focus on. Also, the profiling tools on macOS are 
actually really nice I've found if you are ok with slowing down the program 
significantly to get a more complete profile.

> Benchmark Master failover performance
> -
>
> Key: MESOS-8098
> URL: https://issues.apache.org/jira/browse/MESOS-8098
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Yan Xu
>Assignee: Yan Xu
> Attachments: withoutperfpatches.perf.svg, withperfpatches.perf.svg
>
>
> Master failover performance often sheds light on the master's performance in 
> general as it's often the time the master experiences the highest load. Ways 
> we can benchmark the failover include the time it takes for all agents to 
> reregister, all frameworks to resubscribe or fully reconcile.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2017-11-06 Thread Julien Pepy (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241278#comment-16241278
 ] 

Julien Pepy commented on MESOS-7007:


Hi,

I was looking into rebasing [~chhsia0]'s patch 
(https://reviews.apache.org/r/58980/) on v1.4.0, but as [~naelyn] notcied the 
codebase has diverged a lot since May, mostly due to MESOS-7449.

So here is a new slightly different patch: https://reviews.apache.org/r/63598/
It fills ContainerInfo from the Executor, when present, so that it become the 
default if no ContainerInfo is present in TaskInfo (whether using a container 
image or a command). It seemed logical, since agents can be configured with 
default ContainerInfo to pass to the Executor.
It has been tested successfully on v1.4.0.
Is it possible to integrate it to this ticket? Thanks!

> filesystem/shared and --default_container_info broken since 1.1
> ---
>
> Key: MESOS-7007
> URL: https://issues.apache.org/jira/browse/MESOS-7007
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Pierre Cheynier
>Assignee: Chun-Hung Hsiao
>  Labels: storage
>
> I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
> consequently introduced in this version):
> I'm using default_container_info to mount a /tmp volume in the container's 
> mount namespace from its current sandbox, meaning that each container have a 
> dedicated /tmp, thanks to the {{filesystem/shared}} isolator.
> I noticed through our automation pipeline that integration tests were failing 
> and found that this is because /tmp (the one from the host!) contents is 
> trashed each time a container is created.
> Here is my setup: 
> * 
> {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
> * 
> {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}
> I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
> someone on Slack), but had unfortunately no time to dig into the symptoms a 
> bit more.
> I found nothing interesting even using GLOGv=3.
> Maybe it's a bad usage of isolators that trigger this issue ? If it's the 
> case, then at least a documentation update should be done.
> Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking

2017-11-06 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-8169:
---
Shepherd: James Peach  (was: James Peach)

> master validation incorrectly rejects slaves, buggy executorID checking
> ---
>
> Key: MESOS-8169
> URL: https://issues.apache.org/jira/browse/MESOS-8169
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: James DeFelice
>Assignee: James DeFelice
>  Labels: mesosphere
>
> proposed fix: https://github.com/apache/mesos/pull/248
> I observed this in my environment, where I had two frameworks that used the 
> same ExecutorID and then triggered a master failover. The master refuses to 
> reregister the slave because it's not considering the owning-framework of the 
> ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) 
> that there's an erroneous duplicate executor ID:
> {code}
> W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of 
> agent at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: 
> Executor has a duplicate ExecutorID 'default'
> {code}
> (yes, "default" is probably a terrible name for an ExecutorID - that's a 
> separate discussion!)
> /cc [~neilc]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking

2017-11-06 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8169:
--

Shepherd: James Peach
Assignee: James DeFelice

> master validation incorrectly rejects slaves, buggy executorID checking
> ---
>
> Key: MESOS-8169
> URL: https://issues.apache.org/jira/browse/MESOS-8169
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: James DeFelice
>Assignee: James DeFelice
>  Labels: mesosphere
>
> proposed fix: https://github.com/apache/mesos/pull/248
> I observed this in my environment, where I had two frameworks that used the 
> same ExecutorID and then triggered a master failover. The master refuses to 
> reregister the slave because it's not considering the owning-framework of the 
> ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) 
> that there's an erroneous duplicate executor ID:
> {code}
> W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of 
> agent at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: 
> Executor has a duplicate ExecutorID 'default'
> {code}
> (yes, "default" is probably a terrible name for an ExecutorID - that's a 
> separate discussion!)
> /cc [~neilc]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8174) clang-format incorrectly indents aggregate initializations

2017-11-06 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-8174:
---

 Summary: clang-format incorrectly indents aggregate initializations
 Key: MESOS-8174
 URL: https://issues.apache.org/jira/browse/MESOS-8174
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Bannier


Aggregate initializations are incorrectly indented. I would expect the 
following indention,

{code}
Foo bar{
123,
456,
789};
{code}

Instead this is indented as

{code}
Foo bar{123,
456,
789};
{code}

Forcing a line break after the opening curly incorrectly indents the arguments 
with two instead of four spaces,

{code}
Foo bar{

  123,
  456,
  789};
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8146) Mesos agent fails containers on restart if containers were started with memory-swap less than memory + 64mb

2017-11-06 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240756#comment-16240756
 ] 

Joseph Wu commented on MESOS-8146:
--

One important thing to note is that specifying arbitrary parameters to the 
DockerContainerizer is not guaranteed to work:
https://github.com/apache/mesos/blob/1.4.x/include/mesos/mesos.proto#L2850-L2854

The error here probably comes from a conflict between the underlying resource 
isolation.  Under the covers, Mesos can resize the container's cpu/memory.  The 
extra parameters you specify breaks the assumption Mesos is making (about how 
Docker works).

> Mesos agent fails containers on restart if containers were started with 
> memory-swap less than memory + 64mb
> ---
>
> Key: MESOS-8146
> URL: https://issues.apache.org/jira/browse/MESOS-8146
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.4.0
> Environment: Mesos 1.4.0
> Redhat 7.4
> Marathon 1.4.8
> docker 1.12.6 
> docker api 1.24
>Reporter: Guchakov Nikita
>
> I'd have some strange behaviour with Mesos when trying to disable swap on 
> docker containers. Our mesos version in use is 1.4.0
> When marathon deploys containers with
> ```
> "parameters": [
> {
>   "key": "memory",
>   "value": "1024m"
> },
> {
>   "key": "memory-swap",
>   "value": "1024m"
> }
>   ]
> ```
> then it deploys successfully. BUT when mesos-slave restarts and tries to 
> deregister executor it fails:
> ```E1027 11:11:47.367416 12626 slave.cpp:4287] Failed to update resources for 
> container 6e3e07af-db09-4dc0-88f8-4e5599529cbe of executor 
> 'templates-api.d72549fd-baed-11e7-9742-96b37b4eca54' of framework 
> 20171020-202151-141892780-5050-1-0001, destroying container: Failed to set 
> 'memory.limit_in_bytes': Invalid argument
> ```
> Things goes more weird when I tried different memory-swap configurations:
> Containers doesn't destroyed on slave's restart only when memory-swap >= 
> memory + 64mb.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7124) Replace monadic type get() functions with operator*

2017-11-06 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-7124:

Description: 
In MESOS-2757 we introduced {{T* operator->}} for {{Option}}, {{Future}} and 
{{Try}}. This provided a convenient short-hand for existing member functions 
{{T& get}} providing identical functionality.

To finalize the work of MESOS-2757 we should replace the existing {{T& get()}} 
member functions with functions {{T& operator*}}.

This is desirable as having both {{operator->}} and {{get}} in the code base at 
the same time lures developers into using the old-style {{get}} instead of 
{{operator->}} where it is not needed, e.g.,
{code}
m.get().fun();
{code}
instead of
{code}
m->fun();
{code}

We still require the functionality of {{get}} to directly access the contained 
value, but the current API unnecessarily conflates two (at least from a usage 
perspective) unrelated aspects; in these instances, we should use an 
{{operator*}} instead,
{code}
void f(const T&);

Try m = ..;

f(*m); // instead of: f(m.get());
{code}

Using {{operator*}} in these instances makes it much less likely that users 
would use it in instances when they wanted to call functions of the wrapped 
value, i.e.,
{code}
m->fun();
{code}
appears more natural than
{code}
(*m).fun();
{code}

Note that this proposed change is in line with the interface of 
{{std::optional}}. Also, {{std::shared_ptr}}'s {{get}} is a useful function and 
implements an unrelated interface: it surfaces the wrapped pointer as opposed 
to its {{operator*}} which dereferences the wrapped pointer. Similarly, our 
current {{get}} also produce values, and are unrelated to {{std::shared_ptr}}'s 
{{get}}.

  was:
In MESOS-2757 we introduced {{T* operator->}} for {{Option}}, {{Future}} and 
{{Try}}. This provided a convenient short-hand for existing member functions 
{{T* get}} providing identical functionality.

To finalize the work of MESOS-2757 we should replace the existing {{T* get()}} 
member functions with functions {{T* operator*}}.

This is desirable as having both {{operator->}} and {{get}} in the code base at 
the same time lures developers into using the old-style {{get}} instead of 
{{operator->}} where it is not needed, e.g.,
{code}
m.get().fun();
{code}
instead of
{code}
m->fun();
{code}

We still require the functionality of {{get}} to directly access the contained 
value, but the current API unnecessarily conflates two (at least from a usage 
perspective) unrelated aspects; in these instances, we should use an 
{{operator*}} instead,
{code}
void f(const T&);

Try m = ..;

f(*m); // instead of: f(m.get());
{code}

Using {{operator*}} in these instances makes it much less likely that users 
would use it in instances when they wanted to call functions of the wrapped 
value, i.e.,
{code}
m->fun();
{code}
appears more natural than
{code}
(*m).fun();
{code}

Note that this proposed change is in line with the interface of 
{{std::optional}}. Also, {{std::shared_ptr}}'s {{get}} is a useful function and 
implements an unrelated interface: it surfaces the wrapped pointer as opposed 
to its {{operator*}} which dereferences the wrapped pointer. Similarly, our 
current {{get}} also produce values, and are unrelated to {{std::shared_ptr}}'s 
{{get}}.


> Replace monadic type get() functions with operator*
> ---
>
> Key: MESOS-7124
> URL: https://issues.apache.org/jira/browse/MESOS-7124
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess, stout
>Reporter: Benjamin Bannier
>  Labels: tech-debt
>
> In MESOS-2757 we introduced {{T* operator->}} for {{Option}}, {{Future}} and 
> {{Try}}. This provided a convenient short-hand for existing member functions 
> {{T& get}} providing identical functionality.
> To finalize the work of MESOS-2757 we should replace the existing {{T& 
> get()}} member functions with functions {{T& operator*}}.
> This is desirable as having both {{operator->}} and {{get}} in the code base 
> at the same time lures developers into using the old-style {{get}} instead of 
> {{operator->}} where it is not needed, e.g.,
> {code}
> m.get().fun();
> {code}
> instead of
> {code}
> m->fun();
> {code}
> We still require the functionality of {{get}} to directly access the 
> contained value, but the current API unnecessarily conflates two (at least 
> from a usage perspective) unrelated aspects; in these instances, we should 
> use an {{operator*}} instead,
> {code}
> void f(const T&);
> 
> Try m = ..;
> f(*m); // instead of: f(m.get());
> {code}
> Using {{operator*}} in these instances makes it much less likely that users 
> would use it in instances when they wanted to call functions of the wrapped 
> value, i.e.,
> {code}
> m->fun();
> {code}
> appears more natural than
> 

[jira] [Assigned] (MESOS-8173) Improve fetcher exit status message

2017-11-06 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-8173:
--

Assignee: James Peach

> Improve fetcher exit status message
> ---
>
> Key: MESOS-8173
> URL: https://issues.apache.org/jira/browse/MESOS-8173
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> When the fetcher fails, we emit a message:
> {code}
> return Failure("Failed to fetch all URIs for container '" +
>stringify(containerId) +
>"' with exit status: " +
>stringify(status.get()));
> {code}
> But `status` is the return value from 
> [wait(2)|http://man7.org/linux/man-pages/man2/waitpid.2.html] so we should be 
> using {{WSTRINGIFY}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8173) Improve fetcher exit status message

2017-11-06 Thread James Peach (JIRA)
James Peach created MESOS-8173:
--

 Summary: Improve fetcher exit status message
 Key: MESOS-8173
 URL: https://issues.apache.org/jira/browse/MESOS-8173
 Project: Mesos
  Issue Type: Bug
  Components: fetcher
Reporter: James Peach
Priority: Minor


When the fetcher fails, we emit a message:
{code}
return Failure("Failed to fetch all URIs for container '" +
   stringify(containerId) +
   "' with exit status: " +
   stringify(status.get()));
{code}

But `status` is the return value from 
[wait(2)|http://man7.org/linux/man-pages/man2/waitpid.2.html] so we should be 
using {{WSTRINGIFY}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8172) Agent --authenticate_http_executors commandline flag unrecognized in 1.4.0

2017-11-06 Thread Dan Leary (JIRA)
Dan Leary created MESOS-8172:


 Summary: Agent --authenticate_http_executors commandline flag 
unrecognized in 1.4.0
 Key: MESOS-8172
 URL: https://issues.apache.org/jira/browse/MESOS-8172
 Project: Mesos
  Issue Type: Bug
  Components: executor, security
Affects Versions: 1.4.0
 Environment: Ubuntu 16.04.3 with meso 1.4.0 compiled from source 
tarball.
Reporter: Dan Leary


Apparently the mesos-agent authenticate_http_executors commandline arg was 
introduced in 1.3.0 by MESOS-6365.   But running "mesos-agent 
--authenticate_http_executors ..." in 1.4.0 yields
{noformat}
Failed to load unknown flag 'authenticate_http_executors'
{noformat}
...followed by a usage report that does not include 
"--authenticate_http_executors".
Presumably this means executor authentication is no longer configurable.
It is still documented at 
https://mesos.apache.org/documentation/latest/authentication/#agent




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7924) Add a javascript linter to the webui.

2017-11-06 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-7924:
--
Sprint: Mesosphere Sprint 63, Mesosphere Sprint 64, Mesosphere Sprint 67  
(was: Mesosphere Sprint 63, Mesosphere Sprint 64)

> Add a javascript linter to the webui.
> -
>
> Key: MESOS-7924
> URL: https://issues.apache.org/jira/browse/MESOS-7924
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Benjamin Mahler
>Assignee: Armand Grillet
>  Labels: tech-debt
> Fix For: 1.5.0
>
>
> As far as I can tell, javascript linters (e.g. ESLint) help catch some 
> functional errors as well, for example, we've made some "strict" mistakes a 
> few times that ESLint can catch: MESOS-6624, MESOS-7912.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process

2017-11-06 Thread Andor Molnar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240294#comment-16240294
 ] 

Andor Molnar commented on MESOS-4065:
-

[~tillt]

Hi,
We've started to review your code changes. 
If the patch is still required on Zookeeper side, please come over to GitHub an 
elaborate on the use case a little bit.
Thanks.

> slave FD for ZK tcp connection leaked to executor process
> -
>
> Key: MESOS-4065
> URL: https://issues.apache.org/jira/browse/MESOS-4065
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.1, 0.25.0, 1.2.2
>Reporter: James DeFelice
>  Labels: mesosphere, security
>
> {code}
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd
> root  1432 99.3  0.0 202420 12928 ?Rsl  21:32  13:51 
> ./etcd-mesos-executor -log_dir=./
> root  1450  0.4  0.1  38332 28752 ?Sl   21:32   0:03 ./etcd 
> --data-dir=etcd_data --name=etcd-1449178273 
> --listen-peer-urls=http://10.0.0.45:1025 
> --initial-advertise-peer-urls=http://10.0.0.45:1025 
> --listen-client-urls=http://10.0.0.45:1026 
> --advertise-client-urls=http://10.0.0.45:1026 
> --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025
>  --initial-cluster-state=existing
> core  1651  0.0  0.0   6740   928 pts/0S+   21:46   0:00 grep 
> --colour=auto -e etcd
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181
> etcd-meso 1432 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave
> root  1124  0.2  0.1 900496 25736 ?Ssl  21:11   0:04 
> /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave
> core  1658  0.0  0.0   6740   832 pts/0S+   21:46   0:00 grep 
> --colour=auto -e slave
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181
> mesos-sla 1124 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> {code}
> I only tested against mesos 0.24.1 and 0.25.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error

2017-11-06 Thread Rob Johnson (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240156#comment-16240156
 ] 

Rob Johnson commented on MESOS-7966:


sorry I missed this - I'll take a look to see if we still have the master logs 
from that time.

> check for maintenance on agent causes fatal error
> -
>
> Key: MESOS-7966
> URL: https://issues.apache.org/jira/browse/MESOS-7966
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Rob Johnson
>Assignee: Armand Grillet
>Priority: Blocker
>  Labels: reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully 
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with 
> the api. This happens relatively frequently, and impacts us when downstream 
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: 
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're 
> happy to provide any other logs you need - please let me know what would be 
> useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error

2017-11-06 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240148#comment-16240148
 ] 

Alexander Rukletsov commented on MESOS-7966:


[~robjohnson] do you still have master logs?

> check for maintenance on agent causes fatal error
> -
>
> Key: MESOS-7966
> URL: https://issues.apache.org/jira/browse/MESOS-7966
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Rob Johnson
>Assignee: Armand Grillet
>Priority: Blocker
>  Labels: reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully 
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with 
> the api. This happens relatively frequently, and impacts us when downstream 
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: 
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're 
> happy to provide any other logs you need - please let me know what would be 
> useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7991) fatal, check failed !framework->recovered()

2017-11-06 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215386#comment-16215386
 ] 

Alexander Rukletsov edited comment on MESOS-7991 at 11/6/17 10:44 AM:
--

This could happen if we have master failover, agent re-registers and then again 
re-registers 
(https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/slave/slave.cpp#L1629).
 The statement in 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8070
 thus does not seem correct and the change 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8073
 from the review request https://reviews.apache.org/r/53897/ that happened to 
follow this comment should be removed.

The strange thing is that the tasks are known to the master but not to the 
agent according to the logs (master.cpp:7568), the fact that the agent kept its 
id but not its tasks seems unlikely. [~drribosome] Could you give more context 
around the agent, the registration attempt and also the master logs since the 
failover and the agent logs around that timeframe?

We should write a test reproducing the issue -(having a master + agent, 
launching a task, restarting master, block framework re-registration, let agent 
re-registers twice by spoofing the second re-registration)- and then remove the 
line 8073.


was (Author: armandgrillet):
This could happen if we have master failover, agent re-registers and then again 
re-registers 
(https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/slave/slave.cpp#L1629).
 The statement in 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8070
 thus does not seem correct and the change 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8073
 from the review request https://reviews.apache.org/r/53897/ that happened to 
follow this comment should be removed.

The strange thing is that the tasks are known to the master but not to the 
agent according to the logs (master.cpp:7568), the fact that the agent kept its 
id but not its tasks seems unlikely. Could you give more context around the 
agent, the registration attempt and also the master logs since the failover and 
the agent logs around that timeframe?

We should write a test reproducing the issue -(having a master + agent, 
launching a task, restarting master, block framework re-registration, let agent 
re-registers twice by spoofing the second re-registration)- and then remove the 
line 8073.

> fatal, check failed !framework->recovered()
> ---
>
> Key: MESOS-7991
> URL: https://issues.apache.org/jira/browse/MESOS-7991
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jack Crawford
>Assignee: Armand Grillet
>Priority: Blocker
>  Labels: reliability
>
> mesos master crashed on what appears to be framework recovery
> mesos master version: 1.3.1
> mesos agent version: 1.3.1
> {code}
> W0920 14:58:54.756364 25452 master.cpp:7568] Task 
> 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756369 25452 master.cpp:7568] Task 
> 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756376 25452 master.cpp:7568] Task 
> 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756381 25452 master.cpp:7568] Task 
> e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756386 25452 master.cpp:7568] Task 
> f838a03c-5cd4-47eb-8606-69b004d89808 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756392 25452 master.cpp:7568] Task 
> 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 

[jira] [Comment Edited] (MESOS-7991) fatal, check failed !framework->recovered()

2017-11-06 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215386#comment-16215386
 ] 

Alexander Rukletsov edited comment on MESOS-7991 at 11/6/17 10:43 AM:
--

This could happen if we have master failover, agent re-registers and then again 
re-registers 
(https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/slave/slave.cpp#L1629).
 The statement in 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8070
 thus does not seem correct and the change 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8073
 from the review request https://reviews.apache.org/r/53897/ that happened to 
follow this comment should be removed.

The strange thing is that the tasks are known to the master but not to the 
agent according to the logs (master.cpp:7568), the fact that the agent kept its 
id but not its tasks seems unlikely. Could you give more context around the 
agent, the registration attempt and also the master logs since the failover and 
the agent logs around that timeframe?

We should write a test reproducing the issue -(having a master + agent, 
launching a task, restarting master, block framework re-registration, let agent 
re-registers twice by spoofing the second re-registration)- and then remove the 
line 8073.


was (Author: armandgrillet):
This could happen if we have master failover, agent re-registers and then again 
re-registers 
(https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/slave/slave.cpp#L1629).
 The statement in 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8070
 thus does not seem correct and the change 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8073
 from the review request https://reviews.apache.org/r/53897/ that happened to 
follow this comment should be removed.

The strange thing is that the tasks are known to the master but not to the 
agent according to the logs (master.cpp:7568), the fact that the agent kept its 
id but not its tasks seem unlikely. Could you give more context around the 
agent, the registration attempt and also the master logs since the failover and 
the agent logs around that timeframe?

We should write a test reproducing the issue -(having a master + agent, 
launching a task, restarting master, block framework re-registration, let agent 
re-registers twice by spoofing the second re-registration)- and then remove the 
line 8073.

> fatal, check failed !framework->recovered()
> ---
>
> Key: MESOS-7991
> URL: https://issues.apache.org/jira/browse/MESOS-7991
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jack Crawford
>Assignee: Armand Grillet
>Priority: Blocker
>  Labels: reliability
>
> mesos master crashed on what appears to be framework recovery
> mesos master version: 1.3.1
> mesos agent version: 1.3.1
> {code}
> W0920 14:58:54.756364 25452 master.cpp:7568] Task 
> 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756369 25452 master.cpp:7568] Task 
> 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756376 25452 master.cpp:7568] Task 
> 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756381 25452 master.cpp:7568] Task 
> e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756386 25452 master.cpp:7568] Task 
> f838a03c-5cd4-47eb-8606-69b004d89808 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756392 25452 master.cpp:7568] Task 
> 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the