[jira] [Commented] (MESOS-2487) Ensure protobuf "==" operator does not go out of sync with new protobuf fields

2018-01-26 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341925#comment-16341925
 ] 

Benjamin Mahler commented on MESOS-2487:


An update: When looking at a recent review introducing another operator, I 
noticed protobuf now provides a {{MessageDifferencer}}:
https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.util.message_differencer

This has pretty nice functionality:
* Custom ignore criteria
* Ability to treat fields as sets, lists, or maps
* Reporting of differences

This could replace the majority of our custom ones! cc [~kaysoky]

> Ensure protobuf "==" operator does not go out of sync with new protobuf fields
> --
>
> Key: MESOS-2487
> URL: https://issues.apache.org/jira/browse/MESOS-2487
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Priority: Major
>
> Currently when a new field is added to a protobuf that has a custom "==" 
> operator defined,  we don't make sure that the field is accounted for in the 
> comparison. Ideally we should catch such errors at build time or 'make check' 
> time. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-3160) CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseRSS Flaky

2018-01-26 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341866#comment-16341866
 ] 

Greg Mann commented on MESOS-3160:
--

In the testing I've done today, the most common reason for this failure is when 
the {{MemoryTestHelper}} receives EOF from the subprocess's output FD, [at this 
line|https://github.com/apache/mesos/blob/15fc434e47e026790a6f6dc8e974a8440d0b1bdf/src/tests/containerizer/memory_test_helper.cpp#L156].

Another failure mode I observed occurred at [this 
line|https://github.com/apache/mesos/blob/15fc434e47e026790a6f6dc8e974a8440d0b1bdf/src/tests/containerizer/cgroups_tests.cpp#L1163],
 with {{critical == 1}}.

> CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseRSS Flaky
> 
>
> Key: MESOS-3160
> URL: https://issues.apache.org/jira/browse/MESOS-3160
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.0, 0.26.0
> Environment: Ubuntu 14.04
> CentOS 7
>Reporter: Paul Brett
>Assignee: Greg Mann
>Priority: Major
>  Labels: cgroups, flaky-test, mesosphere
>
> Test will occasionally with:
> [ RUN  ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseUnlockedRSS
> ../../src/tests/containerizer/cgroups_tests.cpp:1103: Failure
> helper.increaseRSS(getpagesize()): Failed to sync with the subprocess
> ../../src/tests/containerizer/cgroups_tests.cpp:1103: Failure
> helper.increaseRSS(getpagesize()): The subprocess has not been spawned yet
> [  FAILED  ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseUnlockedRSS 
> (223 ms)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.

2018-01-26 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya updated MESOS-7258:
--
  Sprint: Mesosphere Sprint 73
Story Points: 5

> Provide scheduler calls to subscribe to additional roles and unsubscribe from 
> roles.
> 
>
> Key: MESOS-7258
> URL: https://issues.apache.org/jira/browse/MESOS-7258
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, scheduler api
>Reporter: Benjamin Mahler
>Assignee: Kapil Arya
>Priority: Major
>  Labels: multitenancy
>
> The current support for schedulers to subscribe to additional roles or 
> unsubscribe from some of their roles requires that the scheduler obtain a new 
> subscription with the master which invalidates the event stream.
> A more lightweight mechanism would be to provide calls for the scheduler to 
> subscribe to additional roles or unsubscribe from some roles such that the 
> existing event stream remains open and offers to the new roles arrive on the 
> existing event stream. E.g.
> SUBSCRIBE_TO_ROLE
>  UNSUBSCRIBE_FROM_ROLE
> One open question pertains to the terminology here, whether we would want to 
> avoid using "subscribe" in this context. An alternative would be:
> UPDATE_FRAMEWORK_INFO
> Which provides a generic mechanism for a framework to perform framework info 
> updates without obtaining a new event stream.
> In addition, it would be easier to use if it returned 200 on success and an 
> error response if invalid, etc. Rather than returning 202.
> *NOTE*: Not specific to this issue, but we need to figure out how to allow 
> the framework to not leak reservations, e.g. MESOS-7651.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8501) Benchmark for framework with large number of roles

2018-01-26 Thread Kapil Arya (JIRA)
Kapil Arya created MESOS-8501:
-

 Summary: Benchmark for framework with large number of roles
 Key: MESOS-8501
 URL: https://issues.apache.org/jira/browse/MESOS-8501
 Project: Mesos
  Issue Type: Task
Reporter: Kapil Arya






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8500) Enhanced support for multi-role scalibility

2018-01-26 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya updated MESOS-8500:
--
Description: CC: [~bmahler]

> Enhanced support for multi-role scalibility
> ---
>
> Key: MESOS-8500
> URL: https://issues.apache.org/jira/browse/MESOS-8500
> Project: Mesos
>  Issue Type: Epic
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>Priority: Major
>
> CC: [~bmahler]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8500) Enhanced support for multi-role scalibility

2018-01-26 Thread Kapil Arya (JIRA)
Kapil Arya created MESOS-8500:
-

 Summary: Enhanced support for multi-role scalibility
 Key: MESOS-8500
 URL: https://issues.apache.org/jira/browse/MESOS-8500
 Project: Mesos
  Issue Type: Epic
Reporter: Kapil Arya
Assignee: Kapil Arya






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8497) Docker parameter `name` does not work with Docker Containerizer.

2018-01-26 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-8497:

 Labels: containerizer  (was: )
Component/s: containerization

> Docker parameter `name` does not work with Docker Containerizer.
> 
>
> Key: MESOS-8497
> URL: https://issues.apache.org/jira/browse/MESOS-8497
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jörg Schad
>Priority: Major
>  Labels: containerizer
> Attachments: agent.log, master.log
>
>
> When deploying a marathon app with Docker Containerizer (need to check Mesos 
> Containerizer) and the parameter name set, Mesos is not able to 
> recognize/control/kill the started container.
> Steps to reproduce 
>  # Deploy the below marathon app definition
>  #  Watch task being stuck in staging and mesos not being able to kill it
>  # Check on node and see container running, but not being recognized by mesos
> {noformat}
> {
> "id": "/docker-test",
> "instances": 1,
> "portDefinitions": [],
> "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
> "image": "ubuntu:16.04",
> "parameters": [
> {
> "key": "name",
> "value": "myname"
> }
> ]
> }
> },
> "cpus": 0.1,
> "mem": 128,
> "requirePorts": false,
> "networks": [],
> "healthChecks": [],
> "fetch": [],
> "constraints": [],
> "cmd": "sleep 1000"
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8497) Docker parameter `name` does not work with Docker Containerizer.

2018-01-26 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341667#comment-16341667
 ] 

Gilbert Song edited comment on MESOS-8497 at 1/26/18 10:10 PM:
---

The root cause is here: 
[https://github.com/apache/mesos/blob/1.5.0-rc1/src/docker/docker.cpp#L1083~#L1090]

We define the name before the parameters, so the user defined `–name` from the 
parameters will overwrite the name Mesos gives to the docker container 
(background: Mesos relies on --name to identify docker containers, the docker 
container is named by Mesos as `mesos`). So any new names passed 
into the parameters will result in Mesos could no longer find this container.

Two solutions:
1. Validate the parameters in master::validate(), to make sure no --name exists 
in docker arbitrary parameters.
2. Return failure on docker containerizer if the name is overwritten by any 
other.

#1 might be a preferable and straight forward solution.


was (Author: gilbert):
The root cause is here: 
[https://github.com/apache/mesos/blob/1.5.0-rc1/src/docker/docker.cpp#L1083~#L1090]

We define the name before the parameters, so the user defined `–name` from the 
parameters will overwrite the name Mesos gives to the docker container 
(background: Mesos relies on --name to identify docker containers, the docker 
container is named by Mesos as `mesos-`). So any new names 
passed into the parameters will result in Mesos could no longer find this 
container.

Two solutions:
1. Validate the parameters in master::validate(), to make sure no --name exists 
in docker arbitrary parameters.
2. Return failure on docker containerizer if the name is overwritten by any 
other.

#1 might be a preferable and straight forward solution.

> Docker parameter `name` does not work with Docker Containerizer.
> 
>
> Key: MESOS-8497
> URL: https://issues.apache.org/jira/browse/MESOS-8497
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jörg Schad
>Priority: Major
>  Labels: containerizer
> Attachments: agent.log, master.log
>
>
> When deploying a marathon app with Docker Containerizer (need to check Mesos 
> Containerizer) and the parameter name set, Mesos is not able to 
> recognize/control/kill the started container.
> Steps to reproduce 
>  # Deploy the below marathon app definition
>  #  Watch task being stuck in staging and mesos not being able to kill it
>  # Check on node and see container running, but not being recognized by mesos
> {noformat}
> {
> "id": "/docker-test",
> "instances": 1,
> "portDefinitions": [],
> "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
> "image": "ubuntu:16.04",
> "parameters": [
> {
> "key": "name",
> "value": "myname"
> }
> ]
> }
> },
> "cpus": 0.1,
> "mem": 128,
> "requirePorts": false,
> "networks": [],
> "healthChecks": [],
> "fetch": [],
> "constraints": [],
> "cmd": "sleep 1000"
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8497) Docker parameter `name` does not work with Docker Containerizer.

2018-01-26 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341667#comment-16341667
 ] 

Gilbert Song commented on MESOS-8497:
-

The root cause is here: 
[https://github.com/apache/mesos/blob/1.5.0-rc1/src/docker/docker.cpp#L1083~#L1090]

We define the name before the parameters, so the user defined `–name` from the 
parameters will overwrite the name Mesos gives to the docker container 
(background: Mesos relies on --name to identify docker containers, the docker 
container is named by Mesos as `mesos-`). So any new names 
passed into the parameters will result in Mesos could no longer find this 
container.

Two solutions:
1. Validate the parameters in master::validate(), to make sure no --name exists 
in docker arbitrary parameters.
2. Return failure on docker containerizer if the name is overwritten by any 
other.

#1 might be a preferable and straight forward solution.

> Docker parameter `name` does not work with Docker Containerizer.
> 
>
> Key: MESOS-8497
> URL: https://issues.apache.org/jira/browse/MESOS-8497
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jörg Schad
>Priority: Major
> Attachments: agent.log, master.log
>
>
> When deploying a marathon app with Docker Containerizer (need to check Mesos 
> Containerizer) and the parameter name set, Mesos is not able to 
> recognize/control/kill the started container.
> Steps to reproduce 
>  # Deploy the below marathon app definition
>  #  Watch task being stuck in staging and mesos not being able to kill it
>  # Check on node and see container running, but not being recognized by mesos
> {noformat}
> {
> "id": "/docker-test",
> "instances": 1,
> "portDefinitions": [],
> "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
> "image": "ubuntu:16.04",
> "parameters": [
> {
> "key": "name",
> "value": "myname"
> }
> ]
> }
> },
> "cpus": 0.1,
> "mem": 128,
> "requirePorts": false,
> "networks": [],
> "healthChecks": [],
> "fetch": [],
> "constraints": [],
> "cmd": "sleep 1000"
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8498) Enable docker health checks on Windows

2018-01-26 Thread Akash Gupta (JIRA)
Akash Gupta created MESOS-8498:
--

 Summary: Enable docker health checks on Windows
 Key: MESOS-8498
 URL: https://issues.apache.org/jira/browse/MESOS-8498
 Project: Mesos
  Issue Type: Task
Reporter: Akash Gupta
Assignee: Akash Gupta


Currently, the docker health checks do not work on Windows. They use a 
Linux-only method of switching process namespaces, which does not work on 
Windows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8497) Docker parameter `name` does not work with Docker Containerizer.

2018-01-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörg Schad updated MESOS-8497:
--
Attachment: master.log
agent.log

> Docker parameter `name` does not work with Docker Containerizer.
> 
>
> Key: MESOS-8497
> URL: https://issues.apache.org/jira/browse/MESOS-8497
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jörg Schad
>Priority: Major
> Attachments: agent.log, master.log
>
>
> When deploying a marathon app with Docker Containerizer (need to check Mesos 
> Containerizer) and the parameter name set, Mesos is not able to 
> recognize/control/kill the started container.
> Steps to reproduce 
>  # Deploy the below marathon app definition
>  #  Watch task being stuck in staging and mesos not being able to kill it
>  # Check on node and see container running, but not being recognized by mesos
> {noformat}
> {
> "id": "/docker-test",
> "instances": 1,
> "portDefinitions": [],
> "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
> "image": "ubuntu:16.04",
> "parameters": [
> {
> "key": "name",
> "value": "myname"
> }
> ]
> }
> },
> "cpus": 0.1,
> "mem": 128,
> "requirePorts": false,
> "networks": [],
> "healthChecks": [],
> "fetch": [],
> "constraints": [],
> "cmd": "sleep 1000"
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8497) Docker parameter `name` does not work with Docker Containerizer.

2018-01-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341655#comment-16341655
 ] 

Jörg Schad commented on MESOS-8497:
---

[^agent.log]

[^master.log]

> Docker parameter `name` does not work with Docker Containerizer.
> 
>
> Key: MESOS-8497
> URL: https://issues.apache.org/jira/browse/MESOS-8497
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jörg Schad
>Priority: Major
> Attachments: agent.log, master.log
>
>
> When deploying a marathon app with Docker Containerizer (need to check Mesos 
> Containerizer) and the parameter name set, Mesos is not able to 
> recognize/control/kill the started container.
> Steps to reproduce 
>  # Deploy the below marathon app definition
>  #  Watch task being stuck in staging and mesos not being able to kill it
>  # Check on node and see container running, but not being recognized by mesos
> {noformat}
> {
> "id": "/docker-test",
> "instances": 1,
> "portDefinitions": [],
> "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
> "image": "ubuntu:16.04",
> "parameters": [
> {
> "key": "name",
> "value": "myname"
> }
> ]
> }
> },
> "cpus": 0.1,
> "mem": 128,
> "requirePorts": false,
> "networks": [],
> "healthChecks": [],
> "fetch": [],
> "constraints": [],
> "cmd": "sleep 1000"
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8184) Implement master's AcknowledgeOfferOperationMessage handler.

2018-01-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291303#comment-16291303
 ] 

Gastón Kleiman edited comment on MESOS-8184 at 1/26/18 9:51 PM:


https://reviews.apache.org/r/65357/
https://reviews.apache.org/r/65358/ 
https://reviews.apache.org/r/65359/
https://reviews.apache.org/r/65360/
https://reviews.apache.org/r/65361/
https://reviews.apache.org/r/65300/
https://reviews.apache.org/r/65362/
https://reviews.apache.org/r/65362/
https://reviews.apache.org/r/64618/


was (Author: gkleiman):
[https://reviews.apache.org/r/65300/]

[https://reviews.apache.org/r/64618/]

> Implement master's AcknowledgeOfferOperationMessage handler.
> 
>
> Key: MESOS-8184
> URL: https://issues.apache.org/jira/browse/MESOS-8184
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Major
>  Labels: mesosphere
>
> This handler should validate the message and forward it to the corresponding 
> agent/ERP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8497) Docker parameter `name` does not work with Docker Containerizer.

2018-01-26 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341588#comment-16341588
 ] 

Vinod Kone commented on MESOS-8497:
---

[~js84] Can you share the master and agent logs?

> Docker parameter `name` does not work with Docker Containerizer.
> 
>
> Key: MESOS-8497
> URL: https://issues.apache.org/jira/browse/MESOS-8497
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jörg Schad
>Priority: Major
>
> When deploying a marathon app with Docker Containerizer (need to check Mesos 
> Containerizer) and the parameter name set, Mesos is not able to 
> recognize/control/kill the started container.
> Steps to reproduce 
>  # Deploy the below marathon app definition
>  #  Watch task being stuck in staging and mesos not being able to kill it
>  # Check on node and see container running, but not being recognized by mesos
> {noformat}
> {
> "id": "/docker-test",
> "instances": 1,
> "portDefinitions": [],
> "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
> "image": "ubuntu:16.04",
> "parameters": [
> {
> "key": "name",
> "value": "myname"
> }
> ]
> }
> },
> "cpus": 0.1,
> "mem": 128,
> "requirePorts": false,
> "networks": [],
> "healthChecks": [],
> "fetch": [],
> "constraints": [],
> "cmd": "sleep 1000"
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8497) Docker parameter `name` does not work with Docker Containerizer.

2018-01-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörg Schad updated MESOS-8497:
--
Description: 
When deploying a marathon app with Docker Containerizer (need to check Mesos 
Containerizer) and the parameter name set, Mesos is not able to 
recognize/control/kill the started container.

Steps to reproduce 
 # Deploy the below marathon app definition
 #  Watch task being stuck in staging and mesos not being able to kill it
 # Check on node and see container running, but not being recognized by mesos

{noformat}

{
"id": "/docker-test",
"instances": 1,
"portDefinitions": [],
"container": {
"type": "DOCKER",
"volumes": [],
"docker": {
"image": "ubuntu:16.04",
"parameters": [
{
"key": "name",
"value": "myname"
}
]
}
},
"cpus": 0.1,
"mem": 128,
"requirePorts": false,
"networks": [],
"healthChecks": [],
"fetch": [],
"constraints": [],
"cmd": "sleep 1000"
}

{noformat}

  was:
When deploying a marathon app with Docker Containerizer (need to check Mesos 
Containerizer) and the parameter name set, Mesos is not able to 
recognize/control/kill the started container.

Steps to reproduce 
 # Deploy the below marathon app definition
 #  Watch task being stuck in staging and mesos not being able to kill it
 # Check on node and see container running, but not being recognized by mesos

{quote}{
"id": "/docker-test",
"instances": 1,
"portDefinitions": [],
"container": {
"type": "DOCKER",
"volumes": [],
"docker": {
"image": "ubuntu:16.04",
"parameters": [
{
"key": "name",
"value": "myname"
}
]
}
},
"cpus": 0.1,
"mem": 128,
"requirePorts": false,
"networks": [],
"healthChecks": [],
"fetch": [],
"constraints": [],
"cmd": "sleep 1000"
}
{quote}


> Docker parameter `name` does not work with Docker Containerizer.
> 
>
> Key: MESOS-8497
> URL: https://issues.apache.org/jira/browse/MESOS-8497
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jörg Schad
>Priority: Major
>
> When deploying a marathon app with Docker Containerizer (need to check Mesos 
> Containerizer) and the parameter name set, Mesos is not able to 
> recognize/control/kill the started container.
> Steps to reproduce 
>  # Deploy the below marathon app definition
>  #  Watch task being stuck in staging and mesos not being able to kill it
>  # Check on node and see container running, but not being recognized by mesos
> {noformat}
> {
> "id": "/docker-test",
> "instances": 1,
> "portDefinitions": [],
> "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
> "image": "ubuntu:16.04",
> "parameters": [
> {
> "key": "name",
> "value": "myname"
> }
> ]
> }
> },
> "cpus": 0.1,
> "mem": 128,
> "requirePorts": false,
> "networks": [],
> "healthChecks": [],
> "fetch": [],
> "constraints": [],
> "cmd": "sleep 1000"
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8485) MasterTest.RegistryGcByCount is flaky

2018-01-26 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341377#comment-16341377
 ] 

Benno Evers commented on MESOS-8485:


Review posted at: https://reviews.apache.org/r/65354

> MasterTest.RegistryGcByCount is flaky
> -
>
> Key: MESOS-8485
> URL: https://issues.apache.org/jira/browse/MESOS-8485
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed this while testing Mesos 1.5.0-rc1 in ASF CI.
>  
> {code}
> 3: [ RUN      ] MasterTest.RegistryGcByCount
> ..snip...
> 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master
> 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master 
> master@172.17.0.2:45634
> 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 
> authenticatee
> 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client 
> SASL connection
> 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating 
> slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication 
> session for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL 
> connection
> 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL 
> authentication mechanisms: CRAM-MD5
> 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to 
> authenticate with mechanism 'CRAM-MD5'
> 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL 
> authentication start
> 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires 
> more steps
> 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL 
> authentication step
> 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL 
> authentication step
> 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: false 
> 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*userPassword'
> 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*cmusaslsecretCRAM-MD5'
> 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: true 
> 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*userPassword' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934324 15987 authenticator.cpp:318] Authentication success
> 3: I0123 19:22:05.934463 15995 authenticatee.cpp:299] Authentication success
> 3: I0123 19:22:05.934563 16002 master.cpp:8988] Successfully authenticated 
> principal 'test-principal' at slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.934708 15993 authenticator.cpp:432] Authentication session 
> cleanup for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.934891 15995 slave.cpp:1320] Successfully authenticated 
> with master master@172.17.0.2:45634
> 3: I0123 19:22:05.935261 15995 slave.cpp:1764] Will retry registration in 
> 2.234083ms if necessary
> 3: I0123 19:22:05.935436 15999 master.cpp:6061] Received register agent 
> message from slave(442)@172.17.0.2:45634 (455912973e2c)
> 3: I0123 19:22:05.935662 15999 master.cpp:3867] Authorizing agent with 
> principal 'test-principal'
> 3: I0123 19:22:05.936161 15992 master.cpp:6123] Authorized registration of 
> agent at slave(442)@172.17.0.2:45634 (455912973e2c)
> 3: I0123 19:22:05.936261 15992 master.cpp:6234] Registering agent at 
> slave(442)@172.17.0.2:45634 (455912973e2c) with id 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0
> 3: I0123 19:22:05.936993 15989 registrar.cpp:495] Applied 1 operations in 
> 227911ns; attempting to update the registry
> 3: I0123 19:22:05.937814 15989 registrar.cpp:552] Successfully updated the 
> registry in 743168ns
> 3: I0123 19:22:05.938057 15991 master.cpp:6282] Admitted agent 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 
> (455912973e2c)
> 3: I0123 19:22:05.938891 15991 master.cpp:6331] Registered agent 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 
> (455912973e2c) with cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> 3: I0123 19:22:05.939159 16002 

[jira] [Commented] (MESOS-8485) MasterTest.RegistryGcByCount is flaky

2018-01-26 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341216#comment-16341216
 ] 

Benno Evers commented on MESOS-8485:


This is fairly reproducible when putting the test machine under heavy load 
(i.e. ca. 1 failure per 3000 runs when I'm compiling Mesos with 24 threads at 
the same time)

 

What happens is the following:

The test case is starting two different instances of `mesos-agent`, marking 
both of them as gone, and forcing one of them to be garbage collected. It 
expects that after this is done, one of the slaves will be marked as "gone" and 
the other be unknown. To get the agent id of the agents it registers, the 
following code is used:

 
{noformat}
  Future slaveRegisteredMessage =
    FUTURE_PROTOBUF(SlaveRegisteredMessage(), master.get()->pid, _);
  Try slave = StartSlave(detector.get(), slaveFlags);
  AWAIT_READY(slaveRegisteredMessage);

  [...] (the slave is marked as gone here)

  Future slaveRegisteredMessage2 =
    FUTURE_PROTOBUF(SlaveRegisteredMessage(), master.get()->pid, _);
  Try slave2 = StartSlave(detector.get(), slaveFlags2);
  AWAIT_READY(slaveRegisteredMessage2);{noformat}
 

In the failure case, the registration of the first agent works as follows:
{noformat}
agent0: Sends RegisterSlaveMessage
master: Does registration, adds SlaveRegisteredMessage to outbound message queue
agent0: Didn't get an answer after timeout, resends RegisterSlaveMessage
agent0: Gets the previously sent SlaveRegisteredMessage
master: Gets the second RegisterSlaveMessage, notices that agent0 is already 
registered and just resends the Slave
test: Proceeds to mark agent0 as gone, creates the 
Future for agent1
test: The future is satisfied by the second SlaveRegisteredMessage sent by the 
master{noformat}
Leading the test code to think that agent1 has the agent id of agent0, which 
leads to the subsequent test failure.

 

Mesos basically works correctly here, so the correct fix seems to be to rewrite 
the test to wait for a `SlaveRegisteredMessage` that is actually destined for 
the correct pid.

 

 

> MasterTest.RegistryGcByCount is flaky
> -
>
> Key: MESOS-8485
> URL: https://issues.apache.org/jira/browse/MESOS-8485
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed this while testing Mesos 1.5.0-rc1 in ASF CI.
>  
> {code}
> 3: [ RUN      ] MasterTest.RegistryGcByCount
> ..snip...
> 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master
> 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master 
> master@172.17.0.2:45634
> 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 
> authenticatee
> 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client 
> SASL connection
> 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating 
> slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication 
> session for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL 
> connection
> 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL 
> authentication mechanisms: CRAM-MD5
> 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to 
> authenticate with mechanism 'CRAM-MD5'
> 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL 
> authentication start
> 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires 
> more steps
> 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL 
> authentication step
> 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL 
> authentication step
> 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: false 
> 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*userPassword'
> 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*cmusaslsecretCRAM-MD5'
> 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: true 
> 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*userPassword' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property 
> 

[jira] [Commented] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

2018-01-26 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341146#comment-16341146
 ] 

Qian Zhang commented on MESOS-8125:
---

{quote}Also looks like docker containerizer doesn't recover the executor pid!?
{quote}
Yes, I have verified after agent recovery the `container->executorPid` is 
`None()`. We should have set it in `DockerContainerizerProcess::_recover`.
{quote}We should fix `_recover` to do `container->status.set(None())` when the 
container->pid is None(). 
{quote}
I think there are two cases that we need to handle:
 # Docker container was stopped when agent was down: In this case, when agent 
recovers, the `container-pid` will be `None()` in 
`DockerContainerizerProcess::_recover` (we can get such info from this method's 
second parameter `_containers`), and do `container->status.set(None())`.
 # Docker container was removed when agent was down: In this case, when agent 
recovers, we will not find the relevant Docker container from `_containers` in 
`DockerContainerizerProcess::_recover`, and we should do 
`container->status.set(None())` as well.

> Agent should properly handle recovering an executor when its pid is reused
> --
>
> Key: MESOS-8125
> URL: https://issues.apache.org/jira/browse/MESOS-8125
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Priority: Critical
>
> Here's how to reproduce this issue:
> # Start a task using the Docker containerizer (the same will probably happen 
> with the command executor).
> # Stop the corresponding Mesos agent while the task is running.
> # Change the executor's checkpointed forked pid, which is located in the meta 
> directory, e.g., 
> {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
>  I used pid 2, which is normally used by {{kthreadd}}.
> # Reboot the host



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8490) UpdateSlaveMessageWithPendingOffers is flaky.

2018-01-26 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht reassigned MESOS-8490:
---

Assignee: Jan Schlicht  (was: Benjamin Bannier)

> UpdateSlaveMessageWithPendingOffers is flaky.
> -
>
> Key: MESOS-8490
> URL: https://issues.apache.org/jira/browse/MESOS-8490
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 6 with SSL
> Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Jan Schlicht
>Priority: Major
>  Labels: flaky-test
> Attachments: UpdateSlaveMessageWithPendingOffers-badrun1.txt, 
> UpdateSlaveMessageWithPendingOffers-badrun2.txt
>
>
> {noformat}
> ../../src/tests/master_tests.cpp:8728
> Failed to wait 15secs for offers
> {noformat}
> Full logs attached. Log output from two failures looks different, might be an 
> indicator of multiple issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8495) Use folder structure in Roles tab

2018-01-26 Thread Armand Grillet (JIRA)
Armand Grillet created MESOS-8495:
-

 Summary: Use folder structure in Roles tab
 Key: MESOS-8495
 URL: https://issues.apache.org/jira/browse/MESOS-8495
 Project: Mesos
  Issue Type: Task
  Components: webui
Reporter: Armand Grillet
Assignee: Armand Grillet
 Attachments: Screen Shot 2018-01-26 à 13.21.06.png

Current table structure:

!Screen Shot 2018-01-26 à 13.21.06.png!
Instead, we should display by default:
 * slave_public
 * slave_public/dcos-edgelb_pools_edgelb-persistent-pool-role
 * slave_public/kubernetes-role

Even better, we should have a way to collapse all the {{slave_public/}} roles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8474) Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky

2018-01-26 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340938#comment-16340938
 ] 

Alexander Rukletsov commented on MESOS-8474:


Disabled the test for now.

> Test StorageLocalResourceProviderTest.ROOT_ConvertPreExistingVolume is flaky
> 
>
> Key: MESOS-8474
> URL: https://issues.apache.org/jira/browse/MESOS-8474
> Project: Mesos
>  Issue Type: Bug
>  Components: storage, test
>Affects Versions: 1.5.0
>Reporter: Benjamin Bannier
>Assignee: Chun-Hung Hsiao
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: consoleText.txt, consoleText.txt
>
>
> Observed on our internal CI on ubuntu16.04 with SSL and GRPC enabled,
> {noformat}
> ../../src/tests/storage_local_resource_provider_tests.cpp:1898
>   Expected: 2u
>   Which is: 2
> To be equal to: destroyed.size()
>   Which is: 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8232) SlaveTest.RegisteredAgentReregisterAfterFailover is flaky.

2018-01-26 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340936#comment-16340936
 ] 

Alexander Rukletsov commented on MESOS-8232:


Disabled the test for now.

> SlaveTest.RegisteredAgentReregisterAfterFailover is flaky.
> --
>
> Key: MESOS-8232
> URL: https://issues.apache.org/jira/browse/MESOS-8232
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 17.04
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky-test
> Attachments: RegisteredAgentReregisterAfterFailover-badrun.txt, 
> RegisteredAgentReregisterAfterFailover-badrun2.txt
>
>
> Observed it in our CI:
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-17.04/mesos/src/tests/slave_tests.cpp:3740
> Mock function called more times than expected - taking default action 
> specified at:
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-17.04/mesos/src/tests/mock_registrar.cpp:54:
> Function call: apply(16-byte object <60-F1 01-F4 38-7F 00-00 90-D0 02-F4 
> 38-7F 00-00>)
>   Returns: 16-byte object  00-00>
>  Expected: to be never called
>Actual: called once - over-saturated and active
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky

2018-01-26 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340939#comment-16340939
 ] 

Alexander Rukletsov commented on MESOS-8336:


Disabled the test for now.

> MasterTest.RegistryUpdateAfterReconfiguration is flaky
> --
>
> Key: MESOS-8336
> URL: https://issues.apache.org/jira/browse/MESOS-8336
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: RegistryUpdateAfterReconfiguration-badrun.txt
>
>
> Observed here: 
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/
> The test here failed because the registry contained 2 slaves, when it should 
> have only one.
> Looking through the log, everything seems normal (in particular, only 1 slave 
> id appears throughout this test). The only thing out of the ordinary seems to 
> be the agent sending two `RegisterSlaveMessage`s and two 
> `ReregisterSlaveMessage`s, but looking at the code for generating the random 
> backoff factor in the slave that seems to be more or less normal, and 
> shouldn't break the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8490) UpdateSlaveMessageWithPendingOffers is flaky.

2018-01-26 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340935#comment-16340935
 ] 

Alexander Rukletsov commented on MESOS-8490:


Disabled the test for now.

> UpdateSlaveMessageWithPendingOffers is flaky.
> -
>
> Key: MESOS-8490
> URL: https://issues.apache.org/jira/browse/MESOS-8490
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: CentOS 6 with SSL
> Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: flaky-test
> Attachments: UpdateSlaveMessageWithPendingOffers-badrun1.txt, 
> UpdateSlaveMessageWithPendingOffers-badrun2.txt
>
>
> {noformat}
> ../../src/tests/master_tests.cpp:8728
> Failed to wait 15secs for offers
> {noformat}
> Full logs attached. Log output from two failures looks different, might be an 
> indicator of multiple issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8210) ReconciliationTest.RemovalInProgress is flaky.

2018-01-26 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8210:
---
Attachment: RemovalInProgress-badrun2.txt

> ReconciliationTest.RemovalInProgress is flaky.
> --
>
> Key: MESOS-8210
> URL: https://issues.apache.org/jira/browse/MESOS-8210
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky-test
> Attachments: RemovalInProgress-badrun.txt, 
> RemovalInProgress-badrun2.txt
>
>
> Observed it today on our internal CI:
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/reconciliation_tests.cpp:655
> Mock function called more times than expected - taking default action 
> specified at:
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/mock_registrar.cpp:54:
> Function call: apply(16-byte object  D0-7F 00-00>)
>   Returns: 16-byte object <90-C5 04-00 D0-7F 00-00 F0-DB 05-00 D0-7F 
> 00-00>
>  Expected: to be called once
>Actual: called twice - over-saturated and active
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8210) ReconciliationTest.RemovalInProgress is flaky.

2018-01-26 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340934#comment-16340934
 ] 

Alexander Rukletsov commented on MESOS-8210:


Disabled the test for now.

> ReconciliationTest.RemovalInProgress is flaky.
> --
>
> Key: MESOS-8210
> URL: https://issues.apache.org/jira/browse/MESOS-8210
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky-test
> Attachments: RemovalInProgress-badrun.txt, 
> RemovalInProgress-badrun2.txt
>
>
> Observed it today on our internal CI:
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/reconciliation_tests.cpp:655
> Mock function called more times than expected - taking default action 
> specified at:
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/mock_registrar.cpp:54:
> Function call: apply(16-byte object  D0-7F 00-00>)
>   Returns: 16-byte object <90-C5 04-00 D0-7F 00-00 F0-DB 05-00 D0-7F 
> 00-00>
>  Expected: to be called once
>Actual: called twice - over-saturated and active
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky

2018-01-26 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8336:
---
Labels: flaky-test  (was: )

> MasterTest.RegistryUpdateAfterReconfiguration is flaky
> --
>
> Key: MESOS-8336
> URL: https://issues.apache.org/jira/browse/MESOS-8336
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: RegistryUpdateAfterReconfiguration-badrun.txt
>
>
> Observed here: 
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/
> The test here failed because the registry contained 2 slaves, when it should 
> have only one.
> Looking through the log, everything seems normal (in particular, only 1 slave 
> id appears throughout this test). The only thing out of the ordinary seems to 
> be the agent sending two `RegisterSlaveMessage`s and two 
> `ReregisterSlaveMessage`s, but looking at the code for generating the random 
> backoff factor in the slave that seems to be more or less normal, and 
> shouldn't break the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky

2018-01-26 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8336:
---
Attachment: RegistryUpdateAfterReconfiguration-badrun.txt

> MasterTest.RegistryUpdateAfterReconfiguration is flaky
> --
>
> Key: MESOS-8336
> URL: https://issues.apache.org/jira/browse/MESOS-8336
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
> Attachments: RegistryUpdateAfterReconfiguration-badrun.txt
>
>
> Observed here: 
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/
> The test here failed because the registry contained 2 slaves, when it should 
> have only one.
> Looking through the log, everything seems normal (in particular, only 1 slave 
> id appears throughout this test). The only thing out of the ordinary seems to 
> be the agent sending two `RegisterSlaveMessage`s and two 
> `ReregisterSlaveMessage`s, but looking at the code for generating the random 
> backoff factor in the slave that seems to be more or less normal, and 
> shouldn't break the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8232) SlaveTest.RegisteredAgentReregisterAfterFailover is flaky.

2018-01-26 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8232:
---
Attachment: RegisteredAgentReregisterAfterFailover-badrun2.txt

> SlaveTest.RegisteredAgentReregisterAfterFailover is flaky.
> --
>
> Key: MESOS-8232
> URL: https://issues.apache.org/jira/browse/MESOS-8232
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 17.04
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky-test
> Attachments: RegisteredAgentReregisterAfterFailover-badrun.txt, 
> RegisteredAgentReregisterAfterFailover-badrun2.txt
>
>
> Observed it in our CI:
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-17.04/mesos/src/tests/slave_tests.cpp:3740
> Mock function called more times than expected - taking default action 
> specified at:
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-17.04/mesos/src/tests/mock_registrar.cpp:54:
> Function call: apply(16-byte object <60-F1 01-F4 38-7F 00-00 90-D0 02-F4 
> 38-7F 00-00>)
>   Returns: 16-byte object  00-00>
>  Expected: to be never called
>Actual: called once - over-saturated and active
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)