[jira] [Updated] (MESOS-7967) Make `mesos-execute` work with old-style resources

2017-10-23 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7967:

Description: 
{{mesos-execute}} should be updated to be able to handle
"pre-reservation-refinement" resource format.

For reservation refinement, new resource format were introduced.
The master and agent have been carefully updated to be able to handle
pre/post reservation-refinement resource formats, whereas the example
frameworks and {{mesos-execute}} were updated such that they require
the new resource format. While the example frameworks are probably fine
being updated to use the new format, {{mesos-execute}} is used as a
developer tool, and as such we should update it to be more robust in its
handling of resource formats.

  was:
{{mesos-execute}} should be updated to be able to handle 
"pre-reservation-refinement" resource format.

For reservation refinement, new resource format were introduced.
The master and agent have been carefully updated to be able to handle
pre/post reservation-refinement resource formats, whereas the example
frameworks and {{mesos-execute}} were updated such that they require
the new resource format. While the example frameworks are probably fine
being updated to use the new format, {{mesos-execute}} is used as a
developer tool, and as such we should update it to be more robust in its
handling of resource formats.


> Make `mesos-execute` work with old-style resources
> --
>
> Key: MESOS-7967
> URL: https://issues.apache.org/jira/browse/MESOS-7967
> Project: Mesos
>  Issue Type: Improvement
>  Components: cli
>Reporter: Michael Park
>
> {{mesos-execute}} should be updated to be able to handle
> "pre-reservation-refinement" resource format.
> For reservation refinement, new resource format were introduced.
> The master and agent have been carefully updated to be able to handle
> pre/post reservation-refinement resource formats, whereas the example
> frameworks and {{mesos-execute}} were updated such that they require
> the new resource format. While the example frameworks are probably fine
> being updated to use the new format, {{mesos-execute}} is used as a
> developer tool, and as such we should update it to be more robust in its
> handling of resource formats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8126) Consider decoupling the authorization logic from response creation.

2017-10-23 Thread Michael Park (JIRA)
Michael Park created MESOS-8126:
---

 Summary: Consider decoupling the authorization logic from response 
creation.
 Key: MESOS-8126
 URL: https://issues.apache.org/jira/browse/MESOS-8126
 Project: Mesos
  Issue Type: Task
Reporter: Michael Park


Currently the {{createAgentResponse}} function performs some authorization,
given an optional {{rolesAcceptor}}. {{_getAgents}} function uses this helper
*with* a {{rolesAcceptor}}. {{createAgentAdded}} on the other hand uses the
helper *without* a {{rolesAcceptor}} and is passed to 
{{Master::Subscriber::send}}
for authorization post-hoc.

>From first glance, it seemed like there were 2 authorizations being done for no
reason, and it seems like it could be beneficial to actually pull the 
authorization
logic out of the response creation logic, rather than coupling them and 
by-passing
authorization when we want a *custom* authorization logic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7851) Master stores old resource format in the registry

2017-10-23 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216150#comment-16216150
 ] 

Michael Park commented on MESOS-7851:
-

https://reviews.apache.org/r/63232/

> Master stores old resource format in the registry
> -
>
> Key: MESOS-7851
> URL: https://issues.apache.org/jira/browse/MESOS-7851
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Greg Mann
>Assignee: Michael Park
>  Labels: master, mesosphere, reservation
>
> We intend for the master to store all internal resource representations in 
> the new, post-reservation-refinement format. However, [when persisting 
> registered agents to the 
> registrar|https://github.com/apache/mesos/blob/498a000ac1bb8f51dc871f22aea265424a407a17/src/master/master.cpp#L5861-L5876],
>  the master does not convert the resources; agents provide resources in the 
> pre-reservation-refinement format, and these resources are stored as-is. This 
> means that after recovery, any agents in the master's {{slaves.recovered}} 
> map will have {{SlaveInfo.resources}} in the pre-reservation-refinement 
> format.
> We should update the master to convert these resources before persisting them 
> to the registry.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7851) Master stores old resource format in the registry

2017-10-23 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7851:

Target Version/s: 1.5.0

> Master stores old resource format in the registry
> -
>
> Key: MESOS-7851
> URL: https://issues.apache.org/jira/browse/MESOS-7851
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Greg Mann
>Assignee: Michael Park
>  Labels: master, mesosphere, reservation
>
> We intend for the master to store all internal resource representations in 
> the new, post-reservation-refinement format. However, [when persisting 
> registered agents to the 
> registrar|https://github.com/apache/mesos/blob/498a000ac1bb8f51dc871f22aea265424a407a17/src/master/master.cpp#L5861-L5876],
>  the master does not convert the resources; agents provide resources in the 
> pre-reservation-refinement format, and these resources are stored as-is. This 
> means that after recovery, any agents in the master's {{slaves.recovered}} 
> map will have {{SlaveInfo.resources}} in the pre-reservation-refinement 
> format.
> We should update the master to convert these resources before persisting them 
> to the registry.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6971) Use arena allocation to improve protobuf message passing performance.

2017-10-23 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-6971:
--

   Resolution: Fixed
 Assignee: Dmitry Zhuk
Fix Version/s: 1.5.0

{noformat}
commit 834053d976e2db18c16e1612b3b723fe1c8ca1ac
Author: Dmitry Zhuk 
Date:   Thu Oct 12 14:45:26 2017 -0700

Used protobuf arenas for creating messages in ProtobufProcess.

When passing const protobuf messages and fields, we can allocate
the protobuf message within an arena. Arenas dramatically reduce
the number of malloc's involved. The use of arenas also improves
the cache locality of the protobuf memory.

Review: https://reviews.apache.org/r/62901/
{noformat}

> Use arena allocation to improve protobuf message passing performance.
> -
>
> Key: MESOS-6971
> URL: https://issues.apache.org/jira/browse/MESOS-6971
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>Assignee: Dmitry Zhuk
>  Labels: mesosphere, performance, tech-debt
> Fix For: 1.5.0
>
>
> The protobuf message passing provided by {{ProtobufProcess}} provide const 
> access of the message and/or its fields to the handler function.
> This means that we can leverage the [arena 
> allocator|https://developers.google.com/protocol-buffers/docs/reference/arenas]
>  provided by protobuf to reduce the memory allocation cost during 
> de-serialization and improve cache efficiency.
> This would require using protobuf 3.x with "proto2" syntax (which appears to 
> be the default if unspecified) to maintain our existing "proto2" 
> requirements. The upgrade to protobuf 3.x while keeping "proto2" syntax 
> should be tackled via a separate ticket that blocks this one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6985) os::getenv() can segfault

2017-10-23 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-6985:
---
Labels: flaky-test reliability stout  (was: reliability stout)

> os::getenv() can segfault
> -
>
> Key: MESOS-6985
> URL: https://issues.apache.org/jira/browse/MESOS-6985
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
> Environment: ASF CI, Ubuntu 14.04 and CentOS 7 both with and without 
> libevent/SSL
>Reporter: Greg Mann
>Assignee: Ilya Pronin
>  Labels: flaky-test, reliability, stout
> Attachments: 
> MasterMaintenanceTest.InverseOffersFilters-truncated.txt, 
> MasterTest.MultipleExecutors.txt
>
>
> This was observed on ASF CI. The segfault first showed up on CI on 9/20/16 
> and has been produced by the tests {{MasterTest.MultipleExecutors}} and 
> {{MasterMaintenanceTest.InverseOffersFilters}}. In both cases, 
> {{os::getenv()}} segfaults with the same stack trace:
> {code}
> *** Aborted at 1485241617 (unix time) try "date -d @1485241617" if you are 
> using GNU date ***
> PC: @ 0x2ad59e3ae82d (unknown)
> I0124 07:06:57.422080 28619 exec.cpp:162] Version: 1.2.0
> *** SIGSEGV (@0xf0) received by PID 28591 (TID 0x2ad5a7b87700) from PID 240; 
> stack trace: ***
> I0124 07:06:57.422336 28615 exec.cpp:212] Executor started at: 
> executor(75)@172.17.0.2:45752 with pid 28591
> @ 0x2ad5ab953197 (unknown)
> @ 0x2ad5ab957479 (unknown)
> @ 0x2ad59e165330 (unknown)
> @ 0x2ad59e3ae82d (unknown)
> @ 0x2ad594631358 os::getenv()
> @ 0x2ad59aba6acf mesos::internal::slave::executorEnvironment()
> @ 0x2ad59ab845c0 mesos::internal::slave::Framework::launchExecutor()
> @ 0x2ad59ab818a2 mesos::internal::slave::Slave::_run()
> @ 0x2ad59ac1ec10 
> _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_
> @ 0x2ad59ac1e6bf 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x2ad59bce2304 std::function<>::operator()()
> @ 0x2ad59bcc9824 process::ProcessBase::visit()
> @ 0x2ad59bd4028e process::DispatchEvent::visit()
> @ 0x2ad594616df1 process::ProcessBase::serve()
> @ 0x2ad59bcc72b7 process::ProcessManager::resume()
> @ 0x2ad59bcd567c 
> process::ProcessManager::init_threads()::$_2::operator()()
> @ 0x2ad59bcd5585 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x2ad59bcd std::_Bind_simple<>::operator()()
> @ 0x2ad59bcd552c std::thread::_Impl<>::_M_run()
> @ 0x2ad59d9e6a60 (unknown)
> @ 0x2ad59e15d184 start_thread
> @ 0x2ad59e46d37d (unknown)
> make[4]: *** [check-local] Segmentation fault
> {code}
> Find attached the full log from a failed run of 
> {{MasterTest.MultipleExecutors}} and a truncated log from a failed run of 
> {{MasterMaintenanceTest.InverseOffersFilters}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8125) Agent shouldn't try to recover executors after a reboot

2017-10-23 Thread JIRA
Gastón Kleiman created MESOS-8125:
-

 Summary: Agent shouldn't try to recover executors after a reboot
 Key: MESOS-8125
 URL: https://issues.apache.org/jira/browse/MESOS-8125
 Project: Mesos
  Issue Type: Bug
Reporter: Gastón Kleiman


We know that all executors will be gone once the host on which an agent is 
running is rebooted, so there's no need to try to recover these executors.

Trying to recover stopped executors can lead to problems if another process is 
assigned the same pid that the executor had before the reboot. In this case the 
agent will unsuccessfully try to reregister with the executor, and then 
transition it to a {{TERMINATING}} state. The executor will sadly get stuck in 
that state, and the tasks that it started will get stuck in whatever state they 
were in at the time of the reboot.

One way of getting rid of stuck executors is to remove the {{latest}} symlink 
under {{work_dir/meta/slaves/latest/frameworks//executors//runs}.

Here's how to reproduce this issue:

# Start a task using the Docker containerizer (the same will probably happen 
with the command executor).
# Stop the corresponding Mesos agent while the task is running.
# Change the executor's checkpointed forked pid, which is located in the meta 
directory, e.g., 
{{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
 I used pid 2, which is normally used by {{kthreadd}}.
# Reboot the host



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7851) Master stores old resource format in the registry

2017-10-23 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142405#comment-16142405
 ] 

Michael Park edited comment on MESOS-7851 at 10/23/17 11:35 PM:


Just writing down what should be done here.

The master has 2 things that contain resources that go into the registry:
{{SlaveInfo}}, and {{QuotaInfo}}. In order to support master downgrades
(e.g., 1.4.0 => 1.3.1), we must store the resources in
the "pre-reservation-refinement" format. This happens for {{SlaveInfo}} today
(albeit incidentally), but not for {{QuotaInfo}}.

Resources inside {{QuotaInfo}} should probably be downgraded for the 
{{Registry}}
and upgraded on their way out. However, with the current requirement that
{{QuotaInfo}} can only hold unreserved resources, we don't need to do anything
for this. (tested manually by setting a quota with 1.4.0 master, downgrading to
1.3.1 and hitting the quota endpoint).

{{Master::_recover}} should upgrade the resources inside {{SlaveInfo}} before
inserting it into the {{slaves.recovered}} map. {{authorizeResources}} can be
updated after this.


was (Author: mcypark):
Just writing down what should be done here.

The master has 2 things that contain resources that go into the registry: 
{{SlaveInfo}}, and {{QuotaInfo}}.
In order to support master downgrades (e.g., 1.4.0 => 1.3.1), we must store the 
resources
in the "pre-reservation-refinement" format. This happens for {{SlaveInfo}} 
today (albeit incidentally), but not
for {{QuotaInfo}}.

Resources inside {{QuotaInfo}} should probably be downgraded for the 
{{Registry}} and upgraded on their way out.
However, with the current requirement that {{QuotaInfo}} can only hold 
unreserved resources,
we don't need to do anything for this. (tested manually by setting a quota with 
1.4.0 master,
downgrading to 1.3.1 and hitting the quota endpoint).

Changes that should to be made:
  - {{Master::_recover}} should upgrade the resources inside {{SlaveInfo}} 
before inserting it into the {{slaves.recovered}} map. {{authorizeResources}} 
can be fixed after this. (tech debt)

> Master stores old resource format in the registry
> -
>
> Key: MESOS-7851
> URL: https://issues.apache.org/jira/browse/MESOS-7851
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Greg Mann
>Assignee: Michael Park
>  Labels: master, mesosphere, reservation
>
> We intend for the master to store all internal resource representations in 
> the new, post-reservation-refinement format. However, [when persisting 
> registered agents to the 
> registrar|https://github.com/apache/mesos/blob/498a000ac1bb8f51dc871f22aea265424a407a17/src/master/master.cpp#L5861-L5876],
>  the master does not convert the resources; agents provide resources in the 
> pre-reservation-refinement format, and these resources are stored as-is. This 
> means that after recovery, any agents in the master's {{slaves.recovered}} 
> map will have {{SlaveInfo.resources}} in the pre-reservation-refinement 
> format.
> We should update the master to convert these resources before persisting them 
> to the registry.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7851) Master stores old resource format in the registry

2017-10-23 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142405#comment-16142405
 ] 

Michael Park edited comment on MESOS-7851 at 10/23/17 11:33 PM:


Just writing down what should be done here.

The master has 2 things that contain resources that go into the registry: 
{{SlaveInfo}}, and {{QuotaInfo}}.
In order to support master downgrades (e.g., 1.4.0 => 1.3.1), we must store the 
resources
in the "pre-reservation-refinement" format. This happens for {{SlaveInfo}} 
today (albeit incidentally), but not
for {{QuotaInfo}}.

Resources inside {{QuotaInfo}} should probably be downgraded for the 
{{Registry}} and upgraded on their way out.
However, with the current requirement that {{QuotaInfo}} can only
hold unreserved resources, we don't need to do anything for this.
(tested manually by setting a quota with 1.4.0 master, downgrading to 1.3.1 and 
hitting the quota endpoint).

Changes that should to be made:
  - {{Master::_recover}} should upgrade the resources inside {{SlaveInfo}} 
before inserting it into the {{slaves.recovered}} map. {{authorizeResources}} 
can be fixed after this. (tech debt)


was (Author: mcypark):
Just writing down what should be done here.

The master has 2 things that contain resources that go into the registry: 
{{SlaveInfo}}, and {{QuotaInfo}}.
In order to support master downgrades (e.g., 1.4.0 => 1.3.1), we must store the 
resources
in the "pre-reservation-refinement" format. This happens for {{SlaveInfo}} 
today (albeit incidentally), but not
for {{QuotaInfo}}.

Resources inside {{QuotaInfo}} should probably be downgraded for the 
{{Registry}} and upgraded on their way out. However, with the current 
requirement that {{QuotaInfo}} can only hold unreserved resources, we don't 
need to do anything for this. (tested manually by setting a quota with 1.4.0 
master, downgrading to 1.3.1 and hitting the quota endpoint).

Changes that should to be made:
  - {{Master::_recover}} should upgrade the resources inside {{SlaveInfo}} 
before inserting it into the {{slaves.recovered}} map. {{authorizeResources}} 
can be fixed after this. (tech debt)

> Master stores old resource format in the registry
> -
>
> Key: MESOS-7851
> URL: https://issues.apache.org/jira/browse/MESOS-7851
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Greg Mann
>Assignee: Michael Park
>  Labels: master, mesosphere, reservation
>
> We intend for the master to store all internal resource representations in 
> the new, post-reservation-refinement format. However, [when persisting 
> registered agents to the 
> registrar|https://github.com/apache/mesos/blob/498a000ac1bb8f51dc871f22aea265424a407a17/src/master/master.cpp#L5861-L5876],
>  the master does not convert the resources; agents provide resources in the 
> pre-reservation-refinement format, and these resources are stored as-is. This 
> means that after recovery, any agents in the master's {{slaves.recovered}} 
> map will have {{SlaveInfo.resources}} in the pre-reservation-refinement 
> format.
> We should update the master to convert these resources before persisting them 
> to the registry.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7851) Master stores old resource format in the registry

2017-10-23 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142405#comment-16142405
 ] 

Michael Park edited comment on MESOS-7851 at 10/23/17 11:33 PM:


Just writing down what should be done here.

The master has 2 things that contain resources that go into the registry: 
{{SlaveInfo}}, and {{QuotaInfo}}.
In order to support master downgrades (e.g., 1.4.0 => 1.3.1), we must store the 
resources
in the "pre-reservation-refinement" format. This happens for {{SlaveInfo}} 
today (albeit incidentally), but not
for {{QuotaInfo}}.

Resources inside {{QuotaInfo}} should probably be downgraded for the 
{{Registry}} and upgraded on their way out.
However, with the current requirement that {{QuotaInfo}} can only hold 
unreserved resources,
we don't need to do anything for this. (tested manually by setting a quota with 
1.4.0 master,
downgrading to 1.3.1 and hitting the quota endpoint).

Changes that should to be made:
  - {{Master::_recover}} should upgrade the resources inside {{SlaveInfo}} 
before inserting it into the {{slaves.recovered}} map. {{authorizeResources}} 
can be fixed after this. (tech debt)


was (Author: mcypark):
Just writing down what should be done here.

The master has 2 things that contain resources that go into the registry: 
{{SlaveInfo}}, and {{QuotaInfo}}.
In order to support master downgrades (e.g., 1.4.0 => 1.3.1), we must store the 
resources
in the "pre-reservation-refinement" format. This happens for {{SlaveInfo}} 
today (albeit incidentally), but not
for {{QuotaInfo}}.

Resources inside {{QuotaInfo}} should probably be downgraded for the 
{{Registry}} and upgraded on their way out.
However, with the current requirement that {{QuotaInfo}} can only
hold unreserved resources, we don't need to do anything for this.
(tested manually by setting a quota with 1.4.0 master, downgrading to 1.3.1 and 
hitting the quota endpoint).

Changes that should to be made:
  - {{Master::_recover}} should upgrade the resources inside {{SlaveInfo}} 
before inserting it into the {{slaves.recovered}} map. {{authorizeResources}} 
can be fixed after this. (tech debt)

> Master stores old resource format in the registry
> -
>
> Key: MESOS-7851
> URL: https://issues.apache.org/jira/browse/MESOS-7851
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Greg Mann
>Assignee: Michael Park
>  Labels: master, mesosphere, reservation
>
> We intend for the master to store all internal resource representations in 
> the new, post-reservation-refinement format. However, [when persisting 
> registered agents to the 
> registrar|https://github.com/apache/mesos/blob/498a000ac1bb8f51dc871f22aea265424a407a17/src/master/master.cpp#L5861-L5876],
>  the master does not convert the resources; agents provide resources in the 
> pre-reservation-refinement format, and these resources are stored as-is. This 
> means that after recovery, any agents in the master's {{slaves.recovered}} 
> map will have {{SlaveInfo.resources}} in the pre-reservation-refinement 
> format.
> We should update the master to convert these resources before persisting them 
> to the registry.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6985) os::getenv() can segfault

2017-10-23 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215959#comment-16215959
 ] 

James Peach commented on MESOS-6985:


Isn't the emulation code in {{os::execvpe}} just fundamentally unsafe? On Linux 
we could use [execvpe(3)|http://man7.org/linux/man-pages/man3/exec.3.html] 
directly, and IIUC everywhere else we could emulate it by searching {{$PATH}} 
before doing a {{execve(2)}}?

> os::getenv() can segfault
> -
>
> Key: MESOS-6985
> URL: https://issues.apache.org/jira/browse/MESOS-6985
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
> Environment: ASF CI, Ubuntu 14.04 and CentOS 7 both with and without 
> libevent/SSL
>Reporter: Greg Mann
>Assignee: Ilya Pronin
>  Labels: reliability, stout
> Attachments: 
> MasterMaintenanceTest.InverseOffersFilters-truncated.txt, 
> MasterTest.MultipleExecutors.txt
>
>
> This was observed on ASF CI. The segfault first showed up on CI on 9/20/16 
> and has been produced by the tests {{MasterTest.MultipleExecutors}} and 
> {{MasterMaintenanceTest.InverseOffersFilters}}. In both cases, 
> {{os::getenv()}} segfaults with the same stack trace:
> {code}
> *** Aborted at 1485241617 (unix time) try "date -d @1485241617" if you are 
> using GNU date ***
> PC: @ 0x2ad59e3ae82d (unknown)
> I0124 07:06:57.422080 28619 exec.cpp:162] Version: 1.2.0
> *** SIGSEGV (@0xf0) received by PID 28591 (TID 0x2ad5a7b87700) from PID 240; 
> stack trace: ***
> I0124 07:06:57.422336 28615 exec.cpp:212] Executor started at: 
> executor(75)@172.17.0.2:45752 with pid 28591
> @ 0x2ad5ab953197 (unknown)
> @ 0x2ad5ab957479 (unknown)
> @ 0x2ad59e165330 (unknown)
> @ 0x2ad59e3ae82d (unknown)
> @ 0x2ad594631358 os::getenv()
> @ 0x2ad59aba6acf mesos::internal::slave::executorEnvironment()
> @ 0x2ad59ab845c0 mesos::internal::slave::Framework::launchExecutor()
> @ 0x2ad59ab818a2 mesos::internal::slave::Slave::_run()
> @ 0x2ad59ac1ec10 
> _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_
> @ 0x2ad59ac1e6bf 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x2ad59bce2304 std::function<>::operator()()
> @ 0x2ad59bcc9824 process::ProcessBase::visit()
> @ 0x2ad59bd4028e process::DispatchEvent::visit()
> @ 0x2ad594616df1 process::ProcessBase::serve()
> @ 0x2ad59bcc72b7 process::ProcessManager::resume()
> @ 0x2ad59bcd567c 
> process::ProcessManager::init_threads()::$_2::operator()()
> @ 0x2ad59bcd5585 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x2ad59bcd std::_Bind_simple<>::operator()()
> @ 0x2ad59bcd552c std::thread::_Impl<>::_M_run()
> @ 0x2ad59d9e6a60 (unknown)
> @ 0x2ad59e15d184 start_thread
> @ 0x2ad59e46d37d (unknown)
> make[4]: *** [check-local] Segmentation fault
> {code}
> Find attached the full log from a failed run of 
> {{MasterTest.MultipleExecutors}} and a truncated log from a failed run of 
> {{MasterMaintenanceTest.InverseOffersFilters}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8124) PosixRLimitsIsolatorTest.TaskExceedingLimit is flaky.

2017-10-23 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-8124:
---
Attachment: failed.txt
success.txt

> PosixRLimitsIsolatorTest.TaskExceedingLimit is flaky.
> -
>
> Key: MESOS-8124
> URL: https://issues.apache.org/jira/browse/MESOS-8124
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>  Labels: flaky-test
> Attachments: failed.txt, success.txt
>
>
> This test fails flaky on CI:
> {noformat}
> ../../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:348: Failure
> Failed to wait 15secs for statusFailed
> ../../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:333: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> statusUpdate(, _))...
>  Expected: to be called 3 times
>Actual: called twice - unsatisfied and active
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8124) PosixRLimitsIsolatorTest.TaskExceedingLimit is flaky.

2017-10-23 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-8124:
--

 Summary: PosixRLimitsIsolatorTest.TaskExceedingLimit is flaky.
 Key: MESOS-8124
 URL: https://issues.apache.org/jira/browse/MESOS-8124
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler


This test fails flaky on CI:

{noformat}
../../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:348: Failure
Failed to wait 15secs for statusFailed
../../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:333: Failure
Actual function call count doesn't match EXPECT_CALL(sched, 
statusUpdate(, _))...
 Expected: to be called 3 times
   Actual: called twice - unsatisfied and active
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6985) os::getenv() can segfault

2017-10-23 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215880#comment-16215880
 ] 

Greg Mann commented on MESOS-6985:
--

Hey [~ipronin]! The approach you proposed here back in January sounds good to 
me. Do you have any cycles to work on this at present? If so, I can shepherd 
the ticket.

> os::getenv() can segfault
> -
>
> Key: MESOS-6985
> URL: https://issues.apache.org/jira/browse/MESOS-6985
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
> Environment: ASF CI, Ubuntu 14.04 and CentOS 7 both with and without 
> libevent/SSL
>Reporter: Greg Mann
>Assignee: Ilya Pronin
>  Labels: reliability, stout
> Attachments: 
> MasterMaintenanceTest.InverseOffersFilters-truncated.txt, 
> MasterTest.MultipleExecutors.txt
>
>
> This was observed on ASF CI. The segfault first showed up on CI on 9/20/16 
> and has been produced by the tests {{MasterTest.MultipleExecutors}} and 
> {{MasterMaintenanceTest.InverseOffersFilters}}. In both cases, 
> {{os::getenv()}} segfaults with the same stack trace:
> {code}
> *** Aborted at 1485241617 (unix time) try "date -d @1485241617" if you are 
> using GNU date ***
> PC: @ 0x2ad59e3ae82d (unknown)
> I0124 07:06:57.422080 28619 exec.cpp:162] Version: 1.2.0
> *** SIGSEGV (@0xf0) received by PID 28591 (TID 0x2ad5a7b87700) from PID 240; 
> stack trace: ***
> I0124 07:06:57.422336 28615 exec.cpp:212] Executor started at: 
> executor(75)@172.17.0.2:45752 with pid 28591
> @ 0x2ad5ab953197 (unknown)
> @ 0x2ad5ab957479 (unknown)
> @ 0x2ad59e165330 (unknown)
> @ 0x2ad59e3ae82d (unknown)
> @ 0x2ad594631358 os::getenv()
> @ 0x2ad59aba6acf mesos::internal::slave::executorEnvironment()
> @ 0x2ad59ab845c0 mesos::internal::slave::Framework::launchExecutor()
> @ 0x2ad59ab818a2 mesos::internal::slave::Slave::_run()
> @ 0x2ad59ac1ec10 
> _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_
> @ 0x2ad59ac1e6bf 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x2ad59bce2304 std::function<>::operator()()
> @ 0x2ad59bcc9824 process::ProcessBase::visit()
> @ 0x2ad59bd4028e process::DispatchEvent::visit()
> @ 0x2ad594616df1 process::ProcessBase::serve()
> @ 0x2ad59bcc72b7 process::ProcessManager::resume()
> @ 0x2ad59bcd567c 
> process::ProcessManager::init_threads()::$_2::operator()()
> @ 0x2ad59bcd5585 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x2ad59bcd std::_Bind_simple<>::operator()()
> @ 0x2ad59bcd552c std::thread::_Impl<>::_M_run()
> @ 0x2ad59d9e6a60 (unknown)
> @ 0x2ad59e15d184 start_thread
> @ 0x2ad59e46d37d (unknown)
> make[4]: *** [check-local] Segmentation fault
> {code}
> Find attached the full log from a failed run of 
> {{MasterTest.MultipleExecutors}} and a truncated log from a failed run of 
> {{MasterMaintenanceTest.InverseOffersFilters}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6985) os::getenv() can segfault

2017-10-23 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6985:
-
Shepherd: Greg Mann

> os::getenv() can segfault
> -
>
> Key: MESOS-6985
> URL: https://issues.apache.org/jira/browse/MESOS-6985
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
> Environment: ASF CI, Ubuntu 14.04 and CentOS 7 both with and without 
> libevent/SSL
>Reporter: Greg Mann
>  Labels: reliability, stout
> Attachments: 
> MasterMaintenanceTest.InverseOffersFilters-truncated.txt, 
> MasterTest.MultipleExecutors.txt
>
>
> This was observed on ASF CI. The segfault first showed up on CI on 9/20/16 
> and has been produced by the tests {{MasterTest.MultipleExecutors}} and 
> {{MasterMaintenanceTest.InverseOffersFilters}}. In both cases, 
> {{os::getenv()}} segfaults with the same stack trace:
> {code}
> *** Aborted at 1485241617 (unix time) try "date -d @1485241617" if you are 
> using GNU date ***
> PC: @ 0x2ad59e3ae82d (unknown)
> I0124 07:06:57.422080 28619 exec.cpp:162] Version: 1.2.0
> *** SIGSEGV (@0xf0) received by PID 28591 (TID 0x2ad5a7b87700) from PID 240; 
> stack trace: ***
> I0124 07:06:57.422336 28615 exec.cpp:212] Executor started at: 
> executor(75)@172.17.0.2:45752 with pid 28591
> @ 0x2ad5ab953197 (unknown)
> @ 0x2ad5ab957479 (unknown)
> @ 0x2ad59e165330 (unknown)
> @ 0x2ad59e3ae82d (unknown)
> @ 0x2ad594631358 os::getenv()
> @ 0x2ad59aba6acf mesos::internal::slave::executorEnvironment()
> @ 0x2ad59ab845c0 mesos::internal::slave::Framework::launchExecutor()
> @ 0x2ad59ab818a2 mesos::internal::slave::Slave::_run()
> @ 0x2ad59ac1ec10 
> _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_
> @ 0x2ad59ac1e6bf 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x2ad59bce2304 std::function<>::operator()()
> @ 0x2ad59bcc9824 process::ProcessBase::visit()
> @ 0x2ad59bd4028e process::DispatchEvent::visit()
> @ 0x2ad594616df1 process::ProcessBase::serve()
> @ 0x2ad59bcc72b7 process::ProcessManager::resume()
> @ 0x2ad59bcd567c 
> process::ProcessManager::init_threads()::$_2::operator()()
> @ 0x2ad59bcd5585 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x2ad59bcd std::_Bind_simple<>::operator()()
> @ 0x2ad59bcd552c std::thread::_Impl<>::_M_run()
> @ 0x2ad59d9e6a60 (unknown)
> @ 0x2ad59e15d184 start_thread
> @ 0x2ad59e46d37d (unknown)
> make[4]: *** [check-local] Segmentation fault
> {code}
> Find attached the full log from a failed run of 
> {{MasterTest.MultipleExecutors}} and a truncated log from a failed run of 
> {{MasterMaintenanceTest.InverseOffersFilters}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6985) os::getenv() can segfault

2017-10-23 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-6985:


Assignee: Ilya Pronin

> os::getenv() can segfault
> -
>
> Key: MESOS-6985
> URL: https://issues.apache.org/jira/browse/MESOS-6985
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
> Environment: ASF CI, Ubuntu 14.04 and CentOS 7 both with and without 
> libevent/SSL
>Reporter: Greg Mann
>Assignee: Ilya Pronin
>  Labels: reliability, stout
> Attachments: 
> MasterMaintenanceTest.InverseOffersFilters-truncated.txt, 
> MasterTest.MultipleExecutors.txt
>
>
> This was observed on ASF CI. The segfault first showed up on CI on 9/20/16 
> and has been produced by the tests {{MasterTest.MultipleExecutors}} and 
> {{MasterMaintenanceTest.InverseOffersFilters}}. In both cases, 
> {{os::getenv()}} segfaults with the same stack trace:
> {code}
> *** Aborted at 1485241617 (unix time) try "date -d @1485241617" if you are 
> using GNU date ***
> PC: @ 0x2ad59e3ae82d (unknown)
> I0124 07:06:57.422080 28619 exec.cpp:162] Version: 1.2.0
> *** SIGSEGV (@0xf0) received by PID 28591 (TID 0x2ad5a7b87700) from PID 240; 
> stack trace: ***
> I0124 07:06:57.422336 28615 exec.cpp:212] Executor started at: 
> executor(75)@172.17.0.2:45752 with pid 28591
> @ 0x2ad5ab953197 (unknown)
> @ 0x2ad5ab957479 (unknown)
> @ 0x2ad59e165330 (unknown)
> @ 0x2ad59e3ae82d (unknown)
> @ 0x2ad594631358 os::getenv()
> @ 0x2ad59aba6acf mesos::internal::slave::executorEnvironment()
> @ 0x2ad59ab845c0 mesos::internal::slave::Framework::launchExecutor()
> @ 0x2ad59ab818a2 mesos::internal::slave::Slave::_run()
> @ 0x2ad59ac1ec10 
> _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_
> @ 0x2ad59ac1e6bf 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x2ad59bce2304 std::function<>::operator()()
> @ 0x2ad59bcc9824 process::ProcessBase::visit()
> @ 0x2ad59bd4028e process::DispatchEvent::visit()
> @ 0x2ad594616df1 process::ProcessBase::serve()
> @ 0x2ad59bcc72b7 process::ProcessManager::resume()
> @ 0x2ad59bcd567c 
> process::ProcessManager::init_threads()::$_2::operator()()
> @ 0x2ad59bcd5585 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x2ad59bcd std::_Bind_simple<>::operator()()
> @ 0x2ad59bcd552c std::thread::_Impl<>::_M_run()
> @ 0x2ad59d9e6a60 (unknown)
> @ 0x2ad59e15d184 start_thread
> @ 0x2ad59e46d37d (unknown)
> make[4]: *** [check-local] Segmentation fault
> {code}
> Find attached the full log from a failed run of 
> {{MasterTest.MultipleExecutors}} and a truncated log from a failed run of 
> {{MasterMaintenanceTest.InverseOffersFilters}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7726) MasterTest.IgnoreOldAgentReregistration test is flaky

2017-10-23 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215815#comment-16215815
 ] 

Benjamin Mahler commented on MESOS-7726:


{quote}
I believe the agent should ignore SlaveRegisteredMessage after sending 
ReregisterSlaveMessage.
{quote}

[~alexr] can you file a separate issue for this? I don't think that's why this 
test is flaky. My read is that the lack of a {{Clock::settle()}} prior to 
advancing the clock for re-registration meant that the clock was advanced 
before the {{delay}} in the agent. It's not that the agent thought it was 
registered, it was that the agent was waiting for the initial re-registration 
backoff, which never occurred.

> MasterTest.IgnoreOldAgentReregistration test is flaky
> -
>
> Key: MESOS-7726
> URL: https://issues.apache.org/jira/browse/MESOS-7726
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Benjamin Mahler
>  Labels: flaky-test, mesosphere-oncall
> Attachments: IgnoreOldAgentReregistration-badrun.txt, 
> IgnoreOldAgentReregistration-goodrun.txt
>
>
> Observed this on ASF CI.
> {code}
> [ RUN  ] MasterTest.IgnoreOldAgentReregistration
> I0627 05:23:06.031154  4917 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0627 05:23:06.033433  4945 master.cpp:438] Master 
> a8778782-0da1-49a5-9cb8-9f6d11701733 (c43debbe7e32) started on 
> 172.17.0.4:41747
> I0627 05:23:06.033457  4945 master.cpp:440] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/2BARnF/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.4.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/2BARnF/master" --zk_session_timeout="10secs"
> I0627 05:23:06.033771  4945 master.cpp:490] Master only allowing 
> authenticated frameworks to register
> I0627 05:23:06.033787  4945 master.cpp:504] Master only allowing 
> authenticated agents to register
> I0627 05:23:06.033798  4945 master.cpp:517] Master only allowing 
> authenticated HTTP frameworks to register
> I0627 05:23:06.033812  4945 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/2BARnF/credentials'
> I0627 05:23:06.034080  4945 master.cpp:562] Using default 'crammd5' 
> authenticator
> I0627 05:23:06.034221  4945 http.cpp:974] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0627 05:23:06.034409  4945 http.cpp:974] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0627 05:23:06.034569  4945 http.cpp:974] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0627 05:23:06.034688  4945 master.cpp:642] Authorization enabled
> I0627 05:23:06.034862  4938 whitelist_watcher.cpp:77] No whitelist given
> I0627 05:23:06.034868  4950 hierarchical.cpp:169] Initialized hierarchical 
> allocator process
> I0627 05:23:06.037211  4957 master.cpp:2161] Elected as the leading master!
> I0627 05:23:06.037236  4957 master.cpp:1700] Recovering from registrar
> I0627 05:23:06.037333  4938 registrar.cpp:345] Recovering registrar
> I0627 05:23:06.038146  4938 registrar.cpp:389] Successfully fetched the 
> registry (0B) in 768256ns
> I0627 05:23:06.038290  4938 registrar.cpp:493] Applied 1 operations in 
> 30798ns; attempting to update the registry
> I0627 05:23:06.038861  4938 registrar.cpp:550] Successfully updated the 
> registry in 510976ns
> I0627 05:23:06.038960  4938 registrar.cpp:422] Successfully recovered 
> registrar
> I0627 05:23:06.039364  4941 hierarchical.cpp:207] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I0627 05:23:06.039594  4958 master.cpp:1799] Recovered 0 agents from the 
> registry 

[jira] [Assigned] (MESOS-7726) MasterTest.IgnoreOldAgentReregistration test is flaky

2017-10-23 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7726:
--

Assignee: Benjamin Mahler

> MasterTest.IgnoreOldAgentReregistration test is flaky
> -
>
> Key: MESOS-7726
> URL: https://issues.apache.org/jira/browse/MESOS-7726
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Benjamin Mahler
>  Labels: flaky-test, mesosphere-oncall
> Attachments: IgnoreOldAgentReregistration-badrun.txt, 
> IgnoreOldAgentReregistration-goodrun.txt
>
>
> Observed this on ASF CI.
> {code}
> [ RUN  ] MasterTest.IgnoreOldAgentReregistration
> I0627 05:23:06.031154  4917 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0627 05:23:06.033433  4945 master.cpp:438] Master 
> a8778782-0da1-49a5-9cb8-9f6d11701733 (c43debbe7e32) started on 
> 172.17.0.4:41747
> I0627 05:23:06.033457  4945 master.cpp:440] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/2BARnF/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.4.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/2BARnF/master" --zk_session_timeout="10secs"
> I0627 05:23:06.033771  4945 master.cpp:490] Master only allowing 
> authenticated frameworks to register
> I0627 05:23:06.033787  4945 master.cpp:504] Master only allowing 
> authenticated agents to register
> I0627 05:23:06.033798  4945 master.cpp:517] Master only allowing 
> authenticated HTTP frameworks to register
> I0627 05:23:06.033812  4945 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/2BARnF/credentials'
> I0627 05:23:06.034080  4945 master.cpp:562] Using default 'crammd5' 
> authenticator
> I0627 05:23:06.034221  4945 http.cpp:974] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0627 05:23:06.034409  4945 http.cpp:974] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0627 05:23:06.034569  4945 http.cpp:974] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0627 05:23:06.034688  4945 master.cpp:642] Authorization enabled
> I0627 05:23:06.034862  4938 whitelist_watcher.cpp:77] No whitelist given
> I0627 05:23:06.034868  4950 hierarchical.cpp:169] Initialized hierarchical 
> allocator process
> I0627 05:23:06.037211  4957 master.cpp:2161] Elected as the leading master!
> I0627 05:23:06.037236  4957 master.cpp:1700] Recovering from registrar
> I0627 05:23:06.037333  4938 registrar.cpp:345] Recovering registrar
> I0627 05:23:06.038146  4938 registrar.cpp:389] Successfully fetched the 
> registry (0B) in 768256ns
> I0627 05:23:06.038290  4938 registrar.cpp:493] Applied 1 operations in 
> 30798ns; attempting to update the registry
> I0627 05:23:06.038861  4938 registrar.cpp:550] Successfully updated the 
> registry in 510976ns
> I0627 05:23:06.038960  4938 registrar.cpp:422] Successfully recovered 
> registrar
> I0627 05:23:06.039364  4941 hierarchical.cpp:207] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I0627 05:23:06.039594  4958 master.cpp:1799] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> I0627 05:23:06.043999  4917 containerizer.cpp:230] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,network/cni,environment_secret
> W0627 05:23:06.044456  4917 backend.cpp:76] Failed to create 'aufs' backend: 
> AufsBackend requires root privileges
> W0627 05:23:06.044548  4917 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I0627 05:23:06.044580  4917 provisioner.cpp:255] Using default backend 'copy'
> I0627 05:23:06.046222  

[jira] [Updated] (MESOS-8070) Bundled GRPC build does not build on Debian 8

2017-10-23 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-8070:

Target Version/s: 1.5.0

> Bundled GRPC build does not build on Debian 8
> -
>
> Key: MESOS-8070
> URL: https://issues.apache.org/jira/browse/MESOS-8070
> Project: Mesos
>  Issue Type: Bug
>Reporter: Zhitao Li
>Assignee: Chun-Hung Hsiao
> Fix For: 1.5.0
>
>
> Debian 8 includes an outdated version of libc-ares-dev, which prevents 
> bundled GRPC to build.
> I believe [~chhsia0] already has a fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7991) fatal, check failed !framework->recovered()

2017-10-23 Thread Armand Grillet (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215386#comment-16215386
 ] 

Armand Grillet commented on MESOS-7991:
---

This could happen if we have master failover, agent re-registers and then again 
re-registers 
(https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/slave/slave.cpp#L1629).
 The statement in 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8070
 thus does not seem correct and the change 
https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8073
 from the review request https://reviews.apache.org/r/53897/ that happened to 
follow this comment should be removed.

The strange thing is that the tasks are known to the master but not to the 
agent according to the logs (master.cpp:7568), the fact that the agent kept its 
id but not its tasks seem unlikely. Could you give more context around the 
agent, the registration attempt and also the master logs since the failover and 
the agent logs around that timeframe?

We should write a test reproducing the issue -(having a master + agent, 
launching a task, restarting master, block framework re-registration, let agent 
re-registers twice by spoofing the second re-registration)- and then remove the 
line 8073.

> fatal, check failed !framework->recovered()
> ---
>
> Key: MESOS-7991
> URL: https://issues.apache.org/jira/browse/MESOS-7991
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jack Crawford
>Assignee: Armand Grillet
>Priority: Blocker
>  Labels: reliability
>
> mesos master crashed on what appears to be framework recovery
> mesos master version: 1.3.1
> mesos agent version: 1.3.1
> {code}
> W0920 14:58:54.756364 25452 master.cpp:7568] Task 
> 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756369 25452 master.cpp:7568] Task 
> 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756376 25452 master.cpp:7568] Task 
> 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756381 25452 master.cpp:7568] Task 
> e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756386 25452 master.cpp:7568] Task 
> f838a03c-5cd4-47eb-8606-69b004d89808 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756392 25452 master.cpp:7568] Task 
> 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756397 25452 master.cpp:7568] Task 
> 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: 
> !framework->recovered()
> *** Check failure stack trace: ***
> @ 0x7f7bf80087ed  google::LogMessage::Fail()
> @ 0x7f7bf800a5a0  google::LogMessage::SendToLog()
> @ 0x7f7bf80083d3  google::LogMessage::Flush()
> @ 0x7f7bf800afc9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f7bf736fe7e  
> mesos::internal::master::Master::reconcileKnownSlave()
> @ 0x7f7bf739e612  mesos::internal::master::Master::_reregisterSlave()
> @ 0x7f7bf73a580e  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc
> 

[jira] [Commented] (MESOS-7306) Support mount propagation for host volumes.

2017-10-23 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215286#comment-16215286
 ] 

Jie Yu commented on MESOS-7306:
---

https://reviews.apache.org/r/63200
https://reviews.apache.org/r/63210
https://reviews.apache.org/r/63211
https://reviews.apache.org/r/63212
https://reviews.apache.org/r/63213


> Support mount propagation for host volumes.
> ---
>
> Key: MESOS-7306
> URL: https://issues.apache.org/jira/browse/MESOS-7306
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Jie Yu
>Assignee: Jie Yu
>  Labels: mesosphere, storage
>
> Currently, all mounts in a container are marked as 'slave' by default. 
> However, for some cases, we may want mounts under certain directory in a 
> container to be propagate back to the root mount namespace. This is useful 
> for the case where we want the mounts to survive container failures.
> See more documentation about mount propagation in:
> https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt
> Given mount propagation is very hard for users to understand, probably worth 
> limiting this to just host volumes because we only see use case for that at 
> the moment.
> Some relevant discussion can be found here:
> https://github.com/kubernetes/community/blob/master/contributors/design-proposals/propagation.md



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7306) Support mount propagation for host volumes.

2017-10-23 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7306:
--
Story Points: 8  (was: 5)

> Support mount propagation for host volumes.
> ---
>
> Key: MESOS-7306
> URL: https://issues.apache.org/jira/browse/MESOS-7306
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Jie Yu
>Assignee: Jie Yu
>  Labels: mesosphere, storage
>
> Currently, all mounts in a container are marked as 'slave' by default. 
> However, for some cases, we may want mounts under certain directory in a 
> container to be propagate back to the root mount namespace. This is useful 
> for the case where we want the mounts to survive container failures.
> See more documentation about mount propagation in:
> https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt
> Given mount propagation is very hard for users to understand, probably worth 
> limiting this to just host volumes because we only see use case for that at 
> the moment.
> Some relevant discussion can be found here:
> https://github.com/kubernetes/community/blob/master/contributors/design-proposals/propagation.md



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7935) CMake build should fail immediately for in-source builds

2017-10-23 Thread Nathan Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215177#comment-16215177
 ] 

Nathan Jackson commented on MESOS-7935:
---

[~kaysoky] Have you had a chance to look at my review?

> CMake build should fail immediately for in-source builds
> 
>
> Key: MESOS-7935
> URL: https://issues.apache.org/jira/browse/MESOS-7935
> Project: Mesos
>  Issue Type: Improvement
>  Components: cmake
> Environment: macOS 10.12
> GNU/Linux Debian Stretch
>Reporter: Damien Gerard
>Assignee: Nathan Jackson
>  Labels: build
>
> In-source builds are neither recommended or supported.  It is simple enough 
> to add a check to fail the build immediately.
> ---
> In-source build of master branch was broken with:
> {noformat}
> cd /Users/damien.gerard/projects/acp/mesos/src && 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
>   -DBUILD_FLAGS=\"\" -DBUILD_JAVA_JVM_LIBRARY=\"\" -DHAS_AUTHENTICATION=1 
> -DLIBDIR=\"/usr/local/libmesos\" -DPICOJSON_USE_INT64 
> -DPKGDATADIR=\"/usr/local/share/mesos\" 
> -DPKGLIBEXECDIR=\"/usr/local/libexec/mesos\" -DUSE_CMAKE_BUILD_CONFIG 
> -DUSE_STATIC_LIB -DVERSION=\"1.4.0\" -D__STDC_FORMAT_MACROS 
> -Dmesos_1_4_0_EXPORTS -I/Users/damien.gerard/projects/acp/mesos/include 
> -I/Users/damien.gerard/projects/acp/mesos/include/mesos 
> -I/Users/damien.gerard/projects/acp/mesos/src -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/protobuf-3.3.0/src/protobuf-3.3.0-lib/lib/include
>  -isystem /Users/damien.gerard/projects/acp/mesos/3rdparty/libprocess/include 
> -isystem /usr/local/opt/apr/libexec/include/apr-1 -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/boost-1.53.0/src/boost-1.53.0
>  -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/elfio-3.2/src/elfio-3.2 
> -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/glog-0.3.3/src/glog-0.3.3-lib/lib/include
>  -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/nvml-352.79/src/nvml-352.79 
> -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/picojson-1.3.0/src/picojson-1.3.0
>  -isystem /usr/local/include/subversion-1 -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/stout/include -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/http_parser-2.6.2/src/http_parser-2.6.2
>  -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/concurrentqueue-1.0.0-beta/src/concurrentqueue-1.0.0-beta
>  -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/libev-4.22/src/libev-4.22 
> -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/zookeeper-3.4.8/src/zookeeper-3.4.8/src/c/include
>  -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/zookeeper-3.4.8/src/zookeeper-3.4.8/src/c/generated
>  -isystem 
> /Users/damien.gerard/projects/acp/mesos/3rdparty/leveldb-1.19/src/leveldb-1.19/include
>   -std=c++11 -fPIC   -o 
> CMakeFiles/mesos-1.4.0.dir/slave/containerizer/mesos/provisioner/backends/copy.cpp.o
>  -c 
> /Users/damien.gerard/projects/acp/mesos/src/slave/containerizer/mesos/provisioner/backends/copy.cpp
> /Users/damien.gerard/projects/acp/mesos/src/slave/containerizer/mesos/provisioner/appc/store.cpp:132:46:
>  error: no member named 'fetcher' in namespace 'mesos::uri'; did you mean 
> 'Fetcher'?
>   Try uriFetcher = uri::fetcher::create();
> ~^~~
>  Fetcher
> /Users/damien.gerard/projects/acp/mesos/include/mesos/uri/fetcher.hpp:46:7: 
> note: 'Fetcher' declared here
> class Fetcher
>   ^
> /Users/damien.gerard/projects/acp/mesos/src/slave/containerizer/mesos/provisioner/appc/store.cpp:132:55:
>  error: no member named 'create' in 'mesos::uri::Fetcher'
>   Try uriFetcher = uri::fetcher::create();
> {noformat}
> Both Linux & macOS, not tested elsewhere, on {{master}} and tag 1.4.0-rc3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8078) Some fields went missing with no replacement in api/v1

2017-10-23 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8078:
-

Shepherd: Till Toenshoff
Assignee: Vinod Kone
  Sprint: Mesosphere Sprint 66
Story Points: 2

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Story
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>Assignee: Vinod Kone
>  Labels: mesosphere
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> leader_info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7851) Master stores old resource format in the registry

2017-10-23 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142405#comment-16142405
 ] 

Michael Park edited comment on MESOS-7851 at 10/23/17 6:23 AM:
---

Just writing down what should be done here.

The master has 2 things that contain resources that go into the registry: 
{{SlaveInfo}}, and {{QuotaInfo}}.
In order to support master downgrades (e.g., 1.4.0 => 1.3.1), we must store the 
resources
in the "pre-reservation-refinement" format. This happens for {{SlaveInfo}} 
today (albeit incidentally), but not
for {{QuotaInfo}}.

Resources inside {{QuotaInfo}} should probably be downgraded for the 
{{Registry}} and upgraded on their way out. However, with the current 
requirement that {{QuotaInfo}} can only hold unreserved resources, we don't 
need to do anything for this. (tested manually by setting a quota with 1.4.0 
master, downgrading to 1.3.1 and hitting the quota endpoint).

Changes that should to be made:
  - {{Master::_recover}} should upgrade the resources inside {{SlaveInfo}} 
before inserting it into the {{slaves.recovered}} map. {{authorizeResources}} 
can be fixed after this. (tech debt)


was (Author: mcypark):
Just writing down what should be done here.

The master has 2 things that contain resources that go into the registry: 
{{SlaveInfo}}, and {{QuotaInfo}}.
In order to support master downgrades (e.g., 1.4.0 => 1.3.1), we must store the 
resources
in the "pre-reservation-refinement" format. This happens for {{SlaveInfo}} 
today (albeit incidentally), but not
for {{QuotaInfo}}.

Resources inside {{QuotaInfo}} should probably be downgraded for the 
{{Registry}} and upgraded on their way out. However, with the current 
requirement that {{QuotaInfo}} can only hold unreserved resources, we don't 
need to do anything for this. (tested manually by setting a quota with 1.4.0 
master, downgrading to 1.3.1 and hitting the quota endpoint).

Changes that should to be made:
  - {{Master::_recover}} should upgrade the resources inside {{SlaveInfo}} 
before inserting it into the {{slaves.recovered}} map. {{authorizeResources}} 
can be fixed after this. (tech debt)
  - The master should upgrade the resources inside {{SlaveInfo}} earlier in the 
{{(re)registerSlave}} handlers, and use the downgraded version just for the 
{{Registry}}. (clean-up)

> Master stores old resource format in the registry
> -
>
> Key: MESOS-7851
> URL: https://issues.apache.org/jira/browse/MESOS-7851
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Greg Mann
>Assignee: Michael Park
>  Labels: master, mesosphere, reservation
>
> We intend for the master to store all internal resource representations in 
> the new, post-reservation-refinement format. However, [when persisting 
> registered agents to the 
> registrar|https://github.com/apache/mesos/blob/498a000ac1bb8f51dc871f22aea265424a407a17/src/master/master.cpp#L5861-L5876],
>  the master does not convert the resources; agents provide resources in the 
> pre-reservation-refinement format, and these resources are stored as-is. This 
> means that after recovery, any agents in the master's {{slaves.recovered}} 
> map will have {{SlaveInfo.resources}} in the pre-reservation-refinement 
> format.
> We should update the master to convert these resources before persisting them 
> to the registry.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)