[jira] [Commented] (MESOS-10011) Operation feedback with stale agent ID crashes the master

2019-10-11 Thread Yan Xu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949656#comment-16949656
 ] 

Yan Xu commented on MESOS-10011:


{{removeOperation}} is probably called from 
[here|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/master/master.cpp#L9245]
 because of these the operations don't have IDs.

{noformat}
I0918 23:15:32.563908 37981 slave.cpp:6285] Forwarding status update of 
operation with no ID (operation_uuid: d2a369e9-ec7c-4be6-9bdb-8ab1961aa773) for 
framework 9ead69cb-63b1-4986-968a-ecd99b7ba95d-2469
{noformat}

> Operation feedback with stale agent ID crashes the master
> -
>
> Key: MESOS-10011
> URL: https://issues.apache.org/jira/browse/MESOS-10011
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Affects Versions: 1.9.0
>Reporter: Yan Xu
>Priority: Critical
>
> We have observed the following in our environment.
> {noformat}
> F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr 
> f664c4a9-d1ca-4cd0-88e4-0a6acf20e629-S218
> *** Check failure stack trace: ***
> @ 0x7fd36ca9cf4d  google::LogMessage::Fail()
> @ 0x7fd36ca9f13d  google::LogMessage::SendToLog()
> @ 0x7fd36ca9ca87  google::LogMessage::Flush()
> @ 0x7fd36ca9fbc9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fd36b5ae3bc  mesos::internal::master::Master::removeOperation()
> @ 0x7fd36b5b3446  
> mesos::internal::master::Master::updateOperationStatus()
> {noformat}
> This follows registration of an agent that has changed its agent ID due to 
> losing its local state.
> The check failure code is in 
> [Master::removeOperation|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/master/master.cpp#L12451].
> The masters would enter a crash loop unless the operation checkpoint state 
> (i.e., {{resources_and_operations.state}}) on the offending agent is deleted.
>  Even thought we try to minimize the cases where an agent would lose its 
> state, it can still happen when the {{latest}} symlink is removed either by 
> an operator or automatically [in certain 
> cases|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/slave.cpp#L7719-L7725].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10011) Operation feedback with stale agent ID crashes the master

2019-10-10 Thread Yan Xu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949147#comment-16949147
 ] 

Yan Xu commented on MESOS-10011:


To be most compatible with the agent checkpointed resources behavior the agent 
should probably convert the agent ID to the new one upon recovery but will this 
change confuse the master?

> Operation feedback with stale agent ID crashes the master
> -
>
> Key: MESOS-10011
> URL: https://issues.apache.org/jira/browse/MESOS-10011
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Affects Versions: 1.9.0
>Reporter: Yan Xu
>Priority: Critical
>
> We have observed the following in our environment.
> {noformat}
> F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr 
> f664c4a9-d1ca-4cd0-88e4-0a6acf20e629-S218
> *** Check failure stack trace: ***
> @ 0x7fd36ca9cf4d  google::LogMessage::Fail()
> @ 0x7fd36ca9f13d  google::LogMessage::SendToLog()
> @ 0x7fd36ca9ca87  google::LogMessage::Flush()
> @ 0x7fd36ca9fbc9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fd36b5ae3bc  mesos::internal::master::Master::removeOperation()
> @ 0x7fd36b5b3446  
> mesos::internal::master::Master::updateOperationStatus()
> {noformat}
> This follows registration of an agent that has changed its agent ID due to 
> losing its local state.
> The check failure code is in 
> [Master::removeOperation|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/master/master.cpp#L12451].
> The masters would enter a crash loop unless the operation checkpoint state 
> (i.e., {{resources_and_operations.state}}) on the offending agent is deleted.
>  Even thought we try to minimize the cases where an agent would lose its 
> state, it can still happen when the {{latest}} symlink is removed either by 
> an operator or automatically [in certain 
> cases|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/slave.cpp#L7719-L7725].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10011) Operation feedback with stale agent ID crashes the master

2019-10-10 Thread Yan Xu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949139#comment-16949139
 ] 

Yan Xu commented on MESOS-10011:


[~greggomann] any thoughts on how this should be addressed?

> Operation feedback with stale agent ID crashes the master
> -
>
> Key: MESOS-10011
> URL: https://issues.apache.org/jira/browse/MESOS-10011
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Affects Versions: 1.9.0
>Reporter: Yan Xu
>Priority: Critical
>
> We have observed the following in our environment.
> {noformat}
> F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr 
> f664c4a9-d1ca-4cd0-88e4-0a6acf20e629-S218
> *** Check failure stack trace: ***
> @ 0x7fd36ca9cf4d  google::LogMessage::Fail()
> @ 0x7fd36ca9f13d  google::LogMessage::SendToLog()
> @ 0x7fd36ca9ca87  google::LogMessage::Flush()
> @ 0x7fd36ca9fbc9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fd36b5ae3bc  mesos::internal::master::Master::removeOperation()
> @ 0x7fd36b5b3446  
> mesos::internal::master::Master::updateOperationStatus()
> {noformat}
> This follows registration of an agent that has changed its agent ID due to 
> losing its local state.
> The check failure code is in 
> [Master::removeOperation|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/master/master.cpp#L12451].
> The masters would enter a crash loop unless the operation checkpoint state 
> (i.e., {{resources_and_operations.state}}) on the offending agent is deleted.
>  Even thought we try to minimize the cases where an agent would lose its 
> state, it can still happen when the {{latest}} symlink is removed either by 
> an operator or automatically [in certain 
> cases|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/slave.cpp#L7719-L7725].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10011) Operation feedback with stale agent ID crashes the master

2019-10-10 Thread Yan Xu (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949069#comment-16949069
 ] 

Yan Xu commented on MESOS-10011:


In our environment we only use old style RESERVE/CREATE persistent volumes and 
a plausible case is that if the scheduler fails to acknowledge the operation 
feedback so the check pointed update still exists with the original agent ID in 
it.

After the agent losing its state, because 
[resources_and_operations.state|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/paths.hpp#L74]
 lives outside the 
[slaves|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/paths.hpp#L59]
 state, the unacked operations don't get cleaned up and now have the stale 
agent ID in them.

> Operation feedback with stale agent ID crashes the master
> -
>
> Key: MESOS-10011
> URL: https://issues.apache.org/jira/browse/MESOS-10011
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, master
>Affects Versions: 1.9.0
>Reporter: Yan Xu
>Priority: Critical
>
> We have observed the following in our environment.
> {noformat}
> F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr 
> f664c4a9-d1ca-4cd0-88e4-0a6acf20e629-S218
> *** Check failure stack trace: ***
> @ 0x7fd36ca9cf4d  google::LogMessage::Fail()
> @ 0x7fd36ca9f13d  google::LogMessage::SendToLog()
> @ 0x7fd36ca9ca87  google::LogMessage::Flush()
> @ 0x7fd36ca9fbc9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fd36b5ae3bc  mesos::internal::master::Master::removeOperation()
> @ 0x7fd36b5b3446  
> mesos::internal::master::Master::updateOperationStatus()
> {noformat}
> This follows registration of an agent that has changed its agent ID due to 
> losing its local state.
> The check failure code is in 
> [Master::removeOperation|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/master/master.cpp#L12451].
> The masters would enter a crash loop unless the operation checkpoint state 
> (i.e., {{resources_and_operations.state}}) on the offending agent is deleted.
>  Even thought we try to minimize the cases where an agent would lose its 
> state, it can still happen when the {{latest}} symlink is removed either by 
> an operator or automatically [in certain 
> cases|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/slave.cpp#L7719-L7725].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10011) Operation feedback with stale agent ID crashes the master

2019-10-10 Thread Yan Xu (Jira)
Yan Xu created MESOS-10011:
--

 Summary: Operation feedback with stale agent ID crashes the master
 Key: MESOS-10011
 URL: https://issues.apache.org/jira/browse/MESOS-10011
 Project: Mesos
  Issue Type: Bug
  Components: agent, master
Affects Versions: 1.9.0
Reporter: Yan Xu


We have observed the following in our environment.
{noformat}
F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr 
f664c4a9-d1ca-4cd0-88e4-0a6acf20e629-S218
*** Check failure stack trace: ***
@ 0x7fd36ca9cf4d  google::LogMessage::Fail()
@ 0x7fd36ca9f13d  google::LogMessage::SendToLog()
@ 0x7fd36ca9ca87  google::LogMessage::Flush()
@ 0x7fd36ca9fbc9  google::LogMessageFatal::~LogMessageFatal()
@ 0x7fd36b5ae3bc  mesos::internal::master::Master::removeOperation()
@ 0x7fd36b5b3446  
mesos::internal::master::Master::updateOperationStatus()
{noformat}
This follows registration of an agent that has changed its agent ID due to 
losing its local state.

The check failure code is in 
[Master::removeOperation|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/master/master.cpp#L12451].

The masters would enter a crash loop unless the operation checkpoint state 
(i.e., {{resources_and_operations.state}}) on the offending agent is deleted.

 Even thought we try to minimize the cases where an agent would lose its state, 
it can still happen when the {{latest}} symlink is removed either by an 
operator or automatically [in certain 
cases|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/slave.cpp#L7719-L7725].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9768) Allow operators to mount the container rootfs with the `nosuid` flag

2019-05-17 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842819#comment-16842819
 ] 

Yan Xu commented on MESOS-9768:
---

What we are primarily interested in is to set it for for the {{overlay}} 
backend but there are multiple backend options. Seems like a common flag 
--{{image_mount_options}} could be applicable to {{bind}} backend as well 
(maybe {{aufs}} too? [~gilbert]). It doesn't apply to the {{copy}} backend of 
course.

One could argue that since this is a security concern, perhaps one flag to 
control all mounts (volumes) make sense but I am afraid that'll be very broad 
and increase the complexity. Also AFAIK you can also just set {{nosuid}} on the 
underlying partition for these cases. It's overlayfs that doesn't honor it so 
we have to protect it this way.

We can probably start off with a generic flag --{{image_mount_options}} but use 
documentation to indicate what backends are applicable/supported.

[~jamespeach] [~gilbert] [~jieyu] WDYT?

> Allow operators to mount the container rootfs with the `nosuid` flag
> 
>
> Key: MESOS-9768
> URL: https://issues.apache.org/jira/browse/MESOS-9768
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Priority: Major
>
> If cluster users are allowed to launch containers with arbitrary images, 
> those images may container setuid programs. For security reasons (auditing, 
> privilege escalation), operators may wish to ensure that setuid programs 
> cannot be used within a container.
>  
> We should provide a way for operators to be able to specify that container 
> volumes (including `/`0 should be mounted with the `nosuid` flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable

2018-11-02 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673503#comment-16673503
 ] 

Yan Xu commented on MESOS-9368:
---

cc [~fiu]

> The agent can be resending status updates too aggressively and the backoff is 
> not configurable
> --
>
> Key: MESOS-9368
> URL: https://issues.apache.org/jira/browse/MESOS-9368
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Major
>
> The current behavior is that when the agent queue status updates in a 
> "stream" which has an exponential backoff window from 10secs to 10mins. In 
> each retry the front of the queue is sent so if multiple statuses are queued 
> up, subsequent ones are not attempted unless the first one is acked. So if 
> the frameworks are for some reason not able to ack at all, there is one 
> update per task in flight at a time.
> If in a cluster we have 500,000 tasks with pending status updates and the 
> master fails over, after each agent is reregistered it starts to send these 
> updates or we are looking at 500,000 updates ~immediately + 500,000 updates 
> 10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later.
> Given that the initial communication of task state is covered by the agent 
> reregistration message and the framework reconciliation requests, it seems 
> that we can safely reduce the retry frequency further, optionally of course. 
> It's not currently configurable so we need to expose a flag for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable

2018-11-02 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673497#comment-16673497
 ] 

Yan Xu commented on MESOS-9368:
---

[~ipronin] [~jasonlai] do you guys feel similarly for your environments?

> The agent can be resending status updates too aggressively and the backoff is 
> not configurable
> --
>
> Key: MESOS-9368
> URL: https://issues.apache.org/jira/browse/MESOS-9368
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Major
>
> The current behavior is that when the agent queue status updates in a 
> "stream" which has an exponential backoff window from 10secs to 10mins. In 
> each retry the front of the queue is sent so if multiple statuses are queued 
> up, subsequent ones are not attempted unless the first one is acked. So if 
> the frameworks are for some reason not able to ack at all, there is one 
> update per task in flight at a time.
> If in a cluster we have 500,000 tasks with pending status updates and the 
> master fails over, after each agent is reregistered it starts to send these 
> updates or we are looking at 500,000 updates ~immediately + 500,000 updates 
> 10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later.
> Given that the initial communication of task state is covered by the agent 
> reregistration message and the framework reconciliation requests, it seems 
> that we can safely reduce the retry frequency further, optionally of course. 
> It's not currently configurable so we need to expose a flag for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9368) The agent can be resending status updates too aggressively and the backoff is not configurable

2018-11-02 Thread Yan Xu (JIRA)
Yan Xu created MESOS-9368:
-

 Summary: The agent can be resending status updates too 
aggressively and the backoff is not configurable
 Key: MESOS-9368
 URL: https://issues.apache.org/jira/browse/MESOS-9368
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


The current behavior is that when the agent queue status updates in a "stream" 
which has an exponential backoff window from 10secs to 10mins. In each retry 
the front of the queue is sent so if multiple statuses are queued up, 
subsequent ones are not attempted unless the first one is acked. So if the 
frameworks are for some reason not able to ack at all, there is one update per 
task in flight at a time.

If in a cluster we have 500,000 tasks with pending status updates and the 
master fails over, after each agent is reregistered it starts to send these 
updates or we are looking at 500,000 updates ~immediately + 500,000 updates 
10secs later + 500,000 updates 20, 40, 80, 160, 320, 600 secs later.

Given that the initial communication of task state is covered by the agent 
reregistration message and the framework reconciliation requests, it seems that 
we can safely reduce the retry frequency further, optionally of course. It's 
not currently configurable so we need to expose a flag for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-09-12 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612870#comment-16612870
 ] 

Yan Xu commented on MESOS-9178:
---

So my proposal is that, we have the following metrics:..

{noformat:title=}
"master/p25_agents_reregistered_secs": 1,
"master/p50_agents_reregistered_secs": 2,
"master/p75_agents_reregistered_secs": 3,
"master/p90_agents_reregistered_secs": 4,
"master/p99_agents_reregistered_secs": 5,
"master/p100_agents_reregistered_secs": 6,
{noformat}

(welcome suggestion for the precise naming and unit)

Note that each of the metric only appears when such percentage of agents have 
reregistered, and they do persist until the master fails over, then we start 
over from having 0 of these metrics. Monitoring systems I have worked with all 
support filling missing values with their previous values so if you plot this I 
do expect them to continuously show the changes of failover performance over 
time.

I agree that we can publish to the event stream (we currently have AGENT_ADDED 
and AGENT_REMOVED) but for monitoring purposes it's shifting the metric 
creation logic to an external entity.

In terms of implementation, given the current tools we have, I think it works 
best if each metric above is its own timer (but comment in more details in the 
review).

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> When an agent is reregistrated, the time delta from that moment to
> the master elected time was saved; In the progress of reregistration,
> each data entry represents the registration time delta from master
> elected time; The percentile of these data as in this metrics can
> represent overall reregistration progress; In case of degradation
> towards to the end of reregistration, the high percentile will
> reflect it.
> Note: These metrics only represent the completed reregistration; It
> does not monitor agents were finally marked as unreachable that the
> reregistration didn't actually happen, the unreachable agents were
> already monitored by existing metrics.
> https://reviews.apache.org/r/68706/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-08-22 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589463#comment-16589463
 ] 

Yan Xu commented on MESOS-9178:
---

+1. Yup that's the approach we talked about. Sorry the JIRA didn't mention it.

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9171) Mesos agent crashes

2018-08-21 Thread Yan Xu (JIRA)
Yan Xu created MESOS-9171:
-

 Summary: Mesos agent crashes
 Key: MESOS-9171
 URL: https://issues.apache.org/jira/browse/MESOS-9171
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.7.0
Reporter: Yan Xu


The error:

{noformat:title=}
../../3rdparty/stout/include/stout/option.hpp:118: const T& Option::get() 
const & [with T = std::basic_string]: Assertion `isSome()' failed.
{noformat}

The backtrace:

{noformat:title=}
Program terminated with signal SIGABRT, Aborted.
#0  0x7fd0ab922495 in raise () from /lib64/libc.so.6
#0  0x7fd0ab922495 in raise () from /lib64/libc.so.6
#1  0x7fd0ab923c75 in abort () from /lib64/libc.so.6
#2  0x7fd0ab91b60e in __assert_fail_base () from /lib64/libc.so.6
#3  0x7fd0ab91b6d0 in __assert_fail () from /lib64/libc.so.6
#4  0x7fd0ae473c33 in Option::get() const & 
(this=0x7fd0a4deb5a8) at ../../3rdparty/stout/include/stout/option.hpp:118
#5  0x7fd0ae48ae94 in get (this=0x7fd0a4deb5a8) at 
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/unordered_set.h:93
#6  mesos::internal::slave::NetworkCniIsolatorProcess::usage 
(this=0x7fd0a4dea800, containerId=...) at 
../../src/slave/containerizer/mesos/isolators/network/cni/cni.cpp:1516
#7  0x7fd0ae1770da in operator() (process=, a0=..., 
promise=..., __closure=) at 
../../3rdparty/libprocess/include/process/dispatch.hpp:354
#8  invoke&, process::Future 
(T::*)(P0), A0&&) [with R = mesos::ResourceStatistics; T = 
mesos::internal::slave::MesosIsolatorProcess; P0 = const mesos::ContainerID&; 
A0 = const 
mesos::ContainerID&]::,
 std::default_delete > >, 
std::decay::type&&, process::ProcessBase*)>, 
std::unique_ptr, 
std::default_delete > >, 
mesos::ContainerID, process::ProcessBase*> (f=...) at 
../../3rdparty/stout/include/stout/cpp17.hpp:42
#9  invoke_expand&, process::Future 
(T::*)(P0), A0&&) [with R = mesos::ResourceStatistics; T = 
mesos::internal::slave::MesosIsolatorProcess; P0 = const mesos::ContainerID&; 
A0 = const 
mesos::ContainerID&]::,
 std::default_delete > >, 
std::decay::type&&, process::ProcessBase*)>, 
std::tuple, 
std::default_delete > >, 
mesos::ContainerID, std::_Placeholder<1> >, 
std::tuple, 0ul, 1ul, 2ul> (args=..., bound_args=..., 
f=...) at ../../3rdparty/stout/include/stout/lambda.hpp:292
#10 operator() (this=) at 
../../3rdparty/stout/include/stout/lambda.hpp:331
#11 invoke&, 
process::Future (T::*)(P0), A0&&) [with R = mesos::ResourceStatistics; T = 
mesos::internal::slave::MesosIsolatorProcess; P0 = const mesos::ContainerID&; 
A0 = const 
mesos::ContainerID&]::,
 std::default_delete > >, 
std::decay::type&&, process::ProcessBase*)>, 
std::unique_ptr, 
std::default_delete > >, 
mesos::ContainerID, std::_Placeholder<1> >, process::ProcessBase*> (f=...) at 
../../3rdparty/stout/include/stout/cpp17.hpp:42
#12 operator()&, process::Future (T::*)(P0), A0&&) [with R = 
mesos::ResourceStatistics; T = mesos::internal::slave::MesosIsolatorProcess; P0 
= const mesos::ContainerID&; A0 = const 
mesos::ContainerID&]::,
 std::default_delete > >, 
std::decay::type&&, process::ProcessBase*)>, 
std::unique_ptr, 
std::default_delete > >, 
mesos::ContainerID, std::_Placeholder<1> >, process::ProcessBase*> (f=..., 
this=) at ../../3rdparty/stout/include/stout/lambda.hpp:398
#13 lambda::CallableOnce::CallableFn
 process::dispatch(process::PID const&, 
process::Future 
(mesos::internal::slave::MesosIsolatorProcess::*)(mesos::ContainerID const&), 
mesos::ContainerID 
const&)::{lambda(std::unique_ptr, 
std::default_delete > >, 
mesos::ContainerID&&, process::ProcessBase*)#1}, 
std::unique_ptr, 
std::default_delete > >, 
mesos::ContainerID, std::_Placeholder<1> > 
>::operator()(process::ProcessBase*&&) && (this=0x7fd099a2a630, 
args#0=) at ../../3rdparty/stout/include/stout/lambda.hpp:463
#14 0x7fd0aed493a2 in operator() (args#0=0x7fd0a4deb6b8, this=) at ../../../3rdparty/stout/include/stout/lambda.hpp:443
#15 process::ProcessBase::consume(process::DispatchEvent&&) (this=, event=...) at ../../../3rdparty/libprocess/src/process.cpp:3563
#16 0x7fd0aed88609 in serve (event=..., this=0x7fd0a4deb6b8) at 
../../../3rdparty/libprocess/include/process/process.hpp:87
#17 process::ProcessManager::resume (this=, 
process=0x7fd0a4deb6b8) at ../../../3rdparty/libprocess/src/process.cpp:2988
#18 0x7fd0aed8f856 in operator() (__closure=0x7fd0a4d44dd8) at 
../../../3rdparty/libprocess/src/process.cpp:2497
#19 _M_invoke<> (this=0x7fd0a4d44dd8) at 
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:1700
#20 operator() (this=0x7fd0a4d44dd8) at 
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:1688
#21 
std::thread::_Impl()>
 >::_M_run(void) (this=0x7fd0a4d44dc0) at 
/opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/thread:115
#22 0x7fd0abd3a470 in ?? () from /usr/lib64/libstdc++.so.6
#23 0x7fd0abf91aa1 in start_thread () from /lib64/libpthread.so.0
#24 

[jira] [Commented] (MESOS-8897) ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill is flaky

2018-05-09 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469112#comment-16469112
 ] 

Yan Xu commented on MESOS-8897:
---

cc [~hdost]

> ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill is flaky
> -
>
> Key: MESOS-8897
> URL: https://issues.apache.org/jira/browse/MESOS-8897
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: Yan Xu
>Priority: Major
>
> {noformat:title=}
> [ RUN ] ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill
> meta-data=/dev/loop0 isize=256 agcount=2, agsize=5120 blks
>  = sectsz=512 attr=2, projid32bit=1
>  = crc=0
> data = bsize=4096 blocks=10240, imaxpct=25
>  = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal log bsize=4096 blocks=1200, version=2
>  = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
> I0508 17:55:12.353438 13453 exec.cpp:162] Version: 1.7.0
> I0508 17:55:12.370332 13451 exec.cpp:236] Executor registered on agent 
> 49668ffa-2a69-4867-b31a-4972b4ac13d2-S0
> I0508 17:55:12.376093 13447 executor.cpp:178] Received SUBSCRIBED event
> I0508 17:55:12.376771 13447 executor.cpp:182] Subscribed executor on 
> mesos.vagrant
> I0508 17:55:12.377038 13447 executor.cpp:178] Received LAUNCH event
> I0508 17:55:12.381901 13447 executor.cpp:665] Starting task 
> edb798b4-1b16-4de4-828c-0db132df70ab
> I0508 17:55:12.387936 13447 executor.cpp:485] Running 
> '/tmp/mesos-build/mesos/build/src/mesos-containerizer launch 
> '
> I0508 17:55:12.392854 13447 executor.cpp:678] Forked command at 13456
> 2+0 records in
> 2+0 records out
> 2097152 bytes (2.1 MB) copied, 0.00404074 s, 519 MB/s
> ../../src/tests/containerizer/xfs_quota_tests.cpp:618: Failure
> Expected: (limit.disk().get()) > (Megabytes(1)), actual: 1MB vs 1MB
> [ FAILED ] ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill (1182 ms)
> {noformat}
> [~jpe...@apache.org] mentioned that 
> {code}
> 409 // If the soft limit is exceeded the container should be killed.
> 410 if (quotaInfo->used > quotaInfo->softLimit) {
> 411   Resource resource;
> 412   resource.set_name("disk");
> 413   resource.set_type(Value::SCALAR);
> 414   resource.mutable_scalar()->set_value(
> 415 quotaInfo->used.bytes() / Bytes::MEGABYTES);
> 416
> 417   info->limitation.set(
> 418   protobuf::slave::createContainerLimitation(
> 419   Resources(resource),
> 420   "Disk usage (" + stringify(quotaInfo->used) +
> 421   ") exceeds quota (" +
> 422   stringify(quotaInfo->softLimit) + ")",
> 423   TaskStatus::REASON_CONTAINER_LIMITATION_DISK));
> 424 }
> 425   }
> {code}
> Converting to MB is rounding down, so we report less space than was actually 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8897) ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill is flaky

2018-05-09 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8897:
-

 Summary: ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill is flaky
 Key: MESOS-8897
 URL: https://issues.apache.org/jira/browse/MESOS-8897
 Project: Mesos
  Issue Type: Bug
  Components: flaky, test
Reporter: Yan Xu


{noformat:title=}
[ RUN ] ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill
meta-data=/dev/loop0 isize=256 agcount=2, agsize=5120 blks
 = sectsz=512 attr=2, projid32bit=1
 = crc=0
data = bsize=4096 blocks=10240, imaxpct=25
 = sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=1200, version=2
 = sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
I0508 17:55:12.353438 13453 exec.cpp:162] Version: 1.7.0
I0508 17:55:12.370332 13451 exec.cpp:236] Executor registered on agent 
49668ffa-2a69-4867-b31a-4972b4ac13d2-S0
I0508 17:55:12.376093 13447 executor.cpp:178] Received SUBSCRIBED event
I0508 17:55:12.376771 13447 executor.cpp:182] Subscribed executor on 
mesos.vagrant
I0508 17:55:12.377038 13447 executor.cpp:178] Received LAUNCH event
I0508 17:55:12.381901 13447 executor.cpp:665] Starting task 
edb798b4-1b16-4de4-828c-0db132df70ab
I0508 17:55:12.387936 13447 executor.cpp:485] Running 
'/tmp/mesos-build/mesos/build/src/mesos-containerizer launch 
'
I0508 17:55:12.392854 13447 executor.cpp:678] Forked command at 13456
2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.00404074 s, 519 MB/s
../../src/tests/containerizer/xfs_quota_tests.cpp:618: Failure
Expected: (limit.disk().get()) > (Megabytes(1)), actual: 1MB vs 1MB
[ FAILED ] ROOT_XFS_QuotaTest.DiskUsageExceedsQuotaWithKill (1182 ms)
{noformat}

[~jpe...@apache.org] mentioned that 

{code}
409 // If the soft limit is exceeded the container should be killed.
410 if (quotaInfo->used > quotaInfo->softLimit) {
411   Resource resource;
412   resource.set_name("disk");
413   resource.set_type(Value::SCALAR);
414   resource.mutable_scalar()->set_value(
415 quotaInfo->used.bytes() / Bytes::MEGABYTES);
416
417   info->limitation.set(
418   protobuf::slave::createContainerLimitation(
419   Resources(resource),
420   "Disk usage (" + stringify(quotaInfo->used) +
421   ") exceeds quota (" +
422   stringify(quotaInfo->softLimit) + ")",
423   TaskStatus::REASON_CONTAINER_LIMITATION_DISK));
424 }
425   }
{code}

Converting to MB is rounding down, so we report less space than was actually 
used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8750) Check failed: !slaves.registered.contains(task->slave_id)

2018-05-03 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463379#comment-16463379
 ] 

Yan Xu commented on MESOS-8750:
---

{code:title=}
commit 520b729857223aeade345cbdf61209ec4f395ad9
Author: Megha Sharma 
Date:   Thu May 3 22:09:02 2018 -0700

Remove unknown unreachable tasks when agent reregisters.

A RunTaskMesssage could get dropped for an agent while it's
disconnected from the master and when such an agent goes unreachable
then this dropped task message gets added to the unreachable tasks.
When the agent reregisters, the master sends status updates for the
tasks that the agent reported when re-registering and these tasks are
also removed from the unreachableTasks on the framework but since the
agent doesn't know about the dropped task so it doesn't get removed
from the unreachableTasks leading to a check failure when
this inconsistency is detected during framework removal.

Review: https://reviews.apache.org/r/66644/
{code}

> Check failed: !slaves.registered.contains(task->slave_id)
> -
>
> Key: MESOS-8750
> URL: https://issues.apache.org/jira/browse/MESOS-8750
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>Priority: Critical
>
> It appears that in certain circumstances an unreachable task doesn't get 
> cleaned up from the framework.unreachableTasks when the respective agent 
> re-registers leading to this check failure later when the framework is being 
> removed. When an agent goes unreachable master adds the tasks from this agent 
> to {{framework.unreachableTasks}} and when such an agent re-registers the 
> master removes the tasks that it specifies during re-registeration from this 
> datastructure but there could be tasks that the agent doesn't know about e.g. 
> if the runTask message for them got dropped and so such tasks will not get 
> removed from unreachableTasks.
> {noformat}
> F0310 13:30:58.856665 62740 master.cpp:9671] Check failed: 
> !slaves.registered.contains(task->slave_id()) Unreachable task  of 
> framework 4f57975b-05dd-4118-8674-5b29a86c6a6c-0850 was found on registered 
> agent 683c4a92-b5a0-490c-998a-6113fc86d37a-S1428
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8630) All subsequent registry operations fail after the registrar is aborted after a failed update

2018-05-01 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-8630:
-

Assignee: Xudong Ni

> All subsequent registry operations fail after the registrar is aborted after 
> a failed update
> 
>
> Key: MESOS-8630
> URL: https://issues.apache.org/jira/browse/MESOS-8630
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Yan Xu
>Assignee: Xudong Ni
>Priority: Major
>
> Failure to update registry always aborts the registrar but don't always abort 
> the master process.
> When the registrar fails to update the registry it would abort the actor and 
> fail all future operations. The rationale as explained here: 
> [https://github.com/apache/mesos/commit/5eaf1eb346fc2f46c852c1246bdff12a89216b60]
> {quote}In this event, the Master won't commit suicide until the initial
>  failure is processed. However, in the interim, subsequent operations
>  are potentially being performed against the Registrar. This could lead
>  to fighting between masters if a "demoted" master re-attempts to
>  acquire log-leadership!
> {quote}
> However when the registrar updates is requested by an operator API 
> (maintenance, quota update, etc) the master process doesn't shut down (a 500 
> error is returned to the client instead) and all subsequent operations will 
> fail!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8618) ReconciliationTest.ReconcileStatusUpdateTaskState is flaky.

2018-04-30 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459442#comment-16459442
 ] 

Yan Xu commented on MESOS-8618:
---

{noformat:title=}
commit 1c6d9e5e6d7439444c77d6c91b18642f69557dfe
Author: Jiang Yan Xu 
Date:   Mon Apr 23 14:59:44 2018 -0700

Fixed flaky ReconciliationTest.ReconcileStatusUpdateTaskState.

To simulate a master failover we need to use `replicated_log` as the
registry otherwise the master loses persisted info about the agents.

Review: https://reviews.apache.org/r/66769
{noformat}

> ReconciliationTest.ReconcileStatusUpdateTaskState is flaky.
> ---
>
> Key: MESOS-8618
> URL: https://issues.apache.org/jira/browse/MESOS-8618
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: ec Debian 9 with SSL
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>Priority: Major
>  Labels: flaky-test
> Fix For: 1.6.0
>
> Attachments: 
> ReconciliationTest.ReconcileStatusUpdateTaskState-badrun.txt
>
>
> {noformat}
> ../../src/tests/reconciliation_tests.cpp:1129
>   Expected: TASK_RUNNING
> To be equal to: update->state()
>   Which is: TASK_FINISHED
> {noformat}
> {noformat}
> ../../src/tests/reconciliation_tests.cpp:1130: Failure
>   Expected: TaskStatus::REASON_RECONCILIATION
>   Which is: 9
> To be equal to: update->reason()
>   Which is: 32
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8855) Change TaskStatus.Reason's default value to something

2018-04-30 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8855:
-

 Summary: Change TaskStatus.Reason's default value to something 
 Key: MESOS-8855
 URL: https://issues.apache.org/jira/browse/MESOS-8855
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


We are constantly adding new task reasons and they'll result in the default 
enum value on clients that don't recognize them, right now the default value 
(first value) is {{REASON_COMMAND_EXECUTOR_FAILED}} and we should change it to 
something more legit. 

Also [~jieyu] has this TODO

{code:title=}
enum Reason {
// TODO(jieyu): The default value when a caller doesn't check for
// presence is 0 and so ideally the 0 reason is not a valid one.
// Since this is not used anywhere, consider removing this reason.
REASON_COMMAND_EXECUTOR_FAILED = 0;
}
{code}

Note that Mesos already defines an used {{REASON_TASK_UNKNOWN}} and the fact 
that there's task state {{TASK_UNKNOWN}} may influence the naming of the 
default enum field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8618) ReconciliationTest.ReconcileStatusUpdateTaskState is flaky.

2018-04-23 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448971#comment-16448971
 ] 

Yan Xu commented on MESOS-8618:
---

This test failed because we didn't enable replicated log registry so the master 
doesn't know the agent when it reregisters. With MESOS-6406 the intention was 
not to send status updates actively when it is a known agent.

However the discussions about this test exposed a bug that we are not sending 
the "status update state" in this case, for which I filed MESOS-8824 and will 
fix next.

> ReconciliationTest.ReconcileStatusUpdateTaskState is flaky.
> ---
>
> Key: MESOS-8618
> URL: https://issues.apache.org/jira/browse/MESOS-8618
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: ec Debian 9 with SSL
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>Priority: Major
>  Labels: flaky-test
> Attachments: 
> ReconciliationTest.ReconcileStatusUpdateTaskState-badrun.txt
>
>
> {noformat}
> ../../src/tests/reconciliation_tests.cpp:1129
>   Expected: TASK_RUNNING
> To be equal to: update->state()
>   Which is: TASK_FINISHED
> {noformat}
> {noformat}
> ../../src/tests/reconciliation_tests.cpp:1130: Failure
>   Expected: TaskStatus::REASON_RECONCILIATION
>   Which is: 9
> To be equal to: update->reason()
>   Which is: 32
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8824) Send the task's latest "status update state" to frameworks when an unreachable agent reregisters.

2018-04-23 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8824:
-

 Summary: Send the task's latest "status update state" to 
frameworks when an unreachable agent reregisters.
 Key: MESOS-8824
 URL: https://issues.apache.org/jira/browse/MESOS-8824
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


With MESOS-6406 the master started to actively send frameworks status updates 
for reregistering agents if the agent:
 - has previously been removed by the master for being unreachable or
 - is unknown to the master due to the garbage collection of the
 unreachable and gone agents in the registry and the master's state.

However we sent the task's [latest 
state|https://github.com/apache/mesos/blob/3711d66aa9eb70e12b184d3c2f79bf56fbd9cffa/include/mesos/v1/mesos.proto#L2147]
 instead of its [latest status update 
state|https://github.com/apache/mesos/blob/3711d66aa9eb70e12b184d3c2f79bf56fbd9cffa/include/mesos/v1/mesos.proto#L2154]
 which means the framework could first get an update with a {{TASK_FINISHED}} 
and then later {{TASK_RUNNING}}.

This is inconsistent with the handling of other master generated updates, e.g,. 
[during 
reconciliation|https://github.com/apache/mesos/blob/3711d66aa9eb70e12b184d3c2f79bf56fbd9cffa/src/master/master.cpp#L8603];
 we should send the status update state instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8618) ReconciliationTest.ReconcileStatusUpdateTaskState is flaky.

2018-04-23 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-8618:
-

Assignee: Yan Xu

> ReconciliationTest.ReconcileStatusUpdateTaskState is flaky.
> ---
>
> Key: MESOS-8618
> URL: https://issues.apache.org/jira/browse/MESOS-8618
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: ec Debian 9 with SSL
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>Priority: Major
>  Labels: flaky-test
> Attachments: 
> ReconciliationTest.ReconcileStatusUpdateTaskState-badrun.txt
>
>
> {noformat}
> ../../src/tests/reconciliation_tests.cpp:1129
>   Expected: TASK_RUNNING
> To be equal to: update->state()
>   Which is: TASK_FINISHED
> {noformat}
> {noformat}
> ../../src/tests/reconciliation_tests.cpp:1130: Failure
>   Expected: TaskStatus::REASON_RECONCILIATION
>   Which is: 9
> To be equal to: update->reason()
>   Which is: 32
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8630) All subsequent registry operations fail after the registrar is aborted after a failed update

2018-04-06 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429105#comment-16429105
 ] 

Yan Xu commented on MESOS-8630:
---

A first step could be to identify all the places that updates the registry and 
{{LOG(FATAL)}}, we can also see if we can abstract it out.

> All subsequent registry operations fail after the registrar is aborted after 
> a failed update
> 
>
> Key: MESOS-8630
> URL: https://issues.apache.org/jira/browse/MESOS-8630
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Yan Xu
>Priority: Major
>
> Failure to update registry always aborts the registrar but don't always abort 
> the master process.
> When the registrar fails to update the registry it would abort the actor and 
> fail all future operations. The rationale as explained here: 
> [https://github.com/apache/mesos/commit/5eaf1eb346fc2f46c852c1246bdff12a89216b60]
> {quote}In this event, the Master won't commit suicide until the initial
>  failure is processed. However, in the interim, subsequent operations
>  are potentially being performed against the Registrar. This could lead
>  to fighting between masters if a "demoted" master re-attempts to
>  acquire log-leadership!
> {quote}
> However when the registrar updates is requested by an operator API 
> (maintenance, quota update, etc) the master process doesn't shut down (a 500 
> error is returned to the client instead) and all subsequent operations will 
> fail!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8636) Master should store `completed` frameworks for lifecycle enforcement separately from that for webUI and endpoints

2018-03-05 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386503#comment-16386503
 ] 

Yan Xu commented on MESOS-8636:
---

This is in line with what Mesos already does for [gone 
agents|https://github.com/apache/mesos/blob/76c38f9d03ee6854e6bcd00a959d697472e0ea58/src/master/registry.proto#L62]
 and should still be valid when we introduce framework persistence MESOS-1719 
because we'll likely store only the framework IDs in the registry and 
reconstruct the full framework archive data from agents. 

> Master should store `completed` frameworks for lifecycle enforcement 
> separately from that for webUI and endpoints
> -
>
> Key: MESOS-8636
> URL: https://issues.apache.org/jira/browse/MESOS-8636
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Yan Xu
>Priority: Major
>
> Currently the master stores the history of completed frameworks in a map with 
> the full historical data of the framework.
> {code:title=}
> struct Frameworks 
> {
>   BoundedHashMap completed;
> }
> {code}
> This map serves the purposes of
> # Rejecting frameworks from reregistering if they have previously marked as 
> completed.
> # Displaying the history of this framework (i.e., its tasks) via webUI and 
> endpoints.
> However because the full framework object is large, it could be prohibitively 
> expensive to keep a long history for its relatively low importance.
> However for 1, we only need to persist the framework ID and keeping a longer 
> history is essential to the integrity of the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8636) Master should store `completed` frameworks for lifecycle enforcement separately from that for webUI and endpoints

2018-03-05 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8636:
-

 Summary: Master should store `completed` frameworks for lifecycle 
enforcement separately from that for webUI and endpoints
 Key: MESOS-8636
 URL: https://issues.apache.org/jira/browse/MESOS-8636
 Project: Mesos
  Issue Type: Improvement
Reporter: Yan Xu


Currently the master stores the history of completed frameworks in a map with 
the full historical data of the framework.

{code:title=}
struct Frameworks 
{
  BoundedHashMap completed;
}
{code}

This map serves the purposes of
# Rejecting frameworks from reregistering if they have previously marked as 
completed.
# Displaying the history of this framework (i.e., its tasks) via webUI and 
endpoints.

However because the full framework object is large, it could be prohibitively 
expensive to keep a long history for its relatively low importance.

However for 1, we only need to persist the framework ID and keeping a longer 
history is essential to the integrity of the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6422) cgroups_tests not correctly tearing down testing hierarchies

2018-03-02 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384307#comment-16384307
 ] 

Yan Xu commented on MESOS-6422:
---

Sorry this is low priority for me right now so I am unassigning.

> cgroups_tests not correctly tearing down testing hierarchies
> 
>
> Key: MESOS-6422
> URL: https://issues.apache.org/jira/browse/MESOS-6422
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, containerization
>Reporter: Yan Xu
>Assignee: Yan Xu
>Priority: Minor
>  Labels: cgroups
>
> We currently do the following in 
> [CgroupsTest::TearDownTestCase()|https://github.com/apache/mesos/blob/5e850a362edbf494921fedff4037cf4b53088c10/src/tests/containerizer/cgroups_tests.cpp#L83]
> {code:title=}
> static void TearDownTestCase()
> {
>   AWAIT_READY(cgroups::cleanup(TEST_CGROUPS_HIERARCHY));
> }
> {code}
> One of its derived test {{CgroupsNoHierarchyTest}} treats 
> {{TEST_CGROUPS_HIERARCHY}} as a hierarchy so it's able to clean it up as a 
> hierarchy.
> However another derived test {{CgroupsAnyHierarchyTest}} would create new 
> hierarchies (if none is available) using {{TEST_CGROUPS_HIERARCHY}} as a 
> parent directory (i.e., base hierarchy) and not as a hierarchy, so when it's 
> time to clean up, it fails:
> {noformat:title=}
> [   OK ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems (1 ms)
> ../../src/tests/containerizer/cgroups_tests.cpp:88: Failure
> (cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6422) cgroups_tests not correctly tearing down testing hierarchies

2018-03-02 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-6422:
-

Assignee: (was: Yan Xu)

> cgroups_tests not correctly tearing down testing hierarchies
> 
>
> Key: MESOS-6422
> URL: https://issues.apache.org/jira/browse/MESOS-6422
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, containerization
>Reporter: Yan Xu
>Priority: Minor
>  Labels: cgroups
>
> We currently do the following in 
> [CgroupsTest::TearDownTestCase()|https://github.com/apache/mesos/blob/5e850a362edbf494921fedff4037cf4b53088c10/src/tests/containerizer/cgroups_tests.cpp#L83]
> {code:title=}
> static void TearDownTestCase()
> {
>   AWAIT_READY(cgroups::cleanup(TEST_CGROUPS_HIERARCHY));
> }
> {code}
> One of its derived test {{CgroupsNoHierarchyTest}} treats 
> {{TEST_CGROUPS_HIERARCHY}} as a hierarchy so it's able to clean it up as a 
> hierarchy.
> However another derived test {{CgroupsAnyHierarchyTest}} would create new 
> hierarchies (if none is available) using {{TEST_CGROUPS_HIERARCHY}} as a 
> parent directory (i.e., base hierarchy) and not as a hierarchy, so when it's 
> time to clean up, it fails:
> {noformat:title=}
> [   OK ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems (1 ms)
> ../../src/tests/containerizer/cgroups_tests.cpp:88: Failure
> (cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8630) All subsequent registry operations fail after the registrar is aborted after a failed update

2018-03-02 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8630:
-

 Summary: All subsequent registry operations fail after the 
registrar is aborted after a failed update
 Key: MESOS-8630
 URL: https://issues.apache.org/jira/browse/MESOS-8630
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Yan Xu


Failure to update registry always aborts the registrar but don't always abort 
the master process.

When the registrar fails to update the registry it would abort the actor and 
fail all future operations. The rationale as explained here: 
[https://github.com/apache/mesos/commit/5eaf1eb346fc2f46c852c1246bdff12a89216b60]
{quote}In this event, the Master won't commit suicide until the initial
 failure is processed. However, in the interim, subsequent operations
 are potentially being performed against the Registrar. This could lead
 to fighting between masters if a "demoted" master re-attempts to
 acquire log-leadership!
{quote}
However when the registrar updates is requested by an operator API 
(maintenance, quota update, etc) the master process doesn't shut down (a 500 
error is returned to the client instead) and all subsequent operations will 
fail!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8622) Agent should send a task status update when upon receiving the task

2018-02-27 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8622:
-

 Summary: Agent should send a task status update when upon 
receiving the task
 Key: MESOS-8622
 URL: https://issues.apache.org/jira/browse/MESOS-8622
 Project: Mesos
  Issue Type: Improvement
Reporter: Yan Xu


Currently before the first status update of a successful task launch is sent, 
the steps include filesystem imagine provisioning, artifact fetching whose 
duration highly depends on the tasks and not the performance of "the 
infrastructure", i.e., Mesos stack, host load or other problems, etc.

Ideally the scheduler would be able to set of a timeout on such delay excluding 
the time spent on FS provisioning and artifact fetching so it can relaunch the 
task somewhere else instead of waiting indefinitely.

{{TASK_STARTING}} wouldn't work for this purpose because it's sent only after 
the executor is registered.

We can actually just have the agent send {{TASK_STAGING}}. Its 
{{TaskStatus.source =  SOURCE_SLAVE}} and {{TaskStatus.reason = null}} can help 
the scheduler distinguish it from the updates as a result of reconciliation. 
Creating a new state for this feels unncessary?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8602) Subscribers::send incorrectly assumes frameworks are registered

2018-02-22 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8602:
-

 Summary: Subscribers::send incorrectly assumes frameworks are 
registered
 Key: MESOS-8602
 URL: https://issues.apache.org/jira/browse/MESOS-8602
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


We observed this in prod
{noformat}
F0214 00:36:15.746939 3827787 master.cpp:11190] Check failed: 'framework' Must 
be non NULL
{noformat}
which is here in code: 
[https://github.com/apache/mesos/blob/9635d4a2d12fc77935c3d5d166469258634c6b7e/src/master/master.cpp#L11203]
h2. Diagnosis

The checks were added in in 
[https://github.com/apache/mesos/commit/cf331184714f692f21988a53fd04fa64fbbb3aba]
 MESOS-8469, 
{code:java}
Framework* framework =
master->getFramework(event.task_added().task().framework_id());

CHECK_NOTNULL(framework);
{code}
However as least when we recover tasks when the agent reregisters after a 
master failover, the frameworks may not have reregistered yet so they don't 
show up in the result from {{master->getFramework}}. Such checks failed to 
consider this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8602) Subscribers::send incorrectly assumes frameworks are registered

2018-02-22 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373661#comment-16373661
 ] 

Yan Xu commented on MESOS-8602:
---

/cc [~greggomann]

> Subscribers::send incorrectly assumes frameworks are registered
> ---
>
> Key: MESOS-8602
> URL: https://issues.apache.org/jira/browse/MESOS-8602
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Major
>
> We observed this in prod
> {noformat}
> F0214 00:36:15.746939 3827787 master.cpp:11190] Check failed: 'framework' 
> Must be non NULL
> {noformat}
> which is here in code: 
> [https://github.com/apache/mesos/blob/9635d4a2d12fc77935c3d5d166469258634c6b7e/src/master/master.cpp#L11203]
> h2. Diagnosis
> The checks were added in in 
> [https://github.com/apache/mesos/commit/cf331184714f692f21988a53fd04fa64fbbb3aba]
>  MESOS-8469, 
> {code:java}
> Framework* framework =
> master->getFramework(event.task_added().task().framework_id());
> CHECK_NOTNULL(framework);
> {code}
> However as least when we recover tasks when the agent reregisters after a 
> master failover, the frameworks may not have reregistered yet so they don't 
> show up in the result from {{master->getFramework}}. Such checks failed to 
> consider this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8595) Mesos agent's use of /tmp for overlayfs could be confusing

2018-02-19 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369442#comment-16369442
 ] 

Yan Xu commented on MESOS-8595:
---

/cc [~gilbert] [~zhitao]

> Mesos agent's use of /tmp for overlayfs could be confusing
> --
>
> Key: MESOS-8595
> URL: https://issues.apache.org/jira/browse/MESOS-8595
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Minor
>
> With MESOS-6000 Mesos creates temp directories under {{/tmp}}, this could be 
> surprising for operators who see no Mesos flags specified with a {{/tmp}} 
> prefix or with default value as such but discover such directories on the 
> host.
> We should at least group them under {{/tmp/mesos}} to suggest that Mesos 
> created those.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8595) Mesos agent

2018-02-19 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8595:
-

 Summary: Mesos agent
 Key: MESOS-8595
 URL: https://issues.apache.org/jira/browse/MESOS-8595
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


With MESOS-6000 Mesos creates temp directories under {{/tmp}}, this could be 
surprising for operators who see no Mesos flags specified with a {{/tmp}} 
prefix or with default value as such but discover such directories on the host.

We should at least group them under {{/tmp/mesos}} to suggest that Mesos 
created those.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8544) Required mesos.Task.state doesn't support upgrades.

2018-02-05 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8544:
-

 Summary: Required mesos.Task.state doesn't support upgrades.
 Key: MESOS-8544
 URL: https://issues.apache.org/jira/browse/MESOS-8544
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


Another case for the problem detailed in MESOS-4997. This ticket tracks adding 
an UNKNOWN default to TaskState and fixing all the places that use it as a 
required field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8232) SlaveTest.RegisteredAgentReregisterAfterFailover is flaky.

2018-01-31 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347285#comment-16347285
 ] 

Yan Xu commented on MESOS-8232:
---

[~alexr] thanks a lot for diligently cleaning up flaky tests and filing tickets 
for them! Sorry to responding late.

> SlaveTest.RegisteredAgentReregisterAfterFailover is flaky.
> --
>
> Key: MESOS-8232
> URL: https://issues.apache.org/jira/browse/MESOS-8232
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 17.04
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>Priority: Major
>  Labels: flaky-test
> Attachments: RegisteredAgentReregisterAfterFailover-badrun.txt, 
> RegisteredAgentReregisterAfterFailover-badrun2.txt
>
>
> Observed it in our CI:
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-17.04/mesos/src/tests/slave_tests.cpp:3740
> Mock function called more times than expected - taking default action 
> specified at:
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-17.04/mesos/src/tests/mock_registrar.cpp:54:
> Function call: apply(16-byte object <60-F1 01-F4 38-7F 00-00 90-D0 02-F4 
> 38-7F 00-00>)
>   Returns: 16-byte object  00-00>
>  Expected: to be never called
>Actual: called once - over-saturated and active
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8232) SlaveTest.RegisteredAgentReregisterAfterFailover is flaky.

2018-01-31 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-8232:
-

Assignee: Yan Xu

> SlaveTest.RegisteredAgentReregisterAfterFailover is flaky.
> --
>
> Key: MESOS-8232
> URL: https://issues.apache.org/jira/browse/MESOS-8232
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 17.04
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>Priority: Major
>  Labels: flaky-test
> Attachments: RegisteredAgentReregisterAfterFailover-badrun.txt, 
> RegisteredAgentReregisterAfterFailover-badrun2.txt
>
>
> Observed it in our CI:
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-17.04/mesos/src/tests/slave_tests.cpp:3740
> Mock function called more times than expected - taking default action 
> specified at:
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-17.04/mesos/src/tests/mock_registrar.cpp:54:
> Function call: apply(16-byte object <60-F1 01-F4 38-7F 00-00 90-D0 02-F4 
> 38-7F 00-00>)
>   Returns: 16-byte object  00-00>
>  Expected: to be never called
>Actual: called once - over-saturated and active
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8507) SLRP discards reservations when the agent is discarded, which could lead to leaked volumes.

2018-01-29 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344362#comment-16344362
 ] 

Yan Xu commented on MESOS-8507:
---

/cc [~chhsia0] [~jieyu]

> SLRP discards reservations when the agent is discarded, which could lead to 
> leaked volumes.
> ---
>
> Key: MESOS-8507
> URL: https://issues.apache.org/jira/browse/MESOS-8507
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Major
>
> In the current SLRP implementation the reservations for new SLRP/CSI backed 
> volumes are checkpointed under {{/slaves/latest/resource_providers}} so 
> when the agent runs into incompatible configuration changes (the kinds that 
> cannot be addressed by MESOS-1739), the operator has to remove the symlink 
> and then the reservations are gone. 
> Then the agent recovers with a new {{SlaveInfo}} and new SLRPs are created to 
> recover the CSI volumes. These CSI volumes will not have reservations and 
> thus will be offered to frameworks of any role, potentially with the data 
> already written by the previous owner. 
>  
> The framework doesn't have any control over this and any chance to clean up 
> before the volumes are re-offered, which is undesired for security reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8507) SLRP discards reservations when the agent is discarded, which could lead to leaked volumes.

2018-01-29 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8507:
-

 Summary: SLRP discards reservations when the agent is discarded, 
which could lead to leaked volumes.
 Key: MESOS-8507
 URL: https://issues.apache.org/jira/browse/MESOS-8507
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


In the current SLRP implementation the reservations for new SLRP/CSI backed 
volumes are checkpointed under {{/slaves/latest/resource_providers}} so 
when the agent runs into incompatible configuration changes (the kinds that 
cannot be addressed by **MESOS-1739), the operator has to remove the symlink 
and then the reservations are gone. 

Then the agent recovers with a new {{SlaveInfo}} and new SLRPs are created to 
recover the CSI volumes. These CSI volumes will not have reservations and thus 
will be offered to frameworks of any role, potentially with the data already 
written by the previous owner. 

 

The framework doesn't have any control over this and any chance to clean up 
before the volumes are re-offered, which is undesired for security reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8507) SLRP discards reservations when the agent is discarded, which could lead to leaked volumes.

2018-01-29 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-8507:
--
Description: 
In the current SLRP implementation the reservations for new SLRP/CSI backed 
volumes are checkpointed under {{/slaves/latest/resource_providers}} so 
when the agent runs into incompatible configuration changes (the kinds that 
cannot be addressed by MESOS-1739), the operator has to remove the symlink and 
then the reservations are gone. 

Then the agent recovers with a new {{SlaveInfo}} and new SLRPs are created to 
recover the CSI volumes. These CSI volumes will not have reservations and thus 
will be offered to frameworks of any role, potentially with the data already 
written by the previous owner. 

 

The framework doesn't have any control over this and any chance to clean up 
before the volumes are re-offered, which is undesired for security reasons.

  was:
In the current SLRP implementation the reservations for new SLRP/CSI backed 
volumes are checkpointed under {{/slaves/latest/resource_providers}} so 
when the agent runs into incompatible configuration changes (the kinds that 
cannot be addressed by **MESOS-1739), the operator has to remove the symlink 
and then the reservations are gone. 

Then the agent recovers with a new {{SlaveInfo}} and new SLRPs are created to 
recover the CSI volumes. These CSI volumes will not have reservations and thus 
will be offered to frameworks of any role, potentially with the data already 
written by the previous owner. 

 

The framework doesn't have any control over this and any chance to clean up 
before the volumes are re-offered, which is undesired for security reasons.


> SLRP discards reservations when the agent is discarded, which could lead to 
> leaked volumes.
> ---
>
> Key: MESOS-8507
> URL: https://issues.apache.org/jira/browse/MESOS-8507
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Major
>
> In the current SLRP implementation the reservations for new SLRP/CSI backed 
> volumes are checkpointed under {{/slaves/latest/resource_providers}} so 
> when the agent runs into incompatible configuration changes (the kinds that 
> cannot be addressed by MESOS-1739), the operator has to remove the symlink 
> and then the reservations are gone. 
> Then the agent recovers with a new {{SlaveInfo}} and new SLRPs are created to 
> recover the CSI volumes. These CSI volumes will not have reservations and 
> thus will be offered to frameworks of any role, potentially with the data 
> already written by the previous owner. 
>  
> The framework doesn't have any control over this and any chance to clean up 
> before the volumes are re-offered, which is undesired for security reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5368) Consider introducing persistent agent ID

2018-01-29 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344190#comment-16344190
 ] 

Yan Xu commented on MESOS-5368:
---

[~vinodkone] It still seems to me that the proposal to tie the current agent ID 
(and the latest agent symlink) to the entire work_dir is problematic. Even with 
MESOS-1739 there still exits possibilities for the agent's checkpointed info to 
lose compatibility with the new configuration. If that happens {{rm -f 
/slaves/latest}} is still the cleaned way to discard the state of the 
"agent" (and not the resources it manages). So we can still end up with the 
need to clean up the "agent" but keep the metadata for the resources on the 
host. Of course we should design this in light of the local resource providers.

> Consider introducing persistent agent ID
> 
>
> Key: MESOS-5368
> URL: https://issues.apache.org/jira/browse/MESOS-5368
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Neil Conway
>Priority: Major
>  Labels: mesosphere
>
> Currently, agent IDs identify a single "session" by an agent: that is, an 
> agent receives an agent ID when it registers with the master; it reuses that 
> agent ID if it disconnects and successfully reregisters; if the agent shuts 
> down and restarts, it registers anew and receives a new agent ID.
> It would be convenient to have a "persistent agent ID" that remains the same 
> for the duration of a given agent {{work_dir}}. This would mean that a given 
> persistent volume would not migrate between different persistent agent IDs 
> over time, for example (see MESOS-4894). If we supported permanently removing 
> an agent from the cluster (i.e., the {{work_dir}} and any volumes used by the 
> agent will never be reused), we could use the persistent agent ID to report 
> which agent has been removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-5368) Consider introducing persistent agent ID

2018-01-29 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344190#comment-16344190
 ] 

Yan Xu edited comment on MESOS-5368 at 1/29/18 11:27 PM:
-

[~vinodkone] It still seems to me that the proposal to tie the current agent ID 
(and the latest agent symlink) to the entire work_dir is problematic. Even with 
MESOS-1739 there still exits possibilities for the agent's checkpointed info to 
lose compatibility with the new configuration. If that happens {{rm -f 
/slaves/latest}} is still the cleanest way to discard the state of the 
"agent" (and not the resources it manages). So we can still end up with the 
need to clean up the "agent" but keep the metadata for the resources on the 
host. Of course we should design this in light of the local resource providers.


was (Author: xujyan):
[~vinodkone] It still seems to me that the proposal to tie the current agent ID 
(and the latest agent symlink) to the entire work_dir is problematic. Even with 
MESOS-1739 there still exits possibilities for the agent's checkpointed info to 
lose compatibility with the new configuration. If that happens {{rm -f 
/slaves/latest}} is still the cleaned way to discard the state of the 
"agent" (and not the resources it manages). So we can still end up with the 
need to clean up the "agent" but keep the metadata for the resources on the 
host. Of course we should design this in light of the local resource providers.

> Consider introducing persistent agent ID
> 
>
> Key: MESOS-5368
> URL: https://issues.apache.org/jira/browse/MESOS-5368
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Neil Conway
>Priority: Major
>  Labels: mesosphere
>
> Currently, agent IDs identify a single "session" by an agent: that is, an 
> agent receives an agent ID when it registers with the master; it reuses that 
> agent ID if it disconnects and successfully reregisters; if the agent shuts 
> down and restarts, it registers anew and receives a new agent ID.
> It would be convenient to have a "persistent agent ID" that remains the same 
> for the duration of a given agent {{work_dir}}. This would mean that a given 
> persistent volume would not migrate between different persistent agent IDs 
> over time, for example (see MESOS-4894). If we supported permanently removing 
> an agent from the cluster (i.e., the {{work_dir}} and any volumes used by the 
> agent will never be reused), we could use the persistent agent ID to report 
> which agent has been removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8337) Invalid state transition attempted when agent is lost.

2018-01-12 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324742#comment-16324742
 ] 

Yan Xu commented on MESOS-8337:
---

{noformat:title=}
commit 35ac2f047abf2c0ea452b98a249c3dbb90d64282 (HEAD -> 1.5.x, apache/1.5.x)
Author: Jiang Yan Xu 
Date:   Fri Jan 12 15:30:15 2018 -0800

Updated CHANGELOG with MESOS-6406, MESOS-7215 and MESOS-8337.

These are all changes we made around partition-awareness in 1.5.0 so
they are grouped together.

commit d59109808443ab2987fd0204d94f9a4e3e84dd9b
Author: James Peach 
Date:   Fri Jan 12 13:46:27 2018 -0800

Prevented a crash when an agent with terminal tasks is partitioned.

If an agent is lost, we try to remove all the tasks that might have
been lost. If a task is already terminal but has unacknowleged status
updates, it is expected that we track it in the unreachable tasks list
so we should remove the CHECK that prevents this. This patch also
changes to how unreachable tasks are presented in the HTTP endpoints
so that terminal but unacknowleged tasks are shown in in the list of
unreachable tasks and not completed tasks, which is different than
1.4.x where they are shown as completed.

Review: https://reviews.apache.org/r/64940/
{noformat}

> Invalid state transition attempted when agent is lost.
> --
>
> Key: MESOS-8337
> URL: https://issues.apache.org/jira/browse/MESOS-8337
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: James Peach
>Assignee: James Peach
>Priority: Blocker
> Fix For: 1.5.0
>
>
> The change in MESOS-7215 can attempt to transition a task from {{FAILED}} to 
> {{LOST}} when removing a lost agent. This ends up triggering a {{CHECK}} that 
> was added in the same patch.
> {noformat}
> I1214 23:42:16.507931 22396 master.cpp:10155] Removing task 
> mobius-mloop-1512774555_3661616380-xxx with resources disk(allocated: *):200; 
> cpus(allocated: *):0.01; mem(allocated: *):200; ports(allocated: 
> *):[31068-31068, 31069-31069, 31072-31072] of framework 
> afcbfa05-7973-4ad3-8399-4153556a8fa9-3607 on agent 
> daceae53-448b-4349-8503-9dd8132a6828-S4 at slave(1)@17.147.52.220:5 
> (magent0006.xxx.com)
> F1214 23:42:16.507961 22396 master.hpp:2342] Check failed: task->state() == 
> TASK_UNREACHABLE || task->state() == TASK_LOST TASK_FAILED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

2018-01-09 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319818#comment-16319818
 ] 

Yan Xu commented on MESOS-8125:
---

We used to not need to handle recovering executors after a reboot because the 
agent would have been considered lost, so not only did we not to need recover 
the executors, we also didn't need to resume unacknowledged status updates etc.

In the new scenario we need to handle these so we cannot just simply remove the 
{{latest}} executor run symlink. I guess we should just short circuit the 
executor reconnect/reregister logic based on the {{rebooted}} field in the 
top-level {{State}} but keep the rest of the recovery logic.

> Agent should properly handle recovering an executor when its pid is reused
> --
>
> Key: MESOS-8125
> URL: https://issues.apache.org/jira/browse/MESOS-8125
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Assignee: Megha Sharma
>Priority: Critical
>
> We know that all executors will be gone once the host on which an agent is 
> running is rebooted, so there's no need to try to recover these executors.
> Trying to recover stopped executors can lead to problems if another process 
> is assigned the same pid that the executor had before the reboot. In this 
> case the agent will unsuccessfully try to reregister with the executor, and 
> then transition it to a {{TERMINATING}} state. The executor will sadly get 
> stuck in that state, and the tasks that it started will get stuck in whatever 
> state they were in at the time of the reboot.
> One way of getting rid of stuck executors is to remove the {{latest}} symlink 
> under {{work_dir/meta/slaves/latest/frameworks/ id>/executors//runs}.
> Here's how to reproduce this issue:
> # Start a task using the Docker containerizer (the same will probably happen 
> with the command executor).
> # Stop the corresponding Mesos agent while the task is running.
> # Change the executor's checkpointed forked pid, which is located in the meta 
> directory, e.g., 
> {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}.
>  I used pid 2, which is normally used by {{kthreadd}}.
> # Reboot the host



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8334) PartitionedSlaveReregistrationMasterFailover is flaky.

2018-01-03 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-8334:
-

Assignee: Yan Xu  (was: Megha Sharma)

> PartitionedSlaveReregistrationMasterFailover is flaky.
> --
>
> Key: MESOS-8334
> URL: https://issues.apache.org/jira/browse/MESOS-8334
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>  Labels: flaky-test
> Attachments: PartitionedSlaveReregistrationMasterFailover-badrun.txt
>
>
> This test became extremely flaky on various Linux platforms, presumably after 
> the chain with https://reviews.apache.org/r/64098/ has been committed.
> {noformat}
> ../../src/tests/partition_tests.cpp:1032
> Failed to wait 15secs for runningAgainStatus1
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8334) PartitionedSlaveReregistrationMasterFailover is flaky.

2018-01-03 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310404#comment-16310404
 ] 

Yan Xu commented on MESOS-8334:
---

The agent reregistered before one scheduler hence the status update is dropped.

{noformat:title=}
I1212 20:14:38.542122 13272 master.cpp:6696] Re-admitted agent 
87ccc512-13d3-43ff-ae38-f2640ccb7cc3-S0 at slave(121)@172.16.10.222:41309 
(ip-172-16-10-222.ec2.internal)
W1212 20:14:38.542284 13272 master.cpp:6833] Dropping update TASK_RUNNING for 
task d781c9d7-f9c5-4cd1-9b80-4b6aa6b4dd0e of framework 
87ccc512-13d3-43ff-ae38-f2640ccb7cc3-0001 'Unknown agent re-registered' for 
unknown framework 87ccc512-13d3-43ff-ae38-f2640ccb7cc3-0001
I1212 20:14:38.542412 13272 master.cpp:7894] Sending status update TASK_RUNNING 
for task 01284ba6-a6e8-4204-9628-bd419bc67fea of framework 
87ccc512-13d3-43ff-ae38-f2640ccb7cc3- 'Unknown agent re-registered'
{noformat}

> PartitionedSlaveReregistrationMasterFailover is flaky.
> --
>
> Key: MESOS-8334
> URL: https://issues.apache.org/jira/browse/MESOS-8334
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Alexander Rukletsov
>Assignee: Megha Sharma
>  Labels: flaky-test
> Attachments: PartitionedSlaveReregistrationMasterFailover-badrun.txt
>
>
> This test became extremely flaky on various Linux platforms, presumably after 
> the chain with https://reviews.apache.org/r/64098/ has been committed.
> {noformat}
> ../../src/tests/partition_tests.cpp:1032
> Failed to wait 15secs for runningAgainStatus1
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6406) Send latest status for partition-aware tasks when agent reregisters

2017-12-12 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288018#comment-16288018
 ] 

Yan Xu commented on MESOS-6406:
---

{noformat:title=}
commit 5e5a8102c3281db25a37157dac123b0ca546e030 (HEAD -> master, apache/master)
Author: Megha Sharma 
Date:   Tue Dec 12 08:21:19 2017 -0800

Send status updates when an unreachable agent re-registers.

Master will send task status updates to frameworks upon agent
re-registration if the agent:
- has previously been removed by the master for being unreachable or
- is unknown to the master due to the garbage collection of the
  unreachable and gone agents in the registry and the master's state.

Review: https://reviews.apache.org/r/64098/

commit 34503f8b429e3459a7a132ca8cf02acdec3c7881
Author: Megha Sharma 
Date:   Tue Dec 12 08:21:14 2017 -0800

Added a new reason to task status.

Added new reason `REASON_AGENT_REREGISTERED`
(`REASON_SLAVE_REREGISTERED` in v0) to task status.

The new reason will be used when master starts to send status update
during the re-registration of an unreachable or unknown agent.

Review: https://reviews.apache.org/r/64250/
{noformat}

> Send latest status for partition-aware tasks when agent reregisters
> ---
>
> Key: MESOS-6406
> URL: https://issues.apache.org/jira/browse/MESOS-6406
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Megha Sharma
>  Labels: mesosphere
>
> When an agent reregisters, we should notify frameworks about the current 
> status of any partition-aware tasks that were/are running on the agent -- 
> i.e., report the current state of the task at the agent to the framework.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8306) Restrict which agents can statically reserve resources for which roles

2017-12-11 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286429#comment-16286429
 ] 

Yan Xu commented on MESOS-8306:
---

After investigating it I found that it makes more sense of reuse the 
{{ReserveResources}} ACL for static reservations in the process of authorizing 
the agent. This ACL clearer in its intention to authorize reservations and its 
implementation and semantics don't rule out static reservations. We can think 
of the agent as the subject that requests to the master to reserve resources. 
i.e., setting {{--resources}} flags on the agent doesn't make it final w.r.t 
static reservations until the master approves it.

Do you see any problems with this approach [~arojas] [~mcypark] 
[~jpe...@apache.org]

> Restrict which agents can statically reserve resources for which roles
> --
>
> Key: MESOS-8306
> URL: https://issues.apache.org/jira/browse/MESOS-8306
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> In some use cases part of a Mesos cluster could be reserved for certain 
> frameworks/roles. A common approach is to use static reservation so the 
> resources of an agent are only offered to frameworks of the designated roles. 
> However without proper authorization any (compromised) agent can register 
> with these special roles and accept workload from these frameworks.
> We can enhance the {{RegisterAgent}} ACL to express: agent principal {{foo}} 
> is allowed to register with static reservation roles {{bar, baz}}; no other 
> principals are allowed to register with static reservation roles {{bar, baz}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8306) Restrict which agents can statically reserve resources for which roles

2017-12-11 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286857#comment-16286857
 ] 

Yan Xu edited comment on MESOS-8306 at 12/12/17 12:35 AM:
--

https://reviews.apache.org/r/64514
https://reviews.apache.org/r/64515
https://reviews.apache.org/r/64516


was (Author: xujyan):
{noformat:title=}
https://reviews.apache.org/r/64514
https://reviews.apache.org/r/64515
https://reviews.apache.org/r/64516
{noformat}

> Restrict which agents can statically reserve resources for which roles
> --
>
> Key: MESOS-8306
> URL: https://issues.apache.org/jira/browse/MESOS-8306
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> In some use cases part of a Mesos cluster could be reserved for certain 
> frameworks/roles. A common approach is to use static reservation so the 
> resources of an agent are only offered to frameworks of the designated roles. 
> However without proper authorization any (compromised) agent can register 
> with these special roles and accept workload from these frameworks.
> We can enhance the {{RegisterAgent}} ACL to express: agent principal {{foo}} 
> is allowed to register with static reservation roles {{bar, baz}}; no other 
> principals are allowed to register with static reservation roles {{bar, baz}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8306) Restrict which agents can statically reserve resources for which roles

2017-12-11 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286750#comment-16286750
 ] 

Yan Xu commented on MESOS-8306:
---

So in order to authorize the static reservations, the master would be 
configured to use the {{reserve_resources}} ACL against the agent's principal 
like this:

{code:title=}
"register_agents": [
{
  "principals": { "values": ["low-security-agent", "high-security-agent"] },
  "agents": { "type": "ANY" }
},
{
  "principals": { "type": "ANY" },
  "agents": { "type": "NONE" }
}
  ],
  "reserve_resources": [
{
  "principals": { "values": ["high-security-agent"] },
  "roles": { "type": "high-security-role" }
},
{
  "principals": { "type": "NONE" },
  "roles": { "type": "high-security-role" }
}
  ]
{code}

As part of agent registration, both of the two ACLs are going to be checked.

If a {{low-security-agent}} principal is comprised, it cannot reserve resources 
of the {{high-security-role}} role.

> Restrict which agents can statically reserve resources for which roles
> --
>
> Key: MESOS-8306
> URL: https://issues.apache.org/jira/browse/MESOS-8306
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> In some use cases part of a Mesos cluster could be reserved for certain 
> frameworks/roles. A common approach is to use static reservation so the 
> resources of an agent are only offered to frameworks of the designated roles. 
> However without proper authorization any (compromised) agent can register 
> with these special roles and accept workload from these frameworks.
> We can enhance the {{RegisterAgent}} ACL to express: agent principal {{foo}} 
> is allowed to register with static reservation roles {{bar, baz}}; no other 
> principals are allowed to register with static reservation roles {{bar, baz}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-621) `HierarchicalAllocatorProcess::removeSlave` doesn't properly handle framework allocations/resources

2017-12-06 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-621:


Assignee: (was: Yan Xu)

> `HierarchicalAllocatorProcess::removeSlave` doesn't properly handle framework 
> allocations/resources
> ---
>
> Key: MESOS-621
> URL: https://issues.apache.org/jira/browse/MESOS-621
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>  Labels: tech-debt
>
> Currently a slaveRemoved() simply removes the slave from 'slaves' map and 
> slave's resources from 'roleSorter'. Looking at resourcesRecovered(), more 
> things need to be done when a slave is removed (e.g., framework 
> unallocations).
> It would be nice to fix this and have a test for this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8306) Restrict which agents can statically reserve resources for which roles

2017-12-06 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-8306:
-

Assignee: Yan Xu

> Restrict which agents can statically reserve resources for which roles
> --
>
> Key: MESOS-8306
> URL: https://issues.apache.org/jira/browse/MESOS-8306
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> In some use cases part of a Mesos cluster could be reserved for certain 
> frameworks/roles. A common approach is to use static reservation so the 
> resources of an agent are only offered to frameworks of the designated roles. 
> However without proper authorization any (compromised) agent can register 
> with these special roles and accept workload from these frameworks.
> We can enhance the {{RegisterAgent}} ACL to express: agent principal {{foo}} 
> is allowed to register with static reservation roles {{bar, baz}}; no other 
> principals are allowed to register with static reservation roles {{bar, baz}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8306) Restrict which agents can statically reserve resources for which roles

2017-12-06 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8306:
-

 Summary: Restrict which agents can statically reserve resources 
for which roles
 Key: MESOS-8306
 URL: https://issues.apache.org/jira/browse/MESOS-8306
 Project: Mesos
  Issue Type: Improvement
Reporter: Yan Xu


In some use cases part of a Mesos cluster could be reserved for certain 
frameworks/roles. A common approach is to use static reservation so the 
resources of an agent are only offered to frameworks of the designated roles. 
However without proper authorization any (compromised) agent can register with 
these special roles and accept workload from these frameworks.

We can enhance the {{RegisterAgent}} ACL to express: agent principal {{foo}} is 
allowed to register with static reservation roles {{bar, baz}}; no other 
principals are allowed to register with static reservation roles {{bar, baz}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8223) Master crashes when suppressed on subscribe is enabled.

2017-12-01 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275169#comment-16275169
 ] 

Yan Xu commented on MESOS-8223:
---

{noformat:title=}
commit 8c2f972b5c0c42e1519d09275cc26e1765a0c5de
Author: Jiang Yan Xu 
Date:   Tue Nov 14 00:12:17 2017 -0800

Fixed a bug that removed the suppressed framework from sorter.

Review: https://reviews.apache.org/r/63831
{noformat}

> Master crashes when suppressed on subscribe is enabled.
> ---
>
> Key: MESOS-8223
> URL: https://issues.apache.org/jira/browse/MESOS-8223
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.0
>Reporter: Yan Xu
>Assignee: Yan Xu
>Priority: Critical
> Fix For: 1.5.0
>
>
> Introduced in MESOS-7015, this feature is not actually turned on due to 
> MESOS-8200. However once this is addressed and the feature enabled, the 
> master crashes with:
> {noformat:title=}
> I1113 17:17:37.240901 11285 master.cpp:3309] Disconnecting framework 
> 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 (test-framework)
> I1113 17:17:37.240911 11285 master.cpp:1435] Giving framework 
> 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 (test-framework) 3days to failover
> I1113 17:17:37.241953 11285 master.cpp:2612] Received subscription request 
> for HTTP framework 'test-framework'
> I1113 17:17:37.242807 11285 master.cpp:2748] Subscribing framework 
> 'test-framework' with checkpointing enabled, roles { * } suppressed and 
> capabilities [ SHARED_RESOURCES, TASK_KILLING_STATE ]
> I1113 17:17:37.242820 11285 master.cpp:6994] Updating info for framework 
> 40f7bdc0-e54b-46da-ace1-48162171baf4-0110
> I1113 17:17:37.252637 11270 hierarchical.cpp:380] Activated framework 
> 40f7bdc0-e54b-46da-ace1-48162171baf4-0110
> I1113 17:17:37.272457 11289 master.cpp:7723] Performing implicit task state 
> reconciliation for framework 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 
> (test-framework)
> I1113 17:17:37.272507 11289 master.cpp:7723] Performing implicit task state 
> reconciliation for framework 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 
> (test-framework)
> I1113 17:17:41.966331 11271 master.cpp:5564] Processing REVIVE call for 
> framework 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 (test-framework)
> F1113 17:17:41.966380 11280 sorter.cpp:270] Check failed: 'find(clientPath)' 
> Must be non NULL
> *** Check failure stack trace: ***
> @ 0x7f3467efd0dd  (unknown)
> {noformat}
> This happens with a unsuppressed framework reregisters with suppressed roles 
> and then revive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8200) Suppressed roles are not honoured for v1 scheduler subscribe requests.

2017-12-01 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275165#comment-16275165
 ] 

Yan Xu commented on MESOS-8200:
---

{noformat:title=}
commit 3711233fcec761be8625af6a028a228fe9d8dc5a
Author: Jiang Yan Xu 
Date:   Fri Nov 10 12:15:37 2017 -0800

Fixed 'NoOffersWithAllRolesSuppressed' test.

Review: https://reviews.apache.org/r/63830

commit 5d9209e69a0a9600ec8c02fbf852ab912b208a88
Author: Jiang Yan Xu 
Date:   Fri Nov 10 12:16:45 2017 -0800

Fixed a bug in devolving framework subscription with suppressed roles.

Review: https://reviews.apache.org/r/63741
{noformat}

> Suppressed roles are not honoured for v1 scheduler subscribe requests.
> --
>
> Key: MESOS-8200
> URL: https://issues.apache.org/jira/browse/MESOS-8200
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler api, scheduler driver
>Affects Versions: 1.4.0
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>Priority: Critical
> Fix For: 1.5.0
>
>
> When triaging MESOS-7996 I've found out that 
> {{Call.subscribe.suppressed_roles}} field is empty when the master processes 
> the request from a v1 HTTP scheduler. More precisely, [this 
> conversion|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/master/http.cpp#L969]
>  wipes the field. This is likely because this conversion relies on a general 
> [protobuf conversion 
> utility|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/internal/devolve.cpp#L28-L50],
>  which fails to copy {{suppressed_roles}} because they have different tags, 
> compare 
> [v0|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/scheduler/scheduler.proto#L271]
>  and 
> [v1|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/v1/scheduler/scheduler.proto#L258].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6406) Send latest status for partition-aware tasks when agent reregisters

2017-11-29 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271862#comment-16271862
 ] 

Yan Xu commented on MESOS-6406:
---

[~ipronin] no if the agent's entry was GCed. The master does know all the 
"registered" agents. I guess to support this the master can choose to send 
status updates for agents that are 1) either unreachable or 2) totally unknown. 
Would this work?

I am mainly not sure it's a good idea to send status updates for all 
non-completed (pending, running, terminated but unacked) tasks during master 
failover, which is a time when the master is very loaded.


> Send latest status for partition-aware tasks when agent reregisters
> ---
>
> Key: MESOS-6406
> URL: https://issues.apache.org/jira/browse/MESOS-6406
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Megha Sharma
>  Labels: mesosphere
>
> When an agent reregisters, we should notify frameworks about the current 
> status of any partition-aware tasks that were/are running on the agent -- 
> i.e., report the current state of the task at the agent to the framework.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6406) Send latest status for partition-aware tasks when agent reregisters

2017-11-29 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271772#comment-16271772
 ] 

Yan Xu commented on MESOS-6406:
---

So I think we can probably improve on the approach stated in the JIRA: when the 
master fails over and for agents that haven't been unreachable, perhaps we 
don't need to send status updates for these tasks? 

For unreachable agents we have informed the frameworks about these tasks via 
{{TASK_UNREACHABLE}} so upon reregistration we need to inform frameworks that 
these tasks are back.

For other agents, if the state of a task has changed during master failover, 
the agent is going to send new status updates with retries so we don't need to 
worry about the schedulers not getting updates; if the state hasn't changed, 
the scheduler is already aware of the latest state of the task so the master 
doesn't need to send me either.

/cc [~megha.sharma] [~ipronin] [~vinodkone]

> Send latest status for partition-aware tasks when agent reregisters
> ---
>
> Key: MESOS-6406
> URL: https://issues.apache.org/jira/browse/MESOS-6406
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Megha Sharma
>  Labels: mesosphere
>
> When an agent reregisters, we should notify frameworks about the current 
> status of any partition-aware tasks that were/are running on the agent -- 
> i.e., report the current state of the task at the agent to the framework.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8276) Benchmark agent reregistration after master failover with connected frameworks.

2017-11-29 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8276:
-

 Summary: Benchmark agent reregistration after master failover with 
connected frameworks.
 Key: MESOS-8276
 URL: https://issues.apache.org/jira/browse/MESOS-8276
 Project: Mesos
  Issue Type: Task
  Components: master, test
Reporter: Yan Xu


As an extension to MESOS-8098, if we add connected frameworks, we can test more 
scenarios such as the performance of sending status updates to a large number 
of frameworks upon master failover (e.g., for MESOS-6406).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8185) Tasks can be known to the agent but unknown to the master.

2017-11-27 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267891#comment-16267891
 ] 

Yan Xu commented on MESOS-8185:
---

[~ipronin] sure and Megha just submitted a RR for MESOS-6406.

> Tasks can be known to the agent but unknown to the master.
> --
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>  Labels: reliability
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6406) Send latest status for partition-aware tasks when agent reregisters

2017-11-27 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-6406:
--
Shepherd: Yan Xu  (was: Vinod Kone)

> Send latest status for partition-aware tasks when agent reregisters
> ---
>
> Key: MESOS-6406
> URL: https://issues.apache.org/jira/browse/MESOS-6406
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Megha Sharma
>  Labels: mesosphere
>
> When an agent reregisters, we should notify frameworks about the current 
> status of any partition-aware tasks that were/are running on the agent -- 
> i.e., report the current state of the task at the agent to the framework.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7711) Master updates registry for reregistering agents even when they haven't been unreachable

2017-11-22 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263224#comment-16263224
 ] 

Yan Xu commented on MESOS-7711:
---

Clarification on the fix: by not calling registrar in the mentioned scenario, 
we eliminated the delay from the registrar dispatching back into the master 
actor (which could be backed up significantly during a master failover) after 
the operation is done so the overall time a reregistration request from the 
agent is spent on the master is reduced and we have seen ~50% reduction in the 
total time for all agents to reregister after a master failover.

> Master updates registry for reregistering agents even when they haven't been 
> unreachable
> 
>
> Key: MESOS-7711
> URL: https://issues.apache.org/jira/browse/MESOS-7711
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Yan Xu
>Assignee: Yan Xu
> Fix For: 1.4.0
>
>
> During a master failover we observed many registry updates, on average _one 
> per two agents_, as indicated by the log line 
> {noformat:title=}
> I0609 04:46:25.220196 48864 registrar.cpp:550] Successfully updated the 
> registry in 42.904064ms
> {noformat}
> [code|https://github.com/apache/mesos/blob/19a6134d03141dc2cb073a904378c2c129b5138d/src/master/registrar.cpp#L550]
> In this case few agents were ever unreachable so most of them are redundant. 
> Associated with each registry update is also the time spent on applying the 
> operations
> {noformat:title=}
> I0609 04:46:26.475761 48897 registrar.cpp:493] Applied 1 operations in 
> 11.673082ms; attempting to update the registry
> {noformat}
> [code|https://github.com/apache/mesos/blob/19a6134d03141dc2cb073a904378c2c129b5138d/src/master/registrar.cpp#L493]
> Even though not consuming the time of the Master actor, all agent 
> reregistrations are guarded and delayed by these operations, and this could 
> be easily avoided by checking with the {{slaves.recovered}} field in 
> {{Master}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8185) Tasks can be known to the agent but unknown to the master.

2017-11-17 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257398#comment-16257398
 ] 

Yan Xu commented on MESOS-8185:
---

I think so. [~ipronin] with MESOS-7215 no tasks will be killed by the master, 
even if the framework is not partition-aware. Will this work?

> Tasks can be known to the agent but unknown to the master.
> --
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>  Labels: reliability
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8200) Suppressed roles are not honoured for v1 scheduler subscribe requests.

2017-11-15 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-8200:
--
Affects Version/s: 1.4.0

> Suppressed roles are not honoured for v1 scheduler subscribe requests.
> --
>
> Key: MESOS-8200
> URL: https://issues.apache.org/jira/browse/MESOS-8200
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler api, scheduler driver
>Affects Versions: 1.4.0
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>Priority: Critical
>
> When triaging MESOS-7996 I've found out that 
> {{Call.subscribe.suppressed_roles}} field is empty when the master processes 
> the request from a v1 HTTP scheduler. More precisely, [this 
> conversion|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/master/http.cpp#L969]
>  wipes the field. This is likely because this conversion relies on a general 
> [protobuf conversion 
> utility|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/internal/devolve.cpp#L28-L50],
>  which fails to copy {{suppressed_roles}} because they have different tags, 
> compare 
> [v0|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/scheduler/scheduler.proto#L271]
>  and 
> [v1|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/v1/scheduler/scheduler.proto#L258].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8223) Master crashes when suppressed on subscribe is enabled.

2017-11-14 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252014#comment-16252014
 ] 

Yan Xu commented on MESOS-8223:
---

The problem is that this 
[code|https://github.com/apache/mesos/blob/bb2deb3baafffb9a35d1dfbc35b0d43677b0b842/src/master/allocator/mesos/hierarchical.cpp#L447-L460]
 treats frameworks moving off a role and frameworks suppressing a role the same 
way. The former should untrack the framework under that role and the latter 
shouldn't.

> Master crashes when suppressed on subscribe is enabled.
> ---
>
> Key: MESOS-8223
> URL: https://issues.apache.org/jira/browse/MESOS-8223
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.0
>Reporter: Yan Xu
>Priority: Critical
>
> Introduced in MESOS-7015, this feature is not actually turned on due to 
> MESOS-8200. However once this is addressed and the feature enabled, the 
> master crashes with:
> {noformat:title=}
> I1113 17:17:37.240901 11285 master.cpp:3309] Disconnecting framework 
> 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 (test-framework)
> I1113 17:17:37.240911 11285 master.cpp:1435] Giving framework 
> 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 (test-framework) 3days to failover
> I1113 17:17:37.241953 11285 master.cpp:2612] Received subscription request 
> for HTTP framework 'test-framework'
> I1113 17:17:37.242807 11285 master.cpp:2748] Subscribing framework 
> 'test-framework' with checkpointing enabled, roles { * } suppressed and 
> capabilities [ SHARED_RESOURCES, TASK_KILLING_STATE ]
> I1113 17:17:37.242820 11285 master.cpp:6994] Updating info for framework 
> 40f7bdc0-e54b-46da-ace1-48162171baf4-0110
> I1113 17:17:37.252637 11270 hierarchical.cpp:380] Activated framework 
> 40f7bdc0-e54b-46da-ace1-48162171baf4-0110
> I1113 17:17:37.272457 11289 master.cpp:7723] Performing implicit task state 
> reconciliation for framework 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 
> (test-framework)
> I1113 17:17:37.272507 11289 master.cpp:7723] Performing implicit task state 
> reconciliation for framework 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 
> (test-framework)
> I1113 17:17:41.966331 11271 master.cpp:5564] Processing REVIVE call for 
> framework 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 (test-framework)
> F1113 17:17:41.966380 11280 sorter.cpp:270] Check failed: 'find(clientPath)' 
> Must be non NULL
> *** Check failure stack trace: ***
> @ 0x7f3467efd0dd  (unknown)
> {noformat}
> This happens with a unsuppressed framework reregisters with suppressed roles 
> and then revive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8223) Master crashes when suppressed on subscribe is enabled.

2017-11-14 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-8223:
-

Assignee: Yan Xu

> Master crashes when suppressed on subscribe is enabled.
> ---
>
> Key: MESOS-8223
> URL: https://issues.apache.org/jira/browse/MESOS-8223
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.0
>Reporter: Yan Xu
>Assignee: Yan Xu
>Priority: Critical
>
> Introduced in MESOS-7015, this feature is not actually turned on due to 
> MESOS-8200. However once this is addressed and the feature enabled, the 
> master crashes with:
> {noformat:title=}
> I1113 17:17:37.240901 11285 master.cpp:3309] Disconnecting framework 
> 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 (test-framework)
> I1113 17:17:37.240911 11285 master.cpp:1435] Giving framework 
> 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 (test-framework) 3days to failover
> I1113 17:17:37.241953 11285 master.cpp:2612] Received subscription request 
> for HTTP framework 'test-framework'
> I1113 17:17:37.242807 11285 master.cpp:2748] Subscribing framework 
> 'test-framework' with checkpointing enabled, roles { * } suppressed and 
> capabilities [ SHARED_RESOURCES, TASK_KILLING_STATE ]
> I1113 17:17:37.242820 11285 master.cpp:6994] Updating info for framework 
> 40f7bdc0-e54b-46da-ace1-48162171baf4-0110
> I1113 17:17:37.252637 11270 hierarchical.cpp:380] Activated framework 
> 40f7bdc0-e54b-46da-ace1-48162171baf4-0110
> I1113 17:17:37.272457 11289 master.cpp:7723] Performing implicit task state 
> reconciliation for framework 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 
> (test-framework)
> I1113 17:17:37.272507 11289 master.cpp:7723] Performing implicit task state 
> reconciliation for framework 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 
> (test-framework)
> I1113 17:17:41.966331 11271 master.cpp:5564] Processing REVIVE call for 
> framework 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 (test-framework)
> F1113 17:17:41.966380 11280 sorter.cpp:270] Check failed: 'find(clientPath)' 
> Must be non NULL
> *** Check failure stack trace: ***
> @ 0x7f3467efd0dd  (unknown)
> {noformat}
> This happens with a unsuppressed framework reregisters with suppressed roles 
> and then revive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8223) Master crashes when suppressed on subscribe is enabled.

2017-11-14 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8223:
-

 Summary: Master crashes when suppressed on subscribe is enabled.
 Key: MESOS-8223
 URL: https://issues.apache.org/jira/browse/MESOS-8223
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.4.0
Reporter: Yan Xu
Priority: Critical


Introduced in MESOS-7015, this feature is not actually turned on due to 
MESOS-8200. However once this is addressed and the feature enabled, the master 
crashes with:

{noformat:title=}
I1113 17:17:37.240901 11285 master.cpp:3309] Disconnecting framework 
40f7bdc0-e54b-46da-ace1-48162171baf4-0110 (test-framework)
I1113 17:17:37.240911 11285 master.cpp:1435] Giving framework 
40f7bdc0-e54b-46da-ace1-48162171baf4-0110 (test-framework) 3days to failover
I1113 17:17:37.241953 11285 master.cpp:2612] Received subscription request for 
HTTP framework 'test-framework'
I1113 17:17:37.242807 11285 master.cpp:2748] Subscribing framework 
'test-framework' with checkpointing enabled, roles { * } suppressed and 
capabilities [ SHARED_RESOURCES, TASK_KILLING_STATE ]
I1113 17:17:37.242820 11285 master.cpp:6994] Updating info for framework 
40f7bdc0-e54b-46da-ace1-48162171baf4-0110
I1113 17:17:37.252637 11270 hierarchical.cpp:380] Activated framework 
40f7bdc0-e54b-46da-ace1-48162171baf4-0110
I1113 17:17:37.272457 11289 master.cpp:7723] Performing implicit task state 
reconciliation for framework 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 
(test-framework)
I1113 17:17:37.272507 11289 master.cpp:7723] Performing implicit task state 
reconciliation for framework 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 
(test-framework)
I1113 17:17:41.966331 11271 master.cpp:5564] Processing REVIVE call for 
framework 40f7bdc0-e54b-46da-ace1-48162171baf4-0110 (test-framework)
F1113 17:17:41.966380 11280 sorter.cpp:270] Check failed: 'find(clientPath)' 
Must be non NULL
*** Check failure stack trace: ***
@ 0x7f3467efd0dd  (unknown)
{noformat}

This happens with a unsuppressed framework reregisters with suppressed roles 
and then revive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8200) Suppressed roles are not honoured for v1 scheduler subscribe requests.

2017-11-13 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250233#comment-16250233
 ] 

Yan Xu commented on MESOS-8200:
---

[~vinodkone] yes. I have the patch for the devolve code ready but this seems to 
be exposing some other bugs. Addressing them right now and will have it ready 
soon.

> Suppressed roles are not honoured for v1 scheduler subscribe requests.
> --
>
> Key: MESOS-8200
> URL: https://issues.apache.org/jira/browse/MESOS-8200
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler api, scheduler driver
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>Priority: Critical
>
> When triaging MESOS-7996 I've found out that 
> {{Call.subscribe.suppressed_roles}} field is empty when the master processes 
> the request from a v1 HTTP scheduler. More precisely, [this 
> conversion|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/master/http.cpp#L969]
>  wipes the field. This is likely because this conversion relies on a general 
> [protobuf conversion 
> utility|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/internal/devolve.cpp#L28-L50],
>  which fails to copy {{suppressed_roles}} because they have different tags, 
> compare 
> [v0|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/scheduler/scheduler.proto#L271]
>  and 
> [v1|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/v1/scheduler/scheduler.proto#L258].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8200) Suppressed roles are not honoured for v1 scheduler subscribe requests.

2017-11-10 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247947#comment-16247947
 ] 

Yan Xu commented on MESOS-8200:
---

The easies fix is probably to just change to tag to 3: {{repeated string 
suppressed_roles = 3;}} however it feels bad if the new API has to carefully 
coordinate with the old in terms of tag numbers.

Will change devolve to handle it properly.

> Suppressed roles are not honoured for v1 scheduler subscribe requests.
> --
>
> Key: MESOS-8200
> URL: https://issues.apache.org/jira/browse/MESOS-8200
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler api, scheduler driver
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>
> When triaging MESOS-7996 I've found out that 
> {{Call.subscribe.suppressed_roles}} field is empty when the master processes 
> the request from a v1 HTTP scheduler. More precisely, [this 
> conversion|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/master/http.cpp#L969]
>  wipes the field. This is likely because this conversion relies on a general 
> [protobuf conversion 
> utility|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/src/internal/devolve.cpp#L28-L50],
>  which fails to copy {{suppressed_roles}} because they have different tags, 
> compare 
> [v0|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/scheduler/scheduler.proto#L271]
>  and 
> [v1|https://github.com/apache/mesos/blob/1132e1ddafa6a1a9bc8aa966bd01d7b35c7682d9/include/mesos/v1/scheduler/scheduler.proto#L258].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8178) UnreachableAgentReregisterAfterFailover is flaky.

2017-11-07 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-8178:
-

Assignee: Yan Xu

> UnreachableAgentReregisterAfterFailover is flaky.
> -
>
> Key: MESOS-8178
> URL: https://issues.apache.org/jira/browse/MESOS-8178
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>  Labels: flaky-test
> Attachments: UnreachableAgentReregisterAfterFailover-badrun.txt
>
>
> {noformat}
> ../../src/tests/slave_tests.cpp:3680: Failure
> Failed to wait 15secs for markUnreachable
> I1107 12:09:52.007308  6705 master.cpp:1148] Master terminating
> I1107 12:09:52.007480  6699 hierarchical.cpp:626] Removed agent 
> a835010f-3c94-4d07-b30a-ab3285263aed-S1
> ../../src/tests/slave_tests.cpp:3673: Failure
> Actual function call count doesn't match 
> EXPECT_CALL(*master.get()->registrar, apply(_))...
>  Expected: to be called once
>Actual: never called - unsatisfied and active
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8098) Benchmark Master failover performance

2017-11-03 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238131#comment-16238131
 ] 

Yan Xu commented on MESOS-8098:
---

{noformat:title=}
commit ac0fa281472c2ba891f7bd0837fbd728ace73039
Author: Jiang Yan Xu 
Date:   Wed Oct 18 01:53:11 2017 -0700

Added a benchmark for agent reregistration during master failover.

Review: https://reviews.apache.org/r/63174
{noformat}

> Benchmark Master failover performance
> -
>
> Key: MESOS-8098
> URL: https://issues.apache.org/jira/browse/MESOS-8098
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Yan Xu
>Assignee: Yan Xu
>Priority: Major
> Attachments: withoutperfpatches.perf.svg, withperfpatches.perf.svg
>
>
> Master failover performance often sheds light on the master's performance in 
> general as it's often the time the master experiences the highest load. Ways 
> we can benchmark the failover include the time it takes for all agents to 
> reregister, all frameworks to resubscribe or fully reconcile.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8098) Benchmark Master failover performance

2017-11-03 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-8098:
--
Attachment: withoutperfpatches.perf.svg
withperfpatches.perf.svg

Attaching two flame graphs comparing the benchmark running against the two 
versions below:

withperfpatches.perf.svg: 
https://github.com/apache/mesos/commit/41193181d6b75eeecae2729bf98007d9318e351a 
(close to the HEAD when the benchmark was created).

vs. 

withoutperfpatches.perf.svg: 
https://github.com/apache/mesos/commit/d9c90bf1d9c8b3a7dcc47be0cb773efff57cfb9d 
(before https://issues.apache.org/jira/browse/MESOS-7713 was merged)

The perf data was captured with me invoking gdb-mesos-tests.sh -> setting two 
break points on the two {{cout}} lines (right before and after the bulk 
reregistration) -> run -> coordinate {{perf record}} with the break points so 
it only captures the process behavior in between.

However I couldn't find much useful info from the resulting graphs. Perhaps 
someone can help me take a look? /cc [~bmahler] [~ipronin] [~dzhuk]?

> Benchmark Master failover performance
> -
>
> Key: MESOS-8098
> URL: https://issues.apache.org/jira/browse/MESOS-8098
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Yan Xu
>Assignee: Yan Xu
>Priority: Major
> Attachments: withoutperfpatches.perf.svg, withperfpatches.perf.svg
>
>
> Master failover performance often sheds light on the master's performance in 
> general as it's often the time the master experiences the highest load. Ways 
> we can benchmark the failover include the time it takes for all agents to 
> reregister, all frameworks to resubscribe or fully reconcile.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8160) Support idempotent framework registration

2017-11-01 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8160:
-

 Summary: Support idempotent framework registration
 Key: MESOS-8160
 URL: https://issues.apache.org/jira/browse/MESOS-8160
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu
Priority: Major


Right now when a framework registers/subscribes, the master assigns a framework 
ID to it and send it back so the framework can use it to tear itself down on 
Mesos when it's done. However if the framework fails to receive it (e.g., it 
fails over before it receives the ID), it doesn't have a way to do the teardown 
or failover.

One apparent solution would be to allow the frameworks to supply the framework 
IDs themselves but it may be easier for backwards compatibility to come up with 
a new framework supplied "unique name/handle" concept to allow the framework to 
identify itself.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8138) Master can fail to detect HTTP framework disconnection if it disconnects very fast

2017-10-26 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221487#comment-16221487
 ] 

Yan Xu commented on MESOS-8138:
---

{quote}
the master realizes the disconnection when it tries to the pipe immediately
{quote}

How? You mean the 
[following|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8178-L8179]
 will execute {{Self::exited}} immediately?

{code:title=}
  http.closed()
.onAny(defer(self(), ::exited, framework->id(), http));
{code}

It looks like it won't because {{http.close()}} internally tracks a 
{{process::http::Pipe::Writer writer}} 
[object|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.hpp#L303]
 which is instantiated 
[here|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/http.cpp#L843]
 which is not connected to the broken socket at all if the HttpProxy is 
terminated. 

Right?

> Master can fail to detect HTTP framework disconnection if it disconnects very 
> fast
> --
>
> Key: MESOS-8138
> URL: https://issues.apache.org/jira/browse/MESOS-8138
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, master
>Reporter: Yan Xu
>
> What we've observed is that if the framework disconnects before the master 
> actor processes the initial subscribe request, the master would [set up an 
> exited 
> callback|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8179]
>  that never gets triggered.
> It looks like it's because when the socket closes and libprocess terminates 
> the HttpProxy for this socket, [the pipe reader for this proxy is not 
> set|https://github.com/apache/mesos/blob/f599839bb854c7aff3d610e49f7e5465d7fe9341/3rdparty/libprocess/src/process.cpp#L1515-L1518].
>  
> Later when the master [sets up the 
> callback|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8179],
>  it would be a noop in this regard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8138) Master can fail to detect HTTP framework disconnection if it disconnects very fast

2017-10-26 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-8138:
--
Description: 
What we've observed is that if the framework disconnects before the master 
actor processes the initial subscribe request, the master would [set up an 
exited 
callback|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8179]
 that never gets triggered.

It looks like it's because when the socket closes and libprocess terminates the 
HttpProxy for this socket, [the pipe reader for this proxy is not 
set|https://github.com/apache/mesos/blob/f599839bb854c7aff3d610e49f7e5465d7fe9341/3rdparty/libprocess/src/process.cpp#L1515-L1518].
 

Later when the master [sets up the 
callback|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8179],
 it would be a noop in this regard.

  was:
What we've observed is that if the framework disconnects before the master 
actor processes the request, the master would [set up an exited 
callback|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8179]
 that never gets triggered.

It looks like it's because when the socket closes and libprocess terminates the 
HttpProxy for this socket, [the pipe reader for this proxy is not 
set|https://github.com/apache/mesos/blob/f599839bb854c7aff3d610e49f7e5465d7fe9341/3rdparty/libprocess/src/process.cpp#L1515-L1518].
 

Later when the master [sets up the 
callback|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8179],
 it would be a noop in this regard.


> Master can fail to detect HTTP framework disconnection if it disconnects very 
> fast
> --
>
> Key: MESOS-8138
> URL: https://issues.apache.org/jira/browse/MESOS-8138
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, master
>Reporter: Yan Xu
>
> What we've observed is that if the framework disconnects before the master 
> actor processes the initial subscribe request, the master would [set up an 
> exited 
> callback|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8179]
>  that never gets triggered.
> It looks like it's because when the socket closes and libprocess terminates 
> the HttpProxy for this socket, [the pipe reader for this proxy is not 
> set|https://github.com/apache/mesos/blob/f599839bb854c7aff3d610e49f7e5465d7fe9341/3rdparty/libprocess/src/process.cpp#L1515-L1518].
>  
> Later when the master [sets up the 
> callback|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8179],
>  it would be a noop in this regard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8138) Master can fail to detect HTTP framework disconnection if it disconnects very fast

2017-10-26 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221369#comment-16221369
 ] 

Yan Xu commented on MESOS-8138:
---

/cc [~anandmazumdar] who implemented MESOS-2294.

> Master can fail to detect HTTP framework disconnection if it disconnects very 
> fast
> --
>
> Key: MESOS-8138
> URL: https://issues.apache.org/jira/browse/MESOS-8138
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, master
>Reporter: Yan Xu
>
> What we've observed is that if the framework disconnects before the master 
> actor processes the request, the master would [set up an exited 
> callback|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8179]
>  that never gets triggered.
> It looks like it's because when the socket closes and libprocess terminates 
> the HttpProxy for this socket, [the pipe reader for this proxy is not 
> set|https://github.com/apache/mesos/blob/f599839bb854c7aff3d610e49f7e5465d7fe9341/3rdparty/libprocess/src/process.cpp#L1515-L1518].
>  
> Later when the master [sets up the 
> callback|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8179],
>  it would be a noop in this regard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8138) Master can fail to detect HTTP framework disconnection if it disconnects very fast

2017-10-26 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8138:
-

 Summary: Master can fail to detect HTTP framework disconnection if 
it disconnects very fast
 Key: MESOS-8138
 URL: https://issues.apache.org/jira/browse/MESOS-8138
 Project: Mesos
  Issue Type: Bug
  Components: HTTP API, master
Reporter: Yan Xu


What we've observed is that if the framework disconnects before the master 
actor processes the request, the master would [set up an exited 
callback|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8179]
 that never gets triggered.

It looks like it's because when the socket closes and libprocess terminates the 
HttpProxy for this socket, [the pipe reader for this proxy is not 
set|https://github.com/apache/mesos/blob/f599839bb854c7aff3d610e49f7e5465d7fe9341/3rdparty/libprocess/src/process.cpp#L1515-L1518].
 

Later when the master [sets up the 
callback|https://github.com/apache/mesos/blob/f26ffcee0a359a644968feca1ec91243401f589a/src/master/master.cpp#L8179],
 it would be a noop in this regard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-5368) Consider introducing persistent agent ID

2017-10-25 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16219668#comment-16219668
 ] 

Yan Xu commented on MESOS-5368:
---

/cc [~anandmazumdar] does my comment above make sense?

> Consider introducing persistent agent ID
> 
>
> Key: MESOS-5368
> URL: https://issues.apache.org/jira/browse/MESOS-5368
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Neil Conway
>  Labels: mesosphere
>
> Currently, agent IDs identify a single "session" by an agent: that is, an 
> agent receives an agent ID when it registers with the master; it reuses that 
> agent ID if it disconnects and successfully reregisters; if the agent shuts 
> down and restarts, it registers anew and receives a new agent ID.
> It would be convenient to have a "persistent agent ID" that remains the same 
> for the duration of a given agent {{work_dir}}. This would mean that a given 
> persistent volume would not migrate between different persistent agent IDs 
> over time, for example (see MESOS-4894). If we supported permanently removing 
> an agent from the cluster (i.e., the {{work_dir}} and any volumes used by the 
> agent will never be reused), we could use the persistent agent ID to report 
> which agent has been removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8085) No point in deallocate() for a framework for maintenance if it is deactivated.

2017-10-16 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-8085:
-

Assignee: Yan Xu

> No point in deallocate() for a framework for maintenance if it is deactivated.
> --
>
> Key: MESOS-8085
> URL: https://issues.apache.org/jira/browse/MESOS-8085
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Assignee: Yan Xu
>  Labels: maintenance
>
> The {{UnavailableResources}} sent from the allocator to the master are going 
> to be dropped by the master anyways, which results in the following line to 
> be printed per inactive framework per allocation which spams the master log. 
> We could tune down the log level but it's better to just not send the 
> {{UnavailableResources}} by the allocator.
> {code:title=}
> LOG(INFO) << "Master ignoring inverse offers to framework " << frameworkId
>   << " because the framework has terminated or is inactive";
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8098) Benchmark Master failover performance

2017-10-16 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8098:
-

 Summary: Benchmark Master failover performance
 Key: MESOS-8098
 URL: https://issues.apache.org/jira/browse/MESOS-8098
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Yan Xu


Master failover performance often sheds light on the master's performance in 
general as it's often the time the master experiences the highest load. Ways we 
can benchmark the failover include the time it takes for all agents to 
reregister, all frameworks to resubscribe or fully reconcile.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8098) Benchmark Master failover performance

2017-10-16 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-8098:
-

Assignee: Yan Xu

> Benchmark Master failover performance
> -
>
> Key: MESOS-8098
> URL: https://issues.apache.org/jira/browse/MESOS-8098
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> Master failover performance often sheds light on the master's performance in 
> general as it's often the time the master experiences the highest load. Ways 
> we can benchmark the failover include the time it takes for all agents to 
> reregister, all frameworks to resubscribe or fully reconcile.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8085) No point in deallocate() for a framework for maintenance if it is deactivated.

2017-10-16 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-8085:
--
Description: 
The {{UnavailableResources}} sent from the allocator to the master are going to 
be dropped by the master anyways, which results in the following line to be 
printed per inactive framework per allocation which spams the master log. We 
could tune down the log level but it's better to just not send the 
{{UnavailableResources}} by the allocator.

{code:title=}
LOG(INFO) << "Master ignoring inverse offers to framework " << frameworkId
  << " because the framework has terminated or is inactive";
{code}

  was:
The {{UnavailableResources}} sent from the allocator to the master are going to 
be dropped by the master anyways, which results in the following line to be 
printed per inactive framework per allocation which spams the master log. We 
could tune down the log level but it's better to just not send the 
{{UnavailableResources}}.

{code:title=}
LOG(INFO) << "Master ignoring inverse offers to framework " << frameworkId
  << " because the framework has terminated or is inactive";
{code}


> No point in deallocate() for a framework for maintenance if it is deactivated.
> --
>
> Key: MESOS-8085
> URL: https://issues.apache.org/jira/browse/MESOS-8085
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>  Labels: maintenance
>
> The {{UnavailableResources}} sent from the allocator to the master are going 
> to be dropped by the master anyways, which results in the following line to 
> be printed per inactive framework per allocation which spams the master log. 
> We could tune down the log level but it's better to just not send the 
> {{UnavailableResources}} by the allocator.
> {code:title=}
> LOG(INFO) << "Master ignoring inverse offers to framework " << frameworkId
>   << " because the framework has terminated or is inactive";
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8085) No point in deallocate() for a framework for maintenance if it is deactivated.

2017-10-12 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-8085:
--
Summary: No point in deallocate() for a framework for maintenance if it is 
deactivated.  (was: No point in deallocate() for a framework for maintenance it 
is deactivated.)

> No point in deallocate() for a framework for maintenance if it is deactivated.
> --
>
> Key: MESOS-8085
> URL: https://issues.apache.org/jira/browse/MESOS-8085
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>  Labels: maintenance
>
> The {{UnavailableResources}} sent from the allocator to the master are going 
> to be dropped by the master anyways, which results in the following line to 
> be printed per inactive framework per allocation which spams the master log. 
> We could tune down the log level but it's better to just not send the 
> {{UnavailableResources}}.
> {code:title=}
> LOG(INFO) << "Master ignoring inverse offers to framework " << frameworkId
>   << " because the framework has terminated or is inactive";
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8085) No point in deallocate() for a framework for maintenance it is deactivated.

2017-10-12 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8085:
-

 Summary: No point in deallocate() for a framework for maintenance 
it is deactivated.
 Key: MESOS-8085
 URL: https://issues.apache.org/jira/browse/MESOS-8085
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


The {{UnavailableResources}} sent from the allocator to the master are going to 
be dropped by the master anyways, which results in the following line to be 
printed per inactive framework per allocation which spams the master log. We 
could tune down the log level but it's better to just not send the 
{{UnavailableResources}}.

{code:title=}
LOG(INFO) << "Master ignoring inverse offers to framework " << frameworkId
  << " because the framework has terminated or is inactive";
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8085) No point in deallocate() for a framework for maintenance it is deactivated.

2017-10-12 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-8085:
--
Labels: maintenance  (was: )

> No point in deallocate() for a framework for maintenance it is deactivated.
> ---
>
> Key: MESOS-8085
> URL: https://issues.apache.org/jira/browse/MESOS-8085
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>  Labels: maintenance
>
> The {{UnavailableResources}} sent from the allocator to the master are going 
> to be dropped by the master anyways, which results in the following line to 
> be printed per inactive framework per allocation which spams the master log. 
> We could tune down the log level but it's better to just not send the 
> {{UnavailableResources}}.
> {code:title=}
> LOG(INFO) << "Master ignoring inverse offers to framework " << frameworkId
>   << " because the framework has terminated or is inactive";
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-5368) Consider introducing persistent agent ID

2017-10-12 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202636#comment-16202636
 ] 

Yan Xu commented on MESOS-5368:
---

Also, how does this relate to MESOS-8008? From there it sounds like the agent 
(and its checkpointed resources) can indeed still come back?

> Consider introducing persistent agent ID
> 
>
> Key: MESOS-5368
> URL: https://issues.apache.org/jira/browse/MESOS-5368
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Neil Conway
>  Labels: mesosphere
>
> Currently, agent IDs identify a single "session" by an agent: that is, an 
> agent receives an agent ID when it registers with the master; it reuses that 
> agent ID if it disconnects and successfully reregisters; if the agent shuts 
> down and restarts, it registers anew and receives a new agent ID.
> It would be convenient to have a "persistent agent ID" that remains the same 
> for the duration of a given agent {{work_dir}}. This would mean that a given 
> persistent volume would not migrate between different persistent agent IDs 
> over time, for example (see MESOS-4894). If we supported permanently removing 
> an agent from the cluster (i.e., the {{work_dir}} and any volumes used by the 
> agent will never be reused), we could use the persistent agent ID to report 
> which agent has been removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8083) Mesos containerizer should run isolate() sequentially.

2017-10-12 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8083:
-

 Summary: Mesos containerizer should run isolate() sequentially.
 Key: MESOS-8083
 URL: https://issues.apache.org/jira/browse/MESOS-8083
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Yan Xu


Currently all the isolate() calls to all isolators are done in parallel, unlike 
prepare() and destroy(), which are done sequentially.

The following comment was provided as justification: 
https://github.com/apache/mesos/blob/615f1f0bcdfab4df264f37d2ebf528da2e6aa426/src/slave/containerizer/mesos/containerizer.cpp#L1894-L1907

{noformat:title=}
  // Isolate the executor with each isolator.
  // NOTE: This is done is parallel and is not sequenced like prepare
  // or destroy because we assume there are no dependencies in
  // isolation.
{noformat}

However this is not strictly true, especially it's uncertain for isolator 
modules. To be safe we should just handle it consistently with other isolator 
calls and make it sequential.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-5368) Consider introducing persistent agent ID

2017-10-12 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202473#comment-16202473
 ] 

Yan Xu commented on MESOS-5368:
---

[~vinodkone] This sounds good to me, just a few details which I hope are 
covered:

* Right now when the agent recovery fails we recommend {{rm -f 
/meta/slaves/latest}}, I guess going forward this will be changed to 
{{rm -f }}?
* Currently the agent would GC (instead of deleting immediately) all sandbox 
data from previous agents under the same . Going forward are we 
requiring that "in order to start with a new agent, all sandboxes need to be 
deleted immediately (because of {{rm -f }})"?
* Currently if we delete work_dir, the data in external volumes remain 
unchanged and will reappear when these volumes are used later. Should we 
provide a "purging" functionality to clean them up?
* Should we eventually remove the "slaves" and "latest" file system structure 
since there is only going to be one agent per work dir?

> Consider introducing persistent agent ID
> 
>
> Key: MESOS-5368
> URL: https://issues.apache.org/jira/browse/MESOS-5368
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.2.1, 1.3.0
>Reporter: Neil Conway
>  Labels: mesosphere
>
> Currently, agent IDs identify a single "session" by an agent: that is, an 
> agent receives an agent ID when it registers with the master; it reuses that 
> agent ID if it disconnects and successfully reregisters; if the agent shuts 
> down and restarts, it registers anew and receives a new agent ID.
> It would be convenient to have a "persistent agent ID" that remains the same 
> for the duration of a given agent {{work_dir}}. This would mean that a given 
> persistent volume would not migrate between different persistent agent IDs 
> over time, for example (see MESOS-4894). If we supported permanently removing 
> an agent from the cluster (i.e., the {{work_dir}} and any volumes used by the 
> agent will never be reused), we could use the persistent agent ID to report 
> which agent has been removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8076) PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy is flaky.

2017-10-12 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-8076:
--
Shepherd: Alexander Rukletsov

> PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy is flaky.
> -
>
> Key: MESOS-8076
> URL: https://issues.apache.org/jira/browse/MESOS-8076
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>  Labels: flaky, flaky-test
> Attachments: SharedPersistentVolumeRescindOnDestroy-badrun.txt, 
> SharedPersistentVolumeRescindOnDestroy-goodrun.txt
>
>
> I'm observing 
> {{ROOT_MountDiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0}}
>  being flaky on our internal CI. From what I see in the logs, when 
> {{framework1}} accepts an offer, creates volumes, launches a task, and kills 
> it right after, the executor might manage to register in-between and hence an 
> unexpected {{TASK_RUNNING}} status update is sent. To fix this, one approach 
> is to explicitly wait for {{TASK_RUNNING}} before attempting to kill the task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8076) PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy is flaky.

2017-10-12 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-8076:
-

Assignee: Yan Xu

> PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy is flaky.
> -
>
> Key: MESOS-8076
> URL: https://issues.apache.org/jira/browse/MESOS-8076
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
>Reporter: Alexander Rukletsov
>Assignee: Yan Xu
>  Labels: flaky, flaky-test
> Attachments: SharedPersistentVolumeRescindOnDestroy-badrun.txt, 
> SharedPersistentVolumeRescindOnDestroy-goodrun.txt
>
>
> I'm observing 
> {{ROOT_MountDiskResource/PersistentVolumeTest.SharedPersistentVolumeRescindOnDestroy/0}}
>  being flaky on our internal CI. From what I see in the logs, when 
> {{framework1}} accepts an offer, creates volumes, launches a task, and kills 
> it right after, the executor might manage to register in-between and hence an 
> unexpected {{TASK_RUNNING}} status update is sent. To fix this, one approach 
> is to explicitly wait for {{TASK_RUNNING}} before attempting to kill the task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8062) Master sends messages to the agent before it reregisters

2017-10-09 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-8062:
--
Component/s: master

> Master sends messages to the agent before it reregisters
> 
>
> Key: MESOS-8062
> URL: https://issues.apache.org/jira/browse/MESOS-8062
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Yan Xu
>Priority: Minor
>
> In a few instances the master sends messages to the agent regardless of 
> whether it is active and these messages ends up being dropped by the agent 
> with a warning:
> https://github.com/apache/mesos/blob/d79dd983e484bfe5690d34b53716e2c97f1d288e/src/master/master.cpp#L7173
> https://github.com/apache/mesos/blob/d79dd983e484bfe5690d34b53716e2c97f1d288e/src/master/master.cpp#L8479
> It can happen if the agent restarts but the message arrives before the agent 
> is reregistered. The master could have checked {{Slave.active}} and not send 
> the message. We should prune the warnings from expected cases to reduce the 
> noise in the logs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8062) Master sends messages to the agent before it reregisters

2017-10-09 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8062:
-

 Summary: Master sends messages to the agent before it reregisters
 Key: MESOS-8062
 URL: https://issues.apache.org/jira/browse/MESOS-8062
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu
Priority: Minor


In a few instances the master sends messages to the agent regardless of whether 
it is active and these messages ends up being dropped by the agent with a 
warning:

https://github.com/apache/mesos/blob/d79dd983e484bfe5690d34b53716e2c97f1d288e/src/master/master.cpp#L7173

https://github.com/apache/mesos/blob/d79dd983e484bfe5690d34b53716e2c97f1d288e/src/master/master.cpp#L8479

It can happen if the agent restarts but the message arrives before the agent is 
reregistered. The master could have checked {{Slave.active}} and not send the 
message. We should prune the warnings from expected cases to reduce the noise 
in the logs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6918) Prometheus exporter endpoints for metrics

2017-10-05 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194161#comment-16194161
 ] 

Yan Xu commented on MESOS-6918:
---

[~bmahler] let's chat about the reviews? [~jpe...@apache.org] and I have 
already discussed this offline and I have added comments to the design doc and 
the reviews. Here's the summary:

- I am not convinced about the newly introduced {{enum Semantics \{COUNTER, 
GAUGE\}}}. We already have metric *types* that are called {{Counter}} and 
{{Gauge}} and I think people could be confused about Counter the semantics and 
Counter the type, for example.
-- I understand that the semantics is supposed to help express:
bq. {{Timer}}'s value should be cumulative / monotonically increasing
(because it's more useful that way, as explained in the design doc) but this 
enum seems to try to suggest that all metric types (potentially future ones as 
well) can and should be classified into one of the two buckets. But are we sure 
about this is the right/only criterion? (The examples cited in the design doc 
don't consistently define this and none defines it as "semantics") Could there 
be other dimensions / features to classify metrics? To me 
{{s/Semantics/Monotonicity/}} would have been clearer but I am not sure about 
the usefulness of that either.
-- The use of this enum right now is just to pass the metric type info down to 
the Prometheus formatter. We can just define {{enum Type \{COUNTER, GAUGE, 
TIMER\}}} and pass it down.
- I hope we confine the Prometheus logic in a 
`metrics/formatters/prometheus.hpp|cpp` file and keep the {{MetricsProcess}} 
logic generic.
- I think we can keep the meaning the existing field {{Timer.value()}} (the 
last sampled value). We can add a new field {{sum}} in the {{TimeSeries}} 
alongside the new {{total}} (can we name it something like {{totalCount}}?) to 
provide Prometheus its required info.

> Prometheus exporter endpoints for metrics
> -
>
> Key: MESOS-6918
> URL: https://issues.apache.org/jira/browse/MESOS-6918
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Reporter: James Peach
>Assignee: James Peach
>
> There are a couple of [Prometheus|https://prometheus.io] metrics exporters 
> for Mesos, of varying quality. Since the Mesos stats system actually knows 
> about statistics data types and semantics, and Mesos has reasonable HTTP 
> support we could add Prometheus metrics endpoints to directly expose 
> statistics in [Prometheus wire 
> format|https://prometheus.io/docs/instrumenting/exposition_formats/], 
> removing the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-1280) Add replace task primitive

2017-10-02 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188488#comment-16188488
 ] 

Yan Xu commented on MESOS-1280:
---

Probably not all fields in the TaskInfo make equal sense to be updatable or 
justify the complexity. If possible we probably still prefer treating tasks as 
cattle and want to only give them pet treatment for certain important benefits. 

[~zhitao] could you elaborate on the uses cases you were thinking about? I see 
that in 

 you mentioned in-place upgrades and launching zero-resource onto running 
executors, among others. 

I am asking because I recently started looking into something related to this. 
I may poll the user/dev thread later but am starting here first.

> Add replace task primitive
> --
>
> Key: MESOS-1280
> URL: https://issues.apache.org/jira/browse/MESOS-1280
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, c++ api, master
>Reporter: Niklas Quarfot Nielsen
>  Labels: mesosphere
>
> Also along the lines of MESOS-938, replaceTask would one of a couple of 
> primitives needed to support various task replacement and scaling scenarios. 
> This replaceTask() version is significantly simpler than the first proposed 
> one; it's only responsibility is to run a new task info on a running tasks 
> resources.
> The running task will be killed as usual, but the newly freed resources will 
> never be announced and the new task will run on them instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7215) Race condition on re-registration of non-partition-aware frameworks

2017-09-29 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186286#comment-16186286
 ] 

Yan Xu commented on MESOS-7215:
---

[~megha.sharma] Per offline discussion, we should probably bundle what was 
described in MESOS-6406 into this ticket, but for all tasks and not just 
partition aware tasks since we are not killing the NPA tasks now. We can commit 
both in this JIRA (because one without the other could be seen as a regression) 
and mark MESOS-6406 being fixed by this ticket.

> Race condition on re-registration of non-partition-aware frameworks
> ---
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Yan Xu
>Assignee: Megha Sharma
>Priority: Critical
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7964) Heavy-duty GC makes the agent unresponsive

2017-09-26 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16181817#comment-16181817
 ] 

Yan Xu commented on MESOS-7964:
---

{noformat:title=master}
commit 06341309e61a5cee702ea3c7b6d3ef340ac95ad0
Author: Chun-Hung Hsiao 
Date:   Tue Sep 26 17:07:11 2017 -0700

Prevent GC path removals from blocking other processes.

This patch dispatches all path removals to a single executor instead of
one `AsyncExecutor` per path such that heavy-duty removals won't occupy
all worker threads and block other actors.

Review: https://reviews.apache.org/r/62230/
{noformat}


{noformat:title=1.4.x}
commit 27b83565082720cbc9c93b3b892305b899af84b7
Author: Chun-Hung Hsiao 
Date:   Tue Sep 26 17:07:11 2017 -0700

Prevent GC path removals from blocking other processes.

This patch dispatches all path removals to a single executor instead of
one `AsyncExecutor` per path such that heavy-duty removals won't occupy
all worker threads and block other actors.

Review: https://reviews.apache.org/r/62230/

commit bf82953f1ede7ddf182f9cad79a3248ef2630dc8
Author: Chun-Hung Hsiao 
Date:   Mon Sep 25 14:10:27 2017 -0700

Added `process::Executor::execute()`.

This patch adds a convenient interface to `process::Executor` to
asynchronously execute arbitrary functions.

Review: https://reviews.apache.org/r/62252/
{noformat}

> Heavy-duty GC makes the agent unresponsive
> --
>
> Key: MESOS-7964
> URL: https://issues.apache.org/jira/browse/MESOS-7964
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.4.0
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
> Fix For: 1.4.1, 1.5.0
>
>
> An agent is observed to performe heavy-duty GC every half an hour:
> {noformat}
> Sep 07 18:15:56 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:15:56.900282 16054 slave.cpp:5920] Current disk 
> usage 93.61%. Max allowed age: 0ns
> Sep 07 18:15:56 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:15:56.900476 16054 gc.cpp:218] Pruning 
> directories with remaining removal time 1.99022105972148days
> ...
> Sep 07 18:22:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:22:08.173645 16050 gc.cpp:178] Deleted 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__f33065c9-eb42-44a7-9013-25bafc306bd5'
> ...
> Sep 07 18:41:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:41:08.195329 16051 slave.cpp:5920] Current disk 
> usage 90.85%. Max allowed age: 0ns
> Sep 07 18:41:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:41:08.195503 16051 gc.cpp:218] Pruning 
> directories with remaining removal time 1.99028708946667days
> ...
> Sep 07 18:49:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:49:01.253906 16049 gc.cpp:178] Deleted 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__014b451a-30de-41ee-b0b1-3733c790382c/runs/c5b922e8-eee0-4793-8637-7abbd7f8507e'
> ...
> Sep 07 19:08:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:08:01.291092 16048 slave.cpp:5920] Current disk 
> usage 91.39%. Max allowed age: 0ns
> Sep 07 19:08:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:08:01.291285 16048 gc.cpp:218] Pruning 
> directories with remaining removal time 1.99028598086815days
> ...
> Sep 07 19:14:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: W0907 19:14:50.737226 16050 gc.cpp:174] Failed to delete 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__4139bf2e-e33b-4743-8527-f8f50ac49280/runs/b1991e28-7ff8-476f-8122-1a483e431ff2':
>  No such file or directory
> ...
> Sep 07 19:33:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:33:50.758191 16052 slave.cpp:5920] Current disk 
> usage 91.39%. Max allowed age: 0ns
> Sep 07 19:33:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:33:50.758872 16047 gc.cpp:218] Pruning 
> directories with remaining removal time 1.99028057238519days
> ...
> Sep 07 19:39:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:39:43.081485 16052 gc.cpp:178] Deleted 
> 

[jira] [Updated] (MESOS-7964) Heavy-duty GC makes the agent unresponsive

2017-09-26 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-7964:
--
Fix Version/s: 1.5.0

> Heavy-duty GC makes the agent unresponsive
> --
>
> Key: MESOS-7964
> URL: https://issues.apache.org/jira/browse/MESOS-7964
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.4.0
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
> Fix For: 1.4.1, 1.5.0
>
>
> An agent is observed to performe heavy-duty GC every half an hour:
> {noformat}
> Sep 07 18:15:56 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:15:56.900282 16054 slave.cpp:5920] Current disk 
> usage 93.61%. Max allowed age: 0ns
> Sep 07 18:15:56 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:15:56.900476 16054 gc.cpp:218] Pruning 
> directories with remaining removal time 1.99022105972148days
> ...
> Sep 07 18:22:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:22:08.173645 16050 gc.cpp:178] Deleted 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__f33065c9-eb42-44a7-9013-25bafc306bd5'
> ...
> Sep 07 18:41:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:41:08.195329 16051 slave.cpp:5920] Current disk 
> usage 90.85%. Max allowed age: 0ns
> Sep 07 18:41:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:41:08.195503 16051 gc.cpp:218] Pruning 
> directories with remaining removal time 1.99028708946667days
> ...
> Sep 07 18:49:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:49:01.253906 16049 gc.cpp:178] Deleted 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__014b451a-30de-41ee-b0b1-3733c790382c/runs/c5b922e8-eee0-4793-8637-7abbd7f8507e'
> ...
> Sep 07 19:08:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:08:01.291092 16048 slave.cpp:5920] Current disk 
> usage 91.39%. Max allowed age: 0ns
> Sep 07 19:08:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:08:01.291285 16048 gc.cpp:218] Pruning 
> directories with remaining removal time 1.99028598086815days
> ...
> Sep 07 19:14:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: W0907 19:14:50.737226 16050 gc.cpp:174] Failed to delete 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__4139bf2e-e33b-4743-8527-f8f50ac49280/runs/b1991e28-7ff8-476f-8122-1a483e431ff2':
>  No such file or directory
> ...
> Sep 07 19:33:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:33:50.758191 16052 slave.cpp:5920] Current disk 
> usage 91.39%. Max allowed age: 0ns
> Sep 07 19:33:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:33:50.758872 16047 gc.cpp:218] Pruning 
> directories with remaining removal time 1.99028057238519days
> ...
> Sep 07 19:39:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:39:43.081485 16052 gc.cpp:178] Deleted 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__d89dce1f-609b-4cf8-957a-5ba198be7828'
> ...
> Sep 07 19:59:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:59:43.150535 16048 slave.cpp:5920] Current disk 
> usage 94.56%. Max allowed age: 0ns
> Sep 07 19:59:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:59:43.150869 16054 gc.cpp:218] Pruning 
> directories with remaining removal time 1.98959316198222days
> ...
> Sep 07 20:06:16 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 20:06:16.251552 16051 gc.cpp:178] Deleted 
> '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0259/executors/data__45283e7d-9a5e-4d4b-9901-b7f1e096cd54/runs/5cfc5e3e-3975-41aa-846b-c125eb529fbe'
> {noformat}
> Each GC activity took 5+ minutes. During the period, the agent became 
> unresponsive, the health check timed out, and no endpoint responded as well. 
> When a disk-usage GC is trigged, around 300 pruning actors would be generated 
> (https://github.com/apache/mesos/blob/master/src/slave/gc.cpp#L229). My 
> hypothesis is that these actors would used all of the 

[jira] [Updated] (MESOS-7964) Heavy-duty GC makes the agent unresponsive

2017-09-26 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-7964:
--
Affects Version/s: 1.4.0

> Heavy-duty GC makes the agent unresponsive
> --
>
> Key: MESOS-7964
> URL: https://issues.apache.org/jira/browse/MESOS-7964
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.4.0
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
> Fix For: 1.4.1
>
>
> An agent is observed to performe heavy-duty GC every half an hour:
> {noformat}
> Sep 07 18:15:56 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:15:56.900282 16054 slave.cpp:5920] Current disk 
> usage 93.61%. Max allowed age: 0ns
> Sep 07 18:15:56 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:15:56.900476 16054 gc.cpp:218] Pruning 
> directories with remaining removal time 1.99022105972148days
> ...
> Sep 07 18:22:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:22:08.173645 16050 gc.cpp:178] Deleted 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__f33065c9-eb42-44a7-9013-25bafc306bd5'
> ...
> Sep 07 18:41:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:41:08.195329 16051 slave.cpp:5920] Current disk 
> usage 90.85%. Max allowed age: 0ns
> Sep 07 18:41:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:41:08.195503 16051 gc.cpp:218] Pruning 
> directories with remaining removal time 1.99028708946667days
> ...
> Sep 07 18:49:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 18:49:01.253906 16049 gc.cpp:178] Deleted 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__014b451a-30de-41ee-b0b1-3733c790382c/runs/c5b922e8-eee0-4793-8637-7abbd7f8507e'
> ...
> Sep 07 19:08:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:08:01.291092 16048 slave.cpp:5920] Current disk 
> usage 91.39%. Max allowed age: 0ns
> Sep 07 19:08:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:08:01.291285 16048 gc.cpp:218] Pruning 
> directories with remaining removal time 1.99028598086815days
> ...
> Sep 07 19:14:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: W0907 19:14:50.737226 16050 gc.cpp:174] Failed to delete 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__4139bf2e-e33b-4743-8527-f8f50ac49280/runs/b1991e28-7ff8-476f-8122-1a483e431ff2':
>  No such file or directory
> ...
> Sep 07 19:33:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:33:50.758191 16052 slave.cpp:5920] Current disk 
> usage 91.39%. Max allowed age: 0ns
> Sep 07 19:33:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:33:50.758872 16047 gc.cpp:218] Pruning 
> directories with remaining removal time 1.99028057238519days
> ...
> Sep 07 19:39:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:39:43.081485 16052 gc.cpp:178] Deleted 
> '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__d89dce1f-609b-4cf8-957a-5ba198be7828'
> ...
> Sep 07 19:59:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:59:43.150535 16048 slave.cpp:5920] Current disk 
> usage 94.56%. Max allowed age: 0ns
> Sep 07 19:59:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 19:59:43.150869 16054 gc.cpp:218] Pruning 
> directories with remaining removal time 1.98959316198222days
> ...
> Sep 07 20:06:16 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com 
> mesos-agent[16040]: I0907 20:06:16.251552 16051 gc.cpp:178] Deleted 
> '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0259/executors/data__45283e7d-9a5e-4d4b-9901-b7f1e096cd54/runs/5cfc5e3e-3975-41aa-846b-c125eb529fbe'
> {noformat}
> Each GC activity took 5+ minutes. During the period, the agent became 
> unresponsive, the health check timed out, and no endpoint responded as well. 
> When a disk-usage GC is trigged, around 300 pruning actors would be generated 
> (https://github.com/apache/mesos/blob/master/src/slave/gc.cpp#L229). My 
> hypothesis is that these actors would used all of the 

[jira] [Commented] (MESOS-7921) process::EventQueue sometimes crashes

2017-09-06 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155706#comment-16155706
 ] 

Yan Xu commented on MESOS-7921:
---

Tried out the patch and it seemed to work. I had run mesos-tests with many 
iterations and no crash occurred (but the top of tree does in the same 
environment).

The environment I used is a standard [ubuntu 
xenial|https://app.vagrantup.com/ubuntu/boxes/xenial64] VM with two cpus 
configured running on my desktop.

> process::EventQueue sometimes crashes
> -
>
> Key: MESOS-7921
> URL: https://issues.apache.org/jira/browse/MESOS-7921
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.4.0
> Environment: autotools,gcc,--verbose,GLOG_v=1 
> MESOS_VERBOSE=1,ubuntu:14.04,(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)
> Note that --enable-lock-free-event-queue is not enabled.
> Details: 
> https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/injectedEnvVars/
>Reporter: Yan Xu
>Priority: Blocker
> Attachments: 
> FetcherCacheTest.CachedCustomOutputFileWithSubdirectory.log.txt, 
> MesosContainerizerSlaveRecoveryTest.ResourceStatisticsFullLog.txt
>
>
> The following segfault is found on 
> [ASF|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/]
>  in {{MesosContainerizerSlaveRecoveryTest.ResourceStatistics}} but it's flaky 
> and shows up in other tests and environments (with or without 
> --enable-lock-free-event-queue) as well.
> {noformat: title=Configuration}
> ./bootstrap '&&' ./configure --verbose '&&' make -j6 distcheck
> {noformat}
> {noformat:title=}
> *** Aborted at 1503937885 (unix time) try "date -d @1503937885" if you are 
> using GNU date ***
> PC: @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> *** SIGSEGV (@0x8) received by PID 751 (TID 0x2b9e31978700) from PID 8; stack 
> trace: ***
> @ 0x2b9e29d26330 (unknown)
> @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> @ 0x2b9e25800a40 process::ProcessManager::resume()
> @ 0x2b9e2580f891 
> process::ProcessManager::init_threads()::$_9::operator()()
> @ 0x2b9e2580f7d5 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_9vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x2b9e2580f7a5 std::_Bind_simple<>::operator()()
> @ 0x2b9e2580f77c std::thread::_Impl<>::_M_run()
> @ 0x2b9e29fe5a60 (unknown)
> @ 0x2b9e29d1e184 start_thread
> @ 0x2b9e2a851ffd (unknown)
> make[3]: *** [CMakeFiles/check] Segmentation fault (core dumped)
> {noformat}
> A bui...@mesos.apache.org query shows many such instances: 
> https://lists.apache.org/list.html?bui...@mesos.apache.org:lte=1M:process%3A%3AEventQueue%3A%3AConsumer%3A%3Aempty



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7921) process::EventQueue sometimes crashes

2017-09-01 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151119#comment-16151119
 ] 

Yan Xu commented on MESOS-7921:
---

So libprocess GC would delete the managed process upon their exit: 
https://github.com/apache/mesos/blob/1ae308c2f1344d9e62e094ab11cc195c96eb5c04/3rdparty/libprocess/include/process/gc.hpp#L45

{code:title=}
  virtual void exited(const UPID& pid)
  {
if (processes.count(pid) > 0) {
  const ProcessBase* process = processes[pid];
  processes.erase(pid);
  delete process;
}
  }
{code}

What happens when another process who's waiting on it donates the thread to 
this process which is terminated after it is extracted from the run queue? 
Could it be destructed before resuming it?
 
https://github.com/apache/mesos/blob/1ae308c2f1344d9e62e094ab11cc195c96eb5c04/3rdparty/libprocess/src/process.cpp#L3581-L3587

{code:title=}
  if (process != nullptr) {
VLOG(2) << "Donating thread to " << process->pid << " while waiting";
ProcessBase* donator = __process__;
resume(process);
running.fetch_sub(1);
__process__ = donator;
  }
{code}

> process::EventQueue sometimes crashes
> -
>
> Key: MESOS-7921
> URL: https://issues.apache.org/jira/browse/MESOS-7921
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.4.0
> Environment: autotools,gcc,--verbose,GLOG_v=1 
> MESOS_VERBOSE=1,ubuntu:14.04,(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)
> Note that --enable-lock-free-event-queue is not enabled.
> Details: 
> https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/injectedEnvVars/
>Reporter: Yan Xu
>Priority: Blocker
> Attachments: 
> FetcherCacheTest.CachedCustomOutputFileWithSubdirectory.log.txt, 
> MesosContainerizerSlaveRecoveryTest.ResourceStatisticsFullLog.txt
>
>
> The following segfault is found on 
> [ASF|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/]
>  in {{MesosContainerizerSlaveRecoveryTest.ResourceStatistics}} but it's flaky 
> and shows up in other tests and environments (with or without 
> --enable-lock-free-event-queue) as well.
> {noformat: title=Configuration}
> ./bootstrap '&&' ./configure --verbose '&&' make -j6 distcheck
> {noformat}
> {noformat:title=}
> *** Aborted at 1503937885 (unix time) try "date -d @1503937885" if you are 
> using GNU date ***
> PC: @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> *** SIGSEGV (@0x8) received by PID 751 (TID 0x2b9e31978700) from PID 8; stack 
> trace: ***
> @ 0x2b9e29d26330 (unknown)
> @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> @ 0x2b9e25800a40 process::ProcessManager::resume()
> @ 0x2b9e2580f891 
> process::ProcessManager::init_threads()::$_9::operator()()
> @ 0x2b9e2580f7d5 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_9vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x2b9e2580f7a5 std::_Bind_simple<>::operator()()
> @ 0x2b9e2580f77c std::thread::_Impl<>::_M_run()
> @ 0x2b9e29fe5a60 (unknown)
> @ 0x2b9e29d1e184 start_thread
> @ 0x2b9e2a851ffd (unknown)
> make[3]: *** [CMakeFiles/check] Segmentation fault (core dumped)
> {noformat}
> A bui...@mesos.apache.org query shows many such instances: 
> https://lists.apache.org/list.html?bui...@mesos.apache.org:lte=1M:process%3A%3AEventQueue%3A%3AConsumer%3A%3Aempty



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7921) process::EventQueue sometimes crashes

2017-09-01 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151096#comment-16151096
 ] 

Yan Xu commented on MESOS-7921:
---

New failure on ASF CI: 
https://lists.apache.org/thread.html/bf6cacef549f0822814914b32e281a55ce32a02232bef5070cce512c@%3Cbuilds.mesos.apache.org%3E

Similar to the one posted in the JIRA description.
{noformat:title=}
*** Aborted at 1504241455 (unix time) try "date -d @1504241455" if you are 
using GNU date ***
I0901 04:50:55.571101 779 registrar.cpp:424] Successfully recovered registrar
I0901 04:50:55.571496 779 master.cpp:1804] Recovered 0 agents from the registry 
(129B); allowing 10mins for agents to re-register
I0901 04:50:55.571521 793 hierarchical.cpp:209] Skipping recovery of 
hierarchical allocator: nothing to recover
PC: @ 0x2b4f0af34c80 process::EventQueue::Consumer::empty()
*** SIGSEGV (@0x8) received by PID 773 (TID 0x2b4f17caa700) from PID 8; stack 
trace: ***
@ 0x2b4f0f452330 (unknown)
@ 0x2b4f0af34c80 process::EventQueue::Consumer::empty()
@ 0x2b4f0af18c20 process::ProcessManager::resume()
@ 0x2b4f0af27a71 process::ProcessManager::init_threads()::$_9::operator()()
@ 0x2b4f0af279b5 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_9vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
@ 0x2b4f0af27985 std::_Bind_simple<>::operator()()
@ 0x2b4f0af2795c std::thread::_Impl<>::_M_run()
@ 0x2b4f0f711a60 (unknown)
@ 0x2b4f0f44a184 start_thread
@ 0x2b4f0ff7dffd (unknown)
{noformat}

> process::EventQueue sometimes crashes
> -
>
> Key: MESOS-7921
> URL: https://issues.apache.org/jira/browse/MESOS-7921
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.4.0
> Environment: autotools,gcc,--verbose,GLOG_v=1 
> MESOS_VERBOSE=1,ubuntu:14.04,(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)
> Note that --enable-lock-free-event-queue is not enabled.
> Details: 
> https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/injectedEnvVars/
>Reporter: Yan Xu
>Priority: Blocker
> Attachments: 
> FetcherCacheTest.CachedCustomOutputFileWithSubdirectory.log.txt, 
> MesosContainerizerSlaveRecoveryTest.ResourceStatisticsFullLog.txt
>
>
> The following segfault is found on 
> [ASF|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/]
>  in {{MesosContainerizerSlaveRecoveryTest.ResourceStatistics}} but it's flaky 
> and shows up in other tests and environments (with or without 
> --enable-lock-free-event-queue) as well.
> {noformat: title=Configuration}
> ./bootstrap '&&' ./configure --verbose '&&' make -j6 distcheck
> {noformat}
> {noformat:title=}
> *** Aborted at 1503937885 (unix time) try "date -d @1503937885" if you are 
> using GNU date ***
> PC: @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> *** SIGSEGV (@0x8) received by PID 751 (TID 0x2b9e31978700) from PID 8; stack 
> trace: ***
> @ 0x2b9e29d26330 (unknown)
> @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> @ 0x2b9e25800a40 process::ProcessManager::resume()
> @ 0x2b9e2580f891 
> process::ProcessManager::init_threads()::$_9::operator()()
> @ 0x2b9e2580f7d5 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_9vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x2b9e2580f7a5 std::_Bind_simple<>::operator()()
> @ 0x2b9e2580f77c std::thread::_Impl<>::_M_run()
> @ 0x2b9e29fe5a60 (unknown)
> @ 0x2b9e29d1e184 start_thread
> @ 0x2b9e2a851ffd (unknown)
> make[3]: *** [CMakeFiles/check] Segmentation fault (core dumped)
> {noformat}
> A bui...@mesos.apache.org query shows many such instances: 
> https://lists.apache.org/list.html?bui...@mesos.apache.org:lte=1M:process%3A%3AEventQueue%3A%3AConsumer%3A%3Aempty



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6918) Prometheus exporter endpoints for metrics

2017-09-01 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150848#comment-16150848
 ] 

Yan Xu commented on MESOS-6918:
---

I think [~bmahler]'s questions (and mine below) suggest we need a (mini) design 
doc about the overarching methodology here. The summary of each review and the 
existing comments in this JIRA are not providing enough of a high level design 
so the justification for each patch is not clear enough.

A few more questions:

- I understand that complex Prometheus metric types such as {{summary}} require 
some more data than what is currently provided so we need to add them 
somewhere. But they should be added to Mesos' existing "core" metrics classes 
only if they are generic (backwards compatible) improvements that make sense 
regardless of the Prometheus support. I believe this is indeed your goal but we 
need to articulate how current timer/statistics modeling is lacking/wrong.
-- There are some patches that remove history from simple metrics because it 
doesn't make sense. Should the history then be put in another base type that 
{{Counter}} and {{Gauge}} don't derive from?
- After the above is done, is Prometheus merely a specific format? If that's 
the case, can we encapsulate the formatting logic into a formatter class/method 
instead of the main metric endpoint actor?


> Prometheus exporter endpoints for metrics
> -
>
> Key: MESOS-6918
> URL: https://issues.apache.org/jira/browse/MESOS-6918
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Reporter: James Peach
>Assignee: James Peach
>
> There are a couple of [Prometheus|https://prometheus.io] metrics exporters 
> for Mesos, of varying quality. Since the Mesos stats system actually knows 
> about statistics data types and semantics, and Mesos has reasonable HTTP 
> support we could add Prometheus metrics endpoints to directly expose 
> statistics in [Prometheus wire 
> format|https://prometheus.io/docs/instrumenting/exposition_formats/], 
> removing the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7921) process::EventQueue sometimes crashes

2017-08-31 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149737#comment-16149737
 ] 

Yan Xu commented on MESOS-7921:
---

[~benjaminhindman]

In the newly attached 
FetcherCacheTest.CachedCustomOutputFileWithSubdirectory.log.txt:

{noformat:title=}
W0831 22:06:30.170509 32070 process.cpp:3240] Attempted to spawn already 
running process version@127.0.1.1:43674
W0831 22:06:30.179316 32070 process.cpp:3240] Attempted to spawn already 
running process files@127.0.1.1:43674
{noformat}

{noformat:title=}
*** Aborted at 1504217191 (unix time) try "date -d @1504217191" if you are 
using GNU date ***
PC: @ 0x7f43f8cb7956 process::EventQueue::Consumer::empty()
*** SIGSEGV (@0x8) received by PID 32070 (TID 0x7f43fa98c800) from PID 8; stack 
trace: ***
@ 0x7f43f070f390 (unknown)
@ 0x7f43f8cb7956 process::EventQueue::Consumer::empty()
@ 0x7f43f8ca2be5 process::ProcessManager::resume()
@ 0x7f43f8ca3b80 process::ProcessManager::wait()
@ 0x7f43f8ca8d7d process::wait()
@ 0x7f43f8c4c3c7 process::Latch::await()
@  0x1ea6749 process::Future<>::await()
@ 0x7f43f7e897f0 
mesos::internal::slave::FetcherProcess::Metrics::~Metrics()
@ 0x7f43f7e89d16 
mesos::internal::slave::FetcherProcess::~FetcherProcess()
@  0x195c578 
mesos::internal::tests::MockFetcherProcess::~MockFetcherProcess()
@  0x195c5ce 
mesos::internal::tests::MockFetcherProcess::~MockFetcherProcess()
@  0x13b68c1 process::Owned<>::Data::~Data()
@  0x13bfdfe std::_Sp_counted_ptr<>::_M_dispose()
@   0xd065ce std::_Sp_counted_base<>::_M_release()
@   0xd04875 std::__shared_count<>::~__shared_count()
@  0x139f916 std::__shared_ptr<>::~__shared_ptr()
@  0x139f932 std::shared_ptr<>::~shared_ptr()
@  0x139f94e process::Owned<>::~Owned()
@ 0x7f43f7e87b38 mesos::internal::slave::Fetcher::~Fetcher()
@ 0x7f43f7e87b7c mesos::internal::slave::Fetcher::~Fetcher()
@  0x115ee95 process::Owned<>::Data::~Data()
@  0x1160f5c std::_Sp_counted_ptr<>::_M_dispose()
@   0xd065ce std::_Sp_counted_base<>::_M_release()
@   0xd04875 std::__shared_count<>::~__shared_count()
@  0x1150d3c std::__shared_ptr<>::~__shared_ptr()
@  0x1150d58 std::shared_ptr<>::~shared_ptr()
@  0x1150d74 process::Owned<>::~Owned()
@  0x13a0054 
mesos::internal::tests::FetcherCacheTest::~FetcherCacheTest()
@  0x13bf7b2 
mesos::internal::tests::FetcherCacheTest_CachedCustomOutputFileWithSubdirectory_Test::~FetcherCacheTest_CachedCustomOutputFileWithSubdirectory_Test()
@  0x13bf7e2 
mesos::internal::tests::FetcherCacheTest_CachedCustomOutputFileWithSubdirectory_Test::~FetcherCacheTest_CachedCustomOutputFileWithSubdirectory_Test()
@  0x23a9eee testing::Test::DeleteSelf_()
@  0x23b65bf 
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
Segmentation fault (core dumped)
{noformat}

> process::EventQueue sometimes crashes
> -
>
> Key: MESOS-7921
> URL: https://issues.apache.org/jira/browse/MESOS-7921
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.4.0
> Environment: autotools,gcc,--verbose,GLOG_v=1 
> MESOS_VERBOSE=1,ubuntu:14.04,(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)
> Note that --enable-lock-free-event-queue is not enabled.
> Details: 
> https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/injectedEnvVars/
>Reporter: Yan Xu
>Priority: Blocker
> Attachments: 
> FetcherCacheTest.CachedCustomOutputFileWithSubdirectory.log.txt, 
> MesosContainerizerSlaveRecoveryTest.ResourceStatisticsFullLog.txt
>
>
> The following segfault is found on 
> [ASF|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)/4159/]
>  in {{MesosContainerizerSlaveRecoveryTest.ResourceStatistics}} but it's flaky 
> and shows up in other tests and environments (with or without 
> --enable-lock-free-event-queue) as well.
> {noformat: title=Configuration}
> ./bootstrap '&&' ./configure --verbose '&&' make -j6 distcheck
> {noformat}
> {noformat:title=}
> *** Aborted at 1503937885 (unix time) try "date -d @1503937885" if you are 
> using GNU date ***
> PC: @ 0x2b9e2581caa0 process::EventQueue::Consumer::empty()
> *** SIGSEGV (@0x8) received by PID 751 (TID 0x2b9e31978700) from PID 8; stack 
> trace: ***
> @ 

  1   2   3   4   5   6   7   8   >