[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-09-13 Thread Xudong Ni (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614185#comment-16614185
 ] 

Xudong Ni commented on MESOS-9178:
--

[~bmahler] What's your thoughts?

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> When an agent is reregistrated, the time delta from that moment to
> the master elected time was saved; In the progress of reregistration,
> each data entry represents the registration time delta from master
> elected time; The percentile of these data as in this metrics can
> represent overall reregistration progress; In case of degradation
> towards to the end of reregistration, the high percentile will
> reflect it.
> Note: These metrics only represent the completed reregistration; It
> does not monitor agents were finally marked as unreachable that the
> reregistration didn't actually happen, the unreachable agents were
> already monitored by existing metrics.
> https://reviews.apache.org/r/68706/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-09-12 Thread Xudong Ni (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612890#comment-16612890
 ] 

Xudong Ni commented on MESOS-9178:
--

The difference in the PR and What Yan suggested is how do we calculate the 
comparison base. The comparison base in the PR is the number of reregistration 
actually happened(as such p is guaranteed, and the max is the last 
reregistration),  The comparison base in What Yan suggested is the the number 
of reregistration actually happened + some reregistration didn't go through 
such as unreachable; Since we already has metric covering unreachable already, 
I think it may be better not baking that factor into this metrics? The 
percentage in the proposal not only represent registration performance but it 
is also impacted by the number of unreachable as well;

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> When an agent is reregistrated, the time delta from that moment to
> the master elected time was saved; In the progress of reregistration,
> each data entry represents the registration time delta from master
> elected time; The percentile of these data as in this metrics can
> represent overall reregistration progress; In case of degradation
> towards to the end of reregistration, the high percentile will
> reflect it.
> Note: These metrics only represent the completed reregistration; It
> does not monitor agents were finally marked as unreachable that the
> reregistration didn't actually happen, the unreachable agents were
> already monitored by existing metrics.
> https://reviews.apache.org/r/68706/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-09-12 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612870#comment-16612870
 ] 

Yan Xu commented on MESOS-9178:
---

So my proposal is that, we have the following metrics:..

{noformat:title=}
"master/p25_agents_reregistered_secs": 1,
"master/p50_agents_reregistered_secs": 2,
"master/p75_agents_reregistered_secs": 3,
"master/p90_agents_reregistered_secs": 4,
"master/p99_agents_reregistered_secs": 5,
"master/p100_agents_reregistered_secs": 6,
{noformat}

(welcome suggestion for the precise naming and unit)

Note that each of the metric only appears when such percentage of agents have 
reregistered, and they do persist until the master fails over, then we start 
over from having 0 of these metrics. Monitoring systems I have worked with all 
support filling missing values with their previous values so if you plot this I 
do expect them to continuously show the changes of failover performance over 
time.

I agree that we can publish to the event stream (we currently have AGENT_ADDED 
and AGENT_REMOVED) but for monitoring purposes it's shifting the metric 
creation logic to an external entity.

In terms of implementation, given the current tools we have, I think it works 
best if each metric above is its own timer (but comment in more details in the 
review).

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> When an agent is reregistrated, the time delta from that moment to
> the master elected time was saved; In the progress of reregistration,
> each data entry represents the registration time delta from master
> elected time; The percentile of these data as in this metrics can
> represent overall reregistration progress; In case of degradation
> towards to the end of reregistration, the high percentile will
> reflect it.
> Note: These metrics only represent the completed reregistration; It
> does not monitor agents were finally marked as unreachable that the
> reregistration didn't actually happen, the unreachable agents were
> already monitored by existing metrics.
> https://reviews.apache.org/r/68706/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-09-11 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610795#comment-16610795
 ] 

James Peach commented on MESOS-9178:


Another way to measure this is to publish it in the event stream.

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-09-10 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609911#comment-16609911
 ] 

James Peach commented on MESOS-9178:


Say you have a time-series gauge at various percentages as per [~bmahler]'s 
suggestion. The gauge value would have to persist, so once it is set, it would 
remain at that value thereafter. If you needed to do analytics, you need to 
carefully choose the first sample after a failover. For time-series, the 
easiest thing to do is to plot it, and it's not at all clear to me how you 
could do that and show a meaningful graph because what you really want is to 
compare the historical failover times. I'm not that experienced with Grafana 
but I don't know how I would do that.

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-09-07 Thread Xudong Ni (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16607612#comment-16607612
 ] 

Xudong Ni commented on MESOS-9178:
--

Per Benjamin suggestion, the metrics will record the progress of the agent 
reregistration in master failover: count, max, min, p50, p90, p95, p99 etc; We 
can look the metrics with different perspectively, I think this metric is 
valuable.

The first one would like Yan's commented, probably the main usage,  the number 
of agents usually won't change, we can compare the metrics before and after 
failover, such as comparing p95. Second we could look the distribution itself, 
whether there is long tail etc. We can even get relative comparison between 
various numbers of agent by the count and percentages too. 

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-09-07 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16607548#comment-16607548
 ] 

James Peach commented on MESOS-9178:


Are we convinced that a metric is the right approach? This seems like something 
that you might want to compare over long time periods which might be more 
suitable to doing analytics on logs

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-08-22 Thread Yan Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589463#comment-16589463
 ] 

Yan Xu commented on MESOS-9178:
---

+1. Yup that's the approach we talked about. Sorry the JIRA didn't mention it.

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-08-22 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589375#comment-16589375
 ] 

Benjamin Mahler commented on MESOS-9178:


Such a metric would be rather brittle, you only need 1 agent to not be able to 
re-register after a master failover for it to be useless. I would love to see 
some alternatives explored here, e.g.

We could have some progress oriented metrics:
* Time taken for failed over master to register (25%, 50%, 75%, 90%, 99% 100%) 
of agents. The metric described in this ticket would be the 100% case, but for 
most users, they'll probably monitor on a lower percentage.

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

2018-08-22 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589159#comment-16589159
 ] 

James Peach commented on MESOS-9178:


/cc [~bmahler]

> Add a metric for master failover time.
> --
>
> Key: MESOS-9178
> URL: https://issues.apache.org/jira/browse/MESOS-9178
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Xudong Ni
>Assignee: Xudong Ni
>Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)