Re: Metrics collection affected when libprocess queue builds up

2017-01-06 Thread Benjamin Mahler
Yep, thanks! For https://issues.apache.org/jira/browse/MESOS-6872 it sounds like you're referring to the help information? We already list the timeout but perhaps we need an example section in our help pages. http://mesos.apache.org/documentation/latest/endpoints/metrics/snapshot/ Or are you refe

Re: Metrics collection affected when libprocess queue builds up

2017-01-06 Thread Zhitao Li
Hi Benjamin, I've filed MESOS-6872 and MESOS-6873 for doc and gauge change, and will fix them. Can you shepherd these? I'll do another pass of other gauge usage in allocator to see whether ther

Re: Metrics collection affected when libprocess queue builds up

2017-01-04 Thread Benjamin Mahler
A patch to update the documentation with a NOTE about this would be great. It excludes all metrics that were not available within the timeout, there is no indication within a particular result whether any timed out and were excluded. My feeling is that taking the difference between enqueued and de

Re: Metrics collection affected when libprocess queue builds up

2016-12-30 Thread Zhitao Li
Hi Benjamin, Thanks for the response. First time heard of the `timeout` parameter. I'll fix our monitoring scripts to always specify this. One question on timeout: does it simply drop any metric callback which is not collected within the timeout? Does caller know which metrics are dropped due to

Re: Metrics collection affected when libprocess queue builds up

2016-12-27 Thread Benjamin Mahler
The /metrics endpoint exposes a timeout parameter if you want to receive a response with all of the metrics that were available within the timeout, e.g. /metrics/snapshot.json?timeout=10secs I'd recommend using this when collecting metrics so that you can maintain visibility when a particular comp

Re: Metrics collection affected when libprocess queue builds up

2016-12-19 Thread Zameer Manji
I believe Zhitao is referring to `/metrics/snapshot` returning a result after 10-30 seconds. I think in a typical environment, this will cause most metrics collection tooling to timeout. This causes the operator to not have any visibility into the system, making debugging/fighting the problem very

Re: Metrics collection affected when libprocess queue builds up

2016-12-19 Thread haosdent
Hi, @zhitao > the `/metrics/snapshot` could take 10-30 seconds to respond. Do you mean it `/metrics/snapshot` return result after 10~30 seconds? Or `/metrics/snapshot` takes 10~30 seconds to reflect the change of ` allocator/mesos/event_queue_dispatches gauge`? On Mon, Dec 19, 2016 at 1:11 PM, Z

Metrics collection affected when libprocess queue builds up

2016-12-18 Thread Zhitao Li
Hi all, While I was debugging an allocator message queue build up issue on master (which I plan to share another thread), I noticed that `/metrics/snapshot` is also badly affected. For example, when the allocator queue has ~3k dispatches in it (revealed by the allocator/mesos/event_queue_dispatch