Hello list. I’m improving a metric collector for a haproxy cluster and want to 
confirm if my findings and sentenses below are correct. My main goal using 
these metrics is to know how far from exhaustion my haproxy cluster is.

1. Source of the metric:

I’m parsing `show info` from admin socket and collecting Idle_pct. I’m 
collecting this metric every 500ms because this is the exact time between 
Idle_pct updates. If I’d collect more often this would just be a spend of 
haproxy time for nothing. If I’d collect less often I’d loose information 
leading to an imprecise result. Out of curiosity: I’m converting Idle_pct back 
to time in order to put the amount of processing time in a time series db. This 
allows me to put the metric in distinct resolutions and easily convert back to 
idle/busy pct using `rate()`.

2. Meaning of Idle_pct in a multi-threaded deployment:

If I understood the code correctly, whenever haproxy process my `show info` 
command, a distinct thread might be responsible for that. Every thread has its 
own Idle_pct calc and they are not shared, I cannot eg query the mean Idle_pct 
between all running threads. On one side I’m always looking to a fraction of 
the haproxy process on every collect, on the other side the workload should be 
equally distributed between all threads so the metric should be almost the 
same, with some negligible difference between them.

3. Meaning of the latency collecting `show info`

I’m also using the latency during the collect of the metric to measure how fast 
haproxy is doing its job. If the mean time to collect a response for `show 
info` is growing, the same should be happening with client requests and server 
responses as well, because neither the TCP sockets used for client/server talk 
nor the unix socket for admin queries has a shortcut, all of them are processed 
in the same queue, in the same order the kernel received them.

Are my sentenses correct? Is there anything else I could have a look in order 
to improve my metrics?


Reply via email to