Hello list. I’m improving a metric collector for a haproxy cluster and want to confirm if my findings and sentenses below are correct. My main goal using these metrics is to know how far from exhaustion my haproxy cluster is.
1. Source of the metric: I’m parsing `show info` from admin socket and collecting Idle_pct. I’m collecting this metric every 500ms because this is the exact time between Idle_pct updates. If I’d collect more often this would just be a spend of haproxy time for nothing. If I’d collect less often I’d loose information leading to an imprecise result. Out of curiosity: I’m converting Idle_pct back to time in order to put the amount of processing time in a time series db. This allows me to put the metric in distinct resolutions and easily convert back to idle/busy pct using `rate()`. 2. Meaning of Idle_pct in a multi-threaded deployment: If I understood the code correctly, whenever haproxy process my `show info` command, a distinct thread might be responsible for that. Every thread has its own Idle_pct calc and they are not shared, I cannot eg query the mean Idle_pct between all running threads. On one side I’m always looking to a fraction of the haproxy process on every collect, on the other side the workload should be equally distributed between all threads so the metric should be almost the same, with some negligible difference between them. 3. Meaning of the latency collecting `show info` I’m also using the latency during the collect of the metric to measure how fast haproxy is doing its job. If the mean time to collect a response for `show info` is growing, the same should be happening with client requests and server responses as well, because neither the TCP sockets used for client/server talk nor the unix socket for admin queries has a shortcut, all of them are processed in the same queue, in the same order the kernel received them. Are my sentenses correct? Is there anything else I could have a look in order to improve my metrics?