[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user gtenev commented on the issue: https://github.com/apache/trafficserver/pull/996 @zwoop, it is ready to land now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/996 Linux build *successful*! See https://ci.trafficserver.apache.org/job/Github-Linux/1021/ for details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/996 FreeBSD build *successful*! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/1127/ for details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/996 Linux build *successful*! See https://ci.trafficserver.apache.org/job/Github-Linux/1020/ for details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/996 FreeBSD build *successful*! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/1126/ for details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/996 Linux build *successful*! See https://ci.trafficserver.apache.org/job/Github-Linux/1019/ for details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/996 FreeBSD build *successful*! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/1124/ for details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user gtenev commented on the issue: https://github.com/apache/trafficserver/pull/996 @jpeach added docs, @SolidWallOfCode made the requested change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user SolidWallOfCode commented on the issue: https://github.com/apache/trafficserver/pull/996 @gtenev Just do the tweaks and we'll commit this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user SolidWallOfCode commented on the issue: https://github.com/apache/trafficserver/pull/996 Are the casts on the state indices required? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user gtenev commented on the issue: https://github.com/apache/trafficserver/pull/996 @zwoop, I have not heard any objections for a while so unless @jpeach and @SolidWallOfCode have any concerns with the latest patch I think we can land it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user zwoop commented on the issue: https://github.com/apache/trafficserver/pull/996 @gtenev This is ready to land now, right ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user gtenev commented on the issue: https://github.com/apache/trafficserver/pull/996 @jpeach and @SolidWallOfCode, really appreciate your feedback! Here is the new patch. Removed `_count`. The reason I added was that internally the "bad disk metric" and "disk errors" both had counter semantics and there are already metrics containing `_count`. Renamed `disk` to `span`, this seems more accurate indeed. I like the idea of exposing "online", "offline", etc. The new patch exposes the following metrics as gauges: ``` proxy.process.cache.span.failing proxy.process.cache.span.offline proxy.process.cache.span.online ``` A span "moves" from "online" bucket (`errors==0`) to "failing" (`errors > 0 && errors < proxy.config.cache.max_disk_errors`) to "offline" (`errors >= proxy.config.cache.max_disk_errors`). Please note that "failing" + "offline" + "online" = total number of spans. It was possible to split the read and write metrics so removed `proxy.process.cache.disk.errors` in favor of ``` proxy.process.cache.span.errors.read proxy.process.cache.span.errors.write ``` Removed the "per-volume" stats. @SolidWallOfCode, incrementing the metrics for all volumes on each failure does not make sense either. I would need to add more code to be able to increment the metrics for the right volume and I am not sure it is worth the effort (and maintenance). I think the idea of this change in general is to give the operator a signal that the disks need to be inspected and then the operator would diagnose them with more sophisticated lower level disk utilities. Noticed that the metrics would not get updated properly (would get somehow inconsistent) if there were failures during the cache initialization (before "cache enabled") and tried to fix wherever noticed and able to test / reproduce. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/996 Linux build *successful*! See https://ci.trafficserver.apache.org/job/Github-Linux/725/ for details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/996 FreeBSD build *successful*! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/829/ for details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user jpeach commented on the issue: https://github.com/apache/trafficserver/pull/996 @gtenev See my comments here and in the bug. I think that these new metrics should follow the existing conventions of the cache system. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user jpeach commented on the issue: https://github.com/apache/trafficserver/pull/996 My comments from [TS-4834](https://issues.apache.org/jira/browse/TS-4834) .. >I think the nomenclature should mirror the existing metrics which typically don't include "count" in the names. Especially since the count of bad disks (gauge) does not have the same semantics as the count of disk errors (counter). Looking at the code I think the bad disk metric is a counter not a gauge. > Is "disk" the right nomenclature, or would "span" be more accurate? How feasible is it to have separate metrics for read and write errors? Does it make sense to also have metrics for disk read and write operations, or are system tools sufficient for that? If you have a count of bad disks, does it make sense to have other disk counts? ie. good disks, offline disks, etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/996 Linux build *successful*! See https://ci.trafficserver.apache.org/job/Github-Linux/651/ for details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures
Github user atsci commented on the issue: https://github.com/apache/trafficserver/pull/996 FreeBSD build *successful*! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/755/ for details. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---