[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-10-31 Thread gtenev
Github user gtenev commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
@zwoop, it is ready to land now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-10-28 Thread atsci
Github user atsci commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
Linux build *successful*! See 
https://ci.trafficserver.apache.org/job/Github-Linux/1021/ for details.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-10-28 Thread atsci
Github user atsci commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
FreeBSD build *successful*! See 
https://ci.trafficserver.apache.org/job/Github-FreeBSD/1127/ for details.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-10-28 Thread atsci
Github user atsci commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
Linux build *successful*! See 
https://ci.trafficserver.apache.org/job/Github-Linux/1020/ for details.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-10-28 Thread atsci
Github user atsci commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
FreeBSD build *successful*! See 
https://ci.trafficserver.apache.org/job/Github-FreeBSD/1126/ for details.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-10-28 Thread atsci
Github user atsci commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
Linux build *successful*! See 
https://ci.trafficserver.apache.org/job/Github-Linux/1019/ for details.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-10-28 Thread atsci
Github user atsci commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
FreeBSD build *successful*! See 
https://ci.trafficserver.apache.org/job/Github-FreeBSD/1124/ for details.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-10-28 Thread gtenev
Github user gtenev commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
@jpeach added docs, @SolidWallOfCode made the requested change.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-10-28 Thread SolidWallOfCode
Github user SolidWallOfCode commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
@gtenev Just do the tweaks and we'll commit this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-10-28 Thread SolidWallOfCode
Github user SolidWallOfCode commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
Are the casts on the state indices required?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-10-28 Thread gtenev
Github user gtenev commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
@zwoop, I have not heard any objections for a while so unless @jpeach and 
@SolidWallOfCode have any concerns with the latest patch I think we can land it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-10-27 Thread zwoop
Github user zwoop commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
@gtenev This is ready to land now, right ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-09-17 Thread gtenev
Github user gtenev commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
@jpeach and @SolidWallOfCode, really appreciate your feedback! Here is the 
new patch.

Removed `_count`. The reason I added was that internally the "bad disk 
metric" and "disk errors" both had counter semantics and there are already 
metrics containing `_count`.

Renamed `disk` to `span`, this seems more accurate indeed.

I like the idea of exposing "online", "offline", etc. The new patch exposes 
the following metrics as gauges: 
```
proxy.process.cache.span.failing
proxy.process.cache.span.offline
proxy.process.cache.span.online
```
 A span "moves" from "online" bucket (`errors==0`) to "failing" (`errors > 
0 && errors < proxy.config.cache.max_disk_errors`) to "offline" (`errors >= 
proxy.config.cache.max_disk_errors`).

Please note that "failing" + "offline" + "online" = total number of spans.

It was possible to split the read and write metrics so removed 
`proxy.process.cache.disk.errors` in favor of
```
proxy.process.cache.span.errors.read
proxy.process.cache.span.errors.write
```

Removed the "per-volume" stats. @SolidWallOfCode, incrementing the metrics 
for all volumes on each failure does not make sense either. I would need to add 
more code to be able to increment the metrics for the right volume and I am not 
sure it is worth the effort (and maintenance).

I think the idea of this change in general is to give the operator a signal 
that the disks need to be inspected and then the operator would diagnose them 
with more sophisticated lower level disk utilities.

Noticed that the metrics would not get updated properly (would get somehow 
inconsistent) if there were failures during the cache initialization (before 
"cache enabled") and tried to fix wherever  noticed and able to test / 
reproduce.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-09-17 Thread atsci
Github user atsci commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
Linux build *successful*! See 
https://ci.trafficserver.apache.org/job/Github-Linux/725/ for details.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-09-17 Thread atsci
Github user atsci commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
FreeBSD build *successful*! See 
https://ci.trafficserver.apache.org/job/Github-FreeBSD/829/ for details.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-09-12 Thread jpeach
Github user jpeach commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
@gtenev See my comments here and in the bug. I think that these new metrics 
should follow the existing conventions of the cache system.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-09-09 Thread jpeach
Github user jpeach commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
My comments from [TS-4834](https://issues.apache.org/jira/browse/TS-4834) ..

>I think the nomenclature should mirror the existing metrics which 
typically don't include "count" in the names. Especially since the count of bad 
disks (gauge) does not have the same semantics as the count of disk errors 
(counter).

Looking at the code I think the bad disk metric is a counter not a gauge.

> Is "disk" the right nomenclature, or would "span" be more accurate?
How feasible is it to have separate metrics for read and write errors? Does 
it make sense to also have metrics for disk read and write operations, or are 
system tools sufficient for that?
If you have a count of bad disks, does it make sense to have other disk 
counts? ie. good disks, offline disks, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-09-08 Thread atsci
Github user atsci commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
Linux build *successful*! See 
https://ci.trafficserver.apache.org/job/Github-Linux/651/ for details.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #996: TS-4834 Expose bad disk and disk access failures

2016-09-08 Thread atsci
Github user atsci commented on the issue:

https://github.com/apache/trafficserver/pull/996
  
FreeBSD build *successful*! See 
https://ci.trafficserver.apache.org/job/Github-FreeBSD/755/ for details.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---