https://bugzilla.wikimedia.org/show_bug.cgi?id=71260
Bug ID: 71260
Summary: Speed up health check.
Product: OCG
Version: unspecified
Hardware: All
OS: All
Status: NEW
Severity: normal
Priority: Unprioritized
Component: General/Unknown
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected]
Web browser: ---
Mobile Platform: ---
icinga warning:
icinga-wm: PROBLEM - LVS HTTP IPv4 on ocg.svc.eqiad.wmnet is CRITICAL: CRITICAL
- Socket timeout after 10 seconds
The port is responsive interactively, it looks like the timeout is just a bit
too short for what the health check is trying to do.
In particular, the health check does a `du -s` of several cache directories,
one of which is >6G now. (The icinga limit for that directory is 40G.) The
latest deploy included a change which partially serialized that `du`
(Ie359701c6972cd49786ffde1e8be1cb64d356fa2), which might be the cause of our
recently starting to toe the timeout line.
We should improve the speed of the health check. Probably the best way to do
this is to cache the sizes of the directories or do the `du` step less
frequently. Alternatively we could add a `quick` check which didn't include
the cache size step. Re-adding some of the parallelism to the `du` might help
some, but probably not enough to cover us when the cache directory climbs
nearer its 40G limit.
--
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l