| Imarlier added a comment. |
I've been running this in a tmux session on a few of the wdqs servers: while :; do DSTAMP=$(date); CW=$(sudo netstat -anet | grep 208.80.154.224 | grep -c CLOSE_WAIT); echo "${DSTAMP}: ${CW}"; sleep 1; done >> ~/close_waits.txt. (154.224 is the edge for the text cache cluster.)
It's a very small sample at this point, but there are a few things that make me think that this is the issue, or one of the issues:
wdqs1006 had three errors thrown at 2018-12-03T20:09:56. This is the close_wait count around that time:
Mon Dec 3 20:09:30 UTC 2018: 0 Mon Dec 3 20:09:31 UTC 2018: 0 Mon Dec 3 20:09:32 UTC 2018: 0 Mon Dec 3 20:09:33 UTC 2018: 0 Mon Dec 3 20:09:34 UTC 2018: 0 Mon Dec 3 20:09:35 UTC 2018: 0 Mon Dec 3 20:09:36 UTC 2018: 0 Mon Dec 3 20:09:37 UTC 2018: 0 Mon Dec 3 20:09:38 UTC 2018: 10 Mon Dec 3 20:09:39 UTC 2018: 10 Mon Dec 3 20:09:40 UTC 2018: 10 Mon Dec 3 20:09:41 UTC 2018: 10 Mon Dec 3 20:09:42 UTC 2018: 10 Mon Dec 3 20:09:43 UTC 2018: 10 Mon Dec 3 20:09:44 UTC 2018: 10 Mon Dec 3 20:09:45 UTC 2018: 10 Mon Dec 3 20:09:46 UTC 2018: 10 Mon Dec 3 20:09:47 UTC 2018: 10 Mon Dec 3 20:09:48 UTC 2018: 10 Mon Dec 3 20:09:49 UTC 2018: 10 Mon Dec 3 20:09:50 UTC 2018: 10 Mon Dec 3 20:09:51 UTC 2018: 10 Mon Dec 3 20:09:52 UTC 2018: 10 Mon Dec 3 20:09:53 UTC 2018: 10 Mon Dec 3 20:09:54 UTC 2018: 10 Mon Dec 3 20:09:55 UTC 2018: 10 Mon Dec 3 20:09:57 UTC 2018: 0 Mon Dec 3 20:09:58 UTC 2018: 0 Mon Dec 3 20:09:59 UTC 2018: 0 Mon Dec 3 20:10:00 UTC 2018: 0
wdqs1004 had three exceptions thrown at 20:08:23. This is the close_wait count around that time:
Mon Dec 3 20:08:10 UTC 2018: 0 Mon Dec 3 20:08:11 UTC 2018: 0 Mon Dec 3 20:08:12 UTC 2018: 0 Mon Dec 3 20:08:13 UTC 2018: 0 Mon Dec 3 20:08:14 UTC 2018: 0 Mon Dec 3 20:08:15 UTC 2018: 10 Mon Dec 3 20:08:16 UTC 2018: 10 Mon Dec 3 20:08:17 UTC 2018: 10 Mon Dec 3 20:08:18 UTC 2018: 10 Mon Dec 3 20:08:19 UTC 2018: 10 Mon Dec 3 20:08:20 UTC 2018: 10 Mon Dec 3 20:08:21 UTC 2018: 10 Mon Dec 3 20:08:22 UTC 2018: 10 Mon Dec 3 20:08:23 UTC 2018: 0 Mon Dec 3 20:08:24 UTC 2018: 0 Mon Dec 3 20:08:25 UTC 2018: 0 Mon Dec 3 20:08:26 UTC 2018: 0 Mon Dec 3 20:08:27 UTC 2018: 0 Mon Dec 3 20:08:28 UTC 2018: 0 Mon Dec 3 20:08:29 UTC 2018: 0wdqs1005 had three exceptions thrown at 20:07:51. This is the close_wait count around that time:
Mon Dec 3 20:07:20 UTC 2018: 0 Mon Dec 3 20:07:21 UTC 2018: 0 Mon Dec 3 20:07:22 UTC 2018: 10 Mon Dec 3 20:07:23 UTC 2018: 10 Mon Dec 3 20:07:24 UTC 2018: 10 Mon Dec 3 20:07:25 UTC 2018: 10 Mon Dec 3 20:07:26 UTC 2018: 10 Mon Dec 3 20:07:27 UTC 2018: 10 Mon Dec 3 20:07:28 UTC 2018: 10 Mon Dec 3 20:07:29 UTC 2018: 10 Mon Dec 3 20:07:30 UTC 2018: 10 Mon Dec 3 20:07:31 UTC 2018: 10 Mon Dec 3 20:07:32 UTC 2018: 10 Mon Dec 3 20:07:33 UTC 2018: 10 Mon Dec 3 20:07:35 UTC 2018: 10 Mon Dec 3 20:07:36 UTC 2018: 10 Mon Dec 3 20:07:37 UTC 2018: 10 Mon Dec 3 20:07:38 UTC 2018: 10 Mon Dec 3 20:07:39 UTC 2018: 10 Mon Dec 3 20:07:40 UTC 2018: 10 Mon Dec 3 20:07:41 UTC 2018: 10 Mon Dec 3 20:07:42 UTC 2018: 10 Mon Dec 3 20:07:43 UTC 2018: 10 Mon Dec 3 20:07:44 UTC 2018: 10 Mon Dec 3 20:07:45 UTC 2018: 10 Mon Dec 3 20:07:46 UTC 2018: 10 Mon Dec 3 20:07:47 UTC 2018: 10 Mon Dec 3 20:07:48 UTC 2018: 10 Mon Dec 3 20:07:49 UTC 2018: 10 Mon Dec 3 20:07:50 UTC 2018: 10 Mon Dec 3 20:07:51 UTC 2018: 0 Mon Dec 3 20:07:52 UTC 2018: 0 Mon Dec 3 20:07:53 UTC 2018: 0 Mon Dec 3 20:07:54 UTC 2018: 0 Mon Dec 3 20:07:55 UTC 2018: 0 Mon Dec 3 20:07:56 UTC 2018: 0 Mon Dec 3 20:07:57 UTC 2018: 0 Mon Dec 3 20:07:58 UTC 2018: 0 Mon Dec 3 20:07:59 UTC 2018: 0wdqs1006 at 20:07:06:
Mon Dec 3 20:06:46 UTC 2018: 0 Mon Dec 3 20:06:47 UTC 2018: 0 Mon Dec 3 20:06:48 UTC 2018: 9 Mon Dec 3 20:06:49 UTC 2018: 9 Mon Dec 3 20:06:50 UTC 2018: 9 Mon Dec 3 20:06:51 UTC 2018: 9 Mon Dec 3 20:06:52 UTC 2018: 9 Mon Dec 3 20:06:53 UTC 2018: 9 Mon Dec 3 20:06:54 UTC 2018: 9 Mon Dec 3 20:06:55 UTC 2018: 9 Mon Dec 3 20:06:56 UTC 2018: 9 Mon Dec 3 20:06:57 UTC 2018: 9 Mon Dec 3 20:06:58 UTC 2018: 9 Mon Dec 3 20:07:00 UTC 2018: 9 Mon Dec 3 20:07:01 UTC 2018: 9 Mon Dec 3 20:07:02 UTC 2018: 9 Mon Dec 3 20:07:03 UTC 2018: 9 Mon Dec 3 20:07:04 UTC 2018: 9 Mon Dec 3 20:07:05 UTC 2018: 9 Mon Dec 3 20:07:06 UTC 2018: 0 Mon Dec 3 20:07:07 UTC 2018: 0 Mon Dec 3 20:07:08 UTC 2018: 0 Mon Dec 3 20:07:09 UTC 2018: 0 Mon Dec 3 20:07:10 UTC 2018: 0 Mon Dec 3 20:07:11 UTC 2018: 0The pattern is consistent enough that I feel pretty comfortable concluding that this is the cause.
https://jira.apache.org/jira/browse/HTTPCLIENT-1918 has a bunch of suggestions around this. It also notes that HttpClient has connection management logging that can be used to detect cases where connections aren't being released back to the pool correctly.
Cc: BBlack, ema, Gehel, Aklapper, Smalyshev, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Imarlier, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Vali.matei, Zppix, _jensen, D3r1ck01, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
