Imarlier added a comment.

I've been running this in a tmux session on a few of the wdqs servers: while :; do DSTAMP=$(date); CW=$(sudo netstat -anet | grep 208.80.154.224 | grep -c CLOSE_WAIT); echo "${DSTAMP}: ${CW}"; sleep 1; done >> ~/close_waits.txt. (154.224 is the edge for the text cache cluster.)

It's a very small sample at this point, but there are a few things that make me think that this is the issue, or one of the issues:

wdqs1006 had three errors thrown at 2018-12-03T20:09:56. This is the close_wait count around that time:

Mon Dec  3 20:09:30 UTC 2018: 0
Mon Dec  3 20:09:31 UTC 2018: 0
Mon Dec  3 20:09:32 UTC 2018: 0
Mon Dec  3 20:09:33 UTC 2018: 0
Mon Dec  3 20:09:34 UTC 2018: 0
Mon Dec  3 20:09:35 UTC 2018: 0
Mon Dec  3 20:09:36 UTC 2018: 0
Mon Dec  3 20:09:37 UTC 2018: 0
Mon Dec  3 20:09:38 UTC 2018: 10
Mon Dec  3 20:09:39 UTC 2018: 10
Mon Dec  3 20:09:40 UTC 2018: 10
Mon Dec  3 20:09:41 UTC 2018: 10
Mon Dec  3 20:09:42 UTC 2018: 10
Mon Dec  3 20:09:43 UTC 2018: 10
Mon Dec  3 20:09:44 UTC 2018: 10
Mon Dec  3 20:09:45 UTC 2018: 10
Mon Dec  3 20:09:46 UTC 2018: 10
Mon Dec  3 20:09:47 UTC 2018: 10
Mon Dec  3 20:09:48 UTC 2018: 10
Mon Dec  3 20:09:49 UTC 2018: 10
Mon Dec  3 20:09:50 UTC 2018: 10
Mon Dec  3 20:09:51 UTC 2018: 10
Mon Dec  3 20:09:52 UTC 2018: 10
Mon Dec  3 20:09:53 UTC 2018: 10
Mon Dec  3 20:09:54 UTC 2018: 10
Mon Dec  3 20:09:55 UTC 2018: 10
Mon Dec  3 20:09:57 UTC 2018: 0
Mon Dec  3 20:09:58 UTC 2018: 0
Mon Dec  3 20:09:59 UTC 2018: 0
Mon Dec  3 20:10:00 UTC 2018: 0

wdqs1004 had three exceptions thrown at 20:08:23. This is the close_wait count around that time:

Mon Dec  3 20:08:10 UTC 2018: 0
Mon Dec  3 20:08:11 UTC 2018: 0
Mon Dec  3 20:08:12 UTC 2018: 0
Mon Dec  3 20:08:13 UTC 2018: 0
Mon Dec  3 20:08:14 UTC 2018: 0
Mon Dec  3 20:08:15 UTC 2018: 10
Mon Dec  3 20:08:16 UTC 2018: 10
Mon Dec  3 20:08:17 UTC 2018: 10
Mon Dec  3 20:08:18 UTC 2018: 10
Mon Dec  3 20:08:19 UTC 2018: 10
Mon Dec  3 20:08:20 UTC 2018: 10
Mon Dec  3 20:08:21 UTC 2018: 10
Mon Dec  3 20:08:22 UTC 2018: 10
Mon Dec  3 20:08:23 UTC 2018: 0
Mon Dec  3 20:08:24 UTC 2018: 0
Mon Dec  3 20:08:25 UTC 2018: 0
Mon Dec  3 20:08:26 UTC 2018: 0
Mon Dec  3 20:08:27 UTC 2018: 0
Mon Dec  3 20:08:28 UTC 2018: 0
Mon Dec  3 20:08:29 UTC 2018: 0

wdqs1005 had three exceptions thrown at 20:07:51. This is the close_wait count around that time:

Mon Dec  3 20:07:20 UTC 2018: 0
Mon Dec  3 20:07:21 UTC 2018: 0
Mon Dec  3 20:07:22 UTC 2018: 10
Mon Dec  3 20:07:23 UTC 2018: 10
Mon Dec  3 20:07:24 UTC 2018: 10
Mon Dec  3 20:07:25 UTC 2018: 10
Mon Dec  3 20:07:26 UTC 2018: 10
Mon Dec  3 20:07:27 UTC 2018: 10
Mon Dec  3 20:07:28 UTC 2018: 10
Mon Dec  3 20:07:29 UTC 2018: 10
Mon Dec  3 20:07:30 UTC 2018: 10
Mon Dec  3 20:07:31 UTC 2018: 10
Mon Dec  3 20:07:32 UTC 2018: 10
Mon Dec  3 20:07:33 UTC 2018: 10
Mon Dec  3 20:07:35 UTC 2018: 10
Mon Dec  3 20:07:36 UTC 2018: 10
Mon Dec  3 20:07:37 UTC 2018: 10
Mon Dec  3 20:07:38 UTC 2018: 10
Mon Dec  3 20:07:39 UTC 2018: 10
Mon Dec  3 20:07:40 UTC 2018: 10
Mon Dec  3 20:07:41 UTC 2018: 10
Mon Dec  3 20:07:42 UTC 2018: 10
Mon Dec  3 20:07:43 UTC 2018: 10
Mon Dec  3 20:07:44 UTC 2018: 10
Mon Dec  3 20:07:45 UTC 2018: 10
Mon Dec  3 20:07:46 UTC 2018: 10
Mon Dec  3 20:07:47 UTC 2018: 10
Mon Dec  3 20:07:48 UTC 2018: 10
Mon Dec  3 20:07:49 UTC 2018: 10
Mon Dec  3 20:07:50 UTC 2018: 10
Mon Dec  3 20:07:51 UTC 2018: 0
Mon Dec  3 20:07:52 UTC 2018: 0
Mon Dec  3 20:07:53 UTC 2018: 0
Mon Dec  3 20:07:54 UTC 2018: 0
Mon Dec  3 20:07:55 UTC 2018: 0
Mon Dec  3 20:07:56 UTC 2018: 0
Mon Dec  3 20:07:57 UTC 2018: 0
Mon Dec  3 20:07:58 UTC 2018: 0
Mon Dec  3 20:07:59 UTC 2018: 0

wdqs1006 at 20:07:06:

Mon Dec  3 20:06:46 UTC 2018: 0
Mon Dec  3 20:06:47 UTC 2018: 0
Mon Dec  3 20:06:48 UTC 2018: 9
Mon Dec  3 20:06:49 UTC 2018: 9
Mon Dec  3 20:06:50 UTC 2018: 9
Mon Dec  3 20:06:51 UTC 2018: 9
Mon Dec  3 20:06:52 UTC 2018: 9
Mon Dec  3 20:06:53 UTC 2018: 9
Mon Dec  3 20:06:54 UTC 2018: 9
Mon Dec  3 20:06:55 UTC 2018: 9
Mon Dec  3 20:06:56 UTC 2018: 9
Mon Dec  3 20:06:57 UTC 2018: 9
Mon Dec  3 20:06:58 UTC 2018: 9
Mon Dec  3 20:07:00 UTC 2018: 9
Mon Dec  3 20:07:01 UTC 2018: 9
Mon Dec  3 20:07:02 UTC 2018: 9
Mon Dec  3 20:07:03 UTC 2018: 9
Mon Dec  3 20:07:04 UTC 2018: 9
Mon Dec  3 20:07:05 UTC 2018: 9
Mon Dec  3 20:07:06 UTC 2018: 0
Mon Dec  3 20:07:07 UTC 2018: 0
Mon Dec  3 20:07:08 UTC 2018: 0
Mon Dec  3 20:07:09 UTC 2018: 0
Mon Dec  3 20:07:10 UTC 2018: 0
Mon Dec  3 20:07:11 UTC 2018: 0

The pattern is consistent enough that I feel pretty comfortable concluding that this is the cause.

https://jira.apache.org/jira/browse/HTTPCLIENT-1918 has a bunch of suggestions around this. It also notes that HttpClient has connection management logging that can be used to detect cases where connections aren't being released back to the pool correctly.


TASK DETAIL
https://phabricator.wikimedia.org/T207718

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Imarlier
Cc: BBlack, ema, Gehel, Aklapper, Smalyshev, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Imarlier, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Vali.matei, Zppix, _jensen, D3r1ck01, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to