Imarlier assigned this task to aaron.Imarlier moved this task from Inbox to Doing on the Performance-Team board.Imarlier added a comment.
@aaron is going to see what else can be done to reduce spam, will then assign back to @ArielGlennTASK DETAILhttps://phabricator.wikimedia.org
Imarlier reassigned this task from Imarlier to Smalyshev.
TASK DETAILhttps://phabricator.wikimedia.org/T207718EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Smalyshev, ImarlierCc: Imarlier, BBlack, ema, Gehel, Aklapper, Smalyshev, Legado_Shulgin, Nandana
Imarlier added a comment.
@Smalyshev Guessing this should go back to you for followup?TASK DETAILhttps://phabricator.wikimedia.org/T207718EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ImarlierCc: BBlack, ema, Gehel, Aklapper, Smalyshev, Legado_Shulgin
Imarlier added a comment.
I've been running this in a tmux session on a few of the wdqs servers: while :; do DSTAMP=$(date); CW=$(sudo netstat -anet | grep 208.80.154.224 | grep -c CLOSE_WAIT); echo "${DSTAMP}: ${CW}"; sleep 1; done >> ~/close_waits.txt. (154.224 is the edge fo
Imarlier added a comment.
@Smalyshev Yes, it would be slower, but it would also be diagnostic -- if persistent connections are disabled and the errors stop, we can be pretty confident that something about the way that they're configured is what's resulting in this issue.TASK DETAILhttps
Imarlier added a comment.
@Smalyshev Another thought: why not just disable pooling, and have the client close each connection after each request?TASK DETAILhttps://phabricator.wikimedia.org/T207718EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ImarlierCc
Imarlier added a comment.
@BBlack @ema Couple of questions for you about Nginx:
Do we have nginx configured to handle a specific number of requests on a given worker process/thread, and then shut that down?
If it possible for nginx to be restarted (interrupting existing persistent connections
Imarlier added a comment.
Hrm. In that case, very likely that you're right, and what I'm seeing is
the retry.
Out of curiosity, have you examined GC behavior around the times that these
issues crop up?TASK DETAILhttps://phabricator.wikimedia.org/T207718EMAIL PREFERENCEShttps
Imarlier added a comment.
In T207718#4748289, @Smalyshev wrote:
So, an interesting thing: in at least some of these cases, there is a web request that is making it to wikidata, and that is returning a 200.
The request is retried if it fails, are you sure it's not the retry that you are seeing
Imarlier added a comment.
Doesn't appear to have solved the issue, but I need to verify that the patches have actually been deployed: https://logstash.wikimedia.org/goto/d72dcf3ef04eb8d02cb6a4c602754cfdTASK DETAILhttps://phabricator.wikimedia.org/T207718EMAIL PREFERENCEShttps
Imarlier added a comment.
@Gehel have the patches referenced above been deployed?TASK DETAILhttps://phabricator.wikimedia.org/T202764EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ImarlierCc: Stashbot, Marostegui, Banyek, Reedy, gerritbot, Krinkle, Addshore
Imarlier added a comment.
Given what we're seeing so far, I suggest that we let https://gerrit.wikimedia.org/r/471737 land and see if that changes behavior. It seems reasonable to me that it might be the culprit.TASK DETAILhttps://phabricator.wikimedia.org/T207718EMAIL PREFERENCEShttps
Imarlier added a comment.
So, an interesting thing: in at least some of these cases, there is a web request that is making it to wikidata, and that is returning a 200. I put together a jupyter notebook that pulls down a list of errors from logstash, and then queries the webrequest table in Hive
Imarlier added a comment.
Can someone add me to T202765, mentioned above by @Gehel? The issue that @Smalyshev created last week seems to involve requests that are never getting to wikidata, which suggests a wdqs related issue of some sort (and thus potentially related to spikes in resource use
Imarlier added subscribers: ema, BBlack.Imarlier added a comment.
Actually, as I think about it, the lack of a matching webrequest is itself a potentially interesting clue. Varnish should have recorded the request even if there were timeouts from the backend, but there's no indication
Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.
TASK DETAILhttps://phabricator.wikimedia.org/T198946EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Niedzielski, ImarlierCc: Gilles, nray, Smalyshev, WMDE-leszek, Tarrow
Imarlier moved this task from Inbox to Doing on the Performance-Team board.Imarlier claimed this task.
TASK DETAILhttps://phabricator.wikimedia.org/T207718WORKBOARDhttps://phabricator.wikimedia.org/project/board/1212/EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel
Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.
TASK DETAILhttps://phabricator.wikimedia.org/T97368EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ImarlierCc: Stashbot, gerritbot, Jdforrester-WMF, Joe, mark, Addshore
Imarlier added a comment.
Hey, @Smalyshev -- Did you tag perf team on this because you're hoping that we can help with the investigation?TASK DETAILhttps://phabricator.wikimedia.org/T202764EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ImarlierCc: Imarlier
Imarlier added a comment.
Made this suggestion in T201409 but figured it should be repeated here: whatever the implementation ends up being, it should be easy (or possibly required?) to set the User Agent string, with the goal that any caller will use that to specify what job/service/etc is making
Imarlier added a comment.
FWIW, I'd maybe offer a soft poke in the direction of Guzzle, mostly because I find that having REST-y functionality integrated can be helpful. But it's not something that I feel strongly about.TASK DETAILhttps://phabricator.wikimedia.org/T110022EMAIL PREFERENCEShttps
Imarlier moved this task from Inbox to Radar on the Performance-Team board.Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.
TASK DETAILhttps://phabricator.wikimedia.org/T152185WORKBOARDhttps://phabricator.wikimedia.org/project/board/1212/EMAIL PREFERENCEShttps
Imarlier added a comment.
This was reverted back to using php5 in https://phabricator.wikimedia.org/T182348TASK DETAILhttps://phabricator.wikimedia.org/T117534EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: hoo, ImarlierCc: Imarlier, nichtich
Imarlier added a comment.
Just noticed this in SAL:
15:00 <jynus@tin> Synchronized wmf-config/db-eqiad.php: Depool db1100 (duration: 00m 58s)
Potentially interesting, given that timeouts has started to be reported slightly before that.
The commit that was being synced just refers to th
Imarlier claimed this task.
TASK DETAILhttps://phabricator.wikimedia.org/T183101EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ImarlierCc: Legoktm, Imarlier, aaron, Aklapper, EBernhardson, dcausse, Smalyshev, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune
Imarlier added a comment.
@aaron -- would love your thoughts on this, when you get a chance.TASK DETAILhttps://phabricator.wikimedia.org/T183101EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: ImarlierCc: Legoktm, Imarlier, aaron, Aklapper, EBernhardson
Imarlier added a comment.
Alternate theory that I think is wrong: there's a chance that something wonky happened with db1070 around this time, which caused replication to stop and may also have caused issues getting locks. If that were the case I would expect to see far more write errors here
Imarlier added a comment.
First appearance of a timeout on db1100:
2017-12-16T14:37:46 mw1311 WARNING Wikimedia\Rdbms\LoadBalancer::doWait: Timed out waiting on db1100 pos db1070-bin.001655/808461384
First exception trying to get the lock:
2017-12-16T14:38:16 mw1290 ERROR
Imarlier added a comment.
Just throwing a couple of notes in here:
Q45825741 and Q45825750 were both updated at very close to the same time (2017-12-16 14:54:00-15:02:00); both had a number of updates that happened within a very tight window (<1 minute). Both also have log entries indicat
Imarlier added a comment.
Aaron is on vacation this week, likely not paying any attention to notifications. We'll try to take a look at it without him.TASK DETAILhttps://phabricator.wikimedia.org/T183101EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences
30 matches
Mail list logo