Gehel added a subscriber: Volans.
Gehel added a comment.

All credit for the findings below goes to @Volans:

  • we have some dropped packets on the NICs, both on wdqs[12]003 and other servers, but higher on wdqs[12]003.
  • NIC interrupts are processed only by CPU0 (see /proc/interrupts), we could spread those and maybe increase throughput
  • we see some messages in dmesg: NMI handler (ghes_notify_nmi) took too long to run: 1.259 msecs, probably related to the contention on processing NIC queues

The above would explain the correlation between high load and the error rate of T202764. Since wdqs[12]003 have more slower cores, and the interrupts managed by a single core, it would explain the lower throughput on some servers.

One hard problem to solve is going to be the CPU contention between blazegraph and other processes. At some point, if blazegraph generates too much CPU load, whatever the tuning, we are going to see contention in other areas. We might be able to limit the number of parallel requests.

Potential actions:

  • reduce load on the networking stack:
    • stop logging nginx to logstash
    • tune logging for wdqs-blazegraph and wdqs-updater
  • tune networking stack to spread interrupts across all CPUs (having a look at the config of LVS servers might help)

TASK DETAIL
https://phabricator.wikimedia.org/T200563

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: Volans, Stashbot, Gehel, Aklapper, Smalyshev, Nandana, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to