Gehel added a subscriber: Volans. Gehel added a comment. |
All credit for the findings below goes to @Volans:
- we have some dropped packets on the NICs, both on wdqs[12]003 and other servers, but higher on wdqs[12]003.
- NIC interrupts are processed only by CPU0 (see /proc/interrupts), we could spread those and maybe increase throughput
- we see some messages in dmesg: NMI handler (ghes_notify_nmi) took too long to run: 1.259 msecs, probably related to the contention on processing NIC queues
The above would explain the correlation between high load and the error rate of T202764. Since wdqs[12]003 have more slower cores, and the interrupts managed by a single core, it would explain the lower throughput on some servers.
One hard problem to solve is going to be the CPU contention between blazegraph and other processes. At some point, if blazegraph generates too much CPU load, whatever the tuning, we are going to see contention in other areas. We might be able to limit the number of parallel requests.
Potential actions:
- reduce load on the networking stack:
- stop logging nginx to logstash
- tune logging for wdqs-blazegraph and wdqs-updater
- tune networking stack to spread interrupts across all CPUs (having a look at the config of LVS servers might help)
TASK DETAIL
EMAIL PREFERENCES
To: Gehel
Cc: Volans, Stashbot, Gehel, Aklapper, Smalyshev, Nandana, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
Cc: Volans, Stashbot, Gehel, Aklapper, Smalyshev, Nandana, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs