| Gehel added a subscriber: BBlack. Gehel added a comment. |
My current understanding of the issue:
All IRQs from NIC are handled by a single CPU. Under load, Blazegraph saturate this CPU (and others), this creates CPU contention with the NIC IRQ and leads to packet being dropped. Note that we also need a way to limit Blazegraph CPU consumption (T206108). Spreading those IRQ over multiple CPU should mitigate the contention.
Currently, all NIC related interrupt are handled by CPU0 (P7629).
NIC is currently configured with 1 TX and 4 RX queues, with a hardware max of 4 queues:
gehel@wdqs2003:~$ sudo ethtool -l eno1 Channel parameters for eno1: Pre-set maximums: RX: 4 TX: 4 Other: 0 Combined: 0 Current hardware settings: RX: 4 TX: 1 Other: 0 Combined: 0
With IRQ 79 (en1-rx-1), it looks like affinity is configured to spread IRQ to all CPUs in NUMA node 0.
gehel@wdqs2003:/proc/irq/79$ cat smp_affinity 00ff00ff gehel@wdqs2003:/proc/irq/79$ cat smp_affinity_list 0-7,16-23 gehel@wdqs2003:/proc/irq/79$ cat affinity_hint 00000000My understanding of the various documentations I see is that the smp_affinity above should be sufficient to spread the IRQs. This does not match what I'm seeing, so I'm probably missing something.
@BBlack: a review of the above and any pointer to the right direction would be welcomed!
Cc: BBlack, Aklapper, Gehel, Nandana, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
