On 06/18/2012 12:05 AM, Richard Elling wrote:
> You might try some of the troubleshooting techniques described in Chapter 5
> of the DTtrace book by Brendan Gregg and Jim Mauro. It is not clear from your
> description that you are seeing the same symptoms, but the technique should
> -- richard
Thanks for the advice, I'll try it. In the mean time, I'm beginning to
suspect I'm hitting some PCI-e issue on the Dell R715 machine. Looking at
# mdb -k
IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s)
91 0x82 7 PCI Edg MSI 5 1 - pcieb_intr_handler
In mpstat I can see that during normal operation, CPU 5 is nearly floored:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
5 0 0 0 512 0 1054 0 0 870 0 0 0 93 0 7
Then, when anything hits which disturbs the PCI-e bus (e.g. a txg flush
or the xcall storm), the CPU goes to 100% utilization and my networking
throughput drops accordingly. The issue can be softened by lowering the
input bandwidth from ~46MB/s to below 20MB/s - at that point I'm getting
only about 10% utilization on the core in question and no xcall storm or
txg flush can influence my network (though I do see the CPU get about
70% busy during the process, but still enough left to avoid packet loss).
So it seems, I'm hitting some hardware design issue, or something...
I'll try moving my network card to the second PCI-e I/O bridge tomorrow
(which seems to be bound to CPU 6).
Any other ideas on what I might try to get the PCI-e I/O bridge
bandwidth back? Or how to fight the starvation of the CPU by other
activities in the system? (xcalls and/or txg flushes) I already tried
putting the CPUs in question into an empty processor set, but that isn't
enough, it seems.
zfs-discuss mailing list