Re: TG3 data corruption (TSO ?)
Ok, it seems like we might have more than just the missing barrier in TG3. Possibly some IOMMU problems on some machines as well. Unfortunately, I don't have a tg3 on a PCI-X or PCI-E card to test on a pSeries or some other machine. [Olof: I've disabled the new U4 DART invalidate code (reverted to the old one) and added an unconditional barrier to dart_flush and I yet have to reproduce the problem. I suspect a problem with the DART invalidate one thingy, maybe a HW problem with the U4 chip. Now regarding the barrier in flush, we'll talk about it later, I think we might have a problem with the way we do the DART accesses (they might leak out of the lock) though I yet have to see that cause a problem in practice due to the round-robin nature of our allocation algorithm.] Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
#define tw32_rx_mbox(reg, val) do { wmb(); tp-write32_rx_mbox(tp, reg, val); } while(0) #define tw32_tx_mbox(reg, val) do { wmb(); tp-write32_tx_mbox(tp, reg, val); } while(0) That should do it. I think we need those tcpdump after all. Can you send it to me? Looks like adding a sync to writel does fix it though... I'm trying to figure out which specific writel in the driver makes a difference. I'll then look into slicing those tcpdumps. Adding a PowerPC sync to every writel() will slow down things by so much that it could well just hide the problem, not solve it... Michael, we see this problem only with TSO on, and then the failure we see is bad data being sent out, with the correct header (but the header is constructed by the tg3 in this case, so no surprise). I'm theorizing that this same failure can happen with TSO off as well, but the header sent on the wire will be bogus as well as the data, so the receiving NIC won't pick it up. tcpdump probably won't see it either... will need a low-level ethernet analyser. Segher - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
On Mon, 2006-09-11 at 15:52 +1000, Benjamin Herrenschmidt wrote: Looks like adding a sync to writel does fix it though... I'm trying to figure out which specific writel in the driver makes a difference. I'll then look into slicing those tcpdumps. During runtime in the fast path, the only writel()'s we do in tg3 are to the tx mbox, rx_mbox, and the interrupt mbox. The interrupt mbox shouldn't matter that much since it has no dependencies on other memory writes before it. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
On Sun, 2006-09-10 at 22:33 -0700, Michael Chan wrote: Benjamin Herrenschmidt wrote: I've done: #define tw32_rx_mbox(reg, val) do { wmb(); tp-write32_rx_mbox(tp, reg, val); } while(0) #define tw32_tx_mbox(reg, val) do { wmb(); tp-write32_tx_mbox(tp, reg, val); } while(0) That should do it. I think we need those tcpdump after all. Can you send it to me? Looks like adding a sync to writel does fix it though... I'm trying to figure out which specific writel in the driver makes a difference. I'll then look into slicing those tcpdumps. Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
Oh, we know about this. The powerpc writel() used to have memory barriers in 2.4 kernels but not any more in 2.6 kernels. Red Hat's version of tg3 has extra wmb()'s to fix this problem. David doesn't think that the upstream version of tg3 should have these wmb()'s, and the problem should instead be fixed in powerpc's writel(). I've added a wmb() in tw32_rx_mbox() and tw32_tx_mbox() and can still reproduce the problem. I've also done a 2 days run without TSO enabled without a failure (my test program normally fails after a couple of minutes). Thus, do you see any other code path in the driver where a synchronisation might be missing ? Is there any case where the chip might use data in memory before it has been told to do so with a mailbox write ? (There are no OWN bits that I can see in the descriptors, thus I doubt it will use a transmit descriptor that is half-built before the store to the mailbox allows using it) but who knows That leads to the question that there might be an unrelated bug in the driver. Segher thinks we might be overriding live descriptors, though I haven't seen how yet. It seems to be TSO specific tho... maybe some missing smp synchronisation in the driver itself or a problem when the TX ring is full ? I don't have the chip docs and I'm not familiar with the driver, so I'll keep looking, but advice is welcome. I'll also see if I can reproduce with some other TSO capable card, in case the problem is in the kernel TSO code and not in the driver. Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
Benjamin Herrenschmidt wrote: I've added a wmb() in tw32_rx_mbox() and tw32_tx_mbox() and can still reproduce the problem. I've also done a 2 days run without TSO enabled without a failure (my test program normally fails after a couple of minutes). Hi Ben, The code is a bit tricky. It uses function pointers for the various register read/write methods. For the 5780, I believe it will be assigned a simple writel() and not tg3_write32_tx_mbox(). Can you double check to make sure you have actually added the wmb()? It's probably easiest to just add the wmb() in tg3_xmit_dma_bug() before the tw32_tx_mbox(). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
On Sun, 2006-09-10 at 22:18 -0700, Michael Chan wrote: Benjamin Herrenschmidt wrote: I've added a wmb() in tw32_rx_mbox() and tw32_tx_mbox() and can still reproduce the problem. I've also done a 2 days run without TSO enabled without a failure (my test program normally fails after a couple of minutes). Hi Ben, The code is a bit tricky. It uses function pointers for the various register read/write methods. For the 5780, I believe it will be assigned a simple writel() and not tg3_write32_tx_mbox(). Can you double check to make sure you have actually added the wmb()? It's probably easiest to just add the wmb() in tg3_xmit_dma_bug() before the tw32_tx_mbox(). I've done: #define tw32_rx_mbox(reg, val) do { wmb(); tp-write32_rx_mbox(tp, reg, val); } while(0) #define tw32_tx_mbox(reg, val) do { wmb(); tp-write32_tx_mbox(tp, reg, val); } while(0) Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
Benjamin Herrenschmidt wrote: I've done: #define tw32_rx_mbox(reg, val)do { wmb(); tp-write32_rx_mbox(tp, reg, val); } while(0) #define tw32_tx_mbox(reg, val)do { wmb(); tp-write32_tx_mbox(tp, reg, val); } while(0) That should do it. I think we need those tcpdump after all. Can you send it to me? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
On Sat, 2006-09-09 at 02:22 -0700, David Miller wrote: From: Benjamin Herrenschmidt [EMAIL PROTECTED] Date: Sat, 09 Sep 2006 07:46:02 +1000 I don't think that in general, you have ordering guarantees between cacheable and non-cacheable stores unless you use explicit barriers. In fact, on most systems you absolutely do have ordering between MMIO and memory accesses. Well, at least powerpc and ia64 don't in hw, I don't know about others... out of order in general is getting fascionable in processor design ... So you are making an extremely poor engineering decision by trying to fixup all the drivers to match PowerPC's semantics. I think a smart engineer would decrease his debugging burdon, by matching his platform's MMIO accessors such that it matches what other platforms do and therefore inheriting the testing coverage provided by all platforms. Otherwise you will be hunting down these kinds of memory barrier issues forever. Well, some of you (Alan, you, etc...) seem to imply that it's always been the rule to have a memory store followed by an MMIO write be strongly ordered. However, if you look at drivers like e1000, USB OHCI, or even sungem (:-) they, all have at least wmb()'s between updating descriptor in memory and the MMIO that triggers reading those by the chip. So it seems that I'm not the only one to have thought otherwise ;-) More specificaly, at least ia64 I think, like PowerPC, assumes no ordering requirement here. So they would need fixing too. My main problem is the cost... it's actually very expensive to do that sort of synchronisation. I don't know for ia64 or other potentially out of order architectures, but we do introduced a measureable performance hit by adding the one we already have to guard against spin_unlock. So if we decide to go the way of making writel synchronous vs. previous MMIOs, I'd really like to have a clearly defined relaxed version as well. However, I'm not sure any of our current relaxed accessor have clear semantics. At least what is implemented currently on PowerPC is the __raw_* versions which not only have no barriers at all (they don't even order between MMIOs, for example, readl might cross writel), and do no endian swap. Quite a mess of semantics if you ask me... Then there has been talks about those *_relaxed operations but those are more a match to the relaxed PCI-X and PCI-E cycels, that is they relax ordering vs. requests in a different direction on the bus, they have nothing to do with storage domains on the CPU. Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
Ar Sul, 2006-09-10 am 08:36 +1000, ysgrifennodd Benjamin Herrenschmidt: Well, some of you (Alan, you, etc...) seem to imply that it's always been the rule to have a memory store followed by an MMIO write be strongly ordered. It has always been the rule However, if you look at drivers like e1000, USB OHCI, or even sungem (:-) they, all have at least wmb()'s between updating descriptor in Driver hacks to cope with platform authors who got read/writel wrong. semantics. At least what is implemented currently on PowerPC is the __raw_* versions which not only have no barriers at all (they don't even order between MMIOs, for example, readl might cross writel), and do no endian swap. Quite a mess of semantics if you ask me... Then there has __writel/__readl seems more in keeping for just not locking. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
semantics. At least what is implemented currently on PowerPC is the __raw_* versions which not only have no barriers at all (they don't even order between MMIOs, for example, readl might cross writel), and do no endian swap. Quite a mess of semantics if you ask me... Then there has __writel/__readl seems more in keeping for just not locking. Not locking... you mean not ordering I suppose. Ok, so the question is no ordering at all (that is even between MMIO read/writes, thus a __readl can cross a __writel), or just no ordering between MMIO and cacheable storage ? It's an important difference and both have their use. For example, on PowerPC, if I completely remove barriers, I get the first semantic and I get the ability to write combine on non-cacheable storage as a benefit (provided we add an ioremap_wc or such, as the Guarded bit we set on normal non-cacheable space does also prevent write combining on most implementations). However, if I keep at least ordering between MMIOs, then I leave an eieio in there, which is not nearly as expensive than a full sync but will not order cacheable cs. non-cacheable. However, it will also prevent write combine as far as I remember. Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
From: Benjamin Herrenschmidt [EMAIL PROTECTED] Date: Sat, 09 Sep 2006 07:46:02 +1000 I don't think that in general, you have ordering guarantees between cacheable and non-cacheable stores unless you use explicit barriers. In fact, on most systems you absolutely do have ordering between MMIO and memory accesses. So you are making an extremely poor engineering decision by trying to fixup all the drivers to match PowerPC's semantics. I think a smart engineer would decrease his debugging burdon, by matching his platform's MMIO accessors such that it matches what other platforms do and therefore inheriting the testing coverage provided by all platforms. Otherwise you will be hunting down these kinds of memory barrier issues forever. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
Copying netdev. Benjamin Herrenschmidt wrote: Hi ! I've been chasing with Segher a data corruption problem lately. Basically transferring huge amount of data (several Gb) and I get corrupted data at the rx side. I cannot tell for sure wether what I've been observing here is the same problem that segher's been seing on is blades, he will confirm or not. He also seemed to imply that reverting to an older kernel on the -receiver- side fixed it, which makes me wonder, since it's looks really like a sending side problem (see explanation below), if some change in, for exmaple, window scaling, might hide or trigger it. Please send me lspci and tg3 probing output so that I know what tg3 hardware you're using. I also want to look at the tcpdump or ethereal on the mirrored port that shows the packet being corrupted. Now, first, I've been playing with ssh from /dev/zero on one machine to /dev/zero on the other. That allowed me to run enough tests all over the place to have some idea of where the problem comes from since ssh will shoke at decryption when hitting the corruption. The base setup where it happens often is 2 Quad G5's connected to a gigabit switch. Both were running some versions of 2.6.18-rc4 and -rc5 (some random git actually, but see below as I've reproduced the problem with today's git snapshot which includes the TG3 tx race fix among others). I have reproduced with various machines as the receiver. A sungem in a Dual G5 and a virtual ethernet in a Power5 partition (so the packets go to an e1000 then routed through an AIX IO server to a virtual ethernet :) are good examples of variety :) I haven't tested with non-PowerPC machines so far. I've also never been able to reproduce with TSO disabled on the emitting TG3's Then, I've hacked tridge socklib test program (a simple TCP server that pushes a known buffer and a simple TCP receiver that connects to it and reads the data). I've added comparison of the data with what they are supposed to be on the receiving end. The interesting thing is that is much faster than ssh or whatever else I tried. ssh or rsync between those 2 Quad G5s give me about 35Mb/sec while I get to 107Mb/sec average with the small test program. The fun thing is, I've not been able to reproduce at all that way. When the link is pretty much saturated, the problem doesn't occur ! As soon as I introduce a small delay (some crap waiting loop) in the sender to slow down the throughput to about 80Mb/sec, then the problem starts occuring every now and then (I don't have precise frquency data but I get a corruption every couple of gigabytes I'd say). As for my previous tests, disabling TSO on the sending side fixes it. Below is a dump of what the corruption look like. I've trimmed the beginning and end of the dumped packet (the receiver does 8k reads). The 0x5a are the expected data, the rest is corruption. They look like kernel pointers, but that isn't always the case (often though but that might not be relevant). The size and position within the buffer of the corrupted data is variable (doesn't seem to be specifically a page or anything nice and round like that). I've configured the switch to send all the traffic between the two machines to a 3rd box and then recorded it with tcpdump (the spy uses an e1000) and I can see the corrupted data in the recorded traces (the exact same pattern as detected by the receiver). So it seems very likely at this point that the corruption happens on the sending side. The TCP checksums are correct I assume. I don't see any error count on the receiving tg3 nor suspicous message in dmesg indicating they aren't. That's all the data I have at this point. I can't guarantee 100% that it's a TSO bug (it might be a bug that TSO renders visible due to timing effects) but it looks like it since I've not reproduced yet with TSO disabled. I'll do an overnight test to confirm that though... sometimes the bug can take it's time to show up ... I've seen it wait 20Gb before it kicked in. Also the fact that fully loading the machine never produced it is strange smells like a race. Cheers, Ben. 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 00 00 00 00 00 00 00 00 2f 63 70 75 73 00 7f 7e c0 00 00 00 01 cb 70 82 00 00 00 04 bf 1d db 4d c0 00 00 00 01 cb 92 00 c0 00 00 01 7b fe 6d 98 c0 00 00 00 01 cb 70 91 00 00 00 04 df 5d fe fd c0 00 00 00 01 cb 92 10 c0 00 00 01 7b fe 6d b8 c0 00 00 00 01 cb 71 0e 00 00 00 04 fe e2 fb cf c0 00 00 00 01 cb 92 20 c0 00 00 01 7b fe 6d d8 c0 00 00 00 01 cb 71 1f 00 00 00 04 73 69 ed ff c0 00 00 00 01 cb 92 30 c0 00 00 01 7b fe 6d f8 c0 00 00 00 01
Re: TG3 data corruption (TSO ?)
I've been chasing with Segher a data corruption problem lately. Basically transferring huge amount of data (several Gb) and I get corrupted data at the rx side. I cannot tell for sure wether what I've been observing here is the same problem that segher's been seing on is blades, he will confirm or not. He also seemed to imply that reverting to an older kernel on the -receiver- side fixed it, which makes me wonder, since it's looks really like a sending side problem (see explanation below), if some change in, for exmaple, window scaling, might hide or trigger it. Please send me lspci and tg3 probing output so that I know what tg3 hardware you're using. I use a 5780 rev. A3, but the problem is not limited to this chip. I also want to look at the tcpdump or ethereal on the mirrored port that shows the packet being corrupted. I don't have such, sorry. That's all the data I have at this point. I can't guarantee 100% that it's a TSO bug (it might be a bug that TSO renders visible due to timing effects) but it looks like it since I've not reproduced yet with TSO disabled. It seems to indeed to only be exposed by TSO, not actually a bug of it /an sich/. I've got a patch that seems so solve the problem, it needs more testing though (maybe Ben can do this :-) ). The problem is that there should be quite a few wmb()'s in the code that are just not there; adding some to tg3_set_txd() seems to fix the immediate problem but more is needed (and I don't see why those should be needed, unless tg3_set_txd() is updating a life ring entry in place or something like that). More testing is needed, but the problem is definitely the lack of memory ordering. Segher - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
On Fri, 2006-09-08 at 21:29 +0200, Segher Boessenkool wrote: I've got a patch that seems so solve the problem, it needs more testing though (maybe Ben can do this :-) ). The problem is that there should be quite a few wmb()'s in the code that are just not there; adding some to tg3_set_txd() seems to fix the immediate problem but more is needed (and I don't see why those should be needed, unless tg3_set_txd() is updating a life ring entry in place or something like that). More testing is needed, but the problem is definitely the lack of memory ordering. Oh, we know about this. The powerpc writel() used to have memory barriers in 2.4 kernels but not any more in 2.6 kernels. Red Hat's version of tg3 has extra wmb()'s to fix this problem. David doesn't think that the upstream version of tg3 should have these wmb()'s, and the problem should instead be fixed in powerpc's writel(). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
Please send me lspci and tg3 probing output so that I know what tg3 hardware you're using. I also want to look at the tcpdump or ethereal on the mirrored port that shows the packet being corrupted. Hi Michael ! It's the dual controller of an Apple Quad G5, thus afaik in an HT2000 chip: 0001:05:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 Gigabit Ethernet (rev 03) Subsystem: Apple Computer Inc. Unknown device 0085 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 64 (16000ns min) Interrupt: pin A routed to IRQ 66 Region 0: Memory at fa53 (64-bit, non-prefetchable) [size=64K] Region 2: Memory at fa52 (64-bit, non-prefetchable) [size=64K] Capabilities: [40] PCI-X non-bridge device Command: DPERE- ERO- RBC=512 OST=1 Status: Dev=05:04.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable+ DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable- Address: 00011aa5c8ce4904 Data: 18d8 0001:05:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 Gigabit Ethernet (rev 03) Subsystem: Apple Computer Inc. Unknown device 0085 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 16 (16000ns min), Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 67 Region 0: Memory at fa51 (64-bit, non-prefetchable) [size=64K] Region 2: Memory at fa50 (64-bit, non-prefetchable) [size=64K] Capabilities: [40] PCI-X non-bridge device Command: DPERE- ERO+ RBC=512 OST=1 Status: Dev=05:04.1 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable- Address: 4e001a0002804460 Data: 00a2 And the dmesg bits: tg3.c:v3.65 (August 07, 2006) eth0: Tigon3 [partno(BCM95780) rev 8003 PHY(5780)] (PCIX:133MHz:64-bit) 10/100/1000BaseT Ethernet 00:14:51:65:e6:90 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] eth0: dma_rwctrl[76144000] dma_mask[40-bit] eth1: Tigon3 [partno(BCM95780) rev 8003 PHY(5780)] (PCIX:133MHz:64-bit) 10/100/1000BaseT Ethernet 00:14:51:65:e6:91 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] eth1: dma_rwctrl[76144000] dma_mask[40-bit] As for the tcpdump output, well, I have a 3Gb file for now :) I need to do a bit of surgery on it to get only the interesting part. I'll try to do that later today (but it may have to wait for monday). Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
On Sat, 2006-09-09 at 07:41 +1000, Benjamin Herrenschmidt wrote: As for the tcpdump output, well, I have a 3Gb file for now :) I need to do a bit of surgery on it to get only the interesting part. I'll try to do that later today (but it may have to wait for monday). Ben, We probably don't need the tcpdump anymore now that we know it's a memory ordering issue. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
On Sat, 2006-09-09 at 07:46 +1000, Benjamin Herrenschmidt wrote: The PowerPC writel has a full sync _after_ the write, mostly to prevent it from leaking out of a spinlock, and for ordering it vs. other writel's or readl's. It doesn't provide any ordering guarantee vs cacheable storage (and was never intended to do so afaik). Such ordering shall be provided explicitely. It's possible that 2.4 used a big hammer approach but we've since been actively fixing drivers for that. It's to be noted that PowerPC might not be the only architecture affected as I don't think that in general, you have ordering guarantees between cacheable and non-cacheable stores unless you use explicit barriers. I think 2.4 might have an additional sync before the write which will guarantee that the buffer descriptor is written before telling the chip to DMA it. Thus I disagree with fixing the powerpc writel(). The barries shall definitely go into tg3. You'll have to take this up with David. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
On Fri, 2006-09-08 at 15:07 -0700, Michael Chan wrote: On Sat, 2006-09-09 at 07:41 +1000, Benjamin Herrenschmidt wrote: As for the tcpdump output, well, I have a 3Gb file for now :) I need to do a bit of surgery on it to get only the interesting part. I'll try to do that later today (but it may have to wait for monday). Ben, We probably don't need the tcpdump anymore now that we know it's a memory ordering issue. Ok. I'm trying to figure out what's the best way with fixing that. I can see the flamewar coming on wether stores to memory vs. writel shall be ordered or not :) I'm very reluctant to add another sync instruction to our writel though. It needs one already after the stores to prevent leaking out of spinlocks (and thus possible mmio vs. mmio order issues on SMP with stores from different CPUs being re-ordered). Fixing the above would require one before the store as well. We already pay a pretty high price for that sync, having 2 would be a real shame. (Unfortunately, there is no cheap barrier available for ordering cacheable vs. non cacheable storage on PowerPC, they are completely separate domains). One option I was discussing with others would be to drop that sync after the store, and instead start requiring drivers to use mmiowb() (as defined by the ia64 folks) to provide ordering of writel's vs. locks. But that probably means breaking and then having to fix a while bunch of drivers in the tree who haven't been updated to use it... I'd rather not have to do that, or even if I go that way, not have to add that sync at all before the store and thus get back the few percent of perfs lost due to those sync's on some heavy IO benchmarks. Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
On Sat, 2006-09-09 at 08:25 +1000, Benjamin Herrenschmidt wrote: Ok. I'm trying to figure out what's the best way with fixing that. I can see the flamewar coming on wether stores to memory vs. writel shall be ordered or not :) I'm very reluctant to add another sync instruction to our writel though. It needs one already after the stores to prevent leaking out of spinlocks (and thus possible mmio vs. mmio order issues on SMP with stores from different CPUs being re-ordered). Fixing the above would require one before the store as well. We already pay a pretty high price for that sync, having 2 would be a real shame. (Unfortunately, there is no cheap barrier available for ordering cacheable vs. non cacheable storage on PowerPC, they are completely separate domains). One option I was discussing with others would be to drop that sync after the store, and instead start requiring drivers to use mmiowb() (as defined by the ia64 folks) to provide ordering of writel's vs. locks. But that probably means breaking and then having to fix a while bunch of drivers in the tree who haven't been updated to use it... I'd rather not have to do that, or even if I go that way, not have to add that sync at all before the store and thus get back the few percent of perfs lost due to those sync's on some heavy IO benchmarks. Another way to fix this without requiring drivers to add all kinds of barriers in the driver code is to add a writel_sync() variant. So on powerpc, writel_sync() will have a sync before and after the write. On most other architectures, writel_sync() is the same as writel() if the ordering is guaranteed. We'll then convert tg3 and other drivers to use writel_sync() in places where they're needed. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TG3 data corruption (TSO ?)
I'd rather not have to do that, or even if I go that way, not have to add that sync at all before the store and thus get back the few percent of perfs lost due to those sync's on some heavy IO benchmarks. Another way to fix this without requiring drivers to add all kinds of barriers in the driver code is to add a writel_sync() variant. So on powerpc, writel_sync() will have a sync before and after the write. On most other architectures, writel_sync() is the same as writel() if the ordering is guaranteed. We'll then convert tg3 and other drivers to use writel_sync() in places where they're needed. I think the preferred approach for that sort of thing is to have writel be the sync version and add special relaxed version. Now there have been talks and debates about relaxed IOs but they generally map to something different, typically IOs that are relaxed vs. DMA (PCI-X/PCIe relaxed ordering options for example). Adding yet another round of IO accessors sounds like a bit nasty to me, driver writers will potentially not understand which ones to use etc... Anyway, I think I'll let Anton and Paulus argue that one for now. Cheers, Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html