Re: TG3 data corruption (TSO ?)

2006-09-11 Thread Benjamin Herrenschmidt
Ok, it seems like we might have more than just the missing barrier in
TG3. Possibly some IOMMU problems on some machines as well.
Unfortunately, I don't have a tg3 on a PCI-X or PCI-E card to test on a
pSeries or some other machine.

[Olof: I've disabled the new U4 DART invalidate code (reverted to the
old one) and added an unconditional barrier to dart_flush and I yet have
to reproduce the problem. I suspect a problem with the DART invalidate
one thingy, maybe a HW problem with the U4 chip. Now regarding the
barrier in flush, we'll talk about it later, I think we might have a
problem with the way we do the DART accesses (they might leak out of the
lock) though I yet have to see that cause a problem in practice due to
the round-robin nature of our allocation algorithm.]

Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-11 Thread Segher Boessenkool

#define tw32_rx_mbox(reg, val)  do { wmb();

tp-write32_rx_mbox(tp, reg, val); } while(0)

#define tw32_tx_mbox(reg, val)  do { wmb();

tp-write32_tx_mbox(tp, reg, val); } while(0)




That should do it.

I think we need those tcpdump after all.  Can you send it to me?


Looks like adding a sync to writel does fix it though... I'm trying to
figure out which specific writel in the driver makes a difference.  
I'll

then look into slicing those tcpdumps.


Adding a PowerPC sync to every writel() will slow down things by
so much that it could well just hide the problem, not solve it...

Michael, we see this problem only with TSO on, and then the failure
we see is bad data being sent out, with the correct header (but the
header is constructed by the tg3 in this case, so no surprise).

I'm theorizing that this same failure can happen with TSO off as well,
but the header sent on the wire will be bogus as well as the data,
so the receiving NIC won't pick it up.  tcpdump probably won't see
it either...  will need a low-level ethernet analyser.


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-11 Thread Michael Chan
On Mon, 2006-09-11 at 15:52 +1000, Benjamin Herrenschmidt wrote:

 Looks like adding a sync to writel does fix it though... I'm trying to
 figure out which specific writel in the driver makes a difference. I'll
 then look into slicing those tcpdumps.

During runtime in the fast path, the only writel()'s we do in tg3 are to
the tx mbox, rx_mbox, and the interrupt mbox.  The interrupt mbox
shouldn't matter that much since it has no dependencies on other memory
writes before it.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-10 Thread Benjamin Herrenschmidt
On Sun, 2006-09-10 at 22:33 -0700, Michael Chan wrote:
 Benjamin Herrenschmidt wrote:
 
  I've done:
  
  #define tw32_rx_mbox(reg, val)  do { wmb();
 tp-write32_rx_mbox(tp, reg, val); } while(0)
  #define tw32_tx_mbox(reg, val)  do { wmb();
 tp-write32_tx_mbox(tp, reg, val); } while(0)
  
 
 That should do it.
 
 I think we need those tcpdump after all.  Can you send it to me?

Looks like adding a sync to writel does fix it though... I'm trying to
figure out which specific writel in the driver makes a difference. I'll
then look into slicing those tcpdumps.

Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-10 Thread Benjamin Herrenschmidt

 Oh, we know about this.  The powerpc writel() used to have memory
 barriers in 2.4 kernels but not any more in 2.6 kernels.  Red Hat's
 version of tg3 has extra wmb()'s to fix this problem.  David doesn't
 think that the upstream version of tg3 should have these wmb()'s, and
 the problem should instead be fixed in powerpc's writel().

I've added a wmb() in tw32_rx_mbox() and tw32_tx_mbox() and can still
reproduce the problem. I've also done a 2 days run without TSO enabled
without a failure (my test program normally fails after a couple of
minutes).

Thus, do you see any other code path in the driver where a
synchronisation might be missing ? Is there any case where the chip
might use data in memory before it has been told to do so  with a
mailbox write ? (There are no OWN bits that I can see in the
descriptors, thus I doubt it will use a transmit descriptor that is
half-built before the store to the mailbox allows using it) but who
knows

That leads to the question that there might be an unrelated bug in the
driver. Segher thinks we might be overriding live descriptors, though
I haven't seen how yet. It seems to be TSO specific tho... maybe some
missing smp synchronisation in the driver itself or a problem when the
TX ring is full ?

I don't have the chip docs and I'm not familiar with the driver, so I'll
keep looking, but advice is welcome. I'll also see if I can reproduce
with some other TSO capable card, in case the problem is in the kernel
TSO code and not in the driver.

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-10 Thread Michael Chan
Benjamin Herrenschmidt wrote:

 I've added a wmb() in tw32_rx_mbox() and tw32_tx_mbox() and can still
 reproduce the problem. I've also done a 2 days run without TSO enabled
 without a failure (my test program normally fails after a couple of
 minutes).
 

Hi Ben,

The code is a bit tricky.  It uses function pointers for the various
register read/write methods.  For the 5780, I believe it will be
assigned a simple writel() and not tg3_write32_tx_mbox().  Can you
double check to make sure you have actually added the wmb()?

It's probably easiest to just add the wmb() in tg3_xmit_dma_bug()
before the tw32_tx_mbox().

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-10 Thread Benjamin Herrenschmidt
On Sun, 2006-09-10 at 22:18 -0700, Michael Chan wrote:
 Benjamin Herrenschmidt wrote:
 
  I've added a wmb() in tw32_rx_mbox() and tw32_tx_mbox() and can still
  reproduce the problem. I've also done a 2 days run without TSO enabled
  without a failure (my test program normally fails after a couple of
  minutes).
  
 
 Hi Ben,
 
 The code is a bit tricky.  It uses function pointers for the various
 register read/write methods.  For the 5780, I believe it will be
 assigned a simple writel() and not tg3_write32_tx_mbox().  Can you
 double check to make sure you have actually added the wmb()?
 
 It's probably easiest to just add the wmb() in tg3_xmit_dma_bug()
 before the tw32_tx_mbox().

I've done:

#define tw32_rx_mbox(reg, val)  do { wmb(); tp-write32_rx_mbox(tp, reg, val); 
} while(0)
#define tw32_tx_mbox(reg, val)  do { wmb(); tp-write32_tx_mbox(tp, reg, val); 
} while(0)

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-10 Thread Michael Chan
Benjamin Herrenschmidt wrote:

 I've done:
 
 #define tw32_rx_mbox(reg, val)do { wmb();
tp-write32_rx_mbox(tp, reg, val); } while(0)
 #define tw32_tx_mbox(reg, val)do { wmb();
tp-write32_tx_mbox(tp, reg, val); } while(0)
 

That should do it.

I think we need those tcpdump after all.  Can you send it to me?

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-09 Thread Benjamin Herrenschmidt
On Sat, 2006-09-09 at 02:22 -0700, David Miller wrote:
 From: Benjamin Herrenschmidt [EMAIL PROTECTED]
 Date: Sat, 09 Sep 2006 07:46:02 +1000
 
  I don't think that in general, you have ordering guarantees between
  cacheable and non-cacheable stores unless you use explicit barriers.
 
 In fact, on most systems you absolutely do have ordering between
 MMIO and memory accesses.

Well, at least powerpc and ia64 don't in hw, I don't know about
others... out of order in general is getting fascionable in processor
design ... 

 So you are making an extremely poor engineering decision
 by trying to fixup all the drivers to match PowerPC's
 semantics.  I think a smart engineer would decrease his
 debugging burdon, by matching his platform's MMIO accessors
 such that it matches what other platforms do and therefore
 inheriting the testing coverage provided by all platforms.
 
 Otherwise you will be hunting down these kinds of memory
 barrier issues forever.

Well, some of you (Alan, you, etc...) seem to imply that it's always
been the rule to have a memory store followed by an MMIO write be
strongly ordered.

However, if you look at drivers like e1000, USB OHCI, or even sungem
(:-) they, all have at least wmb()'s between updating descriptor in
memory and the MMIO that triggers reading those by the chip. So it seems
that I'm not the only one to have thought otherwise ;-) More
specificaly, at least ia64 I think, like PowerPC, assumes no ordering
requirement here. So they would need fixing too.

My main problem is the cost... it's actually very expensive to do that
sort of synchronisation. I don't know for ia64 or other potentially out
of order architectures, but we do introduced a measureable performance
hit by adding the one we already have to guard against spin_unlock.

So if we decide to go the way of making writel synchronous vs. previous
MMIOs, I'd really like to have a clearly defined relaxed version as
well.

However, I'm not sure any of our current relaxed accessor have clear
semantics. At least what is implemented currently on PowerPC is the
__raw_* versions which not only have no barriers at all (they don't even
order between MMIOs, for example, readl might cross writel), and do no
endian swap. Quite a mess of semantics if you ask me... Then there has
been talks about those *_relaxed operations but those are more a match
to the relaxed PCI-X and PCI-E cycels, that is they relax ordering vs.
requests in a different direction on the bus, they have nothing to do
with storage domains on the CPU.

Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-09 Thread Alan Cox
Ar Sul, 2006-09-10 am 08:36 +1000, ysgrifennodd Benjamin Herrenschmidt:
 Well, some of you (Alan, you, etc...) seem to imply that it's always
 been the rule to have a memory store followed by an MMIO write be
 strongly ordered.

It has always been the rule

 However, if you look at drivers like e1000, USB OHCI, or even sungem
 (:-) they, all have at least wmb()'s between updating descriptor in

Driver hacks to cope with platform authors who got read/writel wrong.

 semantics. At least what is implemented currently on PowerPC is the
 __raw_* versions which not only have no barriers at all (they don't even
 order between MMIOs, for example, readl might cross writel), and do no
 endian swap. Quite a mess of semantics if you ask me... Then there has

__writel/__readl seems more in keeping for just not locking.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-09 Thread Benjamin Herrenschmidt

  semantics. At least what is implemented currently on PowerPC is the
  __raw_* versions which not only have no barriers at all (they don't even
  order between MMIOs, for example, readl might cross writel), and do no
  endian swap. Quite a mess of semantics if you ask me... Then there has
 
 __writel/__readl seems more in keeping for just not locking.

Not locking... you mean not ordering I suppose. Ok, so the question is
no ordering at all (that is even between MMIO read/writes, thus a
__readl can cross a __writel), or just no ordering between MMIO and
cacheable storage ?

It's an important difference and both have their use. For example, on
PowerPC, if I completely remove barriers, I get the first semantic and I
get the ability to write combine on non-cacheable storage as a benefit
(provided we add an ioremap_wc or such, as the Guarded bit we set on
normal non-cacheable space does also prevent write combining on most
implementations). However, if I keep at least ordering between MMIOs,
then I leave an eieio in there, which is not nearly as expensive than a
full sync but will not order cacheable cs. non-cacheable. However, it
will also prevent write combine as far as I remember.

Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-09 Thread David Miller
From: Benjamin Herrenschmidt [EMAIL PROTECTED]
Date: Sat, 09 Sep 2006 07:46:02 +1000

 I don't think that in general, you have ordering guarantees between
 cacheable and non-cacheable stores unless you use explicit barriers.

In fact, on most systems you absolutely do have ordering between
MMIO and memory accesses.

So you are making an extremely poor engineering decision
by trying to fixup all the drivers to match PowerPC's
semantics.  I think a smart engineer would decrease his
debugging burdon, by matching his platform's MMIO accessors
such that it matches what other platforms do and therefore
inheriting the testing coverage provided by all platforms.

Otherwise you will be hunting down these kinds of memory
barrier issues forever.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-08 Thread Michael Chan
Copying netdev.

Benjamin Herrenschmidt wrote:
 Hi !
 
 I've been chasing with Segher a data corruption problem lately.
 Basically transferring huge amount of data (several Gb) and I get
 corrupted data at the rx side. I cannot tell for sure wether what I've
 been observing here is the same problem that segher's been seing on is
 blades, he will confirm or not. He also seemed to imply that reverting
 to an older kernel on the -receiver- side fixed it, which makes me
 wonder, since it's looks really like a sending side problem (see
 explanation below), if some change in, for exmaple, window scaling,
 might hide or trigger it. 

Please send me lspci and tg3 probing output so that I know what
tg3 hardware you're using.  I also want to look at the tcpdump or
ethereal on the mirrored port that shows the packet being corrupted.

 
 Now, first, I've been playing with ssh from /dev/zero on one machine
 to /dev/zero on the other. That allowed me to run enough 
 tests all over
 the place to have some idea of where the problem comes from since ssh
 will shoke at decryption when hitting the corruption.
 
 The base setup where it happens often is 2 Quad G5's connected to a
 gigabit switch. Both were running some versions of 2.6.18-rc4 and -rc5
 (some random git actually, but see below as I've reproduced 
 the problem
 with today's git snapshot which includes the TG3 tx race fix among
 others).
 
 I have reproduced with various machines as the receiver. A sungem in a
 Dual G5 and a virtual ethernet in a Power5 partition (so the 
 packets go
 to an e1000 then routed through an AIX IO server to a virtual
 ethernet :) are good examples of variety :) I haven't tested with
 non-PowerPC machines so far. I've also never been able to 
 reproduce with
 TSO disabled on the emitting TG3's
 
 Then, I've hacked tridge socklib test program (a simple TCP server
 that pushes a known buffer and a simple TCP receiver that 
 connects to it
 and reads the data). I've added comparison of the data with what they
 are supposed to be on the receiving end. The interesting thing is that
 is much faster than ssh or whatever else I tried. ssh or rsync between
 those 2 Quad G5s give me about 35Mb/sec while I get to 
 107Mb/sec average
 with the small test program.
 
 The fun thing is, I've not been able to reproduce at all that 
 way. When
 the link is pretty much saturated, the problem doesn't occur !
 
 As soon as I introduce a small delay (some crap waiting loop) in the
 sender to slow down the throughput to about 80Mb/sec, then the problem
 starts occuring every now and then (I don't have precise frquency data
 but I get a corruption every couple of gigabytes I'd say).
 
 As for my previous tests, disabling TSO on the sending side 
 fixes it.
 
 Below is a dump of what the corruption look like. I've trimmed the
 beginning and end of the dumped packet (the receiver does 8k 
 reads). The
 0x5a are the expected data, the rest is corruption. They look like
 kernel pointers, but that isn't always the case (often though but that
 might not be relevant). The size and position within the buffer of the
 corrupted data is variable (doesn't seem to be specifically a page or
 anything nice and round like that).
 
 I've configured the switch to send all the traffic between the two
 machines to a 3rd box and then recorded it with tcpdump (the 
 spy uses
 an e1000) and I can see the corrupted data in the recorded
 traces (the exact same pattern as detected by the receiver). 
 So it seems
 very likely at this point that the corruption happens on the sending
 side. The TCP checksums are correct I assume. I don't see any error
 count on the receiving tg3 nor suspicous message in dmesg indicating
 they aren't.
 
 That's all the data I have at this point. I can't guarantee 100% that
 it's a TSO bug (it might be a bug that TSO renders visible 
 due to timing
 effects) but it looks like it since I've not reproduced yet with TSO
 disabled. I'll do an overnight test to confirm that though... 
 sometimes
 the bug can take it's time to show up ... I've seen it wait 
 20Gb before
 it kicked in. Also the fact that fully loading the machine never
 produced it is strange smells like a race.
 
 Cheers,
 Ben.
 
 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
 5a 5a 5a 5a 5a 5a 5a 5a 00 00 00 00 00 00 00 00
 2f 63 70 75 73 00 7f 7e c0 00 00 00 01 cb 70 82
 00 00 00 04 bf 1d db 4d c0 00 00 00 01 cb 92 00
 c0 00 00 01 7b fe 6d 98 c0 00 00 00 01 cb 70 91
 00 00 00 04 df 5d fe fd c0 00 00 00 01 cb 92 10
 c0 00 00 01 7b fe 6d b8 c0 00 00 00 01 cb 71 0e
 00 00 00 04 fe e2 fb cf c0 00 00 00 01 cb 92 20
 c0 00 00 01 7b fe 6d d8 c0 00 00 00 01 cb 71 1f
 00 00 00 04 73 69 ed ff c0 00 00 00 01 cb 92 30
 c0 00 00 01 7b fe 6d f8 c0 00 00 00 01 

Re: TG3 data corruption (TSO ?)

2006-09-08 Thread Segher Boessenkool

I've been chasing with Segher a data corruption problem lately.
Basically transferring huge amount of data (several Gb) and I get
corrupted data at the rx side. I cannot tell for sure wether what  
I've
been observing here is the same problem that segher's been seing  
on is
blades, he will confirm or not. He also seemed to imply that  
reverting

to an older kernel on the -receiver- side fixed it, which makes me
wonder, since it's looks really like a sending side problem (see
explanation below), if some change in, for exmaple, window scaling,
might hide or trigger it.


Please send me lspci and tg3 probing output so that I know what
tg3 hardware you're using.


I use a 5780 rev. A3, but the problem is not limited to this chip.


I also want to look at the tcpdump or
ethereal on the mirrored port that shows the packet being corrupted.


I don't have such, sorry.


That's all the data I have at this point. I can't guarantee 100% that
it's a TSO bug (it might be a bug that TSO renders visible
due to timing
effects) but it looks like it since I've not reproduced yet with TSO
disabled.


It seems to indeed to only be exposed by TSO, not actually a
bug of it /an sich/.

I've got a patch that seems so solve the problem, it needs more testing
though (maybe Ben can do this :-) ).  The problem is that there should
be quite a few wmb()'s in the code that are just not there; adding some
to tg3_set_txd() seems to fix the immediate problem but more is needed
(and I don't see why those should be needed, unless tg3_set_txd() is
updating a life ring entry in place or something like that).

More testing is needed, but the problem is definitely the lack of memory
ordering.


Segher

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-08 Thread Michael Chan
On Fri, 2006-09-08 at 21:29 +0200, Segher Boessenkool wrote:

 I've got a patch that seems so solve the problem, it needs more testing
 though (maybe Ben can do this :-) ).  The problem is that there should
 be quite a few wmb()'s in the code that are just not there; adding some
 to tg3_set_txd() seems to fix the immediate problem but more is needed
 (and I don't see why those should be needed, unless tg3_set_txd() is
 updating a life ring entry in place or something like that).
 
 More testing is needed, but the problem is definitely the lack of memory
 ordering.
 
Oh, we know about this.  The powerpc writel() used to have memory
barriers in 2.4 kernels but not any more in 2.6 kernels.  Red Hat's
version of tg3 has extra wmb()'s to fix this problem.  David doesn't
think that the upstream version of tg3 should have these wmb()'s, and
the problem should instead be fixed in powerpc's writel().

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-08 Thread Benjamin Herrenschmidt

 Please send me lspci and tg3 probing output so that I know what
 tg3 hardware you're using.  I also want to look at the tcpdump or
 ethereal on the mirrored port that shows the packet being corrupted.

Hi Michael !

It's the dual controller of an Apple Quad G5, thus afaik in an HT2000
chip:

0001:05:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 
Gigabit Ethernet (rev 03)
Subsystem: Apple Computer Inc. Unknown device 0085
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- 
TAbort- MAbort- SERR- PERR-
Latency: 64 (16000ns min)
Interrupt: pin A routed to IRQ 66
Region 0: Memory at fa53 (64-bit, non-prefetchable) [size=64K]
Region 2: Memory at fa52 (64-bit, non-prefetchable) [size=64K]
Capabilities: [40] PCI-X non-bridge device
Command: DPERE- ERO- RBC=512 OST=1
Status: Dev=05:04.0 64bit+ 133MHz+ SCD- USC- DC=simple 
DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
Capabilities: [48] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
Capabilities: [50] Vital Product Data
Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 
Enable-
Address: 00011aa5c8ce4904  Data: 18d8

0001:05:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 
Gigabit Ethernet (rev 03)
Subsystem: Apple Computer Inc. Unknown device 0085
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- 
TAbort- MAbort- SERR- PERR-
Latency: 16 (16000ns min), Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 67
Region 0: Memory at fa51 (64-bit, non-prefetchable) [size=64K]
Region 2: Memory at fa50 (64-bit, non-prefetchable) [size=64K]
Capabilities: [40] PCI-X non-bridge device
Command: DPERE- ERO+ RBC=512 OST=1
Status: Dev=05:04.1 64bit+ 133MHz+ SCD- USC- DC=simple 
DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
Capabilities: [48] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] Vital Product Data
Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 
Enable-
Address: 4e001a0002804460  Data: 00a2

And the dmesg bits:

tg3.c:v3.65 (August 07, 2006)
eth0: Tigon3 [partno(BCM95780) rev 8003 PHY(5780)] (PCIX:133MHz:64-bit) 
10/100/1000BaseT Ethernet 00:14:51:65:e6:90
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1]
eth0: dma_rwctrl[76144000] dma_mask[40-bit]
eth1: Tigon3 [partno(BCM95780) rev 8003 PHY(5780)] (PCIX:133MHz:64-bit) 
10/100/1000BaseT Ethernet 00:14:51:65:e6:91
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1]
eth1: dma_rwctrl[76144000] dma_mask[40-bit]

As for the tcpdump output, well, I have a 3Gb file for now :) I need to do a 
bit of surgery on it to
get only the interesting part. I'll try to do that later today (but it may have 
to wait for monday).

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-08 Thread Michael Chan
On Sat, 2006-09-09 at 07:41 +1000, Benjamin Herrenschmidt wrote:

 As for the tcpdump output, well, I have a 3Gb file for now :) I need to do a 
 bit of surgery on it to
 get only the interesting part. I'll try to do that later today (but it may 
 have to wait for monday).
 
Ben, We probably don't need the tcpdump anymore now that we know it's a
memory ordering issue.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-08 Thread Michael Chan
On Sat, 2006-09-09 at 07:46 +1000, Benjamin Herrenschmidt wrote:

 The PowerPC writel has a full sync _after_ the write, mostly to prevent
 it from leaking out of a spinlock, and for ordering it vs. other
 writel's or readl's. It doesn't provide any ordering guarantee vs
 cacheable storage (and was never intended to do so afaik). Such ordering
 shall
 be provided explicitely. It's possible that 2.4 used a big hammer
 approach but we've since been actively fixing drivers for that. It's to
 be noted that PowerPC might not be the only architecture affected as I
 don't think that in general, you have ordering guarantees between
 cacheable and non-cacheable stores unless you use explicit barriers.

I think 2.4 might have an additional sync before the write which will
guarantee that the buffer descriptor is written before telling the chip
to DMA it.

 
 Thus I disagree with fixing the powerpc writel(). The barries shall
 definitely go into tg3.
 

You'll have to take this up with David.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-08 Thread Benjamin Herrenschmidt
On Fri, 2006-09-08 at 15:07 -0700, Michael Chan wrote:
 On Sat, 2006-09-09 at 07:41 +1000, Benjamin Herrenschmidt wrote:
 
  As for the tcpdump output, well, I have a 3Gb file for now :) I need to do 
  a bit of surgery on it to
  get only the interesting part. I'll try to do that later today (but it may 
  have to wait for monday).
  
 Ben, We probably don't need the tcpdump anymore now that we know it's a
 memory ordering issue.

Ok. I'm trying to figure out what's the best way with fixing that. I can
see the flamewar coming on wether stores to memory vs. writel shall be
ordered or not :)

I'm very reluctant to add another sync instruction to our writel though.
It needs one already after the stores to prevent leaking out of
spinlocks (and thus possible mmio vs. mmio order issues on SMP with
stores from different CPUs being re-ordered). Fixing the above would
require one before the store as well. We already pay a pretty high price
for that sync, having 2 would be a real shame.

(Unfortunately, there is no cheap barrier available for ordering
cacheable vs. non cacheable storage on PowerPC, they are completely
separate domains).

One option I was discussing with others would be to drop that sync after
the store, and instead start requiring drivers to use mmiowb() (as
defined by the ia64 folks) to provide ordering of writel's vs. locks.
But that probably means breaking and then having to fix a while bunch of
drivers in the tree who haven't been updated to use it...

I'd rather not have to do that, or even if I go that way, not have to
add that sync at all before the store and thus get back the few percent
of perfs lost due to those sync's on some heavy IO benchmarks.

Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-08 Thread Michael Chan
On Sat, 2006-09-09 at 08:25 +1000, Benjamin Herrenschmidt wrote:
 Ok. I'm trying to figure out what's the best way with fixing that. I can
 see the flamewar coming on wether stores to memory vs. writel shall be
 ordered or not :)
 
 I'm very reluctant to add another sync instruction to our writel though.
 It needs one already after the stores to prevent leaking out of
 spinlocks (and thus possible mmio vs. mmio order issues on SMP with
 stores from different CPUs being re-ordered). Fixing the above would
 require one before the store as well. We already pay a pretty high price
 for that sync, having 2 would be a real shame.
 
 (Unfortunately, there is no cheap barrier available for ordering
 cacheable vs. non cacheable storage on PowerPC, they are completely
 separate domains).
 
 One option I was discussing with others would be to drop that sync after
 the store, and instead start requiring drivers to use mmiowb() (as
 defined by the ia64 folks) to provide ordering of writel's vs. locks.
 But that probably means breaking and then having to fix a while bunch of
 drivers in the tree who haven't been updated to use it...
 
 I'd rather not have to do that, or even if I go that way, not have to
 add that sync at all before the store and thus get back the few percent
 of perfs lost due to those sync's on some heavy IO benchmarks.
 
Another way to fix this without requiring drivers to add all kinds of
barriers in the driver code is to add a writel_sync() variant.  So on
powerpc, writel_sync() will have a sync before and after the write.  On
most other architectures, writel_sync() is the same as writel() if the
ordering is guaranteed.  We'll then convert tg3 and other drivers to use
writel_sync() in places where they're needed.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TG3 data corruption (TSO ?)

2006-09-08 Thread Benjamin Herrenschmidt

  I'd rather not have to do that, or even if I go that way, not have to
  add that sync at all before the store and thus get back the few percent
  of perfs lost due to those sync's on some heavy IO benchmarks.
  
 Another way to fix this without requiring drivers to add all kinds of
 barriers in the driver code is to add a writel_sync() variant.  So on
 powerpc, writel_sync() will have a sync before and after the write.  On
 most other architectures, writel_sync() is the same as writel() if the
 ordering is guaranteed.  We'll then convert tg3 and other drivers to use
 writel_sync() in places where they're needed.

I think the preferred approach for that sort of thing is to have writel
be the sync version and add special relaxed version. Now there have
been talks and debates about relaxed IOs but they generally map to
something different, typically IOs that are relaxed vs. DMA (PCI-X/PCIe
relaxed ordering options for example).

Adding yet another round of IO accessors sounds like a bit nasty to me,
driver writers will potentially not understand which ones to use etc...

Anyway, I think I'll let Anton and Paulus argue that one for now.

Cheers,
Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html