Re: slab corruption in skb allocs

2005-03-21 Thread Andrew Morton
Richard Fuchs <[EMAIL PROTECTED]> wrote:
>
> he memory allocation debugger gives me the following messages under a
> vanilla 2.6.10 and 2.6.11 kernel when doing
> 
> 1) hdparm -d0 on my hard disk
> 2) tar c / > /dev/null
> 3) sending lots of network traffic to the machine (e.g. close to 100
> mbit/s udp packets)
> 

We ended up deciding that this was a bug in the e100 NAPI implementation.

I have a not-very-official patch in -mm, at
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.12-rc1/2.6.12-rc1-mm1/broken-out/e100-napi-state-machine-fix.patch.
Would you be able to test that?

AFAIK there has been no official fix for this yet.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-21 Thread Andrew Morton
Richard Fuchs [EMAIL PROTECTED] wrote:

 he memory allocation debugger gives me the following messages under a
 vanilla 2.6.10 and 2.6.11 kernel when doing
 
 1) hdparm -d0 on my hard disk
 2) tar c /  /dev/null
 3) sending lots of network traffic to the machine (e.g. close to 100
 mbit/s udp packets)
 

We ended up deciding that this was a bug in the e100 NAPI implementation.

I have a not-very-official patch in -mm, at
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.12-rc1/2.6.12-rc1-mm1/broken-out/e100-napi-state-machine-fix.patch.
Would you be able to test that?

AFAIK there has been no official fix for this yet.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-07 Thread Richard Fuchs
Scott Feldman wrote:
Would you mind giving this patch a try against 2.6.11?  I think it's 
equivalent to Jesse's patch, but less intrusive to the driver.
looks good, no more memory corruption errors. thanks for this.
cheers
richard
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-07 Thread Richard Fuchs
Scott Feldman wrote:
Would you mind giving this patch a try against 2.6.11?  I think it's 
equivalent to Jesse's patch, but less intrusive to the driver.
looks good, no more memory corruption errors. thanks for this.
cheers
richard
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-06 Thread Scott Feldman
On Mar 6, 2005, at 10:40 AM, Richard Fuchs wrote:
Scott Feldman wrote:
A bug in the driver.  I have a hunch: please try this patch with 
2.6.9 or higher:
http://marc.theaimsgroup.com/?l=linux-netdev=110726809431611=2
bingo, that fixes it. too bad neither this patch nor the removal of 
the NAPI config option made it into 2.6.11...
Jesse Brandeburg @ Intel found the fix for the bug but I don't think 
it's been pushed out to Jeff's tree yet, AFAIK.  Soon, I would guess.

Would you mind giving this patch a try against 2.6.11?  I think it's 
equivalent to Jesse's patch, but less intrusive to the driver.

--- linux-2.6.11/drivers/net/e100.c.origSun Mar  6 20:58:15 2005
+++ linux-2.6.11/drivers/net/e100.c Sun Mar  6 21:01:34 2005
@@ -1471,8 +1471,12 @@ static inline int e100_rx_indicate(struc
/* If data isn't ready, nothing to indicate */
if(unlikely(!(rfd_status & cb_complete)))
-   return -EAGAIN;
+   return -ENODATA;
+   /* This allows for a fast restart without re-enabling 
interrupts */
+   if(le16_to_cpu(rfd->command) & cb_el)
+   nic->ru_running = 0;
+
/* Get actual data size */
actual_size = le16_to_cpu(rfd->actual_size) & 0x3FFF;
if(unlikely(actual_size > RFD_BUF_LEN - sizeof(struct rfd)))
@@ -1527,7 +1531,11 @@ static inline void e100_rx_clean(struct
break; /* Better luck next time (see watchdog) 
*/
}

-   e100_start_receiver(nic);
+   /* NAPI: attempt to restart the receiver iff the list is
+* totally clean otherwise we'll race between hardware and
+* nic->rx_to_clean. */
+   if(!work_done || *work_done == 0)
+   e100_start_receiver(nic);
 }
 static void e100_rx_clean_list(struct nic *nic)

No.  e1000 is a totally different driver/device with very similar 
name.
too bad, i was hoping for an explanation for some unexplainable 
crashes i've been experiencing... ;)
Take the e1000 issue to linux-netdev.
-scott
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-06 Thread Richard Fuchs
Scott Feldman wrote:
On Mar 5, 2005, at 11:10 AM, Richard Fuchs wrote:

looks like you are right, enabling NAPI in 2.6.7 does trigger this.
what exactly is this?

A bug in the driver.  I have a hunch: please try this patch with 2.6.9 
or higher:

http://marc.theaimsgroup.com/?l=linux-netdev=110726809431611=2
bingo, that fixes it. too bad neither this patch nor the removal of the 
NAPI config option made it into 2.6.11...

  also, does this affect the e1000 driver in any way?

No.  e1000 is a totally different driver/device with very similar name.
too bad, i was hoping for an explanation for some unexplainable crashes 
i've been experiencing... ;)

cheers
richard
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-06 Thread Scott Feldman
On Mar 5, 2005, at 11:10 AM, Richard Fuchs wrote:
Scott Feldman wrote:
Was NAPI turned on for e100 in 2.6.7?  If not, turn NAPI on in the 
2.6.7 driver and see if you get the same result.  If you do, it's 
very likely the bug is in the e100 driver's NAPI implementation.
looks like you are right, enabling NAPI in 2.6.7 does trigger this.
what exactly is this?
A bug in the driver.  I have a hunch: please try this patch with 2.6.9 
or higher:

http://marc.theaimsgroup.com/?l=linux-netdev=110726809431611=2
i didn't enable NAPI in any of the newer kernel versions i was trying, 
so i'm somewhat confused. :)
NAPI is the only option for new kernels.  2.6.7 had both NAPI and 
non-NAPI.

  also, does this affect the e1000 driver in any way?
No.  e1000 is a totally different driver/device with very similar name.
-scott
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-06 Thread Scott Feldman
On Mar 5, 2005, at 11:10 AM, Richard Fuchs wrote:
Scott Feldman wrote:
Was NAPI turned on for e100 in 2.6.7?  If not, turn NAPI on in the 
2.6.7 driver and see if you get the same result.  If you do, it's 
very likely the bug is in the e100 driver's NAPI implementation.
looks like you are right, enabling NAPI in 2.6.7 does trigger this.
what exactly is this?
A bug in the driver.  I have a hunch: please try this patch with 2.6.9 
or higher:

http://marc.theaimsgroup.com/?l=linux-netdevm=110726809431611w=2
i didn't enable NAPI in any of the newer kernel versions i was trying, 
so i'm somewhat confused. :)
NAPI is the only option for new kernels.  2.6.7 had both NAPI and 
non-NAPI.

  also, does this affect the e1000 driver in any way?
No.  e1000 is a totally different driver/device with very similar name.
-scott
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-06 Thread Richard Fuchs
Scott Feldman wrote:
On Mar 5, 2005, at 11:10 AM, Richard Fuchs wrote:

looks like you are right, enabling NAPI in 2.6.7 does trigger this.
what exactly is this?

A bug in the driver.  I have a hunch: please try this patch with 2.6.9 
or higher:

http://marc.theaimsgroup.com/?l=linux-netdevm=110726809431611w=2
bingo, that fixes it. too bad neither this patch nor the removal of the 
NAPI config option made it into 2.6.11...

  also, does this affect the e1000 driver in any way?

No.  e1000 is a totally different driver/device with very similar name.
too bad, i was hoping for an explanation for some unexplainable crashes 
i've been experiencing... ;)

cheers
richard
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-06 Thread Scott Feldman
On Mar 6, 2005, at 10:40 AM, Richard Fuchs wrote:
Scott Feldman wrote:
A bug in the driver.  I have a hunch: please try this patch with 
2.6.9 or higher:
http://marc.theaimsgroup.com/?l=linux-netdevm=110726809431611w=2
bingo, that fixes it. too bad neither this patch nor the removal of 
the NAPI config option made it into 2.6.11...
Jesse Brandeburg @ Intel found the fix for the bug but I don't think 
it's been pushed out to Jeff's tree yet, AFAIK.  Soon, I would guess.

Would you mind giving this patch a try against 2.6.11?  I think it's 
equivalent to Jesse's patch, but less intrusive to the driver.

--- linux-2.6.11/drivers/net/e100.c.origSun Mar  6 20:58:15 2005
+++ linux-2.6.11/drivers/net/e100.c Sun Mar  6 21:01:34 2005
@@ -1471,8 +1471,12 @@ static inline int e100_rx_indicate(struc
/* If data isn't ready, nothing to indicate */
if(unlikely(!(rfd_status  cb_complete)))
-   return -EAGAIN;
+   return -ENODATA;
+   /* This allows for a fast restart without re-enabling 
interrupts */
+   if(le16_to_cpu(rfd-command)  cb_el)
+   nic-ru_running = 0;
+
/* Get actual data size */
actual_size = le16_to_cpu(rfd-actual_size)  0x3FFF;
if(unlikely(actual_size  RFD_BUF_LEN - sizeof(struct rfd)))
@@ -1527,7 +1531,11 @@ static inline void e100_rx_clean(struct
break; /* Better luck next time (see watchdog) 
*/
}

-   e100_start_receiver(nic);
+   /* NAPI: attempt to restart the receiver iff the list is
+* totally clean otherwise we'll race between hardware and
+* nic-rx_to_clean. */
+   if(!work_done || *work_done == 0)
+   e100_start_receiver(nic);
 }
 static void e100_rx_clean_list(struct nic *nic)

No.  e1000 is a totally different driver/device with very similar 
name.
too bad, i was hoping for an explanation for some unexplainable 
crashes i've been experiencing... ;)
Take the e1000 issue to linux-netdev.
-scott
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-05 Thread Richard Fuchs
Scott Feldman wrote:
Was NAPI turned on for e100 in 2.6.7?  If not, turn NAPI on in the 2.6.7 
driver and see if you get the same result.  If you do, it's very likely 
the bug is in the e100 driver's NAPI implementation.
looks like you are right, enabling NAPI in 2.6.7 does trigger this.
what exactly is this? i didn't enable NAPI in any of the newer kernel 
versions i was trying, so i'm somewhat confused. :)  also, does this 
affect the e1000 driver in any way?

cheers
richard
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-05 Thread Scott Feldman
On Mar 4, 2005, at 4:23 AM, Richard Fuchs wrote:
kernel 2.6.7 doesn't show this behavior, while all kernels from 2.6.9 
and up do. (i didn't test 2.6.8.x).
Was NAPI turned on for e100 in 2.6.7?  If not, turn NAPI on in the 
2.6.7 driver and see if you get the same result.  If you do, it's very 
likely the bug is in the e100 driver's NAPI implementation.

-scott
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-05 Thread Scott Feldman
On Mar 4, 2005, at 4:23 AM, Richard Fuchs wrote:
kernel 2.6.7 doesn't show this behavior, while all kernels from 2.6.9 
and up do. (i didn't test 2.6.8.x).
Was NAPI turned on for e100 in 2.6.7?  If not, turn NAPI on in the 
2.6.7 driver and see if you get the same result.  If you do, it's very 
likely the bug is in the e100 driver's NAPI implementation.

-scott
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-05 Thread Richard Fuchs
Scott Feldman wrote:
Was NAPI turned on for e100 in 2.6.7?  If not, turn NAPI on in the 2.6.7 
driver and see if you get the same result.  If you do, it's very likely 
the bug is in the e100 driver's NAPI implementation.
looks like you are right, enabling NAPI in 2.6.7 does trigger this.
what exactly is this? i didn't enable NAPI in any of the newer kernel 
versions i was trying, so i'm somewhat confused. :)  also, does this 
affect the e1000 driver in any way?

cheers
richard
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Matt Mackall
On Fri, Mar 04, 2005 at 10:19:21PM +0100, Richard Fuchs wrote:
> _correction_ to my previous mail, this does _not_ happen with the 
> eepro100 driver. (sorry for the confusion, i got the kernel images mixed 
> up with all the testing i've been doing.)
> 
> could this affect the e1000 driver as well?

Yes. 

> >Send the output of ethtool, please.

Doh. 'ethtool -k' is what's needed, sorry.

If it's reproduceable, try turning off rx/tx hardware checksumming:

ethtool -k eth0 rx off tx off

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
Matt Mackall wrote:
Doh. 'ethtool -k' is what's needed, sorry.
doh myself. :) this won't be very helpful though, as i get the same on 
all machines (with both drivers):

Offload parameters for eth0:
Cannot get device rx csum settings: Operation not supported
Cannot get device tx csum settings: Operation not supported
Cannot get device scatter-gather settings: Operation not supported
Cannot get device tcp segmentation offload settings: Operation not supported
no offload info available
ethtool -k eth0 rx off tx off
ditto. i'll try to reproduce this on a machine with e1000 though...
cheers
richard
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
_correction_ to my previous mail, this does _not_ happen with the 
eepro100 driver. (sorry for the confusion, i got the kernel images mixed 
up with all the testing i've been doing.)

could this affect the e1000 driver as well?
Matt Mackall wrote:
Send the output of ethtool, please.
box 1, affected:
Settings for eth0:
Supported ports: [ TP MII ]
Supported link modes:   10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: MII
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Current message level: 0x20c1 (8385)
Link detected: yes

box 2, affected:
Supported ports: [ TP MII ]
Supported link modes:   10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: MII
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: g
Current message level: 0x0007 (7)
Link detected: yes

box 3, not affected:
Supported ports: [ TP MII ]
Supported link modes:   10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: MII
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: g
Current message level: 0x0007 (7)
Link detected: yes

This tends to be checksum
offloading not working as it should or the like. Can you repeat this
with bulk ssh traffic?
yes, with various strange effects:
Received disconnect from 195.58.172.154: 2: Bad packet length 919251405.
or
Received disconnect from 195.58.172.154: 2: Corrupted MAC on input.
cheers
richard
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
Matt Mackall wrote:
Which card/driver is this? Is this the same card that's showing ssh
troubles? My theory about your ssh trouble only applies to cards with
checksum offload.
i got the same on all three machines i was testing with, with both the 
e100 and the eepro100 driver. one of those three machines was the one 
with the ssh troubles, its card is identified as "Intel Corp. 82557/8/9 
[Ethernet Pro 100] (rev 08)", pci id 8086:1229. plus, i couldn't 
reproduce those problems on a machine with e1000, which does support all 
kinds of checksum offloading. (there might still be something fishy with 
the e1000 as well, as i'm not entirely trusting the errors from the slab 
checkers alone. especially since i don't see those messages when i 
enable page alloc debugging.)

another machine behaves even more strangely... its nic is identified as 
"Intel Corp. 82801BD PRO/100 VE (LOM) Ethernet Controller (rev 81)", pci 
id 8086:1039, also apparently not supporting hardware checksums. it does 
immediately produce the slab debug errors when i bombard it with udp 
packets while having disk access w/o dma, but remains silent when doing 
the same with a tcp transfer instead of udp packets. neither ssh traffic 
nor /dev/zero piped through netcat (no matter in which direction) makes 
it catch any errors. i only got a _single_ message from the slab 
debugger when sending /dev/zero through netcat in _both_ directions at 
the same time (in and out). however, i do get pages and pages of those 
messages when sending a simple stream of udp packets to the box... 
again, this is all with the e100 driver, i couldn't produce any similar 
results with the eepro100 or the e1000 driver yet, but apparently this 
doesn't necessarily mean that there isn't something wrong anyway...

cheers
richard
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Matt Mackall
On Fri, Mar 04, 2005 at 10:52:38PM +0100, Richard Fuchs wrote:
> Matt Mackall wrote:
> 
> >Doh. 'ethtool -k' is what's needed, sorry.
> 
> doh myself. :) this won't be very helpful though, as i get the same on 
> all machines (with both drivers):
> 
> Offload parameters for eth0:
> Cannot get device rx csum settings: Operation not supported
> Cannot get device tx csum settings: Operation not supported
> Cannot get device scatter-gather settings: Operation not supported
> Cannot get device tcp segmentation offload settings: Operation not supported
> no offload info available

Which card/driver is this? Is this the same card that's showing ssh
troubles? My theory about your ssh trouble only applies to cards with
checksum offload.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Matt Mackall
On Fri, Mar 04, 2005 at 01:23:48PM +0100, Richard Fuchs wrote:
> Andrew Morton wrote:
> 
> >I guess it could be hardware.  But given that disabling DMA _causes_ the
> >problem, rather than fixes it, it seems unlikely.
> >
> >Could you enable CONFIG_DEBUG_PAGEALLOC in .config and see it that triggers
> >an oops?
> 
> by now, i could reproduce this on two different machines with quite 
> different hardware, while a third doesn't seem to show those symptoms. 
> on the second machine, i got the corruption errors from the slab 
> debugger mostly from the disk access alone, the network traffic was only 
> minimal (but still present). i was doing write operations on the hdd in 
> this test.
> 
> kernel 2.6.7 doesn't show this behavior, while all kernels from 2.6.9 
> and up do. (i didn't test 2.6.8.x).
> 
> as for DEBUG_PAGEALLOC... when i enable this option, the errors from 
> DEBUG_SLAB magically disappear. however, my ssh session got disconnected 
> once while doing the disk access with the message:
> 
> Received disconnect from 195.58.172.154: 2: Bad packet length 4239103034.

Send the output of ethtool, please. This tends to be checksum
offloading not working as it should or the like. Can you repeat this
with bulk ssh traffic?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
Richard Fuchs wrote:
> [e100]
i will try again the eepro100 driver and see if it does the same...
yes, the same thing happens with the eepro100 driver.
cheers
richard
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
Dave Jones wrote:
Which network drivers are in use on the box that gets the corruption ?
all three that i tested it on are using the e100 driver. the boxes with 
pci id 8086:1039 and 8086:1229 are seeing corruptions, the one with pci 
id 8086:2449 is not.

i will try again the eepro100 driver and see if it does the same...
cheers
richard
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Dave Jones
On Fri, Mar 04, 2005 at 10:55:31AM +0100, Richard Fuchs wrote:
 > hello all!
 > 
 > the memory allocation debugger gives me the following messages under a
 > vanilla 2.6.10 and 2.6.11 kernel when doing
 > 
 > 1) hdparm -d0 on my hard disk
 > 2) tar c / > /dev/null
 > 3) sending lots of network traffic to the machine (e.g. close to 100
 > mbit/s udp packets)

Which network drivers are in use on the box that gets the corruption ?

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
Andrew Morton wrote:
I guess it could be hardware.  But given that disabling DMA _causes_ the
problem, rather than fixes it, it seems unlikely.
Could you enable CONFIG_DEBUG_PAGEALLOC in .config and see it that triggers
an oops?
by now, i could reproduce this on two different machines with quite 
different hardware, while a third doesn't seem to show those symptoms. 
on the second machine, i got the corruption errors from the slab 
debugger mostly from the disk access alone, the network traffic was only 
minimal (but still present). i was doing write operations on the hdd in 
this test.

kernel 2.6.7 doesn't show this behavior, while all kernels from 2.6.9 
and up do. (i didn't test 2.6.8.x).

as for DEBUG_PAGEALLOC... when i enable this option, the errors from 
DEBUG_SLAB magically disappear. however, my ssh session got disconnected 
once while doing the disk access with the message:

Received disconnect from 195.58.172.154: 2: Bad packet length 4239103034.
never seen this before and not sure if this has anything to do with it...
cheers
richard
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Andrew Morton
Richard Fuchs <[EMAIL PROTECTED]> wrote:
>
> hello all!
> 
> the memory allocation debugger gives me the following messages under a
> vanilla 2.6.10 and 2.6.11 kernel when doing
> 
> 1) hdparm -d0 on my hard disk
> 2) tar c / > /dev/null
> 3) sending lots of network traffic to the machine (e.g. close to 100
> mbit/s udp packets)
> 
> -
> Slab corruption: start=de9141a4, len=2048
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [](kfree_skbmem+0x13/0x30)
> 010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3b c0
> 020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
> 030: 00 df 08 00 45 00 00 1c 41 d0 40 00 40 11 33 78
> 040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
> 050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
> ...
> 
> and so on. the disk activity alone or the network traffic alone doesn't
> trigger this. also doing the same with dma enabled doesn't trigger this
> either, but when everything comes together i get this within a second.
> kernel is not smp and preempt is not enabled.
> 
> kernel config (from 2.6.11) is attached; if you need any more info, let
> me know. is this a kernel issue, or could the hardware be at fault?

I guess it could be hardware.  But given that disabling DMA _causes_ the
problem, rather than fixes it, it seems unlikely.

Could you enable CONFIG_DEBUG_PAGEALLOC in .config and see it that triggers
an oops?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
hello all!
the memory allocation debugger gives me the following messages under a
vanilla 2.6.10 and 2.6.11 kernel when doing
1) hdparm -d0 on my hard disk
2) tar c / > /dev/null
3) sending lots of network traffic to the machine (e.g. close to 100
mbit/s udp packets)
-
Slab corruption: start=de9141a4, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [](kfree_skbmem+0x13/0x30)
010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3b c0
020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
030: 00 df 08 00 45 00 00 1c 41 d0 40 00 40 11 33 78
040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
Next obj: start=de9149b0, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [](kfree_skbmem+0x13/0x30)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Slab corruption: start=de92e8b0, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [](kfree_skbmem+0x13/0x30)
010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3c c0
020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
030: 00 df 08 00 45 00 00 1c 41 cf 40 00 40 11 33 79
040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
Prev obj: start=de92e0a4, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [](kfree_skbmem+0x13/0x30)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Next obj: start=de92f0bc, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [](kfree_skbmem+0x13/0x30)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Slab corruption: start=def5e3a4, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [](kfree_skbmem+0x13/0x30)
010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3b c0
020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
030: 00 df 08 00 45 00 00 1c e3 14 40 00 40 11 92 33
040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
Next obj: start=def5ebb0, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [](kfree_skbmem+0x13/0x30)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Slab corruption: start=de938b30, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [](kfree_skbmem+0x13/0x30)
010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3c c0
020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
030: 00 df 08 00 45 00 00 1c e3 13 40 00 40 11 92 34
040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
Prev obj: start=de938324, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [](kfree_skbmem+0x13/0x30)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Next obj: start=de93933c, len=2048
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [](alloc_skb+0x47/0xf0)
000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
010: 5a 5a 20 a0 00 00 42 cd a9 1e ff ff ff ff 3c c0
Slab corruption: start=de96aa30, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [](kfree_skbmem+0x13/0x30)
010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3b c0
020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
030: 00 df 08 00 45 00 00 1c 0d e4 40 00 40 11 67 64
040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
Prev obj: start=de96a224, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [](kfree_skbmem+0x13/0x30)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Next obj: start=de96b23c, len=2048
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [](alloc_skb+0x47/0xf0)
000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
010: 5a 5a 00 00 00 00 b6 81 15 1f ff ff ff ff 00 00
Slab corruption: start=de8fa5a4, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [](kfree_skbmem+0x13/0x30)
010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3c c0
020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
030: 00 df 08 00 45 00 00 1c 0d e3 40 00 40 11 67 65
040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
Next obj: start=de8fadb0, len=2048
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [](alloc_skb+0x47/0xf0)
000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
010: 5a 5a 00 00 00 00 ce 96 92 1e ff ff ff ff 00 00
-
and so on. the disk activity alone or the network traffic alone doesn't
trigger this. also doing the same with dma enabled doesn't trigger this
either, but when everything comes together i get this within a second.
kernel is not smp and preempt is not enabled.
kernel config (from 2.6.11) is attached; if you need any more info, let
me know. is this a kernel issue, or could the hardware be at fault?
cheers
richard

#
# Automatically generated make config: don't edit
# 

slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
hello all!
the memory allocation debugger gives me the following messages under a
vanilla 2.6.10 and 2.6.11 kernel when doing
1) hdparm -d0 on my hard disk
2) tar c /  /dev/null
3) sending lots of network traffic to the machine (e.g. close to 100
mbit/s udp packets)
-
Slab corruption: start=de9141a4, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [c03b8163](kfree_skbmem+0x13/0x30)
010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3b c0
020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
030: 00 df 08 00 45 00 00 1c 41 d0 40 00 40 11 33 78
040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
Next obj: start=de9149b0, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [c03b8163](kfree_skbmem+0x13/0x30)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Slab corruption: start=de92e8b0, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [c03b8163](kfree_skbmem+0x13/0x30)
010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3c c0
020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
030: 00 df 08 00 45 00 00 1c 41 cf 40 00 40 11 33 79
040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
Prev obj: start=de92e0a4, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [c03b8163](kfree_skbmem+0x13/0x30)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Next obj: start=de92f0bc, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [c03b8163](kfree_skbmem+0x13/0x30)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Slab corruption: start=def5e3a4, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [c03b8163](kfree_skbmem+0x13/0x30)
010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3b c0
020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
030: 00 df 08 00 45 00 00 1c e3 14 40 00 40 11 92 33
040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
Next obj: start=def5ebb0, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [c03b8163](kfree_skbmem+0x13/0x30)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Slab corruption: start=de938b30, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [c03b8163](kfree_skbmem+0x13/0x30)
010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3c c0
020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
030: 00 df 08 00 45 00 00 1c e3 13 40 00 40 11 92 34
040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
Prev obj: start=de938324, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [c03b8163](kfree_skbmem+0x13/0x30)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Next obj: start=de93933c, len=2048
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [c03b7e97](alloc_skb+0x47/0xf0)
000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
010: 5a 5a 20 a0 00 00 42 cd a9 1e ff ff ff ff 3c c0
Slab corruption: start=de96aa30, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [c03b8163](kfree_skbmem+0x13/0x30)
010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3b c0
020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
030: 00 df 08 00 45 00 00 1c 0d e4 40 00 40 11 67 64
040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
Prev obj: start=de96a224, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [c03b8163](kfree_skbmem+0x13/0x30)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Next obj: start=de96b23c, len=2048
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [c03b7e97](alloc_skb+0x47/0xf0)
000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
010: 5a 5a 00 00 00 00 b6 81 15 1f ff ff ff ff 00 00
Slab corruption: start=de8fa5a4, len=2048
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [c03b8163](kfree_skbmem+0x13/0x30)
010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3c c0
020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
030: 00 df 08 00 45 00 00 1c 0d e3 40 00 40 11 67 65
040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
Next obj: start=de8fadb0, len=2048
Redzone: 0x170fc2a5/0x170fc2a5.
Last user: [c03b7e97](alloc_skb+0x47/0xf0)
000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
010: 5a 5a 00 00 00 00 ce 96 92 1e ff ff ff ff 00 00
-
and so on. the disk activity alone or the network traffic alone doesn't
trigger this. also doing the same with dma enabled doesn't trigger this
either, but when everything comes together i get this within a second.
kernel is not smp and preempt is not enabled.
kernel config (from 2.6.11) is attached; if you need any more info, let
me know. is this a 

Re: slab corruption in skb allocs

2005-03-04 Thread Andrew Morton
Richard Fuchs [EMAIL PROTECTED] wrote:

 hello all!
 
 the memory allocation debugger gives me the following messages under a
 vanilla 2.6.10 and 2.6.11 kernel when doing
 
 1) hdparm -d0 on my hard disk
 2) tar c /  /dev/null
 3) sending lots of network traffic to the machine (e.g. close to 100
 mbit/s udp packets)
 
 -
 Slab corruption: start=de9141a4, len=2048
 Redzone: 0x5a2cf071/0x5a2cf071.
 Last user: [c03b8163](kfree_skbmem+0x13/0x30)
 010: 6b 6b 20 a0 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 3b c0
 020: 6b 6b 00 0b cd 1e 1f d2 00 04 23 01 c7 6f 81 00
 030: 00 df 08 00 45 00 00 1c 41 d0 40 00 40 11 33 78
 040: c0 a8 22 1d c0 a8 22 1b 80 52 30 18 00 08 89 ea
 050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6b 6b
 ...
 
 and so on. the disk activity alone or the network traffic alone doesn't
 trigger this. also doing the same with dma enabled doesn't trigger this
 either, but when everything comes together i get this within a second.
 kernel is not smp and preempt is not enabled.
 
 kernel config (from 2.6.11) is attached; if you need any more info, let
 me know. is this a kernel issue, or could the hardware be at fault?

I guess it could be hardware.  But given that disabling DMA _causes_ the
problem, rather than fixes it, it seems unlikely.

Could you enable CONFIG_DEBUG_PAGEALLOC in .config and see it that triggers
an oops?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
Andrew Morton wrote:
I guess it could be hardware.  But given that disabling DMA _causes_ the
problem, rather than fixes it, it seems unlikely.
Could you enable CONFIG_DEBUG_PAGEALLOC in .config and see it that triggers
an oops?
by now, i could reproduce this on two different machines with quite 
different hardware, while a third doesn't seem to show those symptoms. 
on the second machine, i got the corruption errors from the slab 
debugger mostly from the disk access alone, the network traffic was only 
minimal (but still present). i was doing write operations on the hdd in 
this test.

kernel 2.6.7 doesn't show this behavior, while all kernels from 2.6.9 
and up do. (i didn't test 2.6.8.x).

as for DEBUG_PAGEALLOC... when i enable this option, the errors from 
DEBUG_SLAB magically disappear. however, my ssh session got disconnected 
once while doing the disk access with the message:

Received disconnect from 195.58.172.154: 2: Bad packet length 4239103034.
never seen this before and not sure if this has anything to do with it...
cheers
richard
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Dave Jones
On Fri, Mar 04, 2005 at 10:55:31AM +0100, Richard Fuchs wrote:
  hello all!
  
  the memory allocation debugger gives me the following messages under a
  vanilla 2.6.10 and 2.6.11 kernel when doing
  
  1) hdparm -d0 on my hard disk
  2) tar c /  /dev/null
  3) sending lots of network traffic to the machine (e.g. close to 100
  mbit/s udp packets)

Which network drivers are in use on the box that gets the corruption ?

Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
Dave Jones wrote:
Which network drivers are in use on the box that gets the corruption ?
all three that i tested it on are using the e100 driver. the boxes with 
pci id 8086:1039 and 8086:1229 are seeing corruptions, the one with pci 
id 8086:2449 is not.

i will try again the eepro100 driver and see if it does the same...
cheers
richard
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
Richard Fuchs wrote:
 [e100]
i will try again the eepro100 driver and see if it does the same...
yes, the same thing happens with the eepro100 driver.
cheers
richard
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Matt Mackall
On Fri, Mar 04, 2005 at 01:23:48PM +0100, Richard Fuchs wrote:
 Andrew Morton wrote:
 
 I guess it could be hardware.  But given that disabling DMA _causes_ the
 problem, rather than fixes it, it seems unlikely.
 
 Could you enable CONFIG_DEBUG_PAGEALLOC in .config and see it that triggers
 an oops?
 
 by now, i could reproduce this on two different machines with quite 
 different hardware, while a third doesn't seem to show those symptoms. 
 on the second machine, i got the corruption errors from the slab 
 debugger mostly from the disk access alone, the network traffic was only 
 minimal (but still present). i was doing write operations on the hdd in 
 this test.
 
 kernel 2.6.7 doesn't show this behavior, while all kernels from 2.6.9 
 and up do. (i didn't test 2.6.8.x).
 
 as for DEBUG_PAGEALLOC... when i enable this option, the errors from 
 DEBUG_SLAB magically disappear. however, my ssh session got disconnected 
 once while doing the disk access with the message:
 
 Received disconnect from 195.58.172.154: 2: Bad packet length 4239103034.

Send the output of ethtool, please. This tends to be checksum
offloading not working as it should or the like. Can you repeat this
with bulk ssh traffic?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Matt Mackall
On Fri, Mar 04, 2005 at 10:52:38PM +0100, Richard Fuchs wrote:
 Matt Mackall wrote:
 
 Doh. 'ethtool -k' is what's needed, sorry.
 
 doh myself. :) this won't be very helpful though, as i get the same on 
 all machines (with both drivers):
 
 Offload parameters for eth0:
 Cannot get device rx csum settings: Operation not supported
 Cannot get device tx csum settings: Operation not supported
 Cannot get device scatter-gather settings: Operation not supported
 Cannot get device tcp segmentation offload settings: Operation not supported
 no offload info available

Which card/driver is this? Is this the same card that's showing ssh
troubles? My theory about your ssh trouble only applies to cards with
checksum offload.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
Matt Mackall wrote:
Which card/driver is this? Is this the same card that's showing ssh
troubles? My theory about your ssh trouble only applies to cards with
checksum offload.
i got the same on all three machines i was testing with, with both the 
e100 and the eepro100 driver. one of those three machines was the one 
with the ssh troubles, its card is identified as Intel Corp. 82557/8/9 
[Ethernet Pro 100] (rev 08), pci id 8086:1229. plus, i couldn't 
reproduce those problems on a machine with e1000, which does support all 
kinds of checksum offloading. (there might still be something fishy with 
the e1000 as well, as i'm not entirely trusting the errors from the slab 
checkers alone. especially since i don't see those messages when i 
enable page alloc debugging.)

another machine behaves even more strangely... its nic is identified as 
Intel Corp. 82801BD PRO/100 VE (LOM) Ethernet Controller (rev 81), pci 
id 8086:1039, also apparently not supporting hardware checksums. it does 
immediately produce the slab debug errors when i bombard it with udp 
packets while having disk access w/o dma, but remains silent when doing 
the same with a tcp transfer instead of udp packets. neither ssh traffic 
nor /dev/zero piped through netcat (no matter in which direction) makes 
it catch any errors. i only got a _single_ message from the slab 
debugger when sending /dev/zero through netcat in _both_ directions at 
the same time (in and out). however, i do get pages and pages of those 
messages when sending a simple stream of udp packets to the box... 
again, this is all with the e100 driver, i couldn't produce any similar 
results with the eepro100 or the e1000 driver yet, but apparently this 
doesn't necessarily mean that there isn't something wrong anyway...

cheers
richard
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
_correction_ to my previous mail, this does _not_ happen with the 
eepro100 driver. (sorry for the confusion, i got the kernel images mixed 
up with all the testing i've been doing.)

could this affect the e1000 driver as well?
Matt Mackall wrote:
Send the output of ethtool, please.
box 1, affected:
Settings for eth0:
Supported ports: [ TP MII ]
Supported link modes:   10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: MII
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Current message level: 0x20c1 (8385)
Link detected: yes

box 2, affected:
Supported ports: [ TP MII ]
Supported link modes:   10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: MII
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: g
Current message level: 0x0007 (7)
Link detected: yes

box 3, not affected:
Supported ports: [ TP MII ]
Supported link modes:   10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: MII
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: g
Current message level: 0x0007 (7)
Link detected: yes

This tends to be checksum
offloading not working as it should or the like. Can you repeat this
with bulk ssh traffic?
yes, with various strange effects:
Received disconnect from 195.58.172.154: 2: Bad packet length 919251405.
or
Received disconnect from 195.58.172.154: 2: Corrupted MAC on input.
cheers
richard
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Richard Fuchs
Matt Mackall wrote:
Doh. 'ethtool -k' is what's needed, sorry.
doh myself. :) this won't be very helpful though, as i get the same on 
all machines (with both drivers):

Offload parameters for eth0:
Cannot get device rx csum settings: Operation not supported
Cannot get device tx csum settings: Operation not supported
Cannot get device scatter-gather settings: Operation not supported
Cannot get device tcp segmentation offload settings: Operation not supported
no offload info available
ethtool -k eth0 rx off tx off
ditto. i'll try to reproduce this on a machine with e1000 though...
cheers
richard
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slab corruption in skb allocs

2005-03-04 Thread Matt Mackall
On Fri, Mar 04, 2005 at 10:19:21PM +0100, Richard Fuchs wrote:
 _correction_ to my previous mail, this does _not_ happen with the 
 eepro100 driver. (sorry for the confusion, i got the kernel images mixed 
 up with all the testing i've been doing.)
 
 could this affect the e1000 driver as well?

Yes. 

 Send the output of ethtool, please.

Doh. 'ethtool -k' is what's needed, sorry.

If it's reproduceable, try turning off rx/tx hardware checksumming:

ethtool -k eth0 rx off tx off

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/