Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Andreas Steinmetz
Andrew Morton wrote:
> On Tue, 20 Mar 2007 00:25:02 +0100
> Andreas Steinmetz <[EMAIL PROTECTED]> wrote:
> 
>> Mike Christie wrote:
>>> Mike Christie wrote:
 James Bottomley wrote:
> On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
>>> I can't even say if the tapes are written correctly as I can't read them
>>> (one does not reboot production machines back to 2.4.x just to try to
>>> read a backup tape - I don't have 2.6.x older than 2.6.20 on these
>>> machines).
>> Could you try this patch
>> http://marc.info/?l=linux-scsi=116464965414878=2
>> I thought st was modified to not send offsets in the last elements but
>> it looks like it wasn't.
> Actually, there are two patches in the email referred to.  If the
> analysis that we're passing NULL to mempool_free is correct, it should
> be the second one that fixes the problem (the one that checks
> bio->bi_io_vec before freeing it).  Which would mean we have a
> nr_vecs==0 bio generated by the tar somehow.
>
 I think we might only need the first patch if the problem is similar to
 what the lsi guys were seeing. I thought the problem is that we are not
 estimating how large the transfer is correctly because we do not take
 into account offsets at the end. This results in nr_vecs being zero when
 it should be a valid value. I thought Kai's patch:
 http://bugzilla.kernel.org/show_bug.cgi?id=7919
 http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
 fixed the problem on st's side,
>>> Oh, I noticed that the subject for the mail references 2.6.30.3 and the
>>> patch for st in the bugzilla did not make into 2.6.20 and is not in .3.
>>> Could we try the st patch in the bugzilla first?
>> Ok, the st patch from bugzilla solves the problem (tested on both
>> affected machines).
> 
> 
> If you're referring to the below patch then it's already in mainline, and
> has been for a month.
> 

Yes, that's the patch I'm referring to.

> Have you tested 2.6.21-rc4?  If not, please do so.
> 

Sorry, this is not possible on these machines. They are production
servers and every problem on them that cannot be easily solved via
remote access is a 40km (one way) drive in the middle of the night.

> Perhaps we should merge this into 2.6.20.x?
> 

I would suggest so.

> 
> 
> commit 9abe16c670bd3d4ab5519257514f9f291383d104
> Author: Kai Makisara <[EMAIL PROTECTED]>
> Date:   Sat Feb 3 13:21:29 2007 +0200
> 
> [SCSI] st: fix Tape dies if wrong block size used, bug 7919
> 
> On Thu, 1 Feb 2007, Andrew Morton wrote:
> > On Thu, 1 Feb 2007 15:34:29 -0800
> > [EMAIL PROTECTED] wrote:
> >
> > > http://bugzilla.kernel.org/show_bug.cgi?id=7919
> > >
> > >Summary: Tape dies if wrong block size used
> > > Kernel Version: 2.6.20-rc5
> > > Status: NEW
> > >   Severity: normal
> > >  Owner: [EMAIL PROTECTED]
> > >  Submitter: [EMAIL PROTECTED]
> > >
> > >
> > > Most recent kernel where this bug did *NOT* occur: 2.6.17.14
> > >
> > > Other Kernels Tested and Results:
> > >
> > > OK 2.6.15.7
> > > OK 2.6.16.37
> > > OK 2.6.17.14
> > > BAD 2.6.18.6
> > > BAD 2.6.18-1.2869.fc6
> > > BAD 2.6.19.2 +
> > > BAD 2.6.20-rc5
> > >
> > > NOTE: 2.6.18-1.2869.fc6 is a Fedora modified kernel, all others are 
> from kernel.org
> > >
> ...
> > > Steps to reproduce:
> > > Get a Adaptec AHA-2940U/UW/D / AIC-7881U card and a tape drive,
> > > install a recent kernel
> > > set the tape block size - mt setblk 4096
> > > read from or write to tape using wrong block size - tar -b 7 -cvf 
> /dev/tape foo
> > >
> Write does not trigger this bug because the driver refuses in fixed block
> mode writes that are not a multiple of the block size. Read does trigger
> it in my system.
> 
> The bug is not associated with any specific HBA. st tries to do direct i/o
> in fixed block mode with reads that are not a multiple of tape block size.
> 
> The patch in this message fixes the st problem by switching to using the
> driver buffer up to the next close of the device file in fixed block mode
> if the user asks for a read like this.
> 
> I don't know why the bug has surfaced only after 2.6.17 although the st
> problem is old. There may be another bug in the block subsystem and this
> patch works around it. However, the patch fixes a problem in st and in
> this way it is a valid fix.
> 
> This patch may also fix the bug 7900.
> 
> The patch compiles and is lightly tested.
> 
> Signed-off-by: Kai Makisara <[EMAIL PROTECTED]>
> Signed-off-by: James Bottomley <[EMAIL PROTECTED]>
> 
> diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> index 

Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Andrew Morton
On Tue, 20 Mar 2007 00:25:02 +0100
Andreas Steinmetz <[EMAIL PROTECTED]> wrote:

> Mike Christie wrote:
> > Mike Christie wrote:
> >> James Bottomley wrote:
> >>> On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
> > I can't even say if the tapes are written correctly as I can't read them
> > (one does not reboot production machines back to 2.4.x just to try to
> > read a backup tape - I don't have 2.6.x older than 2.6.20 on these
> > machines).
>  Could you try this patch
>  http://marc.info/?l=linux-scsi=116464965414878=2
>  I thought st was modified to not send offsets in the last elements but
>  it looks like it wasn't.
> >>> Actually, there are two patches in the email referred to.  If the
> >>> analysis that we're passing NULL to mempool_free is correct, it should
> >>> be the second one that fixes the problem (the one that checks
> >>> bio->bi_io_vec before freeing it).  Which would mean we have a
> >>> nr_vecs==0 bio generated by the tar somehow.
> >>>
> >> I think we might only need the first patch if the problem is similar to
> >> what the lsi guys were seeing. I thought the problem is that we are not
> >> estimating how large the transfer is correctly because we do not take
> >> into account offsets at the end. This results in nr_vecs being zero when
> >> it should be a valid value. I thought Kai's patch:
> >> http://bugzilla.kernel.org/show_bug.cgi?id=7919
> >> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
> >> fixed the problem on st's side,
> > 
> > Oh, I noticed that the subject for the mail references 2.6.30.3 and the
> > patch for st in the bugzilla did not make into 2.6.20 and is not in .3.
> > Could we try the st patch in the bugzilla first?
> 
> Ok, the st patch from bugzilla solves the problem (tested on both
> affected machines).


If you're referring to the below patch then it's already in mainline, and
has been for a month.

Have you tested 2.6.21-rc4?  If not, please do so.

Perhaps we should merge this into 2.6.20.x?



commit 9abe16c670bd3d4ab5519257514f9f291383d104
Author: Kai Makisara <[EMAIL PROTECTED]>
Date:   Sat Feb 3 13:21:29 2007 +0200

[SCSI] st: fix Tape dies if wrong block size used, bug 7919

On Thu, 1 Feb 2007, Andrew Morton wrote:
> On Thu, 1 Feb 2007 15:34:29 -0800
> [EMAIL PROTECTED] wrote:
>
> > http://bugzilla.kernel.org/show_bug.cgi?id=7919
> >
> >Summary: Tape dies if wrong block size used
> > Kernel Version: 2.6.20-rc5
> > Status: NEW
> >   Severity: normal
> >  Owner: [EMAIL PROTECTED]
> >  Submitter: [EMAIL PROTECTED]
> >
> >
> > Most recent kernel where this bug did *NOT* occur: 2.6.17.14
> >
> > Other Kernels Tested and Results:
> >
> > OK 2.6.15.7
> > OK 2.6.16.37
> > OK 2.6.17.14
> > BAD 2.6.18.6
> > BAD 2.6.18-1.2869.fc6
> > BAD 2.6.19.2 +
> > BAD 2.6.20-rc5
> >
> > NOTE: 2.6.18-1.2869.fc6 is a Fedora modified kernel, all others are 
from kernel.org
> >
...
> > Steps to reproduce:
> > Get a Adaptec AHA-2940U/UW/D / AIC-7881U card and a tape drive,
> > install a recent kernel
> > set the tape block size - mt setblk 4096
> > read from or write to tape using wrong block size - tar -b 7 -cvf 
/dev/tape foo
> >
Write does not trigger this bug because the driver refuses in fixed block
mode writes that are not a multiple of the block size. Read does trigger
it in my system.

The bug is not associated with any specific HBA. st tries to do direct i/o
in fixed block mode with reads that are not a multiple of tape block size.

The patch in this message fixes the st problem by switching to using the
driver buffer up to the next close of the device file in fixed block mode
if the user asks for a read like this.

I don't know why the bug has surfaced only after 2.6.17 although the st
problem is old. There may be another bug in the block subsystem and this
patch works around it. However, the patch fixes a problem in st and in
this way it is a valid fix.

This patch may also fix the bug 7900.

The patch compiles and is lightly tested.

Signed-off-by: Kai Makisara <[EMAIL PROTECTED]>
Signed-off-by: James Bottomley <[EMAIL PROTECTED]>

diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index e016e09..fba8b20 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -9,7 +9,7 @@
Steve Hirsch, Andreas Koppenh"ofer, Michael Leodolter, Eyal Lebedinsky,
Michael Schaefer, J"org Weule, and Eric Youngdale.
 
-   Copyright 1992 - 2006 Kai Makisara
+   Copyright 1992 - 2007 Kai Makisara
email [EMAIL PROTECTED]
 
Some small formal changes - aeb, 950809
@@ -17,7 +17,7 @@
Last modified: 18-JAN-1998 Richard Gooch <[EMAIL PROTECTED]> Devfs 

Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Gene Heskett
On Monday 19 March 2007, James Bottomley wrote:
>On Mon, 2007-03-19 at 17:47 -0400, Gene Heskett wrote:
>> James, could this also be the cause of a tar based backup going crazy
>> and thinking all data is new under any 2.6.21-rc* kernel I've tested
>> so far with amanda, which in my case uses tar?  I've tried the fedora
>> patched tar-1.15-1, and one I hand built right after 1.15-1 came out
>> over a year ago, and they both do it, but only when booted to a
>> 2.6.21-rc* kernel.
>>
>> This obviously will be a show-stopper, either for amanda (and by
>> inference, any app that uses tar), or for the migration of an amanda
>> users machinery to a 2.6.21 kernel.
>
>Er, I don't think so .. that sounds like mtime miscompare, which is
>either a problem with the filesystem or a problem with the way mtime is
>stored in the tar archive.
>
>James

Well, since the times reported by ls -l --full-time are sane
[EMAIL PROTECTED] pix]# ls -l --full-time
total 924784
-rw-r--r-- 1 root   root 985324 2002-06-09 18:14:54.0 -0400 
0203.jpg
[... the rest of a 100k listing, booted to 2.6.20.3-rdsl-0.31]

And:
[EMAIL PROTECTED] pix]# ls -l --full-time
total 924784
-rw-r--r-- 1 root   root 985324 2002-06-09 18:14:54.0 -0400 
0203.jpg

booted to 2.6.21-rc4

allthough the fractional second is a string of .0, even when 
booted to a tar-unfriendly kernel, then it would tend to point at tar, 
but two differently built versions of tar have been confirmed as 
miss-behaving in the presence of a kernel in the 2.6.21 series so far, 
all of them.

I'm going to reboot twice more tonight, once to verify that the output of 
an ls -l --full-time is as I said above, I'll save this and do that again 
and clip it in after a reboot to 2.6.21-rc4, and once to 2.6.20.4-rc1 to 
see if by chance one of those patches is the guilty party.  I'll leave 
the latter running tonight for the amanda run & see what falls out.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Try to relax and enjoy the crisis.
-- Ashleigh Brilliant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Andreas Steinmetz
Mike Christie wrote:
> James Bottomley wrote:
>> On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
 I can't even say if the tapes are written correctly as I can't read them
 (one does not reboot production machines back to 2.4.x just to try to
 read a backup tape - I don't have 2.6.x older than 2.6.20 on these
 machines).
>>> Could you try this patch
>>> http://marc.info/?l=linux-scsi=116464965414878=2
>>> I thought st was modified to not send offsets in the last elements but
>>> it looks like it wasn't.
>> Actually, there are two patches in the email referred to.  If the
>> analysis that we're passing NULL to mempool_free is correct, it should
>> be the second one that fixes the problem (the one that checks
>> bio->bi_io_vec before freeing it).  Which would mean we have a
>> nr_vecs==0 bio generated by the tar somehow.
>>
> 
> I think we might only need the first patch if the problem is similar to
> what the lsi guys were seeing. I thought the problem is that we are not
> estimating how large the transfer is correctly because we do not take
> into account offsets at the end. This results in nr_vecs being zero when
> it should be a valid value. I thought Kai's patch:
> http://bugzilla.kernel.org/show_bug.cgi?id=7919
> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
> fixed the problem on st's side, but I guess not so you are probably right.
> 
> Here is a patch that dumps the sgl we are getting from st so we can see
> for sure what we are getting and can decide if we need the first patch,
> second patch or both.
> 

Here's the patch output:

sg length 6 offset 0
sg length 12 offset 0
sg length 4096 offset 0
sg length 4096 offset 0
sg length 2048 offset 0

Please note (as replied in the other mail) that the bugzilla patch
solves the problem.
> 
> 
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 5f95570..81005aa 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -306,6 +306,10 @@ static int scsi_req_map_sg(struct reques
>   struct bio *bio = NULL;
>   int i, err, nr_vecs = 0;
>  
> + for (i = 0; i < nsegs; i++)
> + printk(KERN_INFO "sg length %u offset %u\n", sgl[i].length,
> + sgl[i].offset);
> +
>   for (i = 0; i < nsegs; i++) {
>   page = sgl[i].page;
>   off = sgl[i].offset;


-- 
Andreas Steinmetz   SPAMmers use [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Andreas Steinmetz
Mike Christie wrote:
> Mike Christie wrote:
>> James Bottomley wrote:
>>> On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
> I can't even say if the tapes are written correctly as I can't read them
> (one does not reboot production machines back to 2.4.x just to try to
> read a backup tape - I don't have 2.6.x older than 2.6.20 on these
> machines).
 Could you try this patch
 http://marc.info/?l=linux-scsi=116464965414878=2
 I thought st was modified to not send offsets in the last elements but
 it looks like it wasn't.
>>> Actually, there are two patches in the email referred to.  If the
>>> analysis that we're passing NULL to mempool_free is correct, it should
>>> be the second one that fixes the problem (the one that checks
>>> bio->bi_io_vec before freeing it).  Which would mean we have a
>>> nr_vecs==0 bio generated by the tar somehow.
>>>
>> I think we might only need the first patch if the problem is similar to
>> what the lsi guys were seeing. I thought the problem is that we are not
>> estimating how large the transfer is correctly because we do not take
>> into account offsets at the end. This results in nr_vecs being zero when
>> it should be a valid value. I thought Kai's patch:
>> http://bugzilla.kernel.org/show_bug.cgi?id=7919
>> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
>> fixed the problem on st's side,
> 
> Oh, I noticed that the subject for the mail references 2.6.30.3 and the
> patch for st in the bugzilla did not make into 2.6.20 and is not in .3.
> Could we try the st patch in the bugzilla first?

Ok, the st patch from bugzilla solves the problem (tested on both
affected machines).
-- 
Andreas Steinmetz   SPAMmers use [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread James Bottomley
On Mon, 2007-03-19 at 17:47 -0400, Gene Heskett wrote:
> James, could this also be the cause of a tar based backup going crazy and 
> thinking all data is new under any 2.6.21-rc* kernel I've tested so far 
> with amanda, which in my case uses tar?  I've tried the fedora patched 
> tar-1.15-1, and one I hand built right after 1.15-1 came out over a year 
> ago, and they both do it, but only when booted to a 2.6.21-rc* kernel.
> 
> This obviously will be a show-stopper, either for amanda (and by 
> inference, any app that uses tar), or for the migration of an amanda 
> users machinery to a 2.6.21 kernel.

Er, I don't think so .. that sounds like mtime miscompare, which is
either a problem with the filesystem or a problem with the way mtime is
stored in the tar archive.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Gene Heskett
On Monday 19 March 2007, James Bottomley wrote:
>On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
>> > I can't even say if the tapes are written correctly as I can't read
>> > them (one does not reboot production machines back to 2.4.x just to
>> > try to read a backup tape - I don't have 2.6.x older than 2.6.20 on
>> > these machines).
>>
>> Could you try this patch
>> http://marc.info/?l=linux-scsi=116464965414878=2
>> I thought st was modified to not send offsets in the last elements but
>> it looks like it wasn't.
>
>Actually, there are two patches in the email referred to.  If the
>analysis that we're passing NULL to mempool_free is correct, it should
>be the second one that fixes the problem (the one that checks
>bio->bi_io_vec before freeing it).  Which would mean we have a
>nr_vecs==0 bio generated by the tar somehow.
>
>James

James, could this also be the cause of a tar based backup going crazy and 
thinking all data is new under any 2.6.21-rc* kernel I've tested so far 
with amanda, which in my case uses tar?  I've tried the fedora patched 
tar-1.15-1, and one I hand built right after 1.15-1 came out over a year 
ago, and they both do it, but only when booted to a 2.6.21-rc* kernel.

This obviously will be a show-stopper, either for amanda (and by 
inference, any app that uses tar), or for the migration of an amanda 
users machinery to a 2.6.21 kernel.

>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to [EMAIL PROTECTED]
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
fractal radiation jamming the backbone
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Mike Christie
Mike Christie wrote:
> James Bottomley wrote:
>> On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
 I can't even say if the tapes are written correctly as I can't read them
 (one does not reboot production machines back to 2.4.x just to try to
 read a backup tape - I don't have 2.6.x older than 2.6.20 on these
 machines).
>>> Could you try this patch
>>> http://marc.info/?l=linux-scsi=116464965414878=2
>>> I thought st was modified to not send offsets in the last elements but
>>> it looks like it wasn't.
>> Actually, there are two patches in the email referred to.  If the
>> analysis that we're passing NULL to mempool_free is correct, it should
>> be the second one that fixes the problem (the one that checks
>> bio->bi_io_vec before freeing it).  Which would mean we have a
>> nr_vecs==0 bio generated by the tar somehow.
>>
> 
> I think we might only need the first patch if the problem is similar to
> what the lsi guys were seeing. I thought the problem is that we are not
> estimating how large the transfer is correctly because we do not take
> into account offsets at the end. This results in nr_vecs being zero when
> it should be a valid value. I thought Kai's patch:
> http://bugzilla.kernel.org/show_bug.cgi?id=7919
> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
> fixed the problem on st's side,

Oh, I noticed that the subject for the mail references 2.6.30.3 and the
patch for st in the bugzilla did not make into 2.6.20 and is not in .3.
Could we try the st patch in the bugzilla first?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Mike Christie
James Bottomley wrote:
> On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
>>> I can't even say if the tapes are written correctly as I can't read them
>>> (one does not reboot production machines back to 2.4.x just to try to
>>> read a backup tape - I don't have 2.6.x older than 2.6.20 on these
>>> machines).
>> Could you try this patch
>> http://marc.info/?l=linux-scsi=116464965414878=2
>> I thought st was modified to not send offsets in the last elements but
>> it looks like it wasn't.
> 
> Actually, there are two patches in the email referred to.  If the
> analysis that we're passing NULL to mempool_free is correct, it should
> be the second one that fixes the problem (the one that checks
> bio->bi_io_vec before freeing it).  Which would mean we have a
> nr_vecs==0 bio generated by the tar somehow.
> 

I think we might only need the first patch if the problem is similar to
what the lsi guys were seeing. I thought the problem is that we are not
estimating how large the transfer is correctly because we do not take
into account offsets at the end. This results in nr_vecs being zero when
it should be a valid value. I thought Kai's patch:
http://bugzilla.kernel.org/show_bug.cgi?id=7919
http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
fixed the problem on st's side, but I guess not so you are probably right.

Here is a patch that dumps the sgl we are getting from st so we can see
for sure what we are getting and can decide if we need the first patch,
second patch or both.
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5f95570..81005aa 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -306,6 +306,10 @@ static int scsi_req_map_sg(struct reques
 	struct bio *bio = NULL;
 	int i, err, nr_vecs = 0;
 
+	for (i = 0; i < nsegs; i++)
+		printk(KERN_INFO "sg length %u offset %u\n", sgl[i].length,
+			sgl[i].offset);
+
 	for (i = 0; i < nsegs; i++) {
 		page = sgl[i].page;
 		off = sgl[i].offset;


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread James Bottomley
On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
> > I can't even say if the tapes are written correctly as I can't read them
> > (one does not reboot production machines back to 2.4.x just to try to
> > read a backup tape - I don't have 2.6.x older than 2.6.20 on these
> > machines).
> 
> Could you try this patch
> http://marc.info/?l=linux-scsi=116464965414878=2
> I thought st was modified to not send offsets in the last elements but
> it looks like it wasn't.

Actually, there are two patches in the email referred to.  If the
analysis that we're passing NULL to mempool_free is correct, it should
be the second one that fixes the problem (the one that checks
bio->bi_io_vec before freeing it).  Which would mean we have a
nr_vecs==0 bio generated by the tar somehow.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Mike Christie
Andreas Steinmetz wrote:
> As posted to lkml and linux-scsi on 2007-03-15 without reply, see
> http://marc.info/?l=linux-kernel=117395128412313=2 for original post:
> 
> It is not so nice when one can write backup tapes but the tapes cannot
> be read. I don't know if memory management or the st driver is the
> culprit, but this is a not so nice situation.
> 
> I can't even say if the tapes are written correctly as I can't read them
> (one does not reboot production machines back to 2.4.x just to try to
> read a backup tape - I don't have 2.6.x older than 2.6.20 on these
> machines).

Could you try this patch
http://marc.info/?l=linux-scsi=116464965414878=2
I thought st was modified to not send offsets in the last elements but
it looks like it wasn't.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Pekka Enberg

On 3/19/07, Pekka Enberg <[EMAIL PROTECTED]> wrote:

You can see that mempool_free is passing a NULL pointer to
kmem_cache_free() which doesn't handle it properly. The NULL pointer
comes from bio_free() where ->bi_io_vec is  NULL because nr_iovecs
passed to bio_alloc_bioset() was zero.

The question is, why is nr_pages zero in scsi_req_map_sg()?


Note that the following patch I posted only addresses the part where
slab is clearly failing here:

http://lkml.org/lkml/2007/3/19/42

So, while it should fix the oops, there might be a bug lurking in the
SCSI or block layer still.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Pekka Enberg

On 3/19/07, Pekka Enberg <[EMAIL PROTECTED]> wrote:

EIP is at kmem_cache_free+0x29/0x5a
eax: c180   ebx: f0ae12c0   ecx: c18f73c0   edx: c180
esi: c1919de0   edi:    ebp: 1000   esp: f1fe7e14
ds: 007b   es: 007b   ss: 0068

But somehow eax and edx have the same value 0xc180 here. Hmm?


Aah, but if you look at contents of the stack:

Stack: f0ae12c0 c1919de0 ffea c0137f97  f0ae12c0 c1919e20 c0168d45
  f0ae12c0 1000 c0168fb9 c02a77e3 1000   
   c17bb6e0 1000  f1b38be8 0003 f54ac050 c1b9d6e8
Call Trace:
[] mempool_free+0x48/0x4c
[] bio_free+0x21/0x2c
[] bio_put+0x22/0x23

You can see that mempool_free is passing a NULL pointer to
kmem_cache_free() which doesn't handle it properly. The NULL pointer
comes from bio_free() where ->bi_io_vec is  NULL because nr_iovecs
passed to bio_alloc_bioset() was zero.

The question is, why is nr_pages zero in scsi_req_map_sg()?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Pekka Enberg

On 3/19/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

BUG_ON(!PageSlab(page));

that's seriously screwed up.  Do you have CONFIG_DEBUG_SLAB enabled?  If
not, please enable it and retest.


This is scary. Looking at disassembly of the OOPS:

Disassembly of section .text:

 <.text>:
  0:   5f  pop%edi
  1:   c3  ret
  2:   57  push   %edi
  3:   89 c1   mov%eax,%ecx
  5:   89 d7   mov%edx,%edi
  7:   8d 92 00 00 00 40   lea0x4000(%edx),%edx
  d:   56  push   %esi
  e:   c1 ea 0cshr$0xc,%edx
 11:   53  push   %ebx
 12:   c1 e2 05shl$0x5,%edx
 15:   03 15 40 5d 5a c0   add0xc05a5d40,%edx

At this point, edx has the result of virt_to_page().

 1b:   8b 02   mov(%edx),%eax
 1d:   f6 c4 40test   $0x40,%ah
 20:   74 03   je 0x25

If it's a compound page, look up the real page from ->private.

 22:   8b 52 0cmov0xc(%edx),%edx

Now, reload page flags.

 25:   8b 02   mov(%edx),%eax

And test...

 27:   a8 80   test   $0x80,%al
 29:   75 04   jne0x2f
 2b:   0f 0b   ud2a
 2d:   eb fe   jmp0x2d
 2f:   39 4a 18cmp%ecx,0x18(%edx)

[snip, snip]

EIP is at kmem_cache_free+0x29/0x5a
eax: c180   ebx: f0ae12c0   ecx: c18f73c0   edx: c180
esi: c1919de0   edi:    ebp: 1000   esp: f1fe7e14
ds: 007b   es: 007b   ss: 0068

But somehow eax and edx have the same value 0xc180 here. Hmm?

  Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Pekka Enberg

On 3/19/07, Andrew Morton [EMAIL PROTECTED] wrote:

BUG_ON(!PageSlab(page));

that's seriously screwed up.  Do you have CONFIG_DEBUG_SLAB enabled?  If
not, please enable it and retest.


This is scary. Looking at disassembly of the OOPS:

Disassembly of section .text:

 .text:
  0:   5f  pop%edi
  1:   c3  ret
  2:   57  push   %edi
  3:   89 c1   mov%eax,%ecx
  5:   89 d7   mov%edx,%edi
  7:   8d 92 00 00 00 40   lea0x4000(%edx),%edx
  d:   56  push   %esi
  e:   c1 ea 0cshr$0xc,%edx
 11:   53  push   %ebx
 12:   c1 e2 05shl$0x5,%edx
 15:   03 15 40 5d 5a c0   add0xc05a5d40,%edx

At this point, edx has the result of virt_to_page().

 1b:   8b 02   mov(%edx),%eax
 1d:   f6 c4 40test   $0x40,%ah
 20:   74 03   je 0x25

If it's a compound page, look up the real page from -private.

 22:   8b 52 0cmov0xc(%edx),%edx

Now, reload page flags.

 25:   8b 02   mov(%edx),%eax

And test...

 27:   a8 80   test   $0x80,%al
 29:   75 04   jne0x2f
 2b:   0f 0b   ud2a
 2d:   eb fe   jmp0x2d
 2f:   39 4a 18cmp%ecx,0x18(%edx)

[snip, snip]

EIP is at kmem_cache_free+0x29/0x5a
eax: c180   ebx: f0ae12c0   ecx: c18f73c0   edx: c180
esi: c1919de0   edi:    ebp: 1000   esp: f1fe7e14
ds: 007b   es: 007b   ss: 0068

But somehow eax and edx have the same value 0xc180 here. Hmm?

  Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Pekka Enberg

On 3/19/07, Pekka Enberg [EMAIL PROTECTED] wrote:

EIP is at kmem_cache_free+0x29/0x5a
eax: c180   ebx: f0ae12c0   ecx: c18f73c0   edx: c180
esi: c1919de0   edi:    ebp: 1000   esp: f1fe7e14
ds: 007b   es: 007b   ss: 0068

But somehow eax and edx have the same value 0xc180 here. Hmm?


Aah, but if you look at contents of the stack:

Stack: f0ae12c0 c1919de0 ffea c0137f97  f0ae12c0 c1919e20 c0168d45
  f0ae12c0 1000 c0168fb9 c02a77e3 1000   
   c17bb6e0 1000  f1b38be8 0003 f54ac050 c1b9d6e8
Call Trace:
[c0137f97] mempool_free+0x48/0x4c
[c0168d45] bio_free+0x21/0x2c
[c0168fb9] bio_put+0x22/0x23

You can see that mempool_free is passing a NULL pointer to
kmem_cache_free() which doesn't handle it properly. The NULL pointer
comes from bio_free() where -bi_io_vec is  NULL because nr_iovecs
passed to bio_alloc_bioset() was zero.

The question is, why is nr_pages zero in scsi_req_map_sg()?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Pekka Enberg

On 3/19/07, Pekka Enberg [EMAIL PROTECTED] wrote:

You can see that mempool_free is passing a NULL pointer to
kmem_cache_free() which doesn't handle it properly. The NULL pointer
comes from bio_free() where -bi_io_vec is  NULL because nr_iovecs
passed to bio_alloc_bioset() was zero.

The question is, why is nr_pages zero in scsi_req_map_sg()?


Note that the following patch I posted only addresses the part where
slab is clearly failing here:

http://lkml.org/lkml/2007/3/19/42

So, while it should fix the oops, there might be a bug lurking in the
SCSI or block layer still.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Mike Christie
Andreas Steinmetz wrote:
 As posted to lkml and linux-scsi on 2007-03-15 without reply, see
 http://marc.info/?l=linux-kernelm=117395128412313w=2 for original post:
 
 It is not so nice when one can write backup tapes but the tapes cannot
 be read. I don't know if memory management or the st driver is the
 culprit, but this is a not so nice situation.
 
 I can't even say if the tapes are written correctly as I can't read them
 (one does not reboot production machines back to 2.4.x just to try to
 read a backup tape - I don't have 2.6.x older than 2.6.20 on these
 machines).

Could you try this patch
http://marc.info/?l=linux-scsim=116464965414878w=2
I thought st was modified to not send offsets in the last elements but
it looks like it wasn't.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread James Bottomley
On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
  I can't even say if the tapes are written correctly as I can't read them
  (one does not reboot production machines back to 2.4.x just to try to
  read a backup tape - I don't have 2.6.x older than 2.6.20 on these
  machines).
 
 Could you try this patch
 http://marc.info/?l=linux-scsim=116464965414878w=2
 I thought st was modified to not send offsets in the last elements but
 it looks like it wasn't.

Actually, there are two patches in the email referred to.  If the
analysis that we're passing NULL to mempool_free is correct, it should
be the second one that fixes the problem (the one that checks
bio-bi_io_vec before freeing it).  Which would mean we have a
nr_vecs==0 bio generated by the tar somehow.

James


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Mike Christie
James Bottomley wrote:
 On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
 I can't even say if the tapes are written correctly as I can't read them
 (one does not reboot production machines back to 2.4.x just to try to
 read a backup tape - I don't have 2.6.x older than 2.6.20 on these
 machines).
 Could you try this patch
 http://marc.info/?l=linux-scsim=116464965414878w=2
 I thought st was modified to not send offsets in the last elements but
 it looks like it wasn't.
 
 Actually, there are two patches in the email referred to.  If the
 analysis that we're passing NULL to mempool_free is correct, it should
 be the second one that fixes the problem (the one that checks
 bio-bi_io_vec before freeing it).  Which would mean we have a
 nr_vecs==0 bio generated by the tar somehow.
 

I think we might only need the first patch if the problem is similar to
what the lsi guys were seeing. I thought the problem is that we are not
estimating how large the transfer is correctly because we do not take
into account offsets at the end. This results in nr_vecs being zero when
it should be a valid value. I thought Kai's patch:
http://bugzilla.kernel.org/show_bug.cgi?id=7919
http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
fixed the problem on st's side, but I guess not so you are probably right.

Here is a patch that dumps the sgl we are getting from st so we can see
for sure what we are getting and can decide if we need the first patch,
second patch or both.
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5f95570..81005aa 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -306,6 +306,10 @@ static int scsi_req_map_sg(struct reques
 	struct bio *bio = NULL;
 	int i, err, nr_vecs = 0;
 
+	for (i = 0; i  nsegs; i++)
+		printk(KERN_INFO sg length %u offset %u\n, sgl[i].length,
+			sgl[i].offset);
+
 	for (i = 0; i  nsegs; i++) {
 		page = sgl[i].page;
 		off = sgl[i].offset;


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Mike Christie
Mike Christie wrote:
 James Bottomley wrote:
 On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
 I can't even say if the tapes are written correctly as I can't read them
 (one does not reboot production machines back to 2.4.x just to try to
 read a backup tape - I don't have 2.6.x older than 2.6.20 on these
 machines).
 Could you try this patch
 http://marc.info/?l=linux-scsim=116464965414878w=2
 I thought st was modified to not send offsets in the last elements but
 it looks like it wasn't.
 Actually, there are two patches in the email referred to.  If the
 analysis that we're passing NULL to mempool_free is correct, it should
 be the second one that fixes the problem (the one that checks
 bio-bi_io_vec before freeing it).  Which would mean we have a
 nr_vecs==0 bio generated by the tar somehow.

 
 I think we might only need the first patch if the problem is similar to
 what the lsi guys were seeing. I thought the problem is that we are not
 estimating how large the transfer is correctly because we do not take
 into account offsets at the end. This results in nr_vecs being zero when
 it should be a valid value. I thought Kai's patch:
 http://bugzilla.kernel.org/show_bug.cgi?id=7919
 http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
 fixed the problem on st's side,

Oh, I noticed that the subject for the mail references 2.6.30.3 and the
patch for st in the bugzilla did not make into 2.6.20 and is not in .3.
Could we try the st patch in the bugzilla first?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Gene Heskett
On Monday 19 March 2007, James Bottomley wrote:
On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
  I can't even say if the tapes are written correctly as I can't read
  them (one does not reboot production machines back to 2.4.x just to
  try to read a backup tape - I don't have 2.6.x older than 2.6.20 on
  these machines).

 Could you try this patch
 http://marc.info/?l=linux-scsim=116464965414878w=2
 I thought st was modified to not send offsets in the last elements but
 it looks like it wasn't.

Actually, there are two patches in the email referred to.  If the
analysis that we're passing NULL to mempool_free is correct, it should
be the second one that fixes the problem (the one that checks
bio-bi_io_vec before freeing it).  Which would mean we have a
nr_vecs==0 bio generated by the tar somehow.

James

James, could this also be the cause of a tar based backup going crazy and 
thinking all data is new under any 2.6.21-rc* kernel I've tested so far 
with amanda, which in my case uses tar?  I've tried the fedora patched 
tar-1.15-1, and one I hand built right after 1.15-1 came out over a year 
ago, and they both do it, but only when booted to a 2.6.21-rc* kernel.

This obviously will be a show-stopper, either for amanda (and by 
inference, any app that uses tar), or for the migration of an amanda 
users machinery to a 2.6.21 kernel.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel
 in the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
fractal radiation jamming the backbone
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread James Bottomley
On Mon, 2007-03-19 at 17:47 -0400, Gene Heskett wrote:
 James, could this also be the cause of a tar based backup going crazy and 
 thinking all data is new under any 2.6.21-rc* kernel I've tested so far 
 with amanda, which in my case uses tar?  I've tried the fedora patched 
 tar-1.15-1, and one I hand built right after 1.15-1 came out over a year 
 ago, and they both do it, but only when booted to a 2.6.21-rc* kernel.
 
 This obviously will be a show-stopper, either for amanda (and by 
 inference, any app that uses tar), or for the migration of an amanda 
 users machinery to a 2.6.21 kernel.

Er, I don't think so .. that sounds like mtime miscompare, which is
either a problem with the filesystem or a problem with the way mtime is
stored in the tar archive.

James


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Andreas Steinmetz
Mike Christie wrote:
 James Bottomley wrote:
 On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
 I can't even say if the tapes are written correctly as I can't read them
 (one does not reboot production machines back to 2.4.x just to try to
 read a backup tape - I don't have 2.6.x older than 2.6.20 on these
 machines).
 Could you try this patch
 http://marc.info/?l=linux-scsim=116464965414878w=2
 I thought st was modified to not send offsets in the last elements but
 it looks like it wasn't.
 Actually, there are two patches in the email referred to.  If the
 analysis that we're passing NULL to mempool_free is correct, it should
 be the second one that fixes the problem (the one that checks
 bio-bi_io_vec before freeing it).  Which would mean we have a
 nr_vecs==0 bio generated by the tar somehow.

 
 I think we might only need the first patch if the problem is similar to
 what the lsi guys were seeing. I thought the problem is that we are not
 estimating how large the transfer is correctly because we do not take
 into account offsets at the end. This results in nr_vecs being zero when
 it should be a valid value. I thought Kai's patch:
 http://bugzilla.kernel.org/show_bug.cgi?id=7919
 http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
 fixed the problem on st's side, but I guess not so you are probably right.
 
 Here is a patch that dumps the sgl we are getting from st so we can see
 for sure what we are getting and can decide if we need the first patch,
 second patch or both.
 

Here's the patch output:

sg length 6 offset 0
sg length 12 offset 0
sg length 4096 offset 0
sg length 4096 offset 0
sg length 2048 offset 0

Please note (as replied in the other mail) that the bugzilla patch
solves the problem.
 
 
 
 diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
 index 5f95570..81005aa 100644
 --- a/drivers/scsi/scsi_lib.c
 +++ b/drivers/scsi/scsi_lib.c
 @@ -306,6 +306,10 @@ static int scsi_req_map_sg(struct reques
   struct bio *bio = NULL;
   int i, err, nr_vecs = 0;
  
 + for (i = 0; i  nsegs; i++)
 + printk(KERN_INFO sg length %u offset %u\n, sgl[i].length,
 + sgl[i].offset);
 +
   for (i = 0; i  nsegs; i++) {
   page = sgl[i].page;
   off = sgl[i].offset;


-- 
Andreas Steinmetz   SPAMmers use [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Andreas Steinmetz
Mike Christie wrote:
 Mike Christie wrote:
 James Bottomley wrote:
 On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
 I can't even say if the tapes are written correctly as I can't read them
 (one does not reboot production machines back to 2.4.x just to try to
 read a backup tape - I don't have 2.6.x older than 2.6.20 on these
 machines).
 Could you try this patch
 http://marc.info/?l=linux-scsim=116464965414878w=2
 I thought st was modified to not send offsets in the last elements but
 it looks like it wasn't.
 Actually, there are two patches in the email referred to.  If the
 analysis that we're passing NULL to mempool_free is correct, it should
 be the second one that fixes the problem (the one that checks
 bio-bi_io_vec before freeing it).  Which would mean we have a
 nr_vecs==0 bio generated by the tar somehow.

 I think we might only need the first patch if the problem is similar to
 what the lsi guys were seeing. I thought the problem is that we are not
 estimating how large the transfer is correctly because we do not take
 into account offsets at the end. This results in nr_vecs being zero when
 it should be a valid value. I thought Kai's patch:
 http://bugzilla.kernel.org/show_bug.cgi?id=7919
 http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
 fixed the problem on st's side,
 
 Oh, I noticed that the subject for the mail references 2.6.30.3 and the
 patch for st in the bugzilla did not make into 2.6.20 and is not in .3.
 Could we try the st patch in the bugzilla first?

Ok, the st patch from bugzilla solves the problem (tested on both
affected machines).
-- 
Andreas Steinmetz   SPAMmers use [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Gene Heskett
On Monday 19 March 2007, James Bottomley wrote:
On Mon, 2007-03-19 at 17:47 -0400, Gene Heskett wrote:
 James, could this also be the cause of a tar based backup going crazy
 and thinking all data is new under any 2.6.21-rc* kernel I've tested
 so far with amanda, which in my case uses tar?  I've tried the fedora
 patched tar-1.15-1, and one I hand built right after 1.15-1 came out
 over a year ago, and they both do it, but only when booted to a
 2.6.21-rc* kernel.

 This obviously will be a show-stopper, either for amanda (and by
 inference, any app that uses tar), or for the migration of an amanda
 users machinery to a 2.6.21 kernel.

Er, I don't think so .. that sounds like mtime miscompare, which is
either a problem with the filesystem or a problem with the way mtime is
stored in the tar archive.

James

Well, since the times reported by ls -l --full-time are sane
[EMAIL PROTECTED] pix]# ls -l --full-time
total 924784
-rw-r--r-- 1 root   root 985324 2002-06-09 18:14:54.0 -0400 
0203.jpg
[... the rest of a 100k listing, booted to 2.6.20.3-rdsl-0.31]

And:
[EMAIL PROTECTED] pix]# ls -l --full-time
total 924784
-rw-r--r-- 1 root   root 985324 2002-06-09 18:14:54.0 -0400 
0203.jpg

booted to 2.6.21-rc4

allthough the fractional second is a string of .0, even when 
booted to a tar-unfriendly kernel, then it would tend to point at tar, 
but two differently built versions of tar have been confirmed as 
miss-behaving in the presence of a kernel in the 2.6.21 series so far, 
all of them.

I'm going to reboot twice more tonight, once to verify that the output of 
an ls -l --full-time is as I said above, I'll save this and do that again 
and clip it in after a reboot to 2.6.21-rc4, and once to 2.6.20.4-rc1 to 
see if by chance one of those patches is the guilty party.  I'll leave 
the latter running tonight for the amanda run  see what falls out.

-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
Try to relax and enjoy the crisis.
-- Ashleigh Brilliant
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Andrew Morton
On Tue, 20 Mar 2007 00:25:02 +0100
Andreas Steinmetz [EMAIL PROTECTED] wrote:

 Mike Christie wrote:
  Mike Christie wrote:
  James Bottomley wrote:
  On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
  I can't even say if the tapes are written correctly as I can't read them
  (one does not reboot production machines back to 2.4.x just to try to
  read a backup tape - I don't have 2.6.x older than 2.6.20 on these
  machines).
  Could you try this patch
  http://marc.info/?l=linux-scsim=116464965414878w=2
  I thought st was modified to not send offsets in the last elements but
  it looks like it wasn't.
  Actually, there are two patches in the email referred to.  If the
  analysis that we're passing NULL to mempool_free is correct, it should
  be the second one that fixes the problem (the one that checks
  bio-bi_io_vec before freeing it).  Which would mean we have a
  nr_vecs==0 bio generated by the tar somehow.
 
  I think we might only need the first patch if the problem is similar to
  what the lsi guys were seeing. I thought the problem is that we are not
  estimating how large the transfer is correctly because we do not take
  into account offsets at the end. This results in nr_vecs being zero when
  it should be a valid value. I thought Kai's patch:
  http://bugzilla.kernel.org/show_bug.cgi?id=7919
  http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
  fixed the problem on st's side,
  
  Oh, I noticed that the subject for the mail references 2.6.30.3 and the
  patch for st in the bugzilla did not make into 2.6.20 and is not in .3.
  Could we try the st patch in the bugzilla first?
 
 Ok, the st patch from bugzilla solves the problem (tested on both
 affected machines).


If you're referring to the below patch then it's already in mainline, and
has been for a month.

Have you tested 2.6.21-rc4?  If not, please do so.

Perhaps we should merge this into 2.6.20.x?



commit 9abe16c670bd3d4ab5519257514f9f291383d104
Author: Kai Makisara [EMAIL PROTECTED]
Date:   Sat Feb 3 13:21:29 2007 +0200

[SCSI] st: fix Tape dies if wrong block size used, bug 7919

On Thu, 1 Feb 2007, Andrew Morton wrote:
 On Thu, 1 Feb 2007 15:34:29 -0800
 [EMAIL PROTECTED] wrote:

  http://bugzilla.kernel.org/show_bug.cgi?id=7919
 
 Summary: Tape dies if wrong block size used
  Kernel Version: 2.6.20-rc5
  Status: NEW
Severity: normal
   Owner: [EMAIL PROTECTED]
   Submitter: [EMAIL PROTECTED]
 
 
  Most recent kernel where this bug did *NOT* occur: 2.6.17.14
 
  Other Kernels Tested and Results:
 
  OK 2.6.15.7
  OK 2.6.16.37
  OK 2.6.17.14
  BAD 2.6.18.6
  BAD 2.6.18-1.2869.fc6
  BAD 2.6.19.2 +
  BAD 2.6.20-rc5
 
  NOTE: 2.6.18-1.2869.fc6 is a Fedora modified kernel, all others are 
from kernel.org
 
...
  Steps to reproduce:
  Get a Adaptec AHA-2940U/UW/D / AIC-7881U card and a tape drive,
  install a recent kernel
  set the tape block size - mt setblk 4096
  read from or write to tape using wrong block size - tar -b 7 -cvf 
/dev/tape foo
 
Write does not trigger this bug because the driver refuses in fixed block
mode writes that are not a multiple of the block size. Read does trigger
it in my system.

The bug is not associated with any specific HBA. st tries to do direct i/o
in fixed block mode with reads that are not a multiple of tape block size.

The patch in this message fixes the st problem by switching to using the
driver buffer up to the next close of the device file in fixed block mode
if the user asks for a read like this.

I don't know why the bug has surfaced only after 2.6.17 although the st
problem is old. There may be another bug in the block subsystem and this
patch works around it. However, the patch fixes a problem in st and in
this way it is a valid fix.

This patch may also fix the bug 7900.

The patch compiles and is lightly tested.

Signed-off-by: Kai Makisara [EMAIL PROTECTED]
Signed-off-by: James Bottomley [EMAIL PROTECTED]

diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index e016e09..fba8b20 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -9,7 +9,7 @@
Steve Hirsch, Andreas Koppenhofer, Michael Leodolter, Eyal Lebedinsky,
Michael Schaefer, Jorg Weule, and Eric Youngdale.
 
-   Copyright 1992 - 2006 Kai Makisara
+   Copyright 1992 - 2007 Kai Makisara
email [EMAIL PROTECTED]
 
Some small formal changes - aeb, 950809
@@ -17,7 +17,7 @@
Last modified: 18-JAN-1998 Richard Gooch [EMAIL PROTECTED] Devfs support
  */
 
-static const char *verstr = 20061107;
+static const char *verstr = 20070203;
 
 #include linux/module.h
 
@@ -1168,6 +1168,7 @@ static int st_open(struct inode *inode, 
  

Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-19 Thread Andreas Steinmetz
Andrew Morton wrote:
 On Tue, 20 Mar 2007 00:25:02 +0100
 Andreas Steinmetz [EMAIL PROTECTED] wrote:
 
 Mike Christie wrote:
 Mike Christie wrote:
 James Bottomley wrote:
 On Mon, 2007-03-19 at 12:49 -0500, Mike Christie wrote:
 I can't even say if the tapes are written correctly as I can't read them
 (one does not reboot production machines back to 2.4.x just to try to
 read a backup tape - I don't have 2.6.x older than 2.6.20 on these
 machines).
 Could you try this patch
 http://marc.info/?l=linux-scsim=116464965414878w=2
 I thought st was modified to not send offsets in the last elements but
 it looks like it wasn't.
 Actually, there are two patches in the email referred to.  If the
 analysis that we're passing NULL to mempool_free is correct, it should
 be the second one that fixes the problem (the one that checks
 bio-bi_io_vec before freeing it).  Which would mean we have a
 nr_vecs==0 bio generated by the tar somehow.

 I think we might only need the first patch if the problem is similar to
 what the lsi guys were seeing. I thought the problem is that we are not
 estimating how large the transfer is correctly because we do not take
 into account offsets at the end. This results in nr_vecs being zero when
 it should be a valid value. I thought Kai's patch:
 http://bugzilla.kernel.org/show_bug.cgi?id=7919
 http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=9abe16c670bd3d4ab5519257514f9f291383d104
 fixed the problem on st's side,
 Oh, I noticed that the subject for the mail references 2.6.30.3 and the
 patch for st in the bugzilla did not make into 2.6.20 and is not in .3.
 Could we try the st patch in the bugzilla first?
 Ok, the st patch from bugzilla solves the problem (tested on both
 affected machines).
 
 
 If you're referring to the below patch then it's already in mainline, and
 has been for a month.
 

Yes, that's the patch I'm referring to.

 Have you tested 2.6.21-rc4?  If not, please do so.
 

Sorry, this is not possible on these machines. They are production
servers and every problem on them that cannot be easily solved via
remote access is a 40km (one way) drive in the middle of the night.

 Perhaps we should merge this into 2.6.20.x?
 

I would suggest so.

 
 
 commit 9abe16c670bd3d4ab5519257514f9f291383d104
 Author: Kai Makisara [EMAIL PROTECTED]
 Date:   Sat Feb 3 13:21:29 2007 +0200
 
 [SCSI] st: fix Tape dies if wrong block size used, bug 7919
 
 On Thu, 1 Feb 2007, Andrew Morton wrote:
  On Thu, 1 Feb 2007 15:34:29 -0800
  [EMAIL PROTECTED] wrote:
 
   http://bugzilla.kernel.org/show_bug.cgi?id=7919
  
  Summary: Tape dies if wrong block size used
   Kernel Version: 2.6.20-rc5
   Status: NEW
 Severity: normal
Owner: [EMAIL PROTECTED]
Submitter: [EMAIL PROTECTED]
  
  
   Most recent kernel where this bug did *NOT* occur: 2.6.17.14
  
   Other Kernels Tested and Results:
  
   OK 2.6.15.7
   OK 2.6.16.37
   OK 2.6.17.14
   BAD 2.6.18.6
   BAD 2.6.18-1.2869.fc6
   BAD 2.6.19.2 +
   BAD 2.6.20-rc5
  
   NOTE: 2.6.18-1.2869.fc6 is a Fedora modified kernel, all others are 
 from kernel.org
  
 ...
   Steps to reproduce:
   Get a Adaptec AHA-2940U/UW/D / AIC-7881U card and a tape drive,
   install a recent kernel
   set the tape block size - mt setblk 4096
   read from or write to tape using wrong block size - tar -b 7 -cvf 
 /dev/tape foo
  
 Write does not trigger this bug because the driver refuses in fixed block
 mode writes that are not a multiple of the block size. Read does trigger
 it in my system.
 
 The bug is not associated with any specific HBA. st tries to do direct i/o
 in fixed block mode with reads that are not a multiple of tape block size.
 
 The patch in this message fixes the st problem by switching to using the
 driver buffer up to the next close of the device file in fixed block mode
 if the user asks for a read like this.
 
 I don't know why the bug has surfaced only after 2.6.17 although the st
 problem is old. There may be another bug in the block subsystem and this
 patch works around it. However, the patch fixes a problem in st and in
 this way it is a valid fix.
 
 This patch may also fix the bug 7900.
 
 The patch compiles and is lightly tested.
 
 Signed-off-by: Kai Makisara [EMAIL PROTECTED]
 Signed-off-by: James Bottomley [EMAIL PROTECTED]
 
 diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
 index e016e09..fba8b20 100644
 --- a/drivers/scsi/st.c
 +++ b/drivers/scsi/st.c
 @@ -9,7 +9,7 @@
 Steve Hirsch, Andreas Koppenhofer, Michael Leodolter, Eyal Lebedinsky,
 Michael Schaefer, Jorg Weule, and Eric Youngdale.
  
 -   Copyright 1992 - 2006 Kai Makisara
 +   Copyright 1992 - 2007 Kai Makisara
  

Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-18 Thread Andrew Morton
On Mon, 19 Mar 2007 01:34:22 +0100 Andreas Steinmetz <[EMAIL PROTECTED]> wrote:

> As posted to lkml and linux-scsi on 2007-03-15 without reply, see
> http://marc.info/?l=linux-kernel=117395128412313=2 for original post:

Repeatable oops in our most recently released kernel, nobody bothers to
reply.

> It is not so nice when one can write backup tapes but the tapes cannot
> be read. I don't know if memory management or the st driver is the
> culprit, but this is a not so nice situation.
> 
> I can't even say if the tapes are written correctly as I can't read them
> (one does not reboot production machines back to 2.4.x just to try to
> read a backup tape - I don't have 2.6.x older than 2.6.20 on these
> machines).

BUG_ON(!PageSlab(page));

that's seriously screwed up.  Do you have CONFIG_DEBUG_SLAB enabled?  If
not, please enable it and retest.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-18 Thread Andreas Steinmetz
As posted to lkml and linux-scsi on 2007-03-15 without reply, see
http://marc.info/?l=linux-kernel=117395128412313=2 for original post:

It is not so nice when one can write backup tapes but the tapes cannot
be read. I don't know if memory management or the st driver is the
culprit, but this is a not so nice situation.

I can't even say if the tapes are written correctly as I can't read them
(one does not reboot production machines back to 2.4.x just to try to
read a backup tape - I don't have 2.6.x older than 2.6.20 on these
machines).
-- 
Andreas Steinmetz   SPAMmers use [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-18 Thread Andreas Steinmetz
As posted to lkml and linux-scsi on 2007-03-15 without reply, see
http://marc.info/?l=linux-kernelm=117395128412313w=2 for original post:

It is not so nice when one can write backup tapes but the tapes cannot
be read. I don't know if memory management or the st driver is the
culprit, but this is a not so nice situation.

I can't even say if the tapes are written correctly as I can't read them
(one does not reboot production machines back to 2.4.x just to try to
read a backup tape - I don't have 2.6.x older than 2.6.20 on these
machines).
-- 
Andreas Steinmetz   SPAMmers use [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20.3: kernel BUG at mm/slab.c:597 try#2

2007-03-18 Thread Andrew Morton
On Mon, 19 Mar 2007 01:34:22 +0100 Andreas Steinmetz [EMAIL PROTECTED] wrote:

 As posted to lkml and linux-scsi on 2007-03-15 without reply, see
 http://marc.info/?l=linux-kernelm=117395128412313w=2 for original post:

Repeatable oops in our most recently released kernel, nobody bothers to
reply.

 It is not so nice when one can write backup tapes but the tapes cannot
 be read. I don't know if memory management or the st driver is the
 culprit, but this is a not so nice situation.
 
 I can't even say if the tapes are written correctly as I can't read them
 (one does not reboot production machines back to 2.4.x just to try to
 read a backup tape - I don't have 2.6.x older than 2.6.20 on these
 machines).

BUG_ON(!PageSlab(page));

that's seriously screwed up.  Do you have CONFIG_DEBUG_SLAB enabled?  If
not, please enable it and retest.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/