Re: Multipath failover handling (Was: Re: 2.6.24-rc3-mm1)

2008-01-07 Thread Mike Christie

James Bottomley wrote:

However, there's still devloss_tmo to consider ... even in
multipath, I don't think you want to signal path failure until
devloss_tmo has fired otherwise you'll get too many transient up/down
events which damage performance if the array has an expensive failover
model.


Yes. But currently we have a very high failover latency as we always have
to wait for the requeued commands to time-out.
Hence we're damaging performance on arrays with inexpensive failover.


If it's a either/or choice between the two that's showing our current
approach to multi-path is broken.


The other problem is what to do with in-flight commands at the time the
link went down.  With your current patch, they're still stuck until they
time out ... surely there needs to be some type of recovery mechanism
for these?


Well, the in-flight commands are owned by the HBA driver, which should
have the proper code to terminate / return those commands with the
appriopriate codes. They will then be rescheduled and will be caught
like 'normal' IO requests.


But my point is that if a driver goes blocked, those commands will be
forced to wait the blocked timeout anyway, so your proposed patch does
nothing to improve the case for dm anyway ... you only avoid commands
stuck when a device goes blocked if by chance its request queue was
empty.



How about my patches to use new transport error values and make the 
iscsi and fc behave the same.


The problem I think Hannes and I are both trying to solve is this:

1. We do not want to wait for dev_loss_tmo seconds for failover.

2. The FC drivers can hook into fast_io_fail_tmo related callouts and 
with that set that tmo to a very low value like a couple of seconds if 
they are using multipath, so failovers are fast. However, there is a bug 
with where when the fast_io_fail_tmo fires requests that made it to the 
driver get failed and returned to the multipath layer, but commands in 
the blocked request queue are stuck in there until dev_loss_tmo fires.


With my patches here (need to be rediffed and for FC I need to handle 
JamesS's comments about not using a new field for the fast_fail_timeout 
state bit):


http://marc.info/?l=linux-scsi=117399843216280=2
http://marc.info/?l=linux-scsi=117399544112073=2
http://marc.info/?l=linux-scsi=117399844316771=2
http://marc.info/?l=linux-scsi=117400203324693=2
http://marc.info/?l=linux-scsi=117400203324690=2

For FC we can use the fast_io_fail_tmo for fast failovers, and commands 
will not get stuck in a blocked queue for dev_loss_tmo seconds because 
when the fast_io_fail_tmo fires the target's queues are unblocked and 
fc_remote_port_chkready() ready kicks in (iSCSI does the same with the 
patches in the links). And with the patches if multipath-tools is 
sending its path testing IO it will get a DID_TRANSPORT_* error code 
that it can use to make a decent path failing decision with.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Multipath failover handling (Was: Re: 2.6.24-rc3-mm1)

2008-01-07 Thread James Bottomley
On Mon, 2008-01-07 at 15:05 +0100, Hannes Reinecke wrote:
> James Bottomley wrote:
> > On Fri, 2007-12-14 at 10:00 +0100, Hannes Reinecke wrote:
> >> James Bottomley wrote:
> >>> On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
>  OK, thanks.  I'll assume that James and Hannes have this in hand (or will
>  have, by mid-week) and I won't do anything here.
> >>> Just to confirm what I think I'm going to be doing:  rebasing the
> >>> scsi-misc tree to remove this commit:
> >>>
> >>> commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
> >>> Author: Hannes Reinecke <[EMAIL PROTECTED]>
> >>> Date:   Tue Nov 6 09:23:40 2007 +0100
> >>>
> >>> [SCSI] Do not requeue requests if REQ_FAILFAST is set
> >>>
> >>> And its allied fix ups:
> >>>
> >>> commit 983289045faa96fba8841d3c51b98bb8623d9504
> >>> Author: James Bottomley <[EMAIL PROTECTED]>
> >>> Date:   Sat Nov 24 19:47:25 2007 +0200
> >>>
> >>> [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE
> >>>
> >>> commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
> >>> Author: James Bottomley <[EMAIL PROTECTED]>
> >>> Date:   Sat Nov 24 19:55:53 2007 +0200
> >>>
> >>> [SCSI] fix domain validation to work again
> >>>
> >>> James
> >>>
> >>>
> >> Or just apply my latest patch (cf Undo __scsi_kill_request).
> >> The main point is that we shouldn't retry requests
> >> with FAILFAST set when the queue is blocked. AFAICS
> >> only FC and iSCSI transports set the queue to blocked,
> >> and use this to indicate a loss of connection. So any
> >> retry with queue blocked is futile.
> > 
> > I still don't think this is the right approach.
> > 
> > For link up/down events, those are direct pathing events and should be
> > signalled along a kernel notifier, not by mucking with the SCSI state
> > machine.
> Of course they will be signalled. And eventually we should patch up
> mutltipath-tools to read the exising events from the uevent socket.
> But even with that patch there is a quite largish window during
> which IOs will be sent to the blocked device, and hence will be
> stuck in the request queue until the timer expires.

But the assumption your code makes is that if REQ_FAILFAST is set then
it's a dm request ... and that's not true.  The code in question
negatively impacts other users of REQ_FAILFAST.  For every user other
than dm, the right thing to do is to wait out the block.

> > However, there's still devloss_tmo to consider ... even in
> > multipath, I don't think you want to signal path failure until
> > devloss_tmo has fired otherwise you'll get too many transient up/down
> > events which damage performance if the array has an expensive failover
> > model.
> > 
> Yes. But currently we have a very high failover latency as we always have
> to wait for the requeued commands to time-out.
> Hence we're damaging performance on arrays with inexpensive failover.

If it's a either/or choice between the two that's showing our current
approach to multi-path is broken.

> > The other problem is what to do with in-flight commands at the time the
> > link went down.  With your current patch, they're still stuck until they
> > time out ... surely there needs to be some type of recovery mechanism
> > for these?
> > 
> Well, the in-flight commands are owned by the HBA driver, which should
> have the proper code to terminate / return those commands with the
> appriopriate codes. They will then be rescheduled and will be caught
> like 'normal' IO requests.

But my point is that if a driver goes blocked, those commands will be
forced to wait the blocked timeout anyway, so your proposed patch does
nothing to improve the case for dm anyway ... you only avoid commands
stuck when a device goes blocked if by chance its request queue was
empty.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Multipath failover handling (Was: Re: 2.6.24-rc3-mm1)

2008-01-07 Thread Hannes Reinecke
James Bottomley wrote:
> On Fri, 2007-12-14 at 10:00 +0100, Hannes Reinecke wrote:
>> James Bottomley wrote:
>>> On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
 OK, thanks.  I'll assume that James and Hannes have this in hand (or will
 have, by mid-week) and I won't do anything here.
>>> Just to confirm what I think I'm going to be doing:  rebasing the
>>> scsi-misc tree to remove this commit:
>>>
>>> commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
>>> Author: Hannes Reinecke <[EMAIL PROTECTED]>
>>> Date:   Tue Nov 6 09:23:40 2007 +0100
>>>
>>> [SCSI] Do not requeue requests if REQ_FAILFAST is set
>>>
>>> And its allied fix ups:
>>>
>>> commit 983289045faa96fba8841d3c51b98bb8623d9504
>>> Author: James Bottomley <[EMAIL PROTECTED]>
>>> Date:   Sat Nov 24 19:47:25 2007 +0200
>>>
>>> [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE
>>>
>>> commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
>>> Author: James Bottomley <[EMAIL PROTECTED]>
>>> Date:   Sat Nov 24 19:55:53 2007 +0200
>>>
>>> [SCSI] fix domain validation to work again
>>>
>>> James
>>>
>>>
>> Or just apply my latest patch (cf Undo __scsi_kill_request).
>> The main point is that we shouldn't retry requests
>> with FAILFAST set when the queue is blocked. AFAICS
>> only FC and iSCSI transports set the queue to blocked,
>> and use this to indicate a loss of connection. So any
>> retry with queue blocked is futile.
> 
> I still don't think this is the right approach.
> 
> For link up/down events, those are direct pathing events and should be
> signalled along a kernel notifier, not by mucking with the SCSI state
> machine.
Of course they will be signalled. And eventually we should patch up
mutltipath-tools to read the exising events from the uevent socket.
But even with that patch there is a quite largish window during
which IOs will be sent to the blocked device, and hence will be
stuck in the request queue until the timer expires.

> However, there's still devloss_tmo to consider ... even in
> multipath, I don't think you want to signal path failure until
> devloss_tmo has fired otherwise you'll get too many transient up/down
> events which damage performance if the array has an expensive failover
> model.
> 
Yes. But currently we have a very high failover latency as we always have
to wait for the requeued commands to time-out.
Hence we're damaging performance on arrays with inexpensive failover.

> The other problem is what to do with in-flight commands at the time the
> link went down.  With your current patch, they're still stuck until they
> time out ... surely there needs to be some type of recovery mechanism
> for these?
> 
Well, the in-flight commands are owned by the HBA driver, which should
have the proper code to terminate / return those commands with the
appriopriate codes. They will then be rescheduled and will be caught
like 'normal' IO requests.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Multipath failover handling (Was: Re: 2.6.24-rc3-mm1)

2008-01-07 Thread Hannes Reinecke
James Bottomley wrote:
 On Fri, 2007-12-14 at 10:00 +0100, Hannes Reinecke wrote:
 James Bottomley wrote:
 On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
 OK, thanks.  I'll assume that James and Hannes have this in hand (or will
 have, by mid-week) and I won't do anything here.
 Just to confirm what I think I'm going to be doing:  rebasing the
 scsi-misc tree to remove this commit:

 commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
 Author: Hannes Reinecke [EMAIL PROTECTED]
 Date:   Tue Nov 6 09:23:40 2007 +0100

 [SCSI] Do not requeue requests if REQ_FAILFAST is set

 And its allied fix ups:

 commit 983289045faa96fba8841d3c51b98bb8623d9504
 Author: James Bottomley [EMAIL PROTECTED]
 Date:   Sat Nov 24 19:47:25 2007 +0200

 [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE

 commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
 Author: James Bottomley [EMAIL PROTECTED]
 Date:   Sat Nov 24 19:55:53 2007 +0200

 [SCSI] fix domain validation to work again

 James


 Or just apply my latest patch (cf Undo __scsi_kill_request).
 The main point is that we shouldn't retry requests
 with FAILFAST set when the queue is blocked. AFAICS
 only FC and iSCSI transports set the queue to blocked,
 and use this to indicate a loss of connection. So any
 retry with queue blocked is futile.
 
 I still don't think this is the right approach.
 
 For link up/down events, those are direct pathing events and should be
 signalled along a kernel notifier, not by mucking with the SCSI state
 machine.
Of course they will be signalled. And eventually we should patch up
mutltipath-tools to read the exising events from the uevent socket.
But even with that patch there is a quite largish window during
which IOs will be sent to the blocked device, and hence will be
stuck in the request queue until the timer expires.

 However, there's still devloss_tmo to consider ... even in
 multipath, I don't think you want to signal path failure until
 devloss_tmo has fired otherwise you'll get too many transient up/down
 events which damage performance if the array has an expensive failover
 model.
 
Yes. But currently we have a very high failover latency as we always have
to wait for the requeued commands to time-out.
Hence we're damaging performance on arrays with inexpensive failover.

 The other problem is what to do with in-flight commands at the time the
 link went down.  With your current patch, they're still stuck until they
 time out ... surely there needs to be some type of recovery mechanism
 for these?
 
Well, the in-flight commands are owned by the HBA driver, which should
have the proper code to terminate / return those commands with the
appriopriate codes. They will then be rescheduled and will be caught
like 'normal' IO requests.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Multipath failover handling (Was: Re: 2.6.24-rc3-mm1)

2008-01-07 Thread Mike Christie

James Bottomley wrote:

However, there's still devloss_tmo to consider ... even in
multipath, I don't think you want to signal path failure until
devloss_tmo has fired otherwise you'll get too many transient up/down
events which damage performance if the array has an expensive failover
model.


Yes. But currently we have a very high failover latency as we always have
to wait for the requeued commands to time-out.
Hence we're damaging performance on arrays with inexpensive failover.


If it's a either/or choice between the two that's showing our current
approach to multi-path is broken.


The other problem is what to do with in-flight commands at the time the
link went down.  With your current patch, they're still stuck until they
time out ... surely there needs to be some type of recovery mechanism
for these?


Well, the in-flight commands are owned by the HBA driver, which should
have the proper code to terminate / return those commands with the
appriopriate codes. They will then be rescheduled and will be caught
like 'normal' IO requests.


But my point is that if a driver goes blocked, those commands will be
forced to wait the blocked timeout anyway, so your proposed patch does
nothing to improve the case for dm anyway ... you only avoid commands
stuck when a device goes blocked if by chance its request queue was
empty.



How about my patches to use new transport error values and make the 
iscsi and fc behave the same.


The problem I think Hannes and I are both trying to solve is this:

1. We do not want to wait for dev_loss_tmo seconds for failover.

2. The FC drivers can hook into fast_io_fail_tmo related callouts and 
with that set that tmo to a very low value like a couple of seconds if 
they are using multipath, so failovers are fast. However, there is a bug 
with where when the fast_io_fail_tmo fires requests that made it to the 
driver get failed and returned to the multipath layer, but commands in 
the blocked request queue are stuck in there until dev_loss_tmo fires.


With my patches here (need to be rediffed and for FC I need to handle 
JamesS's comments about not using a new field for the fast_fail_timeout 
state bit):


http://marc.info/?l=linux-scsim=117399843216280w=2
http://marc.info/?l=linux-scsim=117399544112073w=2
http://marc.info/?l=linux-scsim=117399844316771w=2
http://marc.info/?l=linux-scsim=117400203324693w=2
http://marc.info/?l=linux-scsim=117400203324690w=2

For FC we can use the fast_io_fail_tmo for fast failovers, and commands 
will not get stuck in a blocked queue for dev_loss_tmo seconds because 
when the fast_io_fail_tmo fires the target's queues are unblocked and 
fc_remote_port_chkready() ready kicks in (iSCSI does the same with the 
patches in the links). And with the patches if multipath-tools is 
sending its path testing IO it will get a DID_TRANSPORT_* error code 
that it can use to make a decent path failing decision with.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-12-14 Thread James Bottomley

On Fri, 2007-12-14 at 10:00 +0100, Hannes Reinecke wrote:
> James Bottomley wrote:
> > On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
> >> OK, thanks.  I'll assume that James and Hannes have this in hand (or will
> >> have, by mid-week) and I won't do anything here.
> > 
> > Just to confirm what I think I'm going to be doing:  rebasing the
> > scsi-misc tree to remove this commit:
> > 
> > commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
> > Author: Hannes Reinecke <[EMAIL PROTECTED]>
> > Date:   Tue Nov 6 09:23:40 2007 +0100
> > 
> > [SCSI] Do not requeue requests if REQ_FAILFAST is set
> > 
> > And its allied fix ups:
> > 
> > commit 983289045faa96fba8841d3c51b98bb8623d9504
> > Author: James Bottomley <[EMAIL PROTECTED]>
> > Date:   Sat Nov 24 19:47:25 2007 +0200
> > 
> > [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE
> > 
> > commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
> > Author: James Bottomley <[EMAIL PROTECTED]>
> > Date:   Sat Nov 24 19:55:53 2007 +0200
> > 
> > [SCSI] fix domain validation to work again
> > 
> > James
> > 
> > 
> Or just apply my latest patch (cf Undo __scsi_kill_request).
> The main point is that we shouldn't retry requests
> with FAILFAST set when the queue is blocked. AFAICS
> only FC and iSCSI transports set the queue to blocked,
> and use this to indicate a loss of connection. So any
> retry with queue blocked is futile.

I still don't think this is the right approach.

For link up/down events, those are direct pathing events and should be
signalled along a kernel notifier, not by mucking with the SCSI state
machine.  However, there's still devloss_tmo to consider ... even in
multipath, I don't think you want to signal path failure until
devloss_tmo has fired otherwise you'll get too many transient up/down
events which damage performance if the array has an expensive failover
model.

The other problem is what to do with in-flight commands at the time the
link went down.  With your current patch, they're still stuck until they
time out ... surely there needs to be some type of recovery mechanism
for these?

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-12-14 Thread Hannes Reinecke
James Bottomley wrote:
> On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
>> OK, thanks.  I'll assume that James and Hannes have this in hand (or will
>> have, by mid-week) and I won't do anything here.
> 
> Just to confirm what I think I'm going to be doing:  rebasing the
> scsi-misc tree to remove this commit:
> 
> commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
> Author: Hannes Reinecke <[EMAIL PROTECTED]>
> Date:   Tue Nov 6 09:23:40 2007 +0100
> 
> [SCSI] Do not requeue requests if REQ_FAILFAST is set
> 
> And its allied fix ups:
> 
> commit 983289045faa96fba8841d3c51b98bb8623d9504
> Author: James Bottomley <[EMAIL PROTECTED]>
> Date:   Sat Nov 24 19:47:25 2007 +0200
> 
> [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE
> 
> commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
> Author: James Bottomley <[EMAIL PROTECTED]>
> Date:   Sat Nov 24 19:55:53 2007 +0200
> 
> [SCSI] fix domain validation to work again
> 
> James
> 
> 
Or just apply my latest patch (cf Undo __scsi_kill_request).
The main point is that we shouldn't retry requests
with FAILFAST set when the queue is blocked. AFAICS
only FC and iSCSI transports set the queue to blocked,
and use this to indicate a loss of connection. So any
retry with queue blocked is futile.

Cheers,

Hannes

-- 
Dr. Hannes Reinecke   zSeries & Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-12-14 Thread Hannes Reinecke
James Bottomley wrote:
 On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
 OK, thanks.  I'll assume that James and Hannes have this in hand (or will
 have, by mid-week) and I won't do anything here.
 
 Just to confirm what I think I'm going to be doing:  rebasing the
 scsi-misc tree to remove this commit:
 
 commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
 Author: Hannes Reinecke [EMAIL PROTECTED]
 Date:   Tue Nov 6 09:23:40 2007 +0100
 
 [SCSI] Do not requeue requests if REQ_FAILFAST is set
 
 And its allied fix ups:
 
 commit 983289045faa96fba8841d3c51b98bb8623d9504
 Author: James Bottomley [EMAIL PROTECTED]
 Date:   Sat Nov 24 19:47:25 2007 +0200
 
 [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE
 
 commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
 Author: James Bottomley [EMAIL PROTECTED]
 Date:   Sat Nov 24 19:55:53 2007 +0200
 
 [SCSI] fix domain validation to work again
 
 James
 
 
Or just apply my latest patch (cf Undo __scsi_kill_request).
The main point is that we shouldn't retry requests
with FAILFAST set when the queue is blocked. AFAICS
only FC and iSCSI transports set the queue to blocked,
and use this to indicate a loss of connection. So any
retry with queue blocked is futile.

Cheers,

Hannes

-- 
Dr. Hannes Reinecke   zSeries  Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-12-14 Thread James Bottomley

On Fri, 2007-12-14 at 10:00 +0100, Hannes Reinecke wrote:
 James Bottomley wrote:
  On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
  OK, thanks.  I'll assume that James and Hannes have this in hand (or will
  have, by mid-week) and I won't do anything here.
  
  Just to confirm what I think I'm going to be doing:  rebasing the
  scsi-misc tree to remove this commit:
  
  commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
  Author: Hannes Reinecke [EMAIL PROTECTED]
  Date:   Tue Nov 6 09:23:40 2007 +0100
  
  [SCSI] Do not requeue requests if REQ_FAILFAST is set
  
  And its allied fix ups:
  
  commit 983289045faa96fba8841d3c51b98bb8623d9504
  Author: James Bottomley [EMAIL PROTECTED]
  Date:   Sat Nov 24 19:47:25 2007 +0200
  
  [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE
  
  commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
  Author: James Bottomley [EMAIL PROTECTED]
  Date:   Sat Nov 24 19:55:53 2007 +0200
  
  [SCSI] fix domain validation to work again
  
  James
  
  
 Or just apply my latest patch (cf Undo __scsi_kill_request).
 The main point is that we shouldn't retry requests
 with FAILFAST set when the queue is blocked. AFAICS
 only FC and iSCSI transports set the queue to blocked,
 and use this to indicate a loss of connection. So any
 retry with queue blocked is futile.

I still don't think this is the right approach.

For link up/down events, those are direct pathing events and should be
signalled along a kernel notifier, not by mucking with the SCSI state
machine.  However, there's still devloss_tmo to consider ... even in
multipath, I don't think you want to signal path failure until
devloss_tmo has fired otherwise you'll get too many transient up/down
events which damage performance if the array has an expensive failover
model.

The other problem is what to do with in-flight commands at the time the
link went down.  With your current patch, they're still stuck until they
time out ... surely there needs to be some type of recovery mechanism
for these?

James


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-12-12 Thread Jens Axboe
On Wed, Dec 12 2007, Boaz Harrosh wrote:
> On Tue, Dec 11 2007 at 18:33 +0200, James Bottomley <[EMAIL PROTECTED]> wrote:
> > On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
> >> OK, thanks.  I'll assume that James and Hannes have this in hand (or will
> >> have, by mid-week) and I won't do anything here.
> > 
> > Just to confirm what I think I'm going to be doing:  rebasing the
> > scsi-misc tree to remove this commit:
> > 
> > commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
> > Author: Hannes Reinecke <[EMAIL PROTECTED]>
> > Date:   Tue Nov 6 09:23:40 2007 +0100
> > 
> > [SCSI] Do not requeue requests if REQ_FAILFAST is set
> > 
> > And its allied fix ups:
> > 
> > commit 983289045faa96fba8841d3c51b98bb8623d9504
> > Author: James Bottomley <[EMAIL PROTECTED]>
> > Date:   Sat Nov 24 19:47:25 2007 +0200
> > 
> > [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE
> > 
> > commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
> > Author: James Bottomley <[EMAIL PROTECTED]>
> > Date:   Sat Nov 24 19:55:53 2007 +0200
> > 
> > [SCSI] fix domain validation to work again
> > 
> > James
> > 
> 
> The problems caused by this patch where nagging me at the back of my head
> from the begging. Why should we fail on a check of FAIL_FAST in all kind
> of weird places like boots, when the only place that should ever set the 
> flag should be one of the multi-path drivers. finally it struck me:
> 
> It might be a bug in ll_rw_blk at blk_rq_bio_prep() there is this:
> 
> static void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
>   struct bio *bio)
> {
>   /* first two bits are identical in rq->cmd_flags and bio->bi_rw */
>   rq->cmd_flags |= (bio->bi_rw & 3);
>   ...
> 
> Now this is no longer true and is a bug.
> Second bit of bio->bi_rw defined in bio.h is:
> #define BIO_RW_AHEAD  1
> but
> Second bit of rq->cmd_flags is __REQ_FAILFAST
> 
> so maybe we are getting FAILFAST in the wrong places?

But that's actually on purpose, though the comment is pretty much crap.
We don't want to be retrying readahead requests, those should always
just be tossable.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-12-12 Thread Boaz Harrosh
On Tue, Dec 11 2007 at 18:33 +0200, James Bottomley <[EMAIL PROTECTED]> wrote:
> On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
>> OK, thanks.  I'll assume that James and Hannes have this in hand (or will
>> have, by mid-week) and I won't do anything here.
> 
> Just to confirm what I think I'm going to be doing:  rebasing the
> scsi-misc tree to remove this commit:
> 
> commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
> Author: Hannes Reinecke <[EMAIL PROTECTED]>
> Date:   Tue Nov 6 09:23:40 2007 +0100
> 
> [SCSI] Do not requeue requests if REQ_FAILFAST is set
> 
> And its allied fix ups:
> 
> commit 983289045faa96fba8841d3c51b98bb8623d9504
> Author: James Bottomley <[EMAIL PROTECTED]>
> Date:   Sat Nov 24 19:47:25 2007 +0200
> 
> [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE
> 
> commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
> Author: James Bottomley <[EMAIL PROTECTED]>
> Date:   Sat Nov 24 19:55:53 2007 +0200
> 
> [SCSI] fix domain validation to work again
> 
> James
> 

The problems caused by this patch where nagging me at the back of my head
from the begging. Why should we fail on a check of FAIL_FAST in all kind
of weird places like boots, when the only place that should ever set the 
flag should be one of the multi-path drivers. finally it struck me:

It might be a bug in ll_rw_blk at blk_rq_bio_prep() there is this:

static void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
struct bio *bio)
{
/* first two bits are identical in rq->cmd_flags and bio->bi_rw */
rq->cmd_flags |= (bio->bi_rw & 3);
...

Now this is no longer true and is a bug.
Second bit of bio->bi_rw defined in bio.h is:
#define BIO_RW_AHEAD1
but
Second bit of rq->cmd_flags is __REQ_FAILFAST

so maybe we are getting FAILFAST in the wrong places?

(I will look for an old patch I sent a year ago that fixes
this bug)

Boaz
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-12-12 Thread Boaz Harrosh
On Tue, Dec 11 2007 at 18:33 +0200, James Bottomley [EMAIL PROTECTED] wrote:
 On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
 OK, thanks.  I'll assume that James and Hannes have this in hand (or will
 have, by mid-week) and I won't do anything here.
 
 Just to confirm what I think I'm going to be doing:  rebasing the
 scsi-misc tree to remove this commit:
 
 commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
 Author: Hannes Reinecke [EMAIL PROTECTED]
 Date:   Tue Nov 6 09:23:40 2007 +0100
 
 [SCSI] Do not requeue requests if REQ_FAILFAST is set
 
 And its allied fix ups:
 
 commit 983289045faa96fba8841d3c51b98bb8623d9504
 Author: James Bottomley [EMAIL PROTECTED]
 Date:   Sat Nov 24 19:47:25 2007 +0200
 
 [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE
 
 commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
 Author: James Bottomley [EMAIL PROTECTED]
 Date:   Sat Nov 24 19:55:53 2007 +0200
 
 [SCSI] fix domain validation to work again
 
 James
 

The problems caused by this patch where nagging me at the back of my head
from the begging. Why should we fail on a check of FAIL_FAST in all kind
of weird places like boots, when the only place that should ever set the 
flag should be one of the multi-path drivers. finally it struck me:

It might be a bug in ll_rw_blk at blk_rq_bio_prep() there is this:

static void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
struct bio *bio)
{
/* first two bits are identical in rq-cmd_flags and bio-bi_rw */
rq-cmd_flags |= (bio-bi_rw  3);
...

Now this is no longer true and is a bug.
Second bit of bio-bi_rw defined in bio.h is:
#define BIO_RW_AHEAD1
but
Second bit of rq-cmd_flags is __REQ_FAILFAST

so maybe we are getting FAILFAST in the wrong places?

(I will look for an old patch I sent a year ago that fixes
this bug)

Boaz
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-12-12 Thread Jens Axboe
On Wed, Dec 12 2007, Boaz Harrosh wrote:
 On Tue, Dec 11 2007 at 18:33 +0200, James Bottomley [EMAIL PROTECTED] wrote:
  On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
  OK, thanks.  I'll assume that James and Hannes have this in hand (or will
  have, by mid-week) and I won't do anything here.
  
  Just to confirm what I think I'm going to be doing:  rebasing the
  scsi-misc tree to remove this commit:
  
  commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
  Author: Hannes Reinecke [EMAIL PROTECTED]
  Date:   Tue Nov 6 09:23:40 2007 +0100
  
  [SCSI] Do not requeue requests if REQ_FAILFAST is set
  
  And its allied fix ups:
  
  commit 983289045faa96fba8841d3c51b98bb8623d9504
  Author: James Bottomley [EMAIL PROTECTED]
  Date:   Sat Nov 24 19:47:25 2007 +0200
  
  [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE
  
  commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
  Author: James Bottomley [EMAIL PROTECTED]
  Date:   Sat Nov 24 19:55:53 2007 +0200
  
  [SCSI] fix domain validation to work again
  
  James
  
 
 The problems caused by this patch where nagging me at the back of my head
 from the begging. Why should we fail on a check of FAIL_FAST in all kind
 of weird places like boots, when the only place that should ever set the 
 flag should be one of the multi-path drivers. finally it struck me:
 
 It might be a bug in ll_rw_blk at blk_rq_bio_prep() there is this:
 
 static void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
   struct bio *bio)
 {
   /* first two bits are identical in rq-cmd_flags and bio-bi_rw */
   rq-cmd_flags |= (bio-bi_rw  3);
   ...
 
 Now this is no longer true and is a bug.
 Second bit of bio-bi_rw defined in bio.h is:
 #define BIO_RW_AHEAD  1
 but
 Second bit of rq-cmd_flags is __REQ_FAILFAST
 
 so maybe we are getting FAILFAST in the wrong places?

But that's actually on purpose, though the comment is pretty much crap.
We don't want to be retrying readahead requests, those should always
just be tossable.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-12-11 Thread James Bottomley

On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
> OK, thanks.  I'll assume that James and Hannes have this in hand (or will
> have, by mid-week) and I won't do anything here.

Just to confirm what I think I'm going to be doing:  rebasing the
scsi-misc tree to remove this commit:

commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
Author: Hannes Reinecke <[EMAIL PROTECTED]>
Date:   Tue Nov 6 09:23:40 2007 +0100

[SCSI] Do not requeue requests if REQ_FAILFAST is set

And its allied fix ups:

commit 983289045faa96fba8841d3c51b98bb8623d9504
Author: James Bottomley <[EMAIL PROTECTED]>
Date:   Sat Nov 24 19:47:25 2007 +0200

[SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE

commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
Author: James Bottomley <[EMAIL PROTECTED]>
Date:   Sat Nov 24 19:55:53 2007 +0200

[SCSI] fix domain validation to work again

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-12-11 Thread James Bottomley

On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote:
 OK, thanks.  I'll assume that James and Hannes have this in hand (or will
 have, by mid-week) and I won't do anything here.

Just to confirm what I think I'm going to be doing:  rebasing the
scsi-misc tree to remove this commit:

commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0
Author: Hannes Reinecke [EMAIL PROTECTED]
Date:   Tue Nov 6 09:23:40 2007 +0100

[SCSI] Do not requeue requests if REQ_FAILFAST is set

And its allied fix ups:

commit 983289045faa96fba8841d3c51b98bb8623d9504
Author: James Bottomley [EMAIL PROTECTED]
Date:   Sat Nov 24 19:47:25 2007 +0200

[SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE

commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17
Author: James Bottomley [EMAIL PROTECTED]
Date:   Sat Nov 24 19:55:53 2007 +0200

[SCSI] fix domain validation to work again

James


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 make headers_check fails

2007-12-02 Thread Avi Kivity

Andrew Morton wrote:

On Wed, 21 Nov 2007 12:17:14 +0200 Avi Kivity <[EMAIL PROTECTED]> wrote:

  

Avi Kivity wrote:

  
  

The make headers_check fails,

 CHECK   include/linux/usb/gadgetfs.h
 CHECK   include/linux/usb/ch9.h
 CHECK   include/linux/usb/cdc.h
 CHECK   include/linux/usb/audio.h
 CHECK   include/linux/kvm.h
/root/kernels/linux-2.6.24-rc3/usr/include/linux/kvm.h requires 
asm/kvm.h, which does not exist in exported headers
   


hm, works for me, on i386 and x86_64.  What's different over there?
   
  

Hi Andrew,

It fails on the powerpc box, with allyesconfig option.

 
  


How do we fix this?  Export linux/kvm.h only on x86?  Seems ugly.

  

Is kvm x86 specific? Then move the .h file to asm-x86.
Otherwise no good idea...

  

kvm.h is x86 specific today, but will be s390, ppc, ia64, and x86 
specific tomorrow.


What about having a asm-generic/kvm.h with a nice #error?would 
that suit?


  
headers_check continues to complain.  Is the only recourse to add 
asm/kvm.h for all archs?





That would work.

Meanwhile my recourse is to drop the kvm tree ;)
  


Since you put it this way...

I committed the attached (sorry) patch to kvm.git.   Rather than 
touching 2*($NARCH - 1) file, I changed include/linux/Kbuild to only 
export kvm.h if the arch actually supports it.  Currently that's just x86.



--
error compiling committee.c: too many arguments to function

>From a393444c97f6d7355a6d7d6d7aeb80f1e72472b1 Mon Sep 17 00:00:00 2001
From: Avi Kivity <[EMAIL PROTECTED]>
Date: Sun, 2 Dec 2007 10:50:06 +0200
Subject: [PATCH] KVM: Export include/linux/kvm.h only if $ARCH actually supports KVM

Currently, make headers_check barfs due to , which 
includes, not existing.  Rather than add a zillion s, export kvm.h
only if the arch actually supports it.

Signed-off-by: Avi Kivity <[EMAIL PROTECTED]>
---
 arch/x86/Kconfig |3 +++
 drivers/kvm/Kconfig  |4 ++--
 include/linux/Kbuild |2 +-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 368864d..eded44e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -112,6 +112,9 @@ config GENERIC_TIME_VSYSCALL
 	bool
 	default X86_64
 
+config ARCH_SUPPORTS_KVM
+	bool
+	default y
 
 
 
diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig
index 6569206..4086080 100644
--- a/drivers/kvm/Kconfig
+++ b/drivers/kvm/Kconfig
@@ -3,7 +3,7 @@
 #
 menuconfig VIRTUALIZATION
 	bool "Virtualization"
-	depends on X86
+	depends on ARCH_SUPPORTS_KVM || X86
 	default y
 	---help---
 	  Say Y here to get to see options for using your Linux host to run other
@@ -16,7 +16,7 @@ if VIRTUALIZATION
 
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
-	depends on X86 && EXPERIMENTAL
+	depends on ARCH_SUPPORTS_KVM && EXPERIMENTAL
 	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 105c5d6..397197f 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -254,7 +254,7 @@ unifdef-y += kd.h
 unifdef-y += kernelcapi.h
 unifdef-y += kernel.h
 unifdef-y += keyboard.h
-unifdef-y += kvm.h
+unifdef-$(CONFIG_ARCH_SUPPORTS_KVM) += kvm.h
 unifdef-y += llc.h
 unifdef-y += loop.h
 unifdef-y += lp.h
-- 
1.5.3



Re: 2.6.24-rc3-mm1 make headers_check fails

2007-12-02 Thread Avi Kivity

Andrew Morton wrote:

On Wed, 21 Nov 2007 12:17:14 +0200 Avi Kivity [EMAIL PROTECTED] wrote:

  

Avi Kivity wrote:

  
  

The make headers_check fails,

 CHECK   include/linux/usb/gadgetfs.h
 CHECK   include/linux/usb/ch9.h
 CHECK   include/linux/usb/cdc.h
 CHECK   include/linux/usb/audio.h
 CHECK   include/linux/kvm.h
/root/kernels/linux-2.6.24-rc3/usr/include/linux/kvm.h requires 
asm/kvm.h, which does not exist in exported headers
   


hm, works for me, on i386 and x86_64.  What's different over there?
   
  

Hi Andrew,

It fails on the powerpc box, with allyesconfig option.

 
  


How do we fix this?  Export linux/kvm.h only on x86?  Seems ugly.

  

Is kvm x86 specific? Then move the .h file to asm-x86.
Otherwise no good idea...

  

kvm.h is x86 specific today, but will be s390, ppc, ia64, and x86 
specific tomorrow.


What about having a asm-generic/kvm.h with a nice #error?would 
that suit?


  
headers_check continues to complain.  Is the only recourse to add 
asm/kvm.h for all archs?





That would work.

Meanwhile my recourse is to drop the kvm tree ;)
  


Since you put it this way...

I committed the attached (sorry) patch to kvm.git.   Rather than 
touching 2*($NARCH - 1) file, I changed include/linux/Kbuild to only 
export kvm.h if the arch actually supports it.  Currently that's just x86.



--
error compiling committee.c: too many arguments to function

From a393444c97f6d7355a6d7d6d7aeb80f1e72472b1 Mon Sep 17 00:00:00 2001
From: Avi Kivity [EMAIL PROTECTED]
Date: Sun, 2 Dec 2007 10:50:06 +0200
Subject: [PATCH] KVM: Export include/linux/kvm.h only if $ARCH actually supports KVM

Currently, make headers_check barfs due to asm/kvm.h, which linux/kvm.h
includes, not existing.  Rather than add a zillion asm/kvm.hs, export kvm.h
only if the arch actually supports it.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 arch/x86/Kconfig |3 +++
 drivers/kvm/Kconfig  |4 ++--
 include/linux/Kbuild |2 +-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 368864d..eded44e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -112,6 +112,9 @@ config GENERIC_TIME_VSYSCALL
 	bool
 	default X86_64
 
+config ARCH_SUPPORTS_KVM
+	bool
+	default y
 
 
 
diff --git a/drivers/kvm/Kconfig b/drivers/kvm/Kconfig
index 6569206..4086080 100644
--- a/drivers/kvm/Kconfig
+++ b/drivers/kvm/Kconfig
@@ -3,7 +3,7 @@
 #
 menuconfig VIRTUALIZATION
 	bool Virtualization
-	depends on X86
+	depends on ARCH_SUPPORTS_KVM || X86
 	default y
 	---help---
 	  Say Y here to get to see options for using your Linux host to run other
@@ -16,7 +16,7 @@ if VIRTUALIZATION
 
 config KVM
 	tristate Kernel-based Virtual Machine (KVM) support
-	depends on X86  EXPERIMENTAL
+	depends on ARCH_SUPPORTS_KVM  EXPERIMENTAL
 	select PREEMPT_NOTIFIERS
 	select ANON_INODES
 	---help---
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 105c5d6..397197f 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -254,7 +254,7 @@ unifdef-y += kd.h
 unifdef-y += kernelcapi.h
 unifdef-y += kernel.h
 unifdef-y += keyboard.h
-unifdef-y += kvm.h
+unifdef-$(CONFIG_ARCH_SUPPORTS_KVM) += kvm.h
 unifdef-y += llc.h
 unifdef-y += loop.h
 unifdef-y += lp.h
-- 
1.5.3



Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-28 Thread Laurent Riffard
Le 25.11.2007 21:39, Laurent Riffard a écrit :
> Le 25.11.2007 08:37, James Bottomley a écrit :
>> On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote:
>>> Le 24.11.2007 14:26, James Bottomley a écrit :
 OK, could you post dmesgs again, please.  I actually tested this
>>> with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.
>>> James, 
>>>
>>> Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch "separates
>>> the 
>>> BLOCK and QUIESCE states
>>> correctly" (http://lkml.org/lkml/2007/11/24/8).
>>>
[...]
>>> [   25.521256] scsi0 : pata_via
>>> [   25.521711] scsi1 : pata_via
>>> [   25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq 
>>> 14
>>> [   25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq 
>>> 15
>>> [   25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100
>>> [   25.683208] ata1.00: 78165360 sectors, multi 16: LBA 
>>> [   25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133
>>> [   25.684116] ata1.01: 160086528 sectors, multi 16: LBA 
>>> [   25.691127] ata1.00: configured for UDMA/100
>>> [   25.699142] ata1.01: configured for UDMA/100
>>> [   26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33
>>> [   26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr
>>> [   26.330839] ata2.00: configured for UDMA/33
>>> [   26.490828] ata2.01: configured for MWDMA2
>>> [   26.503014] scsi 0:0:0:0: Direct-Access ATA  ST340016A 3.75 PQ: 
>>> 0 ANSI: 5
>>> [   26.504670] scsi 0:0:1:0: Direct-Access ATA  Maxtor 6Y080L0 YAR4 
>>> PQ: 0 ANSI: 5
>>> [   26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B 
>>> DL05 PQ: 0 ANSI: 5
>>> [   26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q  
>>> PQ: 0 ANSI: 5
>> [...]
>>> [   60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> [   60.216124] end_request: I/O error, dev sda, sector 16460
>> I think this one's quite easy:  PATA devices in libata are queue depth 1
>> (since they don't do NCQ).  Thus, they're peculiarly sensitive to the
>> bug where we fail over queue depth requests.
>>
>> On the other hand, I don't see how a filesystem request is getting
>> REQ_FAILFAST ... unless there's a bio or readahead issue involved.
>> Anyway, could you try this patch:
>>
>> http://marc.info/?l=linux-scsi=119592627425498
>>
>> Which should fix the queue depth issue, and see if the errors go away?
> 
> No, this one doesn't help...
 
still happens with 2.6.24-rc3-mm2...
-- 
laurent
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-28 Thread Laurent Riffard
Le 25.11.2007 21:39, Laurent Riffard a écrit :
 Le 25.11.2007 08:37, James Bottomley a écrit :
 On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote:
 Le 24.11.2007 14:26, James Bottomley a écrit :
 OK, could you post dmesgs again, please.  I actually tested this
 with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.
 James, 

 Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch separates
 the 
 BLOCK and QUIESCE states
 correctly (http://lkml.org/lkml/2007/11/24/8).

[...]
 [   25.521256] scsi0 : pata_via
 [   25.521711] scsi1 : pata_via
 [   25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq 
 14
 [   25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq 
 15
 [   25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100
 [   25.683208] ata1.00: 78165360 sectors, multi 16: LBA 
 [   25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133
 [   25.684116] ata1.01: 160086528 sectors, multi 16: LBA 
 [   25.691127] ata1.00: configured for UDMA/100
 [   25.699142] ata1.01: configured for UDMA/100
 [   26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33
 [   26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr
 [   26.330839] ata2.00: configured for UDMA/33
 [   26.490828] ata2.01: configured for MWDMA2
 [   26.503014] scsi 0:0:0:0: Direct-Access ATA  ST340016A 3.75 PQ: 
 0 ANSI: 5
 [   26.504670] scsi 0:0:1:0: Direct-Access ATA  Maxtor 6Y080L0 YAR4 
 PQ: 0 ANSI: 5
 [   26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B 
 DL05 PQ: 0 ANSI: 5
 [   26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q  
 PQ: 0 ANSI: 5
 [...]
 [   60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 [   60.216124] end_request: I/O error, dev sda, sector 16460
 I think this one's quite easy:  PATA devices in libata are queue depth 1
 (since they don't do NCQ).  Thus, they're peculiarly sensitive to the
 bug where we fail over queue depth requests.

 On the other hand, I don't see how a filesystem request is getting
 REQ_FAILFAST ... unless there's a bio or readahead issue involved.
 Anyway, could you try this patch:

 http://marc.info/?l=linux-scsim=119592627425498

 Which should fix the queue depth issue, and see if the errors go away?
 
 No, this one doesn't help...
 
still happens with 2.6.24-rc3-mm2...
-- 
laurent
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 make headers_check fails

2007-11-27 Thread Andrew Morton
On Wed, 21 Nov 2007 12:17:14 +0200 Avi Kivity <[EMAIL PROTECTED]> wrote:

> Avi Kivity wrote:
> >   
> >> The make headers_check fails,
> >>
> >>  CHECK   include/linux/usb/gadgetfs.h
> >>  CHECK   include/linux/usb/ch9.h
> >>  CHECK   include/linux/usb/cdc.h
> >>  CHECK   include/linux/usb/audio.h
> >>  CHECK   include/linux/kvm.h
> >> /root/kernels/linux-2.6.24-rc3/usr/include/linux/kvm.h requires 
> >> asm/kvm.h, which does not exist in exported headers
> >>
> > hm, works for me, on i386 and x86_64.  What's different over there?
> >
>  Hi Andrew,
> 
>  It fails on the powerpc box, with allyesconfig option.
> 
>   
>    
> >>> How do we fix this?  Export linux/kvm.h only on x86?  Seems ugly.
> >>> 
> >>
> >> Is kvm x86 specific? Then move the .h file to asm-x86.
> >> Otherwise no good idea...
> >>
> >>   
> >
> > kvm.h is x86 specific today, but will be s390, ppc, ia64, and x86 
> > specific tomorrow.
> >
> > What about having a asm-generic/kvm.h with a nice #error?would 
> > that suit?
> >
> 
> headers_check continues to complain.  Is the only recourse to add 
> asm/kvm.h for all archs?
> 

That would work.

Meanwhile my recourse is to drop the kvm tree ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-27 Thread Rik van Riel
On Mon, 26 Nov 2007 15:28:32 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Tue, 27 Nov 2007 00:14:17 +0100
> Jiri Slaby <[EMAIL PROTECTED]> wrote:
> 
> > On 11/26/2007 11:17 PM, Andrew Morton wrote:
> > >> Maybe if you can emit a broken-out with the fresh pull to test?
> > > 
> > > http://userweb.kernel.org/~akpm/mmotm/ is current.  But it probably won't
> > > compile. 
> > 
> > Yes it did :). And it worked. Both in qemu and on my desktop...
> 
> boggle.  Let's slap 2.6.25 on it and take the rest of the year off.

No worries, the mmotm compiling issue seems to have been fixed:

  CC [M]  drivers/scsi/libsas/sas_ata.o
drivers/scsi/libsas/sas_ata.c:39: error: field ‘rphy’ has incomplete type
drivers/scsi/libsas/sas_ata.c: In function ‘sas_discover_sata’:
drivers/scsi/libsas/sas_ata.c:773: error: implicit declaration of function 
‘ata_sas_rphy_alloc’
drivers/scsi/libsas/sas_ata.c:775: error: dereferencing pointer to incomplete 
type
drivers/scsi/libsas/sas_ata.c:775: warning: assignment makes pointer from 
integer without a cast
drivers/scsi/libsas/sas_ata.c:781: error: dereferencing pointer to incomplete 
type
drivers/scsi/libsas/sas_ata.c:782: error: dereferencing pointer to incomplete 
type
drivers/scsi/libsas/sas_ata.c:784: warning: type defaults to ‘int’ in 
declaration of ‘__mptr’
drivers/scsi/libsas/sas_ata.c:784: warning: initialization from incompatible 
pointer type
drivers/scsi/libsas/sas_ata.c:791: error: implicit declaration of function 
‘ata_sas_rphy_add’
drivers/scsi/libsas/sas_ata.c:807: error: implicit declaration of function 
‘ata_sas_rphy_delete’
drivers/scsi/libsas/sas_ata.c:809: error: implicit declaration of function 
‘ata_sas_rphy_free’
make[3]: *** [drivers/scsi/libsas/sas_ata.o] Error 1
make[2]: *** [drivers/scsi/libsas] Error 2
make[1]: *** [drivers/scsi] Error 2
make: *** [drivers] Error 2

So much for continuing the bisect with that tree, to find the
cause of the second bug :)

Guess I'll extract an x86 tree changeset first, to place into
the 2.6.23-rc3-mm1 broken out tree and work from there...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 -- arch/x86/xen/en lighten.c:591: error: 'TLB_FLUSH_ALL' undeclared (first use in this function)

2007-11-27 Thread Jeremy Fitzhardinge
Miles Lane wrote:
> arch/x86/xen/enlighten.c: In function 'xen_flush_tlb_others':
> arch/x86/xen/enlighten.c:591: error: 'TLB_FLUSH_ALL' undeclared (first
> use in this function)
> arch/x86/xen/enlighten.c:591: error: (Each undeclared identifier is
> reported only once
> arch/x86/xen/enlighten.c:591: error: for each function it appears in.)
> make[1]: *** [arch/x86/xen/enlighten.o] Error 1
>   

Hm, I can't reproduce this in current git with your .config.  Is there
something in -mm which touches the tlb headers?

I do have a stack of tglx's x86 unification patches applied as well. 
Perhaps they help.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-27 Thread Ingo Molnar

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> Otherwise, please proceed to work out which diff I need to drop and 
> hope like hell that it isn't git-x86..

hm? x86.git is fully bisectable - so a more accurate statement would be 
"and hope that it's x86.git, so that it can be properly bisected" :-) 
For x86.git bisection, pull the 'mm' branch from:

   git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-27 Thread Valdis . Kletnieks
On Tue, 27 Nov 2007 16:25:22 +0800, Dave Young said:

> does boot_delay helps?

It might, if the kernel lived long enough to output a first printk for
us to delay after.  :)

Shooting this one would be *easy* if the problem was an boot-time oops that
would otherwise scroll off the screen without a boot_delay...


pgpqJlvYAN6Ou.pgp
Description: PGP signature


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-27 Thread Dave Young
On Nov 27, 2007 3:16 PM,  <[EMAIL PROTECTED]> wrote:
> On Tue, 20 Nov 2007 20:45:25 PST, Andrew Morton said:
> >
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
>
> Finally got both time and motivation to at least start a bisect..
>
> 2.6.23-mm1 works on my D820 (x86_64 kernel, Core2 Duo T7200)
>
> 24-rc3-mm1 (plus 3 patches from hotfixes/) bricks *instantly* at boot - grub
> prints its 3 or 4 lines saying what it loaded, the screen clears, and *blam*
> dead. No serial console output, no pair of penguins on the monitor, no
> netconsole, no earlyprintk=vga output, no alt-sysrq, only thing that does
> *anything* is "hold the power button for 5 seconds".  Whatever it is, it
> happens *very* early (before we get as far as the 'Linux version 2.6.mumble'
> banner), and happens *hard*.
>
> I've bisected it down this far:
>
> git-ipwireless_cs.patch GOOD
> git-x86.patch
> git-x86-fixup.patch
> git-x86-thread_order-borkage.patch
> git-x86-thread_order-borkage-fix.patch
> git-x86-identify_cpu-fix.patch
> git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
> git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
> git-x86-inlining-borkage.patch
> x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
> x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD
>
> Anybody got any good debugging ideas before I go through and do the final
> 3 or 4 bisects?  I suspect I'll need them once I find the offending patch
> to tell *why* said patch dies on my box - I've seen enough traffic regarding
> -rc3-mm1 dying *later* to know it's probably a subtle issue and not one
> that will be obvious once I finger a specific patch.  For example, it's
> probably not the IO-APIC panic that people are seeing, because their kernels
> live long enough to panic. ;)

Hi,
does boot_delay helps?

Regards
dave
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-27 Thread Andrew Morton
On Tue, 27 Nov 2007 02:54:56 -0500 [EMAIL PROTECTED] wrote:

> On Mon, 26 Nov 2007 23:27:03 PST, Andrew Morton said:
> 
> > > git-x86.patch
> > > git-x86-fixup.patch
> > > git-x86-thread_order-borkage.patch
> > > git-x86-thread_order-borkage-fix.patch
> > > git-x86-identify_cpu-fix.patch
> > > git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
> > > git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
> > > git-x86-inlining-borkage.patch
> > > x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
> > > x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD
> 
> > You could try http://userweb.kernel.org/~akpm/mmotm/ - we might have already
> > fixed it.
> 
> I suspect that trying -rc3-mm1 but refreshing just the 10 patches above
> from -mmotm would be far less likely to pull in other heartburn?

All the above are no longer in -mm.  They got merged, dropped,
otherwise-fixed, etc.

> > Otherwise, please proceed to work out which diff I need to drop and hope 
> > like
> > hell that it isn't git-x86..
> 
> That's a 41,240 line diff, the rest *total* to about 400 lines.  I don't have
> warm-n-fuzzies about my odds here. ;)

No.

> I'm a git-idiot, but *do* know how to git-bisect through Linus tree - what
> would I need to do to git-bisect through git-x86.patch? (I do *not* know how
> to deal with more than 1 source git tree, so if the magic is just 'get a
> linus tree, merge git-x86, then bisect as usual", I'm stuck on "merge 
> git-x86")..

umm, I'm minimally git-afflicted hence am the wrong person to ask. 
Something like:


- checkout Linus's tree

- echo 'git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git#mm' 
> .git/branches/git-x86

- git-fetch git-x86

- git-checkout git-x86

- start bisecting.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-27 Thread Andrew Morton
On Tue, 27 Nov 2007 02:54:56 -0500 [EMAIL PROTECTED] wrote:

 On Mon, 26 Nov 2007 23:27:03 PST, Andrew Morton said:
 
   git-x86.patch
   git-x86-fixup.patch
   git-x86-thread_order-borkage.patch
   git-x86-thread_order-borkage-fix.patch
   git-x86-identify_cpu-fix.patch
   git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
   git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
   git-x86-inlining-borkage.patch
   x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
   x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD
 
  You could try http://userweb.kernel.org/~akpm/mmotm/ - we might have already
  fixed it.
 
 I suspect that trying -rc3-mm1 but refreshing just the 10 patches above
 from -mmotm would be far less likely to pull in other heartburn?

All the above are no longer in -mm.  They got merged, dropped,
otherwise-fixed, etc.

  Otherwise, please proceed to work out which diff I need to drop and hope 
  like
  hell that it isn't git-x86..
 
 That's a 41,240 line diff, the rest *total* to about 400 lines.  I don't have
 warm-n-fuzzies about my odds here. ;)

No.

 I'm a git-idiot, but *do* know how to git-bisect through Linus tree - what
 would I need to do to git-bisect through git-x86.patch? (I do *not* know how
 to deal with more than 1 source git tree, so if the magic is just 'get a
 linus tree, merge git-x86, then bisect as usual, I'm stuck on merge 
 git-x86)..

umm, I'm minimally git-afflicted hence am the wrong person to ask. 
Something like:


- checkout Linus's tree

- echo 'git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git#mm' 
 .git/branches/git-x86

- git-fetch git-x86

- git-checkout git-x86

- start bisecting.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-27 Thread Dave Young
On Nov 27, 2007 3:16 PM,  [EMAIL PROTECTED] wrote:
 On Tue, 20 Nov 2007 20:45:25 PST, Andrew Morton said:
 
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/

 Finally got both time and motivation to at least start a bisect..

 2.6.23-mm1 works on my D820 (x86_64 kernel, Core2 Duo T7200)

 24-rc3-mm1 (plus 3 patches from hotfixes/) bricks *instantly* at boot - grub
 prints its 3 or 4 lines saying what it loaded, the screen clears, and *blam*
 dead. No serial console output, no pair of penguins on the monitor, no
 netconsole, no earlyprintk=vga output, no alt-sysrq, only thing that does
 *anything* is hold the power button for 5 seconds.  Whatever it is, it
 happens *very* early (before we get as far as the 'Linux version 2.6.mumble'
 banner), and happens *hard*.

 I've bisected it down this far:

 git-ipwireless_cs.patch GOOD
 git-x86.patch
 git-x86-fixup.patch
 git-x86-thread_order-borkage.patch
 git-x86-thread_order-borkage-fix.patch
 git-x86-identify_cpu-fix.patch
 git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
 git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
 git-x86-inlining-borkage.patch
 x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
 x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD

 Anybody got any good debugging ideas before I go through and do the final
 3 or 4 bisects?  I suspect I'll need them once I find the offending patch
 to tell *why* said patch dies on my box - I've seen enough traffic regarding
 -rc3-mm1 dying *later* to know it's probably a subtle issue and not one
 that will be obvious once I finger a specific patch.  For example, it's
 probably not the IO-APIC panic that people are seeing, because their kernels
 live long enough to panic. ;)

Hi,
does boot_delay helps?

Regards
dave
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-27 Thread Valdis . Kletnieks
On Tue, 27 Nov 2007 16:25:22 +0800, Dave Young said:

 does boot_delay helps?

It might, if the kernel lived long enough to output a first printk for
us to delay after.  :)

Shooting this one would be *easy* if the problem was an boot-time oops that
would otherwise scroll off the screen without a boot_delay...


pgpqJlvYAN6Ou.pgp
Description: PGP signature


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-27 Thread Ingo Molnar

* Andrew Morton [EMAIL PROTECTED] wrote:

 Otherwise, please proceed to work out which diff I need to drop and 
 hope like hell that it isn't git-x86..

hm? x86.git is fully bisectable - so a more accurate statement would be 
and hope that it's x86.git, so that it can be properly bisected :-) 
For x86.git bisection, pull the 'mm' branch from:

   git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 -- arch/x86/xen/en lighten.c:591: error: 'TLB_FLUSH_ALL' undeclared (first use in this function)

2007-11-27 Thread Jeremy Fitzhardinge
Miles Lane wrote:
 arch/x86/xen/enlighten.c: In function 'xen_flush_tlb_others':
 arch/x86/xen/enlighten.c:591: error: 'TLB_FLUSH_ALL' undeclared (first
 use in this function)
 arch/x86/xen/enlighten.c:591: error: (Each undeclared identifier is
 reported only once
 arch/x86/xen/enlighten.c:591: error: for each function it appears in.)
 make[1]: *** [arch/x86/xen/enlighten.o] Error 1
   

Hm, I can't reproduce this in current git with your .config.  Is there
something in -mm which touches the tlb headers?

I do have a stack of tglx's x86 unification patches applied as well. 
Perhaps they help.

J
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-27 Thread Rik van Riel
On Mon, 26 Nov 2007 15:28:32 -0800
Andrew Morton [EMAIL PROTECTED] wrote:

 On Tue, 27 Nov 2007 00:14:17 +0100
 Jiri Slaby [EMAIL PROTECTED] wrote:
 
  On 11/26/2007 11:17 PM, Andrew Morton wrote:
   Maybe if you can emit a broken-out with the fresh pull to test?
   
   http://userweb.kernel.org/~akpm/mmotm/ is current.  But it probably won't
   compile. 
  
  Yes it did :). And it worked. Both in qemu and on my desktop...
 
 boggle.  Let's slap 2.6.25 on it and take the rest of the year off.

No worries, the mmotm compiling issue seems to have been fixed:

  CC [M]  drivers/scsi/libsas/sas_ata.o
drivers/scsi/libsas/sas_ata.c:39: error: field ‘rphy’ has incomplete type
drivers/scsi/libsas/sas_ata.c: In function ‘sas_discover_sata’:
drivers/scsi/libsas/sas_ata.c:773: error: implicit declaration of function 
‘ata_sas_rphy_alloc’
drivers/scsi/libsas/sas_ata.c:775: error: dereferencing pointer to incomplete 
type
drivers/scsi/libsas/sas_ata.c:775: warning: assignment makes pointer from 
integer without a cast
drivers/scsi/libsas/sas_ata.c:781: error: dereferencing pointer to incomplete 
type
drivers/scsi/libsas/sas_ata.c:782: error: dereferencing pointer to incomplete 
type
drivers/scsi/libsas/sas_ata.c:784: warning: type defaults to ‘int’ in 
declaration of ‘__mptr’
drivers/scsi/libsas/sas_ata.c:784: warning: initialization from incompatible 
pointer type
drivers/scsi/libsas/sas_ata.c:791: error: implicit declaration of function 
‘ata_sas_rphy_add’
drivers/scsi/libsas/sas_ata.c:807: error: implicit declaration of function 
‘ata_sas_rphy_delete’
drivers/scsi/libsas/sas_ata.c:809: error: implicit declaration of function 
‘ata_sas_rphy_free’
make[3]: *** [drivers/scsi/libsas/sas_ata.o] Error 1
make[2]: *** [drivers/scsi/libsas] Error 2
make[1]: *** [drivers/scsi] Error 2
make: *** [drivers] Error 2

So much for continuing the bisect with that tree, to find the
cause of the second bug :)

Guess I'll extract an x86 tree changeset first, to place into
the 2.6.23-rc3-mm1 broken out tree and work from there...

-- 
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it. - Brian W. Kernighan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 make headers_check fails

2007-11-27 Thread Andrew Morton
On Wed, 21 Nov 2007 12:17:14 +0200 Avi Kivity [EMAIL PROTECTED] wrote:

 Avi Kivity wrote:

  The make headers_check fails,
 
   CHECK   include/linux/usb/gadgetfs.h
   CHECK   include/linux/usb/ch9.h
   CHECK   include/linux/usb/cdc.h
   CHECK   include/linux/usb/audio.h
   CHECK   include/linux/kvm.h
  /root/kernels/linux-2.6.24-rc3/usr/include/linux/kvm.h requires 
  asm/kvm.h, which does not exist in exported headers
 
  hm, works for me, on i386 and x86_64.  What's different over there?
 
  Hi Andrew,
 
  It fails on the powerpc box, with allyesconfig option.
 
   

  How do we fix this?  Export linux/kvm.h only on x86?  Seems ugly.
  
 
  Is kvm x86 specific? Then move the .h file to asm-x86.
  Otherwise no good idea...
 

 
  kvm.h is x86 specific today, but will be s390, ppc, ia64, and x86 
  specific tomorrow.
 
  What about having a asm-generic/kvm.h with a nice #error?would 
  that suit?
 
 
 headers_check continues to complain.  Is the only recourse to add 
 asm/kvm.h for all archs?
 

That would work.

Meanwhile my recourse is to drop the kvm tree ;)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-26 Thread Valdis . Kletnieks
On Mon, 26 Nov 2007 23:27:03 PST, Andrew Morton said:

> > git-x86.patch
> > git-x86-fixup.patch
> > git-x86-thread_order-borkage.patch
> > git-x86-thread_order-borkage-fix.patch
> > git-x86-identify_cpu-fix.patch
> > git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
> > git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
> > git-x86-inlining-borkage.patch
> > x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
> > x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD

> You could try http://userweb.kernel.org/~akpm/mmotm/ - we might have already
> fixed it.

I suspect that trying -rc3-mm1 but refreshing just the 10 patches above
from -mmotm would be far less likely to pull in other heartburn?

> Otherwise, please proceed to work out which diff I need to drop and hope like
> hell that it isn't git-x86..

That's a 41,240 line diff, the rest *total* to about 400 lines.  I don't have
warm-n-fuzzies about my odds here. ;)

I'm a git-idiot, but *do* know how to git-bisect through Linus tree - what
would I need to do to git-bisect through git-x86.patch? (I do *not* know how
to deal with more than 1 source git tree, so if the magic is just 'get a
linus tree, merge git-x86, then bisect as usual", I'm stuck on "merge 
git-x86")..



pgpxMGUuWzdJd.pgp
Description: PGP signature


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-26 Thread Andrew Morton
On Tue, 27 Nov 2007 02:16:26 -0500 [EMAIL PROTECTED] wrote:

> On Tue, 20 Nov 2007 20:45:25 PST, Andrew Morton said:
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> 
> Finally got both time and motivation to at least start a bisect..
> 
> 2.6.23-mm1 works on my D820 (x86_64 kernel, Core2 Duo T7200)
> 
> 24-rc3-mm1 (plus 3 patches from hotfixes/) bricks *instantly* at boot - grub
> prints its 3 or 4 lines saying what it loaded, the screen clears, and *blam*
> dead. No serial console output, no pair of penguins on the monitor, no
> netconsole, no earlyprintk=vga output, no alt-sysrq, only thing that does
> *anything* is "hold the power button for 5 seconds".  Whatever it is, it
> happens *very* early (before we get as far as the 'Linux version 2.6.mumble'
> banner), and happens *hard*.
> 
> I've bisected it down this far:
> 
> git-ipwireless_cs.patch GOOD
> git-x86.patch
> git-x86-fixup.patch
> git-x86-thread_order-borkage.patch
> git-x86-thread_order-borkage-fix.patch
> git-x86-identify_cpu-fix.patch
> git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
> git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
> git-x86-inlining-borkage.patch
> x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
> x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD
> 
> Anybody got any good debugging ideas before I go through and do the final
> 3 or 4 bisects?  I suspect I'll need them once I find the offending patch
> to tell *why* said patch dies on my box - I've seen enough traffic regarding
> -rc3-mm1 dying *later* to know it's probably a subtle issue and not one
> that will be obvious once I finger a specific patch.  For example, it's
> probably not the IO-APIC panic that people are seeing, because their kernels
> live long enough to panic. ;)
> 

You could try http://userweb.kernel.org/~akpm/mmotm/ - we might have already
fixed it.

Otherwise, please proceed to work out which diff I need to drop and hope like
hell that it isn't git-x86..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-26 Thread Valdis . Kletnieks
On Tue, 20 Nov 2007 20:45:25 PST, Andrew Morton said:
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/

Finally got both time and motivation to at least start a bisect..

2.6.23-mm1 works on my D820 (x86_64 kernel, Core2 Duo T7200)

24-rc3-mm1 (plus 3 patches from hotfixes/) bricks *instantly* at boot - grub
prints its 3 or 4 lines saying what it loaded, the screen clears, and *blam*
dead. No serial console output, no pair of penguins on the monitor, no
netconsole, no earlyprintk=vga output, no alt-sysrq, only thing that does
*anything* is "hold the power button for 5 seconds".  Whatever it is, it
happens *very* early (before we get as far as the 'Linux version 2.6.mumble'
banner), and happens *hard*.

I've bisected it down this far:

git-ipwireless_cs.patch GOOD
git-x86.patch
git-x86-fixup.patch
git-x86-thread_order-borkage.patch
git-x86-thread_order-borkage-fix.patch
git-x86-identify_cpu-fix.patch
git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
git-x86-inlining-borkage.patch
x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD

Anybody got any good debugging ideas before I go through and do the final
3 or 4 bisects?  I suspect I'll need them once I find the offending patch
to tell *why* said patch dies on my box - I've seen enough traffic regarding
-rc3-mm1 dying *later* to know it's probably a subtle issue and not one
that will be obvious once I finger a specific patch.  For example, it's
probably not the IO-APIC panic that people are seeing, because their kernels
live long enough to panic. ;)



pgpbW8UIlUa1z.pgp
Description: PGP signature


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Andrew Morton
On Fri, 23 Nov 2007 06:55:41 +0100 Gabriel C <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Fri, 23 Nov 2007 02:39:08 +0100 Gabriel C <[EMAIL PROTECTED]> wrote:
> > 
> >> I have some warnings on each SCSI disc:
> >>
> >>
> >> ...
> >>
> >> [   30.724410] scsi 0:0:0:0: Direct-Access SEAGATE  ST318406LW   
> >> 0109 PQ: 0 ANSI: 3
> >> [   30.724419] scsi0:A:0:0: Tagged Queuing enabled.  Depth 32
> >> [   30.724435]  target0:0:0: Beginning Domain Validation
> >> [   30.724446]  target0:0:0: Domain Validation Initial Inquiry Failed <--
> >> [   30.724572]  target0:0:0: Ending Domain Validation
> >> [   30.729747] scsi 0:0:1:0: Direct-Access FUJITSU  MAH3182MP
> >> 0114 PQ: 0 ANSI: 4
> >> [   30.729754] scsi0:A:1:0: Tagged Queuing enabled.  Depth 32
> >> [   30.729771]  target0:0:1: Beginning Domain Validation
> >> [   30.729780]  target0:0:1: Domain Validation Initial Inquiry Failed <--
> >> [   30.729908]  target0:0:1: Ending Domain Validation
> >>
> > 
> > Don't know what would have caused that.  But yes, something is wrong in
> > scsi land.
> 
> Actually I'm lucky the author didn't fix that FIXME in scsi_transport_spi.c 
> and I still can boot ;)
> 
> > 
> >> no idea whatever this is related but buffered disk reads are 2.XX MB/sec 
> >> and the box is somewhat laggy.
> >>
> >> hdparm -t on sda and sdb reports :
> >>
> >> /dev/sda:
> >>  Timing buffered disk reads:8 MB in  3.26 seconds =   2.46 MB/sec
> >>
> >> /dev/sdb:
> >>  Timing buffered disk reads:8 MB in  3.56 seconds =   2.25 MB/sec
> >>
> >> My IDE discs are fine.
> >>
> >> Please let me know if you need my config or any other informations.
> >>
> > 
> > And you're the second to report very slow scsi throughput in 2.6.24-rc3-mm1.
> > 
> 
> I found the commit which cause these problems , it is in git-scsi-misc patch 
> and reverting it fixes both problems for me.
> 
> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff_plain;h=8655a546c83fc43f0a73416bbd126d02de7ad6c0;hp=5bc717b6bdaaf52edf365eb7d9d8c89fec79df5d
> 

OK, thanks.  I'll assume that James and Hannes have this in hand (or will
have, by mid-week) and I won't do anything here.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Andrew Morton
On Tue, 27 Nov 2007 00:14:17 +0100
Jiri Slaby <[EMAIL PROTECTED]> wrote:

> On 11/26/2007 11:17 PM, Andrew Morton wrote:
> >> Maybe if you can emit a broken-out with the fresh pull to test?
> > 
> > http://userweb.kernel.org/~akpm/mmotm/ is current.  But it probably won't
> > compile. 
> 
> Yes it did :). And it worked. Both in qemu and on my desktop...

boggle.  Let's slap 2.6.25 on it and take the rest of the year off.

> qemu output at:
> http://www.fi.muni.cz/~xslaby/sklad/qemu-output.txt

Thanks for testing.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Jiri Slaby
On 11/26/2007 11:17 PM, Andrew Morton wrote:
>> Maybe if you can emit a broken-out with the fresh pull to test?
> 
> http://userweb.kernel.org/~akpm/mmotm/ is current.  But it probably won't
> compile. 

Yes it did :). And it worked. Both in qemu and on my desktop...

qemu output at:
http://www.fi.muni.cz/~xslaby/sklad/qemu-output.txt

thanks,
-- 
Jiri Slaby ([EMAIL PROTECTED])
Faculty of Informatics, Masaryk University
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Jiri Slaby
On 11/26/2007 11:17 PM, Andrew Morton wrote:
> http://userweb.kernel.org/~akpm/mmotm/ is current.  But it probably won't
> compile.  I'd suggest bisecting 2.6.24-rc3-mm1 would be easier.  

Yes, I've bisected this and it pointed to git-x86.patch + 2 pushed fixes from
series, Then tried x86 git, but its HEAD was OK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Andrew Morton
On Mon, 26 Nov 2007 23:08:33 +0100
Jiri Slaby <[EMAIL PROTECTED]> wrote:

> On 11/26/2007 09:45 PM, Ingo Molnar wrote:
> > * Andrew Morton <[EMAIL PROTECTED]> wrote:
> > 
> >> On Mon, 26 Nov 2007 14:39:43 -0500
> >> Rik van Riel <[EMAIL PROTECTED]> wrote:
> >>
> >>> On Tue, 20 Nov 2007 22:18:39 -0800
> >>> Andrew Morton <[EMAIL PROTECTED]> wrote:
> >>>
> > ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> > Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
> > 'noapic' kernel parameter
>  ACPI or x86 breakage, I guess.
> 
>  Did 'noapic' work?
> >>> I got the same bug as above, 'noapic' gets past that point 
> >> We still don't know what caused this, afaik.
> > 
> > yes. Is it a regression? If yes, could someone try to bisect it so that 
> > we can fix it? If it's caused by x86.git then the 'mm' branch of the x86 
> > git tree can be used for bisection:
> > 
> >git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git
> 
> I did, but it's hard, if you don't know the BAD point. HEAD boots fine and 
> 'x86:
> randomize brk' too (the top of git-x86.patch).

So the bug wasn't in git-x86 in 2.6.24-rc3-mm1.

But it might be in there now, as some patches got moved over.

Or it could be git-acpi.  Or lots of other things.

> Andrew, how do you pull it, git
> #mm doesn't fit to the ids from the patch.

The -mm git tree reimports the plain git-foo.patch files back into a new
git tree, so the commit IDs won't line up.

The way to find the culprit patch in 2.6.24-rc3-mm1 is
http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt.  It
will be quite quick.

> Maybe if you can emit a broken-out with the fresh pull to test?

http://userweb.kernel.org/~akpm/mmotm/ is current.  But it probably won't
compile.  I'd suggest bisecting 2.6.24-rc3-mm1 would be easier.  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Jiri Slaby
On 11/26/2007 09:45 PM, Ingo Molnar wrote:
> * Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
>> On Mon, 26 Nov 2007 14:39:43 -0500
>> Rik van Riel <[EMAIL PROTECTED]> wrote:
>>
>>> On Tue, 20 Nov 2007 22:18:39 -0800
>>> Andrew Morton <[EMAIL PROTECTED]> wrote:
>>>
> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
> 'noapic' kernel parameter
 ACPI or x86 breakage, I guess.

 Did 'noapic' work?
>>> I got the same bug as above, 'noapic' gets past that point 
>> We still don't know what caused this, afaik.
> 
> yes. Is it a regression? If yes, could someone try to bisect it so that 
> we can fix it? If it's caused by x86.git then the 'mm' branch of the x86 
> git tree can be used for bisection:
> 
>git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

I did, but it's hard, if you don't know the BAD point. HEAD boots fine and 'x86:
randomize brk' too (the top of git-x86.patch). Andrew, how do you pull it, git
#mm doesn't fit to the ids from the patch.

Maybe if you can emit a broken-out with the fresh pull to test?

regards,
-- 
Jiri Slaby ([EMAIL PROTECTED])
Faculty of Informatics, Masaryk University
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Christoph Lameter
On Mon, 26 Nov 2007, Randy Dunlap wrote:

> ARCH_SELECT_MEMORY_MODEL depends on X86_32.  Is that too restrictive?

No. X86_64 only has one memory model.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Christoph Lameter
On Mon, 26 Nov 2007, Andrew Morton wrote:

> hm.  This smells like a startup ordering problem, but everything which
> refresh_zone_stat_thresholds() should be set up by the time we run
> initcalls.  Maybe the zone lists are bad?

refresh_zone_stat_thresholds goes through each zone and updates
the stat threshold for every per cpu structure in each zone.

So this could be a processor marked online where the pcp structures have 
not been allocated or a zone NULL pointer.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Rik van Riel
On Mon, 26 Nov 2007 12:33:19 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> > Unable to handle kernel NULL pointer dereference at 0021 RIP:
> >  [] refresh_zone_stat_thresholds+0x6d/0x90
> > PGD 0
> > Oops: 0002 [1] SMP
> > last sysfs file:
> > CPU 0
> > Modules linked in:
> > Pid: 1, comm: swapper Not tainted 2.6.24-rc3-mm1 #2
> > RIP: 0010:[]  [] 
> > refresh_zone_stat_thresholds+0x6d/0x90
> > RSP: :81007fb59ec0  EFLAGS: 00010293
> > RAX:  RBX: 0004 RCX: 0001
> > RDX: 0001 RSI: 8146fb38 RDI: 0001
> > RBP: 8100c000 R08:  R09: 
> > R10: 81007fb59e60 R11: 0028 R12: 814d4558
> > R13:  R14: 814b62c0 R15: 
> > FS:  () GS:813d9000() knlGS:
> > CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
> > CR2: 0021 CR3: 00201000 CR4: 06a0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: 0ff0 DR7: 0400
> > Process swapper (pid: 1, threadinfo 81007FB58000, task 81007FB56000)
> > Stack:     814a3839
> >   8148e626 81007fb56000 8126d36a
> >    8105786b 
> > Call Trace:
> >  [] setup_vmstat+0x6/0x40
> >  [] kernel_init+0x169/0x2d8
> >  [] trace_hardirqs_on_thunk+0x35/0x3a
> >  [] trace_hardirqs_on+0x115/0x138
> >  [] child_rip+0xa/0x12
> >  [] restore_args+0x0/0x30
> >  [] kernel_init+0x0/0x2d8
> >  [] child_rip+0x0/0x12
> > 
> > INFO: lockdep is turned off.
> 
> hm.  This smells like a startup ordering problem, but everything which
> refresh_zone_stat_thresholds() should be set up by the time we run
> initcalls.  Maybe the zone lists are bad?

Or the CPU array. Look at the oops Kamalesh got a few mails upthread...

I guess I'll have to start a bisect - can't port the VM code to a kernel
that doesn't boot...

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Ingo Molnar

* Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Mon, 26 Nov 2007 14:39:43 -0500
> Rik van Riel <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, 20 Nov 2007 22:18:39 -0800
> > Andrew Morton <[EMAIL PROTECTED]> wrote:
> > 
> > > > ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> > > > Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
> > > > 'noapic' kernel parameter
> > > 
> > > ACPI or x86 breakage, I guess.
> > > 
> > > Did 'noapic' work?
> > 
> > I got the same bug as above, 'noapic' gets past that point 
> 
> We still don't know what caused this, afaik.

yes. Is it a regression? If yes, could someone try to bisect it so that 
we can fix it? If it's caused by x86.git then the 'mm' branch of the x86 
git tree can be used for bisection:

   git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

it's supposed to build and boot fine at every bisection point. The 
bisection run can be cut significantly by narrowing the bisection to the 
arch/x86 changes only:

  git-bisect start arch/x86 include/asm-x86/

(and if it finds a nonsensical commit, i.e. the breakage is not caused 
by the x86 commits, save the "git-bisect log" output into a file, 
restart the git bisection and use "git-bisect replay" to insert all the 
test points into a fuller bisection run - this saves quite some time.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Randy Dunlap
On Mon, 26 Nov 2007 11:34:15 -0800 (PST) Christoph Lameter wrote:

> On Mon, 26 Nov 2007, Randy Dunlap wrote:
> 
> > On Tue, 20 Nov 2007 20:45:25 -0800 Andrew Morton wrote:
> > 
> > > 
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> > 
> > allnoconfig on x86_64 gives:
> > 
> > arch/x86/mm/init_64.c:84: error: implicit declaration of function 
> > 'pfn_valid'
> > mm/page_alloc.c:2533: error: implicit declaration of function 'pfn_valid'
> > mm/vmstat.c:518: error: implicit declaration of function 'pfn_valid'
> > mm/memory.c:400: error: implicit declaration of function 'pfn_valid'
> > drivers/char/mem.c:312: error: implicit declaration of function 'pfn_valid'
> 
> Hmmm... CONFIG_SPARSEMEM is not set if you do allnoconfig
> 
> config SPARSEMEM
> def_bool y
> depends on SPARSEMEM_MANUAL
> 
> So I guess we need to set SPARSEMEM_MANUAL
> 
> But arch/x86/Kconfig has
> 
> config SPARSEMEM_MANUAL
> bool "Sparse Memory"
> depends on ARCH_SPARSEMEM_ENABLE
> help
>   This will be the only option for some systems, including
>   memory hotplug systems.  This is normal.
> 
> It needs to be not deselectable for x86_64. 
> 
> Inserting
> 
>   def_bool y if X86_64
> 
> did not help
> 
> Somehow make menuconfig did not give me an ability to even enable this 
> again.

Thanks for the hint.

ARCH_SELECT_MEMORY_MODEL depends on X86_32.  Is that too restrictive?

config ARCH_SELECT_MEMORY_MODEL
def_bool y
depends on X86_32 && ARCH_SPARSEMEM_ENABLE

---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Andrew Morton
On Mon, 26 Nov 2007 14:39:43 -0500
Rik van Riel <[EMAIL PROTECTED]> wrote:

> On Tue, 20 Nov 2007 22:18:39 -0800
> Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
> > > ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> > > Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
> > > 'noapic' kernel parameter
> > 
> > ACPI or x86 breakage, I guess.
> > 
> > Did 'noapic' work?
> 
> I got the same bug as above, 'noapic' gets past that point 

We still don't know what caused this, afaik.

> and right to the
> next oops.  I'm posting it here because this one is different from the others
> in the thread, yet looks vaguely related:
> 
> Unable to handle kernel NULL pointer dereference at 0021 RIP:
>  [] refresh_zone_stat_thresholds+0x6d/0x90
> PGD 0
> Oops: 0002 [1] SMP
> last sysfs file:
> CPU 0
> Modules linked in:
> Pid: 1, comm: swapper Not tainted 2.6.24-rc3-mm1 #2
> RIP: 0010:[]  [] 
> refresh_zone_stat_thresholds+0x6d/0x90
> RSP: :81007fb59ec0  EFLAGS: 00010293
> RAX:  RBX: 0004 RCX: 0001
> RDX: 0001 RSI: 8146fb38 RDI: 0001
> RBP: 8100c000 R08:  R09: 
> R10: 81007fb59e60 R11: 0028 R12: 814d4558
> R13:  R14: 814b62c0 R15: 
> FS:  () GS:813d9000() knlGS:
> CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
> CR2: 0021 CR3: 00201000 CR4: 06a0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Process swapper (pid: 1, threadinfo 81007FB58000, task 81007FB56000)
> Stack:     814a3839
>   8148e626 81007fb56000 8126d36a
>    8105786b 
> Call Trace:
>  [] setup_vmstat+0x6/0x40
>  [] kernel_init+0x169/0x2d8
>  [] trace_hardirqs_on_thunk+0x35/0x3a
>  [] trace_hardirqs_on+0x115/0x138
>  [] child_rip+0xa/0x12
>  [] restore_args+0x0/0x30
>  [] kernel_init+0x0/0x2d8
>  [] child_rip+0x0/0x12
> 
> INFO: lockdep is turned off.

hm.  This smells like a startup ordering problem, but everything which
refresh_zone_stat_thresholds() should be set up by the time we run
initcalls.  Maybe the zone lists are bad?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Rik van Riel
On Tue, 20 Nov 2007 22:18:39 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> > ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> > Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
> > 'noapic' kernel parameter
> 
> ACPI or x86 breakage, I guess.
> 
> Did 'noapic' work?

I got the same bug as above, 'noapic' gets past that point and right to the
next oops.  I'm posting it here because this one is different from the others
in the thread, yet looks vaguely related:

Unable to handle kernel NULL pointer dereference at 0021 RIP:
 [] refresh_zone_stat_thresholds+0x6d/0x90
PGD 0
Oops: 0002 [1] SMP
last sysfs file:
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.24-rc3-mm1 #2
RIP: 0010:[]  [] 
refresh_zone_stat_thresholds+0x6d/0x90
RSP: :81007fb59ec0  EFLAGS: 00010293
RAX:  RBX: 0004 RCX: 0001
RDX: 0001 RSI: 8146fb38 RDI: 0001
RBP: 8100c000 R08:  R09: 
R10: 81007fb59e60 R11: 0028 R12: 814d4558
R13:  R14: 814b62c0 R15: 
FS:  () GS:813d9000() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 0021 CR3: 00201000 CR4: 06a0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process swapper (pid: 1, threadinfo 81007FB58000, task 81007FB56000)
Stack:     814a3839
  8148e626 81007fb56000 8126d36a
   8105786b 
Call Trace:
 [] setup_vmstat+0x6/0x40
 [] kernel_init+0x169/0x2d8
 [] trace_hardirqs_on_thunk+0x35/0x3a
 [] trace_hardirqs_on+0x115/0x138
 [] child_rip+0xa/0x12
 [] restore_args+0x0/0x30
 [] kernel_init+0x0/0x2d8
 [] child_rip+0x0/0x12

INFO: lockdep is turned off.

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Christoph Lameter
On Mon, 26 Nov 2007, Randy Dunlap wrote:

> On Tue, 20 Nov 2007 20:45:25 -0800 Andrew Morton wrote:
> 
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> 
> allnoconfig on x86_64 gives:
> 
> arch/x86/mm/init_64.c:84: error: implicit declaration of function 'pfn_valid'
> mm/page_alloc.c:2533: error: implicit declaration of function 'pfn_valid'
> mm/vmstat.c:518: error: implicit declaration of function 'pfn_valid'
> mm/memory.c:400: error: implicit declaration of function 'pfn_valid'
> drivers/char/mem.c:312: error: implicit declaration of function 'pfn_valid'

Hmmm... CONFIG_SPARSEMEM is not set if you do allnoconfig

config SPARSEMEM
def_bool y
depends on SPARSEMEM_MANUAL

So I guess we need to set SPARSEMEM_MANUAL

But arch/x86/Kconfig has

config SPARSEMEM_MANUAL
bool "Sparse Memory"
depends on ARCH_SPARSEMEM_ENABLE
help
  This will be the only option for some systems, including
  memory hotplug systems.  This is normal.

It needs to be not deselectable for x86_64. 

Inserting

def_bool y if X86_64

did not help

Somehow make menuconfig did not give me an ability to even enable this 
again.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Jiri Slaby
On 11/26/2007 07:48 PM, Rik van Riel wrote:
 ERROR: "empty_zero_page" [drivers/kvm/kvm.ko] undefined!
[...]
> FYI, x86_64 has the exact same issue.

yes:
hot-fixes/git-x86-dont-unexport-empty_zero_page.patch

regards,
-- 
Jiri Slaby ([EMAIL PROTECTED])
Faculty of Informatics, Masaryk University
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Randy Dunlap
On Tue, 20 Nov 2007 20:45:25 -0800 Andrew Morton wrote:

> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/

allnoconfig on x86_64 gives:

arch/x86/mm/init_64.c:84: error: implicit declaration of function 'pfn_valid'
mm/page_alloc.c:2533: error: implicit declaration of function 'pfn_valid'
mm/vmstat.c:518: error: implicit declaration of function 'pfn_valid'
mm/memory.c:400: error: implicit declaration of function 'pfn_valid'
drivers/char/mem.c:312: error: implicit declaration of function 'pfn_valid'


---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Rik van Riel
On Wed, 21 Nov 2007 14:03:34 +0800
"Dave Young" <[EMAIL PROTECTED]> wrote:
> On Nov 21, 2007 2:00 PM, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > On Wed, 21 Nov 2007 13:51:47 +0800 "Dave Young" <[EMAIL PROTECTED]> wrote:
> >
> > > Hi, andrew
> > >
> > > modpost failed for me:
> > >   MODPOST 360 modules
> > > ERROR: "empty_zero_page" [drivers/kvm/kvm.ko] undefined!
> > > make[1]: *** [__modpost] Error 1
> > > make: *** [modules] Error 2
> > >
> >
> > You're a victim of the hasty unexporting fad.  Which architecture?
> > x86_64 I guess?

> ia32 instead.

FYI, x86_64 has the exact same issue.


KVM needs the empty_zero_page export reinstated.

Signed-off-by: Rik van Riel <[EMAIL PROTECTED]>

diff -up 
linux-2.6.24-rc3-mm1/arch/x86/kernel/x8664_ksyms_64.c.export-empty-zero-page 
linux-2.6.24-rc3-mm1/arch/x86/kernel/x8664_ksyms_64.c
--- 
linux-2.6.24-rc3-mm1/arch/x86/kernel/x8664_ksyms_64.c.export-empty-zero-page
2007-11-26 13:47:53.0 -0500
+++ linux-2.6.24-rc3-mm1/arch/x86/kernel/x8664_ksyms_64.c   2007-11-26 
13:41:32.0 -0500
@@ -33,6 +33,7 @@ EXPORT_SYMBOL(__copy_from_user_inatomic)
 
 EXPORT_SYMBOL(copy_page);
 EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(empty_zero_page);
 
 /* Export string functions. We normally rely on gcc builtin for most of these,
but gcc sometimes decides not to inline them. */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Rik van Riel
On Wed, 21 Nov 2007 14:03:34 +0800
Dave Young [EMAIL PROTECTED] wrote:
 On Nov 21, 2007 2:00 PM, Andrew Morton [EMAIL PROTECTED] wrote:
  On Wed, 21 Nov 2007 13:51:47 +0800 Dave Young [EMAIL PROTECTED] wrote:
 
   Hi, andrew
  
   modpost failed for me:
 MODPOST 360 modules
   ERROR: empty_zero_page [drivers/kvm/kvm.ko] undefined!
   make[1]: *** [__modpost] Error 1
   make: *** [modules] Error 2
  
 
  You're a victim of the hasty unexporting fad.  Which architecture?
  x86_64 I guess?

 ia32 instead.

FYI, x86_64 has the exact same issue.


KVM needs the empty_zero_page export reinstated.

Signed-off-by: Rik van Riel [EMAIL PROTECTED]

diff -up 
linux-2.6.24-rc3-mm1/arch/x86/kernel/x8664_ksyms_64.c.export-empty-zero-page 
linux-2.6.24-rc3-mm1/arch/x86/kernel/x8664_ksyms_64.c
--- 
linux-2.6.24-rc3-mm1/arch/x86/kernel/x8664_ksyms_64.c.export-empty-zero-page
2007-11-26 13:47:53.0 -0500
+++ linux-2.6.24-rc3-mm1/arch/x86/kernel/x8664_ksyms_64.c   2007-11-26 
13:41:32.0 -0500
@@ -33,6 +33,7 @@ EXPORT_SYMBOL(__copy_from_user_inatomic)
 
 EXPORT_SYMBOL(copy_page);
 EXPORT_SYMBOL(clear_page);
+EXPORT_SYMBOL(empty_zero_page);
 
 /* Export string functions. We normally rely on gcc builtin for most of these,
but gcc sometimes decides not to inline them. */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Randy Dunlap
On Tue, 20 Nov 2007 20:45:25 -0800 Andrew Morton wrote:

 
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/

allnoconfig on x86_64 gives:

arch/x86/mm/init_64.c:84: error: implicit declaration of function 'pfn_valid'
mm/page_alloc.c:2533: error: implicit declaration of function 'pfn_valid'
mm/vmstat.c:518: error: implicit declaration of function 'pfn_valid'
mm/memory.c:400: error: implicit declaration of function 'pfn_valid'
drivers/char/mem.c:312: error: implicit declaration of function 'pfn_valid'


---
~Randy
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Jiri Slaby
On 11/26/2007 07:48 PM, Rik van Riel wrote:
 ERROR: empty_zero_page [drivers/kvm/kvm.ko] undefined!
[...]
 FYI, x86_64 has the exact same issue.

yes:
hot-fixes/git-x86-dont-unexport-empty_zero_page.patch

regards,
-- 
Jiri Slaby ([EMAIL PROTECTED])
Faculty of Informatics, Masaryk University
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Christoph Lameter
On Mon, 26 Nov 2007, Randy Dunlap wrote:

 On Tue, 20 Nov 2007 20:45:25 -0800 Andrew Morton wrote:
 
  
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 
 allnoconfig on x86_64 gives:
 
 arch/x86/mm/init_64.c:84: error: implicit declaration of function 'pfn_valid'
 mm/page_alloc.c:2533: error: implicit declaration of function 'pfn_valid'
 mm/vmstat.c:518: error: implicit declaration of function 'pfn_valid'
 mm/memory.c:400: error: implicit declaration of function 'pfn_valid'
 drivers/char/mem.c:312: error: implicit declaration of function 'pfn_valid'

Hmmm... CONFIG_SPARSEMEM is not set if you do allnoconfig

config SPARSEMEM
def_bool y
depends on SPARSEMEM_MANUAL

So I guess we need to set SPARSEMEM_MANUAL

But arch/x86/Kconfig has

config SPARSEMEM_MANUAL
bool Sparse Memory
depends on ARCH_SPARSEMEM_ENABLE
help
  This will be the only option for some systems, including
  memory hotplug systems.  This is normal.

It needs to be not deselectable for x86_64. 

Inserting

def_bool y if X86_64

did not help

Somehow make menuconfig did not give me an ability to even enable this 
again.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Rik van Riel
On Tue, 20 Nov 2007 22:18:39 -0800
Andrew Morton [EMAIL PROTECTED] wrote:

  ..MP-BIOS bug: 8254 timer not connected to IO-APIC
  Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
  'noapic' kernel parameter
 
 ACPI or x86 breakage, I guess.
 
 Did 'noapic' work?

I got the same bug as above, 'noapic' gets past that point and right to the
next oops.  I'm posting it here because this one is different from the others
in the thread, yet looks vaguely related:

Unable to handle kernel NULL pointer dereference at 0021 RIP:
 [8108382a] refresh_zone_stat_thresholds+0x6d/0x90
PGD 0
Oops: 0002 [1] SMP
last sysfs file:
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.24-rc3-mm1 #2
RIP: 0010:[8108382a]  [8108382a] 
refresh_zone_stat_thresholds+0x6d/0x90
RSP: :81007fb59ec0  EFLAGS: 00010293
RAX:  RBX: 0004 RCX: 0001
RDX: 0001 RSI: 8146fb38 RDI: 0001
RBP: 8100c000 R08:  R09: 
R10: 81007fb59e60 R11: 0028 R12: 814d4558
R13:  R14: 814b62c0 R15: 
FS:  () GS:813d9000() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 0021 CR3: 00201000 CR4: 06a0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process swapper (pid: 1, threadinfo 81007FB58000, task 81007FB56000)
Stack:     814a3839
  8148e626 81007fb56000 8126d36a
   8105786b 
Call Trace:
 [814a3839] setup_vmstat+0x6/0x40
 [8148e626] kernel_init+0x169/0x2d8
 [8126d36a] trace_hardirqs_on_thunk+0x35/0x3a
 [8105786b] trace_hardirqs_on+0x115/0x138
 [8100ce48] child_rip+0xa/0x12
 [8100c55f] restore_args+0x0/0x30
 [8148e4bd] kernel_init+0x0/0x2d8
 [8100ce3e] child_rip+0x0/0x12

INFO: lockdep is turned off.

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Andrew Morton
On Mon, 26 Nov 2007 14:39:43 -0500
Rik van Riel [EMAIL PROTECTED] wrote:

 On Tue, 20 Nov 2007 22:18:39 -0800
 Andrew Morton [EMAIL PROTECTED] wrote:
 
   ..MP-BIOS bug: 8254 timer not connected to IO-APIC
   Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
   'noapic' kernel parameter
  
  ACPI or x86 breakage, I guess.
  
  Did 'noapic' work?
 
 I got the same bug as above, 'noapic' gets past that point 

We still don't know what caused this, afaik.

 and right to the
 next oops.  I'm posting it here because this one is different from the others
 in the thread, yet looks vaguely related:
 
 Unable to handle kernel NULL pointer dereference at 0021 RIP:
  [8108382a] refresh_zone_stat_thresholds+0x6d/0x90
 PGD 0
 Oops: 0002 [1] SMP
 last sysfs file:
 CPU 0
 Modules linked in:
 Pid: 1, comm: swapper Not tainted 2.6.24-rc3-mm1 #2
 RIP: 0010:[8108382a]  [8108382a] 
 refresh_zone_stat_thresholds+0x6d/0x90
 RSP: :81007fb59ec0  EFLAGS: 00010293
 RAX:  RBX: 0004 RCX: 0001
 RDX: 0001 RSI: 8146fb38 RDI: 0001
 RBP: 8100c000 R08:  R09: 
 R10: 81007fb59e60 R11: 0028 R12: 814d4558
 R13:  R14: 814b62c0 R15: 
 FS:  () GS:813d9000() knlGS:
 CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
 CR2: 0021 CR3: 00201000 CR4: 06a0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process swapper (pid: 1, threadinfo 81007FB58000, task 81007FB56000)
 Stack:     814a3839
   8148e626 81007fb56000 8126d36a
    8105786b 
 Call Trace:
  [814a3839] setup_vmstat+0x6/0x40
  [8148e626] kernel_init+0x169/0x2d8
  [8126d36a] trace_hardirqs_on_thunk+0x35/0x3a
  [8105786b] trace_hardirqs_on+0x115/0x138
  [8100ce48] child_rip+0xa/0x12
  [8100c55f] restore_args+0x0/0x30
  [8148e4bd] kernel_init+0x0/0x2d8
  [8100ce3e] child_rip+0x0/0x12
 
 INFO: lockdep is turned off.

hm.  This smells like a startup ordering problem, but everything which
refresh_zone_stat_thresholds() should be set up by the time we run
initcalls.  Maybe the zone lists are bad?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Randy Dunlap
On Mon, 26 Nov 2007 11:34:15 -0800 (PST) Christoph Lameter wrote:

 On Mon, 26 Nov 2007, Randy Dunlap wrote:
 
  On Tue, 20 Nov 2007 20:45:25 -0800 Andrew Morton wrote:
  
   
   ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
  
  allnoconfig on x86_64 gives:
  
  arch/x86/mm/init_64.c:84: error: implicit declaration of function 
  'pfn_valid'
  mm/page_alloc.c:2533: error: implicit declaration of function 'pfn_valid'
  mm/vmstat.c:518: error: implicit declaration of function 'pfn_valid'
  mm/memory.c:400: error: implicit declaration of function 'pfn_valid'
  drivers/char/mem.c:312: error: implicit declaration of function 'pfn_valid'
 
 Hmmm... CONFIG_SPARSEMEM is not set if you do allnoconfig
 
 config SPARSEMEM
 def_bool y
 depends on SPARSEMEM_MANUAL
 
 So I guess we need to set SPARSEMEM_MANUAL
 
 But arch/x86/Kconfig has
 
 config SPARSEMEM_MANUAL
 bool Sparse Memory
 depends on ARCH_SPARSEMEM_ENABLE
 help
   This will be the only option for some systems, including
   memory hotplug systems.  This is normal.
 
 It needs to be not deselectable for x86_64. 
 
 Inserting
 
   def_bool y if X86_64
 
 did not help
 
 Somehow make menuconfig did not give me an ability to even enable this 
 again.

Thanks for the hint.

ARCH_SELECT_MEMORY_MODEL depends on X86_32.  Is that too restrictive?

config ARCH_SELECT_MEMORY_MODEL
def_bool y
depends on X86_32  ARCH_SPARSEMEM_ENABLE

---
~Randy
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Ingo Molnar

* Andrew Morton [EMAIL PROTECTED] wrote:

 On Mon, 26 Nov 2007 14:39:43 -0500
 Rik van Riel [EMAIL PROTECTED] wrote:
 
  On Tue, 20 Nov 2007 22:18:39 -0800
  Andrew Morton [EMAIL PROTECTED] wrote:
  
..MP-BIOS bug: 8254 timer not connected to IO-APIC
Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
'noapic' kernel parameter
   
   ACPI or x86 breakage, I guess.
   
   Did 'noapic' work?
  
  I got the same bug as above, 'noapic' gets past that point 
 
 We still don't know what caused this, afaik.

yes. Is it a regression? If yes, could someone try to bisect it so that 
we can fix it? If it's caused by x86.git then the 'mm' branch of the x86 
git tree can be used for bisection:

   git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

it's supposed to build and boot fine at every bisection point. The 
bisection run can be cut significantly by narrowing the bisection to the 
arch/x86 changes only:

  git-bisect start arch/x86 include/asm-x86/

(and if it finds a nonsensical commit, i.e. the breakage is not caused 
by the x86 commits, save the git-bisect log output into a file, 
restart the git bisection and use git-bisect replay to insert all the 
test points into a fuller bisection run - this saves quite some time.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Rik van Riel
On Mon, 26 Nov 2007 12:33:19 -0800
Andrew Morton [EMAIL PROTECTED] wrote:

  Unable to handle kernel NULL pointer dereference at 0021 RIP:
   [8108382a] refresh_zone_stat_thresholds+0x6d/0x90
  PGD 0
  Oops: 0002 [1] SMP
  last sysfs file:
  CPU 0
  Modules linked in:
  Pid: 1, comm: swapper Not tainted 2.6.24-rc3-mm1 #2
  RIP: 0010:[8108382a]  [8108382a] 
  refresh_zone_stat_thresholds+0x6d/0x90
  RSP: :81007fb59ec0  EFLAGS: 00010293
  RAX:  RBX: 0004 RCX: 0001
  RDX: 0001 RSI: 8146fb38 RDI: 0001
  RBP: 8100c000 R08:  R09: 
  R10: 81007fb59e60 R11: 0028 R12: 814d4558
  R13:  R14: 814b62c0 R15: 
  FS:  () GS:813d9000() knlGS:
  CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
  CR2: 0021 CR3: 00201000 CR4: 06a0
  DR0:  DR1:  DR2: 
  DR3:  DR6: 0ff0 DR7: 0400
  Process swapper (pid: 1, threadinfo 81007FB58000, task 81007FB56000)
  Stack:     814a3839
    8148e626 81007fb56000 8126d36a
     8105786b 
  Call Trace:
   [814a3839] setup_vmstat+0x6/0x40
   [8148e626] kernel_init+0x169/0x2d8
   [8126d36a] trace_hardirqs_on_thunk+0x35/0x3a
   [8105786b] trace_hardirqs_on+0x115/0x138
   [8100ce48] child_rip+0xa/0x12
   [8100c55f] restore_args+0x0/0x30
   [8148e4bd] kernel_init+0x0/0x2d8
   [8100ce3e] child_rip+0x0/0x12
  
  INFO: lockdep is turned off.
 
 hm.  This smells like a startup ordering problem, but everything which
 refresh_zone_stat_thresholds() should be set up by the time we run
 initcalls.  Maybe the zone lists are bad?

Or the CPU array. Look at the oops Kamalesh got a few mails upthread...

I guess I'll have to start a bisect - can't port the VM code to a kernel
that doesn't boot...

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Christoph Lameter
On Mon, 26 Nov 2007, Andrew Morton wrote:

 hm.  This smells like a startup ordering problem, but everything which
 refresh_zone_stat_thresholds() should be set up by the time we run
 initcalls.  Maybe the zone lists are bad?

refresh_zone_stat_thresholds goes through each zone and updates
the stat threshold for every per cpu structure in each zone.

So this could be a processor marked online where the pcp structures have 
not been allocated or a zone NULL pointer.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Christoph Lameter
On Mon, 26 Nov 2007, Randy Dunlap wrote:

 ARCH_SELECT_MEMORY_MODEL depends on X86_32.  Is that too restrictive?

No. X86_64 only has one memory model.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Jiri Slaby
On 11/26/2007 09:45 PM, Ingo Molnar wrote:
 * Andrew Morton [EMAIL PROTECTED] wrote:
 
 On Mon, 26 Nov 2007 14:39:43 -0500
 Rik van Riel [EMAIL PROTECTED] wrote:

 On Tue, 20 Nov 2007 22:18:39 -0800
 Andrew Morton [EMAIL PROTECTED] wrote:

 ..MP-BIOS bug: 8254 timer not connected to IO-APIC
 Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
 'noapic' kernel parameter
 ACPI or x86 breakage, I guess.

 Did 'noapic' work?
 I got the same bug as above, 'noapic' gets past that point 
 We still don't know what caused this, afaik.
 
 yes. Is it a regression? If yes, could someone try to bisect it so that 
 we can fix it? If it's caused by x86.git then the 'mm' branch of the x86 
 git tree can be used for bisection:
 
git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git

I did, but it's hard, if you don't know the BAD point. HEAD boots fine and 'x86:
randomize brk' too (the top of git-x86.patch). Andrew, how do you pull it, git
#mm doesn't fit to the ids from the patch.

Maybe if you can emit a broken-out with the fresh pull to test?

regards,
-- 
Jiri Slaby ([EMAIL PROTECTED])
Faculty of Informatics, Masaryk University
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Andrew Morton
On Mon, 26 Nov 2007 23:08:33 +0100
Jiri Slaby [EMAIL PROTECTED] wrote:

 On 11/26/2007 09:45 PM, Ingo Molnar wrote:
  * Andrew Morton [EMAIL PROTECTED] wrote:
  
  On Mon, 26 Nov 2007 14:39:43 -0500
  Rik van Riel [EMAIL PROTECTED] wrote:
 
  On Tue, 20 Nov 2007 22:18:39 -0800
  Andrew Morton [EMAIL PROTECTED] wrote:
 
  ..MP-BIOS bug: 8254 timer not connected to IO-APIC
  Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 
  'noapic' kernel parameter
  ACPI or x86 breakage, I guess.
 
  Did 'noapic' work?
  I got the same bug as above, 'noapic' gets past that point 
  We still don't know what caused this, afaik.
  
  yes. Is it a regression? If yes, could someone try to bisect it so that 
  we can fix it? If it's caused by x86.git then the 'mm' branch of the x86 
  git tree can be used for bisection:
  
 git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git
 
 I did, but it's hard, if you don't know the BAD point. HEAD boots fine and 
 'x86:
 randomize brk' too (the top of git-x86.patch).

So the bug wasn't in git-x86 in 2.6.24-rc3-mm1.

But it might be in there now, as some patches got moved over.

Or it could be git-acpi.  Or lots of other things.

 Andrew, how do you pull it, git
 #mm doesn't fit to the ids from the patch.

The -mm git tree reimports the plain git-foo.patch files back into a new
git tree, so the commit IDs won't line up.

The way to find the culprit patch in 2.6.24-rc3-mm1 is
http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt.  It
will be quite quick.

 Maybe if you can emit a broken-out with the fresh pull to test?

http://userweb.kernel.org/~akpm/mmotm/ is current.  But it probably won't
compile.  I'd suggest bisecting 2.6.24-rc3-mm1 would be easier.  
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Jiri Slaby
On 11/26/2007 11:17 PM, Andrew Morton wrote:
 http://userweb.kernel.org/~akpm/mmotm/ is current.  But it probably won't
 compile.  I'd suggest bisecting 2.6.24-rc3-mm1 would be easier.  

Yes, I've bisected this and it pointed to git-x86.patch + 2 pushed fixes from
series, Then tried x86 git, but its HEAD was OK.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Jiri Slaby
On 11/26/2007 11:17 PM, Andrew Morton wrote:
 Maybe if you can emit a broken-out with the fresh pull to test?
 
 http://userweb.kernel.org/~akpm/mmotm/ is current.  But it probably won't
 compile. 

Yes it did :). And it worked. Both in qemu and on my desktop...

qemu output at:
http://www.fi.muni.cz/~xslaby/sklad/qemu-output.txt

thanks,
-- 
Jiri Slaby ([EMAIL PROTECTED])
Faculty of Informatics, Masaryk University
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC

2007-11-26 Thread Andrew Morton
On Tue, 27 Nov 2007 00:14:17 +0100
Jiri Slaby [EMAIL PROTECTED] wrote:

 On 11/26/2007 11:17 PM, Andrew Morton wrote:
  Maybe if you can emit a broken-out with the fresh pull to test?
  
  http://userweb.kernel.org/~akpm/mmotm/ is current.  But it probably won't
  compile. 
 
 Yes it did :). And it worked. Both in qemu and on my desktop...

boggle.  Let's slap 2.6.25 on it and take the rest of the year off.

 qemu output at:
 http://www.fi.muni.cz/~xslaby/sklad/qemu-output.txt

Thanks for testing.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1

2007-11-26 Thread Andrew Morton
On Fri, 23 Nov 2007 06:55:41 +0100 Gabriel C [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Fri, 23 Nov 2007 02:39:08 +0100 Gabriel C [EMAIL PROTECTED] wrote:
  
  I have some warnings on each SCSI disc:
 
 
  ...
 
  [   30.724410] scsi 0:0:0:0: Direct-Access SEAGATE  ST318406LW   
  0109 PQ: 0 ANSI: 3
  [   30.724419] scsi0:A:0:0: Tagged Queuing enabled.  Depth 32
  [   30.724435]  target0:0:0: Beginning Domain Validation
  [   30.724446]  target0:0:0: Domain Validation Initial Inquiry Failed --
  [   30.724572]  target0:0:0: Ending Domain Validation
  [   30.729747] scsi 0:0:1:0: Direct-Access FUJITSU  MAH3182MP
  0114 PQ: 0 ANSI: 4
  [   30.729754] scsi0:A:1:0: Tagged Queuing enabled.  Depth 32
  [   30.729771]  target0:0:1: Beginning Domain Validation
  [   30.729780]  target0:0:1: Domain Validation Initial Inquiry Failed --
  [   30.729908]  target0:0:1: Ending Domain Validation
 
  
  Don't know what would have caused that.  But yes, something is wrong in
  scsi land.
 
 Actually I'm lucky the author didn't fix that FIXME in scsi_transport_spi.c 
 and I still can boot ;)
 
  
  no idea whatever this is related but buffered disk reads are 2.XX MB/sec 
  and the box is somewhat laggy.
 
  hdparm -t on sda and sdb reports :
 
  /dev/sda:
   Timing buffered disk reads:8 MB in  3.26 seconds =   2.46 MB/sec
 
  /dev/sdb:
   Timing buffered disk reads:8 MB in  3.56 seconds =   2.25 MB/sec
 
  My IDE discs are fine.
 
  Please let me know if you need my config or any other informations.
 
  
  And you're the second to report very slow scsi throughput in 2.6.24-rc3-mm1.
  
 
 I found the commit which cause these problems , it is in git-scsi-misc patch 
 and reverting it fixes both problems for me.
 
 http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff_plain;h=8655a546c83fc43f0a73416bbd126d02de7ad6c0;hp=5bc717b6bdaaf52edf365eb7d9d8c89fec79df5d
 

OK, thanks.  I'll assume that James and Hannes have this in hand (or will
have, by mid-week) and I won't do anything here.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-26 Thread Valdis . Kletnieks
On Tue, 20 Nov 2007 20:45:25 PST, Andrew Morton said:
 
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/

Finally got both time and motivation to at least start a bisect..

2.6.23-mm1 works on my D820 (x86_64 kernel, Core2 Duo T7200)

24-rc3-mm1 (plus 3 patches from hotfixes/) bricks *instantly* at boot - grub
prints its 3 or 4 lines saying what it loaded, the screen clears, and *blam*
dead. No serial console output, no pair of penguins on the monitor, no
netconsole, no earlyprintk=vga output, no alt-sysrq, only thing that does
*anything* is hold the power button for 5 seconds.  Whatever it is, it
happens *very* early (before we get as far as the 'Linux version 2.6.mumble'
banner), and happens *hard*.

I've bisected it down this far:

git-ipwireless_cs.patch GOOD
git-x86.patch
git-x86-fixup.patch
git-x86-thread_order-borkage.patch
git-x86-thread_order-borkage-fix.patch
git-x86-identify_cpu-fix.patch
git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
git-x86-inlining-borkage.patch
x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD

Anybody got any good debugging ideas before I go through and do the final
3 or 4 bisects?  I suspect I'll need them once I find the offending patch
to tell *why* said patch dies on my box - I've seen enough traffic regarding
-rc3-mm1 dying *later* to know it's probably a subtle issue and not one
that will be obvious once I finger a specific patch.  For example, it's
probably not the IO-APIC panic that people are seeing, because their kernels
live long enough to panic. ;)



pgpbW8UIlUa1z.pgp
Description: PGP signature


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-26 Thread Andrew Morton
On Tue, 27 Nov 2007 02:16:26 -0500 [EMAIL PROTECTED] wrote:

 On Tue, 20 Nov 2007 20:45:25 PST, Andrew Morton said:
  
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 
 Finally got both time and motivation to at least start a bisect..
 
 2.6.23-mm1 works on my D820 (x86_64 kernel, Core2 Duo T7200)
 
 24-rc3-mm1 (plus 3 patches from hotfixes/) bricks *instantly* at boot - grub
 prints its 3 or 4 lines saying what it loaded, the screen clears, and *blam*
 dead. No serial console output, no pair of penguins on the monitor, no
 netconsole, no earlyprintk=vga output, no alt-sysrq, only thing that does
 *anything* is hold the power button for 5 seconds.  Whatever it is, it
 happens *very* early (before we get as far as the 'Linux version 2.6.mumble'
 banner), and happens *hard*.
 
 I've bisected it down this far:
 
 git-ipwireless_cs.patch GOOD
 git-x86.patch
 git-x86-fixup.patch
 git-x86-thread_order-borkage.patch
 git-x86-thread_order-borkage-fix.patch
 git-x86-identify_cpu-fix.patch
 git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
 git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
 git-x86-inlining-borkage.patch
 x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
 x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD
 
 Anybody got any good debugging ideas before I go through and do the final
 3 or 4 bisects?  I suspect I'll need them once I find the offending patch
 to tell *why* said patch dies on my box - I've seen enough traffic regarding
 -rc3-mm1 dying *later* to know it's probably a subtle issue and not one
 that will be obvious once I finger a specific patch.  For example, it's
 probably not the IO-APIC panic that people are seeing, because their kernels
 live long enough to panic. ;)
 

You could try http://userweb.kernel.org/~akpm/mmotm/ - we might have already
fixed it.

Otherwise, please proceed to work out which diff I need to drop and hope like
hell that it isn't git-x86..
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 - brick my Dell Latitude D820

2007-11-26 Thread Valdis . Kletnieks
On Mon, 26 Nov 2007 23:27:03 PST, Andrew Morton said:

  git-x86.patch
  git-x86-fixup.patch
  git-x86-thread_order-borkage.patch
  git-x86-thread_order-borkage-fix.patch
  git-x86-identify_cpu-fix.patch
  git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko.patch
  git-x86-memory_add_physaddr_to_nid-export-for-acpi-memhotplugko-checkpatch-fixes.patch
  git-x86-inlining-borkage.patch
  x86_64-set-cpu_index-to-nr_cpus-instead-of-0.patch
  x86_64-make-sparsemem-vmemmap-the-default-memory-model-v2.patch BAD

 You could try http://userweb.kernel.org/~akpm/mmotm/ - we might have already
 fixed it.

I suspect that trying -rc3-mm1 but refreshing just the 10 patches above
from -mmotm would be far less likely to pull in other heartburn?

 Otherwise, please proceed to work out which diff I need to drop and hope like
 hell that it isn't git-x86..

That's a 41,240 line diff, the rest *total* to about 400 lines.  I don't have
warm-n-fuzzies about my odds here. ;)

I'm a git-idiot, but *do* know how to git-bisect through Linus tree - what
would I need to do to git-bisect through git-x86.patch? (I do *not* know how
to deal with more than 1 source git tree, so if the magic is just 'get a
linus tree, merge git-x86, then bisect as usual, I'm stuck on merge 
git-x86)..



pgpxMGUuWzdJd.pgp
Description: PGP signature


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-25 Thread Hannes Reinecke
On Sat, Nov 24, 2007 at 07:44:13PM +0200, James Bottomley wrote:
> Probing intermittent failures in Domain Validation, even with the fixes
> applied leads me to the conclusion that there are further problems with
> this commit:
> 
> commit fc5eb4facedbd6d7117905e775cee1975f894e79
> Author: Hannes Reinecke <[EMAIL PROTECTED]>
> Date:   Tue Nov 6 09:23:40 2007 +0100
> 
> [SCSI] Do not requeue requests if REQ_FAILFAST is set
>  
> The essence of the problems is that you're causing REQ_FAILFAST to
> terminate commands with error on requeuing conditions, some of which are
> relatively common on most SCSI devices.  While this may be the correct
> behaviour for multi-path, it's certainly wrong for the previously
> understood meaning of REQ_FAILFAST, which was don't retry on error,
> which is why domain validation and other applications use it to control
> error handling, but don't expect to get failures for a simple requeue
> are now spitting errors.
> 
> I honestly can't see that, even for the multi-path case, returning an
> error when we're over queue depth is the correct thing to do (it may not
> matter to something like a symmetrix, but an array that has a non-zero
> cost associated with a path change, like a CPQ HSV or the AVT
> controllers, will show fairly large slow downs if you do this).  Even if
> this is the desired behaviour (and I think that's a policy issue),
> DID_NO_CONNECT is almost certainly the wrong error to be sending back.
> 
> This patch fixes up domain validation to work again correctly, however,
> I really think it's just a bandaid.  Do you want to rethink the above
> commit?
> 
Given the amounted error, yes, I'll have to.
But we still face the initial problem that requeued requests will be
stuck in the queue forever (ie until the timeout catches it), causing
failover to be painfully slow.

Anyway, I'll think it over.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Markus Rex, HRB 16746 (AG N�rnberg)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 (sync is slow ?)

2007-11-25 Thread KAMEZAWA Hiroyuki
On Sat, 24 Nov 2007 19:04:34 +0100
Gabriel C <[EMAIL PROTECTED]> wrote:
> >> It seems OK here from a quick test (i386, ext3-on-IDE).
> >>
> >> Maybe device driver/block breakage?
> 
> Try revert
> 
> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff_plain;h=8655a546c83fc43f0a73416bbd126d02de7ad6c0;hp=5bc717b6bdaaf52edf365eb7d9d8c89fec79df5d
> 
> See also :
> http://lkml.org/lkml/2007/11/23/5
> 
> and search for '2.6.24-rc3-mm1: I/O error, system hangs' on LKML
> 

Thank you!
The problem was fixed by reverting the patch you pointed out.

-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-25 Thread Laurent Riffard
Le 25.11.2007 08:37, James Bottomley a écrit :
> On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote:
>> Le 24.11.2007 14:26, James Bottomley a écrit :
>>> OK, could you post dmesgs again, please.  I actually tested this
>> with an
>>> aic79xx card, and for me it does cause Domain Validation to succeed
>>> again.
>> James, 
>>
>> Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch "separates
>> the 
>> BLOCK and QUIESCE states
>> correctly" (http://lkml.org/lkml/2007/11/24/8).
>>
>> How to reproduce :
>> - boot
>> - switch to a text console
>> - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the 
>>   system does work.
>> - switch to X console, log in the Gnome Desktop, the system partially 
>>   hangs.
>> - switch back to a text console: dmesg(1) still works, it shows some 
>>   additonal I/O errors. At this point, any disk access makes the system 
>>   completely hung.
>>
>> Additionnal data:
>> - the I/O errors always happen on the same blocks.
>>
>> plain text document attachment (dmesg-2.6.24-rc3-mm1-patched)
> [...]
>> [   25.521256] scsi0 : pata_via
>> [   25.521711] scsi1 : pata_via
>> [   25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq 
>> 14
>> [   25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq 
>> 15
>> [   25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100
>> [   25.683208] ata1.00: 78165360 sectors, multi 16: LBA 
>> [   25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133
>> [   25.684116] ata1.01: 160086528 sectors, multi 16: LBA 
>> [   25.691127] ata1.00: configured for UDMA/100
>> [   25.699142] ata1.01: configured for UDMA/100
>> [   26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33
>> [   26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr
>> [   26.330839] ata2.00: configured for UDMA/33
>> [   26.490828] ata2.01: configured for MWDMA2
>> [   26.503014] scsi 0:0:0:0: Direct-Access ATA  ST340016A 3.75 PQ: 0 
>> ANSI: 5
>> [   26.504670] scsi 0:0:1:0: Direct-Access ATA  Maxtor 6Y080L0 YAR4 
>> PQ: 0 ANSI: 5
>> [   26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B 
>> DL05 PQ: 0 ANSI: 5
>> [   26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q  PQ: 
>> 0 ANSI: 5
> [...]
>> [   60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
>> driverbyte=DRIVER_OK,SUGGEST_OK
>> [   60.216124] end_request: I/O error, dev sda, sector 16460
> 
> I think this one's quite easy:  PATA devices in libata are queue depth 1
> (since they don't do NCQ).  Thus, they're peculiarly sensitive to the
> bug where we fail over queue depth requests.
> 
> On the other hand, I don't see how a filesystem request is getting
> REQ_FAILFAST ... unless there's a bio or readahead issue involved.
> Anyway, could you try this patch:
> 
> http://marc.info/?l=linux-scsi=119592627425498
> 
> Which should fix the queue depth issue, and see if the errors go away?

No, this one doesn't help...

-- 
laurent
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-25 Thread Laurent Riffard
Le 25.11.2007 08:37, James Bottomley a écrit :
 On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote:
 Le 24.11.2007 14:26, James Bottomley a écrit :
 OK, could you post dmesgs again, please.  I actually tested this
 with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.
 James, 

 Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch separates
 the 
 BLOCK and QUIESCE states
 correctly (http://lkml.org/lkml/2007/11/24/8).

 How to reproduce :
 - boot
 - switch to a text console
 - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the 
   system does work.
 - switch to X console, log in the Gnome Desktop, the system partially 
   hangs.
 - switch back to a text console: dmesg(1) still works, it shows some 
   additonal I/O errors. At this point, any disk access makes the system 
   completely hung.

 Additionnal data:
 - the I/O errors always happen on the same blocks.

 plain text document attachment (dmesg-2.6.24-rc3-mm1-patched)
 [...]
 [   25.521256] scsi0 : pata_via
 [   25.521711] scsi1 : pata_via
 [   25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xb800 irq 
 14
 [   25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xb808 irq 
 15
 [   25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100
 [   25.683208] ata1.00: 78165360 sectors, multi 16: LBA 
 [   25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133
 [   25.684116] ata1.01: 160086528 sectors, multi 16: LBA 
 [   25.691127] ata1.00: configured for UDMA/100
 [   25.699142] ata1.01: configured for UDMA/100
 [   26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max UDMA/33
 [   26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr
 [   26.330839] ata2.00: configured for UDMA/33
 [   26.490828] ata2.01: configured for MWDMA2
 [   26.503014] scsi 0:0:0:0: Direct-Access ATA  ST340016A 3.75 PQ: 0 
 ANSI: 5
 [   26.504670] scsi 0:0:1:0: Direct-Access ATA  Maxtor 6Y080L0 YAR4 
 PQ: 0 ANSI: 5
 [   26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM GSA-4165B 
 DL05 PQ: 0 ANSI: 5
 [   26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU A4Q  PQ: 
 0 ANSI: 5
 [...]
 [   60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 [   60.216124] end_request: I/O error, dev sda, sector 16460
 
 I think this one's quite easy:  PATA devices in libata are queue depth 1
 (since they don't do NCQ).  Thus, they're peculiarly sensitive to the
 bug where we fail over queue depth requests.
 
 On the other hand, I don't see how a filesystem request is getting
 REQ_FAILFAST ... unless there's a bio or readahead issue involved.
 Anyway, could you try this patch:
 
 http://marc.info/?l=linux-scsim=119592627425498
 
 Which should fix the queue depth issue, and see if the errors go away?

No, this one doesn't help...

-- 
laurent
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 (sync is slow ?)

2007-11-25 Thread KAMEZAWA Hiroyuki
On Sat, 24 Nov 2007 19:04:34 +0100
Gabriel C [EMAIL PROTECTED] wrote:
  It seems OK here from a quick test (i386, ext3-on-IDE).
 
  Maybe device driver/block breakage?
 
 Try revert
 
 http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff_plain;h=8655a546c83fc43f0a73416bbd126d02de7ad6c0;hp=5bc717b6bdaaf52edf365eb7d9d8c89fec79df5d
 
 See also :
 http://lkml.org/lkml/2007/11/23/5
 
 and search for '2.6.24-rc3-mm1: I/O error, system hangs' on LKML
 

Thank you!
The problem was fixed by reverting the patch you pointed out.

-Kame

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-25 Thread Hannes Reinecke
On Sat, Nov 24, 2007 at 07:44:13PM +0200, James Bottomley wrote:
 Probing intermittent failures in Domain Validation, even with the fixes
 applied leads me to the conclusion that there are further problems with
 this commit:
 
 commit fc5eb4facedbd6d7117905e775cee1975f894e79
 Author: Hannes Reinecke [EMAIL PROTECTED]
 Date:   Tue Nov 6 09:23:40 2007 +0100
 
 [SCSI] Do not requeue requests if REQ_FAILFAST is set
  
 The essence of the problems is that you're causing REQ_FAILFAST to
 terminate commands with error on requeuing conditions, some of which are
 relatively common on most SCSI devices.  While this may be the correct
 behaviour for multi-path, it's certainly wrong for the previously
 understood meaning of REQ_FAILFAST, which was don't retry on error,
 which is why domain validation and other applications use it to control
 error handling, but don't expect to get failures for a simple requeue
 are now spitting errors.
 
 I honestly can't see that, even for the multi-path case, returning an
 error when we're over queue depth is the correct thing to do (it may not
 matter to something like a symmetrix, but an array that has a non-zero
 cost associated with a path change, like a CPQ HSV or the AVT
 controllers, will show fairly large slow downs if you do this).  Even if
 this is the desired behaviour (and I think that's a policy issue),
 DID_NO_CONNECT is almost certainly the wrong error to be sending back.
 
 This patch fixes up domain validation to work again correctly, however,
 I really think it's just a bandaid.  Do you want to rethink the above
 commit?
 
Given the amounted error, yes, I'll have to.
But we still face the initial problem that requeued requests will be
stuck in the queue forever (ie until the timeout catches it), causing
failover to be painfully slow.

Anyway, I'll think it over.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
[EMAIL PROTECTED] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Markus Rex, HRB 16746 (AG N�rnberg)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley
On Sat, 2007-11-24 at 23:59 +0100, Laurent Riffard wrote:
> Le 24.11.2007 14:26, James Bottomley a écrit :
> > OK, could you post dmesgs again, please.  I actually tested this
> with an
> > aic79xx card, and for me it does cause Domain Validation to succeed
> > again.
> 
> James, 
> 
> Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch "separates
> the 
> BLOCK and QUIESCE states
> correctly" (http://lkml.org/lkml/2007/11/24/8).
> 
> How to reproduce :
> - boot
> - switch to a text console
> - capture dmesg in a file, sync, etc. There are 3 I/O errors, but the 
>   system does work.
> - switch to X console, log in the Gnome Desktop, the system partially 
>   hangs.
> - switch back to a text console: dmesg(1) still works, it shows some 
>   additonal I/O errors. At this point, any disk access makes the
> system 
>   completely hung.
> 
> Additionnal data:
> - the I/O errors always happen on the same blocks.
> 
> plain text document attachment (dmesg-2.6.24-rc3-mm1-patched)
[...]
> [   25.521256] scsi0 : pata_via
> [   25.521711] scsi1 : pata_via
> [   25.524089] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma
> 0xb800 irq 14
> [   25.524176] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma
> 0xb808 irq 15
> [   25.683141] ata1.00: ATA-5: ST340016A, 3.75, max UDMA/100
> [   25.683208] ata1.00: 78165360 sectors, multi 16: LBA 
> [   25.683475] ata1.01: ATA-7: Maxtor 6Y080L0, YAR41BW0, max UDMA/133
> [   25.684116] ata1.01: 160086528 sectors, multi 16: LBA 
> [   25.691127] ata1.00: configured for UDMA/100
> [   25.699142] ata1.01: configured for UDMA/100
> [   26.170860] ata2.00: ATAPI: HL-DT-ST DVDRAM GSA-4165B, DL05, max
> UDMA/33
> [   26.171562] ata2.01: ATAPI: CD-950E/AKU, A4Q, max MWDMA2, CDB intr
> [   26.330839] ata2.00: configured for UDMA/33
> [   26.490828] ata2.01: configured for MWDMA2
> [   26.503014] scsi 0:0:0:0: Direct-Access ATA  ST340016A
> 3.75 PQ: 0 ANSI: 5
> [   26.504670] scsi 0:0:1:0: Direct-Access ATA  Maxtor 6Y080L0
> YAR4 PQ: 0 ANSI: 5
> [   26.509842] scsi 1:0:0:0: CD-ROMHL-DT-ST DVDRAM
> GSA-4165B DL05 PQ: 0 ANSI: 5
> [   26.511673] scsi 1:0:1:0: CD-ROME-IDECD-950E/AKU
> A4Q  PQ: 0 ANSI: 5
[...]
> [   60.216113] sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT
> driverbyte=DRIVER_OK,SUGGEST_OK
> [   60.216124] end_request: I/O error, dev sda, sector 16460

I think this one's quite easy:  PATA devices in libata are queue depth 1
(since they don't do NCQ).  Thus, they're peculiarly sensitive to the
bug where we fail over queue depth requests.

On the other hand, I don't see how a filesystem request is getting
REQ_FAILFAST ... unless there's a bio or readahead issue involved.
Anyway, could you try this patch:

http://marc.info/?l=linux-scsi=119592627425498

Which should fix the queue depth issue, and see if the errors go away?

Thanks,

James


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Laurent Riffard
Le 24.11.2007 14:26, James Bottomley a écrit :
> On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
>> Le 24.11.2007 07:42, James Bottomley a écrit :
>>> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
[snip]
 I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.

>> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
>> error where
>> I shouldn't. Checking ...
>>
> Ok, found it. We are blocking even special commands (ie requests with 
> PREEMPT not set)
> when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
> this.
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
 errors.
>>> I think the problem is the way we treat BLOCKED and QUIESCED (the latter
>>> is the state that the domain validation uses and which we cannot kill
>>> fastfail on).  It's definitely wrong to kill fastfail requests when the
>>> state is QUIESCE.
>>>
>>> This patch (which is applied on top of Hannes original) separates the
>>> BLOCK and QUIESCE states correctly ... does this fix the problem?
>>
>> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
> 
> OK, could you post dmesgs again, please.  I actually tested this with an
> aic79xx card, and for me it does cause Domain Validation to succeed
> again.

James, 

Here is a dmesg produced by 2.6.24-rc3-mm1 + your patch "separates the 
BLOCK and QUIESCE states correctly" (http://lkml.org/lkml/2007/11/24/8).

How to reproduce :
- boot
- switch to a text console
- capture dmesg in a file, sync, etc. There are 3 I/O errors, but the 
  system does work.
- switch to X console, log in the Gnome Desktop, the system partially 
  hangs.
- switch back to a text console: dmesg(1) still works, it shows some 
  additonal I/O errors. At this point, any disk access makes the system 
  completely hung.

Additionnal data:
- the I/O errors always happen on the same blocks.

-- 
laurent
[0.00] Linux version 2.6.24-rc3-mm1 ([EMAIL PROTECTED]) (gcc version 
4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)) #122 PREEMPT Fri Nov 23 
18:47:58 CET 2007
[0.00] BIOS-provided physical RAM map:
[0.00]  BIOS-e820:  - 0009fc00 (usable)
[0.00]  BIOS-e820: 0009fc00 - 000a (reserved)
[0.00]  BIOS-e820: 000f - 0010 (reserved)
[0.00]  BIOS-e820: 0010 - 1ffec000 (usable)
[0.00]  BIOS-e820: 1ffec000 - 1ffef000 (ACPI data)
[0.00]  BIOS-e820: 1ffef000 - 1000 (reserved)
[0.00]  BIOS-e820: 1000 - 2000 (ACPI NVS)
[0.00]  BIOS-e820:  - 0001 (reserved)
[0.00] 511MB LOWMEM available.
[0.00] Entering add_active_range(0, 0, 131052) 0 entries of 256 used
[0.00] sizeof(struct page) = 32
[0.00] Zone PFN ranges:
[0.00]   DMA 0 -> 4096
[0.00]   Normal   4096 ->   131052
[0.00] Movable zone start PFN for each node
[0.00] early_node_map[1] active PFN ranges
[0.00] 0:0 ->   131052
[0.00] On node 0 totalpages: 131052
[0.00] Node 0 memmap at 0xC100 size 4194304 first pfn 0xC100
[0.00]   DMA zone: 32 pages used for memmap
[0.00]   DMA zone: 0 pages reserved
[0.00]   DMA zone: 4064 pages, LIFO batch:0
[0.00]   Normal zone: 991 pages used for memmap
[0.00]   Normal zone: 125965 pages, LIFO batch:31
[0.00]   Movable zone: 0 pages used for memmap
[0.00] DMI 2.3 present.
[0.00] ACPI: RSDP 000F6A80, 0014 (r0 ASUS  )
[0.00] ACPI: RSDT 1FFEC000, 002C (r1 ASUS   A7V133-C 30303031 MSFT 
31313031)
[0.00] ACPI: FACP 1FFEC080, 0074 (r1 ASUS   A7V133-C 30303031 MSFT 
31313031)
[0.00] ACPI: DSDT 1FFEC100, 2CE1 (r1   ASUS A7V133-C 1000 MSFT  
10B)
[0.00] ACPI: FACS 1000, 0040
[0.00] ACPI: BOOT 1FFEC040, 0028 (r1 ASUS   A7V133-C 30303031 MSFT 
31313031)
[0.00] ACPI: PM-Timer IO Port: 0xe408
[0.00] Allocating PCI resources starting at 3000 (gap: 
2000:dfff)
[0.00] swsusp: Registered nosave memory region: 0009f000 - 
000a
[0.00] swsusp: Registered nosave memory region: 000a - 
000f
[0.00] swsusp: Registered nosave memory region: 000f - 
0010
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 130029
[0.00] Kernel command line: root=/dev/mapper/vglinux1-lv_ubuntu2 ro 
locale=fr_FR video=radeonfb:[EMAIL PROTECTED] resume=/dev/mapper/vglinux1-lvswap
[0.00] Local APIC disabled by BIOS -- you can enable it with "lapic"
[0.00] mapped APIC to b000 

Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Gabriel C
Gabriel C wrote:
> James Bottomley wrote:
>> On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote:
>>> James Bottomley wrote:
 On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
> Le 24.11.2007 07:42, James Bottomley a écrit :
>> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
>>> Le 23.11.2007 12:38, Hannes Reinecke a écrit :
 Hannes Reinecke wrote:
> Laurent Riffard wrote:
>> Le 21.11.2007 23:41, Andrew Morton a écrit :
>>> On Wed, 21 Nov 2007 22:45:22 +0100
>>> Laurent Riffard <[EMAIL PROTECTED]> wrote:
>>>
 Le 21.11.2007 05:45, Andrew Morton a écrit :
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W 
 shows
 that a bunch of task are blocked in "D" state, they seem to wait 
 for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format "3.6" with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
 extents:1 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
 reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

>>> Could be - 
>>> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
>>> and 
>>> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
>>> touch pata_via.c.
>> None of the above...
>>
>> I did a bisection, it spotted git-scsi-misc.patch. 
>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
>> fine.
>>
>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do 
>> not 
>> requeue requests if REQ_FAILFAST is set" is the real culprit. The 
>> other 
>> commits are touching documentation or drivers I don't use. I'll try 
>> to revert only this one this evening.
>>> I can confirm : reverting commit 
>>> 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
>>> does fix the problem.
>>>
> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
> error where
> I shouldn't. Checking ...
>
 Ok, found it. We are blocking even special commands (ie requests with 
 PREEMPT not set)
 when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
 this.
>>> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with 
>>> I/O errors.
>> I think the problem is the way we treat BLOCKED and QUIESCED (the latter
>> is the state that the domain validation uses and which we cannot kill
>> fastfail on).  It's definitely wrong to kill fastfail requests when the
>> state is QUIESCE.
>>
>> This patch (which is applied on top of Hannes original) separates the
>> BLOCK and QUIESCE states correctly ... does this fix the problem?
> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
 OK, could you post dmesgs again, please.  I actually tested this with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.

>>> Are the patches indeed to fix that problem as well ? 
>>>
>>> http://lkml.org/lkml/2007/11/23/5
>> That dmesg is from an unknown SCSI card exhibiting Domain Validation
>> problems, so it's a reasonable probability, yes ... but you'll need the
>> additional hack I just did to prevent further intermittent failures.
> 
> My controller is:
> 
> 03:0e.0 SCSI storage controller [0100]: Adaptec AIC-7892P U160/m [9005:008f] 
> (rev 02)
> 
> I'll try the patches in a bit.

With your patches my problem(s) are solved. Domain Validation works again.

...

[   32.179521] scsi 

Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Gabriel C
James Bottomley wrote:
> On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote:
>> James Bottomley wrote:
>>> On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
 Le 24.11.2007 07:42, James Bottomley a écrit :
> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
>> Le 23.11.2007 12:38, Hannes Reinecke a écrit :
>>> Hannes Reinecke wrote:
 Laurent Riffard wrote:
> Le 21.11.2007 23:41, Andrew Morton a écrit :
>> On Wed, 21 Nov 2007 22:45:22 +0100
>> Laurent Riffard <[EMAIL PROTECTED]> wrote:
>>
>>> Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
>>> Hello, 
>>>
>>> My system hangs shortly after I logged in Gnome desktop. SysRq-W 
>>> shows
>>> that a bunch of task are blocked in "D" state, they seem to wait for
>>> some I/O completion. I can try to hand-copy some data if requested.
>>>
>>> I found these messages in dmesg:
>>>
>>> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
>>> EXT3-fs: mounted filesystem with ordered data mode.
>>> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sda, sector 16460
>>> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
>>> ReiserFS: sda7: using ordered data mode
>>> --
>>> ReiserFS: sda7: Using r5 hash to sort names
>>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sdb, sector 19632
>>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sdb, sector 40037363
>>> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
>>> extents:1 across:1048568k
>>> lp0: using parport0 (interrupt-driven).
>>>
>>> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
>>> reproducible.
>>> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
>>>
>>> Maybe something is broken in pata_via driver ?
>>>
>> Could be - 
>> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
>> and 
>> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
>> touch pata_via.c.
> None of the above...
>
> I did a bisection, it spotted git-scsi-misc.patch. 
> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
> fine.
>
> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do 
> not 
> requeue requests if REQ_FAILFAST is set" is the real culprit. The 
> other 
> commits are touching documentation or drivers I don't use. I'll try 
> to revert only this one this evening.
>> I can confirm : reverting commit 
>> 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
>> does fix the problem.
>>
 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
 error where
 I shouldn't. Checking ...

>>> Ok, found it. We are blocking even special commands (ie requests with 
>>> PREEMPT not set)
>>> when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
>>> this.
>> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
>> errors.
> I think the problem is the way we treat BLOCKED and QUIESCED (the latter
> is the state that the domain validation uses and which we cannot kill
> fastfail on).  It's definitely wrong to kill fastfail requests when the
> state is QUIESCE.
>
> This patch (which is applied on top of Hannes original) separates the
> BLOCK and QUIESCE states correctly ... does this fix the problem?
 No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
>>> OK, could you post dmesgs again, please.  I actually tested this with an
>>> aic79xx card, and for me it does cause Domain Validation to succeed
>>> again.
>>>
>> Are the patches indeed to fix that problem as well ? 
>>
>> http://lkml.org/lkml/2007/11/23/5
> 
> That dmesg is from an unknown SCSI card exhibiting Domain Validation
> problems, so it's a reasonable probability, yes ... but you'll need the
> additional hack I just did to prevent further intermittent failures.

My controller is:

03:0e.0 SCSI storage controller [0100]: Adaptec AIC-7892P U160/m [9005:008f] 
(rev 02)

I'll try the patches in a bit.

> 
> James
> 

Gabriel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: 2.6.24-rc3-mm1 (sync is slow ?)

2007-11-24 Thread Gabriel C
kosaki wrote:
> Hi, Andrew 
> 
>>> Hi, Andrew
>>>
>>> I got following result in 'sync' command.
>>> It was too slow. (memory controller config is off ;)
>>> I attaches my .config.
>>> ==
>  (snip)
>> Well I wonder how we did that.
>>
>> It seems OK here from a quick test (i386, ext3-on-IDE).
>>
>> Maybe device driver/block breakage?

Try revert

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff_plain;h=8655a546c83fc43f0a73416bbd126d02de7ad6c0;hp=5bc717b6bdaaf52edf365eb7d9d8c89fec79df5d

See also :
http://lkml.org/lkml/2007/11/23/5

and search for '2.6.24-rc3-mm1: I/O error, system hangs' on LKML

> 
> I tested x86, ext3-on-SATA(/dev/sda).
> It seems works well.
> 
> Hmm...

IDE/SATA is fine here as well just SCSI broke


Regards,

Gabriel 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley

On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote:
> James Bottomley wrote:
> > On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
> >> Le 24.11.2007 07:42, James Bottomley a écrit :
> >>> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
>  Le 23.11.2007 12:38, Hannes Reinecke a écrit :
> > Hannes Reinecke wrote:
> >> Laurent Riffard wrote:
> >>> Le 21.11.2007 23:41, Andrew Morton a écrit :
>  On Wed, 21 Nov 2007 22:45:22 +0100
>  Laurent Riffard <[EMAIL PROTECTED]> wrote:
> 
> > Le 21.11.2007 05:45, Andrew Morton a écrit :
> >> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> > Hello, 
> >
> > My system hangs shortly after I logged in Gnome desktop. SysRq-W 
> > shows
> > that a bunch of task are blocked in "D" state, they seem to wait for
> > some I/O completion. I can try to hand-copy some data if requested.
> >
> > I found these messages in dmesg:
> >
> > ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
> > EXT3-fs: mounted filesystem with ordered data mode.
> > sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > end_request: I/O error, dev sda, sector 16460
> > ReiserFS: sda7: found reiserfs format "3.6" with standard journal
> > ReiserFS: sda7: using ordered data mode
> > --
> > ReiserFS: sda7: Using r5 hash to sort names
> > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > end_request: I/O error, dev sdb, sector 19632
> > sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > end_request: I/O error, dev sdb, sector 40037363
> > Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
> > extents:1 across:1048568k
> > lp0: using parport0 (interrupt-driven).
> >
> > These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
> > reproducible.
> > 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
> >
> > Maybe something is broken in pata_via driver ?
> >
>  Could be - 
>  libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
>  and 
>  pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
>  touch pata_via.c.
> >>> None of the above...
> >>>
> >>> I did a bisection, it spotted git-scsi-misc.patch. 
> >>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
> >>> fine.
> >>>
> >>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do 
> >>> not 
> >>> requeue requests if REQ_FAILFAST is set" is the real culprit. The 
> >>> other 
> >>> commits are touching documentation or drivers I don't use. I'll try 
> >>> to revert only this one this evening.
>  I can confirm : reverting commit 
>  8655a546c83fc43f0a73416bbd126d02de7ad6c0 
>  does fix the problem.
> 
> >> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
> >> error where
> >> I shouldn't. Checking ...
> >>
> > Ok, found it. We are blocking even special commands (ie requests with 
> > PREEMPT not set)
> > when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
> > this.
>  Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
>  errors.
> >>> I think the problem is the way we treat BLOCKED and QUIESCED (the latter
> >>> is the state that the domain validation uses and which we cannot kill
> >>> fastfail on).  It's definitely wrong to kill fastfail requests when the
> >>> state is QUIESCE.
> >>>
> >>> This patch (which is applied on top of Hannes original) separates the
> >>> BLOCK and QUIESCE states correctly ... does this fix the problem?
> >>
> >> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
> > 
> > OK, could you post dmesgs again, please.  I actually tested this with an
> > aic79xx card, and for me it does cause Domain Validation to succeed
> > again.
> > 
> 
> Are the patches indeed to fix that problem as well ? 
> 
> http://lkml.org/lkml/2007/11/23/5

That dmesg is from an unknown SCSI card exhibiting Domain Validation
problems, so it's a reasonable probability, yes ... but you'll need the
additional hack I just did to prevent further intermittent failures.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Gabriel C
James Bottomley wrote:
> On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
>> Le 24.11.2007 07:42, James Bottomley a écrit :
>>> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
> Hannes Reinecke wrote:
>> Laurent Riffard wrote:
>>> Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard <[EMAIL PROTECTED]> wrote:

> Le 21.11.2007 05:45, Andrew Morton a écrit :
>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> Hello, 
>
> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
> that a bunch of task are blocked in "D" state, they seem to wait for
> some I/O completion. I can try to hand-copy some data if requested.
>
> I found these messages in dmesg:
>
> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
> EXT3-fs: mounted filesystem with ordered data mode.
> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sda, sector 16460
> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
> ReiserFS: sda7: using ordered data mode
> --
> ReiserFS: sda7: Using r5 hash to sort names
> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdb, sector 19632
> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> driverbyte=DRIVER_OK,SUGGEST_OK
> end_request: I/O error, dev sdb, sector 40037363
> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
> extents:1 across:1048568k
> lp0: using parport0 (interrupt-driven).
>
> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
> reproducible.
> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
>
> Maybe something is broken in pata_via driver ?
>
 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
>>> None of the above...
>>>
>>> I did a bisection, it spotted git-scsi-misc.patch. 
>>> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
>>> fine.
>>>
>>> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not 
>>> requeue requests if REQ_FAILFAST is set" is the real culprit. The other 
>>> commits are touching documentation or drivers I don't use. I'll try 
>>> to revert only this one this evening.
 I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.

>> Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
>> error where
>> I shouldn't. Checking ...
>>
> Ok, found it. We are blocking even special commands (ie requests with 
> PREEMPT not set)
> when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
> this.
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
 errors.
>>> I think the problem is the way we treat BLOCKED and QUIESCED (the latter
>>> is the state that the domain validation uses and which we cannot kill
>>> fastfail on).  It's definitely wrong to kill fastfail requests when the
>>> state is QUIESCE.
>>>
>>> This patch (which is applied on top of Hannes original) separates the
>>> BLOCK and QUIESCE states correctly ... does this fix the problem?
>>
>> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
> 
> OK, could you post dmesgs again, please.  I actually tested this with an
> aic79xx card, and for me it does cause Domain Validation to succeed
> again.
> 

Are the patches indeed to fix that problem as well ? 

http://lkml.org/lkml/2007/11/23/5

> James

Gabriel 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley
Probing intermittent failures in Domain Validation, even with the fixes
applied leads me to the conclusion that there are further problems with
this commit:

commit fc5eb4facedbd6d7117905e775cee1975f894e79
Author: Hannes Reinecke <[EMAIL PROTECTED]>
Date:   Tue Nov 6 09:23:40 2007 +0100

[SCSI] Do not requeue requests if REQ_FAILFAST is set
 
The essence of the problems is that you're causing REQ_FAILFAST to
terminate commands with error on requeuing conditions, some of which are
relatively common on most SCSI devices.  While this may be the correct
behaviour for multi-path, it's certainly wrong for the previously
understood meaning of REQ_FAILFAST, which was don't retry on error,
which is why domain validation and other applications use it to control
error handling, but don't expect to get failures for a simple requeue
are now spitting errors.

I honestly can't see that, even for the multi-path case, returning an
error when we're over queue depth is the correct thing to do (it may not
matter to something like a symmetrix, but an array that has a non-zero
cost associated with a path change, like a CPQ HSV or the AVT
controllers, will show fairly large slow downs if you do this).  Even if
this is the desired behaviour (and I think that's a policy issue),
DID_NO_CONNECT is almost certainly the wrong error to be sending back.

This patch fixes up domain validation to work again correctly, however,
I really think it's just a bandaid.  Do you want to rethink the above
commit?

James

Index: BUILD-2.6/drivers/scsi/scsi_lib.c
===
--- BUILD-2.6.orig/drivers/scsi/scsi_lib.c  2007-11-24 11:25:20.0 
-0600
+++ BUILD-2.6/drivers/scsi/scsi_lib.c   2007-11-24 11:26:22.0 -0600
@@ -1552,7 +1552,8 @@ static void scsi_request_fn(struct reque
break;
 
if (!scsi_dev_queue_ready(q, sdev)) {
-   if (req->cmd_flags & REQ_FAILFAST) {
+   if ((req->cmd_flags & REQ_FAILFAST) &&
+   !(req->cmd_flags & REQ_PREEMPT)) {
scsi_kill_request(req, q);
continue;
}


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 make headers_check fails

2007-11-24 Thread Adrian Bunk
On Wed, Nov 21, 2007 at 10:58:21AM +0100, Sam Ravnborg wrote:
> On Wed, Nov 21, 2007 at 10:44:40AM +0200, Avi Kivity wrote:
> > Kamalesh Babulal wrote:
> > >Andrew Morton wrote:
> > >  
> > >>On Wed, 21 Nov 2007 13:54:50 +0530 Kamalesh Babulal 
> > >><[EMAIL PROTECTED]> wrote:
> > >>
> > >>
> > >>>The make headers_check fails,
> > >>>
> > >>>  CHECK   include/linux/usb/gadgetfs.h
> > >>>  CHECK   include/linux/usb/ch9.h
> > >>>  CHECK   include/linux/usb/cdc.h
> > >>>  CHECK   include/linux/usb/audio.h
> > >>>  CHECK   include/linux/kvm.h
> > >>>/root/kernels/linux-2.6.24-rc3/usr/include/linux/kvm.h requires 
> > >>>asm/kvm.h, which does not exist in exported headers
> > >>>  
> > >>hm, works for me, on i386 and x86_64.  What's different over there?
> > >>
> > >Hi Andrew,
> > >
> > >It fails on the powerpc box, with allyesconfig option.
> > >
> > >  
> > 
> > How do we fix this?  Export linux/kvm.h only on x86?  Seems ugly.
> 
> Is kvm x86 specific? Then move the .h file to asm-x86.
> Otherwise no good idea...

What about adding a whitelist in hdrcheck.sh?

For all practical purposes in userspace the compile error due to the 
non-existing asm header should be fine, so there's no reason to change 
the code in such cases. 

>   Sam

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley
On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
> Le 24.11.2007 07:42, James Bottomley a écrit :
> > On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
> >> Le 23.11.2007 12:38, Hannes Reinecke a écrit :
> >>> Hannes Reinecke wrote:
>  Laurent Riffard wrote:
> > Le 21.11.2007 23:41, Andrew Morton a écrit :
> >> On Wed, 21 Nov 2007 22:45:22 +0100
> >> Laurent Riffard <[EMAIL PROTECTED]> wrote:
> >>
> >>> Le 21.11.2007 05:45, Andrew Morton a écrit :
>  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
> >>> Hello, 
> >>>
> >>> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
> >>> that a bunch of task are blocked in "D" state, they seem to wait for
> >>> some I/O completion. I can try to hand-copy some data if requested.
> >>>
> >>> I found these messages in dmesg:
> >>>
> >>> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
> >>> EXT3-fs: mounted filesystem with ordered data mode.
> >>> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
> >>> driverbyte=DRIVER_OK,SUGGEST_OK
> >>> end_request: I/O error, dev sda, sector 16460
> >>> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
> >>> ReiserFS: sda7: using ordered data mode
> >>> --
> >>> ReiserFS: sda7: Using r5 hash to sort names
> >>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> >>> driverbyte=DRIVER_OK,SUGGEST_OK
> >>> end_request: I/O error, dev sdb, sector 19632
> >>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
> >>> driverbyte=DRIVER_OK,SUGGEST_OK
> >>> end_request: I/O error, dev sdb, sector 40037363
> >>> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
> >>> extents:1 across:1048568k
> >>> lp0: using parport0 (interrupt-driven).
> >>>
> >>> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
> >>> reproducible.
> >>> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
> >>>
> >>> Maybe something is broken in pata_via driver ?
> >>>
> >> Could be - 
> >> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
> >> and 
> >> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
> >> touch pata_via.c.
> > None of the above...
> >
> > I did a bisection, it spotted git-scsi-misc.patch. 
> > I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
> > fine.
> >
> > I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not 
> > requeue requests if REQ_FAILFAST is set" is the real culprit. The other 
> > commits are touching documentation or drivers I don't use. I'll try 
> > to revert only this one this evening.
> >> I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
> >> does fix the problem.
> >>
>  Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
>  error where
>  I shouldn't. Checking ...
> 
> >>> Ok, found it. We are blocking even special commands (ie requests with 
> >>> PREEMPT not set)
> >>> when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
> >>> this.
> >> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
> >> errors.
> > 
> > I think the problem is the way we treat BLOCKED and QUIESCED (the latter
> > is the state that the domain validation uses and which we cannot kill
> > fastfail on).  It's definitely wrong to kill fastfail requests when the
> > state is QUIESCE.
> > 
> > This patch (which is applied on top of Hannes original) separates the
> > BLOCK and QUIESCE states correctly ... does this fix the problem?
> 
> 
> No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)

OK, could you post dmesgs again, please.  I actually tested this with an
aic79xx card, and for me it does cause Domain Validation to succeed
again.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Laurent Riffard


Le 24.11.2007 07:42, James Bottomley a écrit :
> On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
>> Le 23.11.2007 12:38, Hannes Reinecke a écrit :
>>> Hannes Reinecke wrote:
 Laurent Riffard wrote:
> Le 21.11.2007 23:41, Andrew Morton a écrit :
>> On Wed, 21 Nov 2007 22:45:22 +0100
>> Laurent Riffard <[EMAIL PROTECTED]> wrote:
>>
>>> Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
>>> Hello, 
>>>
>>> My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
>>> that a bunch of task are blocked in "D" state, they seem to wait for
>>> some I/O completion. I can try to hand-copy some data if requested.
>>>
>>> I found these messages in dmesg:
>>>
>>> ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
>>> EXT3-fs: mounted filesystem with ordered data mode.
>>> sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sda, sector 16460
>>> ReiserFS: sda7: found reiserfs format "3.6" with standard journal
>>> ReiserFS: sda7: using ordered data mode
>>> --
>>> ReiserFS: sda7: Using r5 hash to sort names
>>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sdb, sector 19632
>>> sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>>> end_request: I/O error, dev sdb, sector 40037363
>>> Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
>>> extents:1 across:1048568k
>>> lp0: using parport0 (interrupt-driven).
>>>
>>> These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
>>> reproducible.
>>> 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
>>>
>>> Maybe something is broken in pata_via driver ?
>>>
>> Could be - 
>> libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
>> and 
>> pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
>> touch pata_via.c.
> None of the above...
>
> I did a bisection, it spotted git-scsi-misc.patch. 
> I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.
>
> I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 "[SCSI] Do not 
> requeue requests if REQ_FAILFAST is set" is the real culprit. The other 
> commits are touching documentation or drivers I don't use. I'll try 
> to revert only this one this evening.
>> I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
>> does fix the problem.
>>
 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
 error where
 I shouldn't. Checking ...

>>> Ok, found it. We are blocking even special commands (ie requests with 
>>> PREEMPT not set)
>>> when FAILFAST is set. Which is clearly wrong. The attached patch fixes this.
>> Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
>> errors.
> 
> I think the problem is the way we treat BLOCKED and QUIESCED (the latter
> is the state that the domain validation uses and which we cannot kill
> fastfail on).  It's definitely wrong to kill fastfail requests when the
> state is QUIESCE.
> 
> This patch (which is applied on top of Hannes original) separates the
> BLOCK and QUIESCE states correctly ... does this fix the problem?


No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)


> James
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 13e7e09..a7cf23a 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c

> @@ -1279,18 +1279,21 @@ int scsi_prep_state_check(struct scsi_device *sdev, 
> struct request *req)
>   "rejecting I/O to dead device\n");
>   ret = BLKPREP_KILL;
>   break;
> - case SDEV_QUIESCE:
>   case SDEV_BLOCK:
>   /*
> -  * If the devices is blocked we defer normal commands.
> -  */
> - if (!(req->cmd_flags & REQ_PREEMPT))
> - ret = BLKPREP_DEFER;
> - /*
>* Return failfast requests immediately
>*/
>   if (req->cmd_flags & REQ_FAILFAST)
>   ret = BLKPREP_KILL;
> +
> + /* fall through */
> +
> + case SDEV_QUIESCE:
> + /*
> +  * If the devices is blocked we defer normal commands.
> +  */
> + if (!(req->cmd_flags & REQ_PREEMPT))
> + ret = BLKPREP_DEFER;
>   break;
>   default:
>   /*
> 
-
To 

Re: 2.6.24-rc3-mm1 (sync is slow ?)

2007-11-24 Thread kosaki
Hi, Andrew 

> > Hi, Andrew
> > 
> > I got following result in 'sync' command.
> > It was too slow. (memory controller config is off ;)
> > I attaches my .config.
> > ==
 (snip)
> 
> Well I wonder how we did that.
> 
> It seems OK here from a quick test (i386, ext3-on-IDE).
> 
> Maybe device driver/block breakage?

I tested x86, ext3-on-SATA(/dev/sda).
It seems works well.

Hmm...


-- 
kosaki


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 (sync is slow ?)

2007-11-24 Thread kosaki
Hi, Andrew 

  Hi, Andrew
  
  I got following result in 'sync' command.
  It was too slow. (memory controller config is off ;)
  I attaches my .config.
  ==
 (snip)
 
 Well I wonder how we did that.
 
 It seems OK here from a quick test (i386, ext3-on-IDE).
 
 Maybe device driver/block breakage?

I tested x86, ext3-on-SATA(/dev/sda).
It seems works well.

Hmm...


-- 
kosaki


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Laurent Riffard


Le 24.11.2007 07:42, James Bottomley a écrit :
 On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
 Hannes Reinecke wrote:
 Laurent Riffard wrote:
 Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard [EMAIL PROTECTED] wrote:

 Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
 that a bunch of task are blocked in D state, they seem to wait for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format 3.6 with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
 extents:1 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
 reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
 None of the above...

 I did a bisection, it spotted git-scsi-misc.patch. 
 I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works fine.

 I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not 
 requeue requests if REQ_FAILFAST is set is the real culprit. The other 
 commits are touching documentation or drivers I don't use. I'll try 
 to revert only this one this evening.
 I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.

 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
 error where
 I shouldn't. Checking ...

 Ok, found it. We are blocking even special commands (ie requests with 
 PREEMPT not set)
 when FAILFAST is set. Which is clearly wrong. The attached patch fixes this.
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
 errors.
 
 I think the problem is the way we treat BLOCKED and QUIESCED (the latter
 is the state that the domain validation uses and which we cannot kill
 fastfail on).  It's definitely wrong to kill fastfail requests when the
 state is QUIESCE.
 
 This patch (which is applied on top of Hannes original) separates the
 BLOCK and QUIESCE states correctly ... does this fix the problem?


No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)


 James
 
 diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
 index 13e7e09..a7cf23a 100644
 --- a/drivers/scsi/scsi_lib.c
 +++ b/drivers/scsi/scsi_lib.c

 @@ -1279,18 +1279,21 @@ int scsi_prep_state_check(struct scsi_device *sdev, 
 struct request *req)
   rejecting I/O to dead device\n);
   ret = BLKPREP_KILL;
   break;
 - case SDEV_QUIESCE:
   case SDEV_BLOCK:
   /*
 -  * If the devices is blocked we defer normal commands.
 -  */
 - if (!(req-cmd_flags  REQ_PREEMPT))
 - ret = BLKPREP_DEFER;
 - /*
* Return failfast requests immediately
*/
   if (req-cmd_flags  REQ_FAILFAST)
   ret = BLKPREP_KILL;
 +
 + /* fall through */
 +
 + case SDEV_QUIESCE:
 + /*
 +  * If the devices is blocked we defer normal commands.
 +  */
 + if (!(req-cmd_flags  REQ_PREEMPT))
 + ret = BLKPREP_DEFER;
   break;
   default:
   /*
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley
On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
 Le 24.11.2007 07:42, James Bottomley a écrit :
  On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
  Le 23.11.2007 12:38, Hannes Reinecke a écrit :
  Hannes Reinecke wrote:
  Laurent Riffard wrote:
  Le 21.11.2007 23:41, Andrew Morton a écrit :
  On Wed, 21 Nov 2007 22:45:22 +0100
  Laurent Riffard [EMAIL PROTECTED] wrote:
 
  Le 21.11.2007 05:45, Andrew Morton a écrit :
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
  Hello, 
 
  My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
  that a bunch of task are blocked in D state, they seem to wait for
  some I/O completion. I can try to hand-copy some data if requested.
 
  I found these messages in dmesg:
 
  ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
  EXT3-fs: mounted filesystem with ordered data mode.
  sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sda, sector 16460
  ReiserFS: sda7: found reiserfs format 3.6 with standard journal
  ReiserFS: sda7: using ordered data mode
  --
  ReiserFS: sda7: Using r5 hash to sort names
  sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sdb, sector 19632
  sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sdb, sector 40037363
  Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
  extents:1 across:1048568k
  lp0: using parport0 (interrupt-driven).
 
  These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
  reproducible.
  2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
 
  Maybe something is broken in pata_via driver ?
 
  Could be - 
  libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
  and 
  pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
  touch pata_via.c.
  None of the above...
 
  I did a bisection, it spotted git-scsi-misc.patch. 
  I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
  fine.
 
  I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not 
  requeue requests if REQ_FAILFAST is set is the real culprit. The other 
  commits are touching documentation or drivers I don't use. I'll try 
  to revert only this one this evening.
  I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
  does fix the problem.
 
  Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
  error where
  I shouldn't. Checking ...
 
  Ok, found it. We are blocking even special commands (ie requests with 
  PREEMPT not set)
  when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
  this.
  Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
  errors.
  
  I think the problem is the way we treat BLOCKED and QUIESCED (the latter
  is the state that the domain validation uses and which we cannot kill
  fastfail on).  It's definitely wrong to kill fastfail requests when the
  state is QUIESCE.
  
  This patch (which is applied on top of Hannes original) separates the
  BLOCK and QUIESCE states correctly ... does this fix the problem?
 
 
 No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)

OK, could you post dmesgs again, please.  I actually tested this with an
aic79xx card, and for me it does cause Domain Validation to succeed
again.

James


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 make headers_check fails

2007-11-24 Thread Adrian Bunk
On Wed, Nov 21, 2007 at 10:58:21AM +0100, Sam Ravnborg wrote:
 On Wed, Nov 21, 2007 at 10:44:40AM +0200, Avi Kivity wrote:
  Kamalesh Babulal wrote:
  Andrew Morton wrote:

  On Wed, 21 Nov 2007 13:54:50 +0530 Kamalesh Babulal 
  [EMAIL PROTECTED] wrote:
  
  
  The make headers_check fails,
  
CHECK   include/linux/usb/gadgetfs.h
CHECK   include/linux/usb/ch9.h
CHECK   include/linux/usb/cdc.h
CHECK   include/linux/usb/audio.h
CHECK   include/linux/kvm.h
  /root/kernels/linux-2.6.24-rc3/usr/include/linux/kvm.h requires 
  asm/kvm.h, which does not exist in exported headers

  hm, works for me, on i386 and x86_64.  What's different over there?
  
  Hi Andrew,
  
  It fails on the powerpc box, with allyesconfig option.
  

  
  How do we fix this?  Export linux/kvm.h only on x86?  Seems ugly.
 
 Is kvm x86 specific? Then move the .h file to asm-x86.
 Otherwise no good idea...

What about adding a whitelist in hdrcheck.sh?

For all practical purposes in userspace the compile error due to the 
non-existing asm header should be fine, so there's no reason to change 
the code in such cases. 

   Sam

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley
Probing intermittent failures in Domain Validation, even with the fixes
applied leads me to the conclusion that there are further problems with
this commit:

commit fc5eb4facedbd6d7117905e775cee1975f894e79
Author: Hannes Reinecke [EMAIL PROTECTED]
Date:   Tue Nov 6 09:23:40 2007 +0100

[SCSI] Do not requeue requests if REQ_FAILFAST is set
 
The essence of the problems is that you're causing REQ_FAILFAST to
terminate commands with error on requeuing conditions, some of which are
relatively common on most SCSI devices.  While this may be the correct
behaviour for multi-path, it's certainly wrong for the previously
understood meaning of REQ_FAILFAST, which was don't retry on error,
which is why domain validation and other applications use it to control
error handling, but don't expect to get failures for a simple requeue
are now spitting errors.

I honestly can't see that, even for the multi-path case, returning an
error when we're over queue depth is the correct thing to do (it may not
matter to something like a symmetrix, but an array that has a non-zero
cost associated with a path change, like a CPQ HSV or the AVT
controllers, will show fairly large slow downs if you do this).  Even if
this is the desired behaviour (and I think that's a policy issue),
DID_NO_CONNECT is almost certainly the wrong error to be sending back.

This patch fixes up domain validation to work again correctly, however,
I really think it's just a bandaid.  Do you want to rethink the above
commit?

James

Index: BUILD-2.6/drivers/scsi/scsi_lib.c
===
--- BUILD-2.6.orig/drivers/scsi/scsi_lib.c  2007-11-24 11:25:20.0 
-0600
+++ BUILD-2.6/drivers/scsi/scsi_lib.c   2007-11-24 11:26:22.0 -0600
@@ -1552,7 +1552,8 @@ static void scsi_request_fn(struct reque
break;
 
if (!scsi_dev_queue_ready(q, sdev)) {
-   if (req-cmd_flags  REQ_FAILFAST) {
+   if ((req-cmd_flags  REQ_FAILFAST) 
+   !(req-cmd_flags  REQ_PREEMPT)) {
scsi_kill_request(req, q);
continue;
}


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Gabriel C
James Bottomley wrote:
 On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
 Le 24.11.2007 07:42, James Bottomley a écrit :
 On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
 Hannes Reinecke wrote:
 Laurent Riffard wrote:
 Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard [EMAIL PROTECTED] wrote:

 Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W shows
 that a bunch of task are blocked in D state, they seem to wait for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format 3.6 with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
 extents:1 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
 reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
 None of the above...

 I did a bisection, it spotted git-scsi-misc.patch. 
 I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
 fine.

 I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do not 
 requeue requests if REQ_FAILFAST is set is the real culprit. The other 
 commits are touching documentation or drivers I don't use. I'll try 
 to revert only this one this evening.
 I can confirm : reverting commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.

 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
 error where
 I shouldn't. Checking ...

 Ok, found it. We are blocking even special commands (ie requests with 
 PREEMPT not set)
 when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
 this.
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
 errors.
 I think the problem is the way we treat BLOCKED and QUIESCED (the latter
 is the state that the domain validation uses and which we cannot kill
 fastfail on).  It's definitely wrong to kill fastfail requests when the
 state is QUIESCE.

 This patch (which is applied on top of Hannes original) separates the
 BLOCK and QUIESCE states correctly ... does this fix the problem?

 No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
 
 OK, could you post dmesgs again, please.  I actually tested this with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.
 

Are the patches indeed to fix that problem as well ? 

http://lkml.org/lkml/2007/11/23/5

 James

Gabriel 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread James Bottomley

On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote:
 James Bottomley wrote:
  On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
  Le 24.11.2007 07:42, James Bottomley a écrit :
  On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
  Le 23.11.2007 12:38, Hannes Reinecke a écrit :
  Hannes Reinecke wrote:
  Laurent Riffard wrote:
  Le 21.11.2007 23:41, Andrew Morton a écrit :
  On Wed, 21 Nov 2007 22:45:22 +0100
  Laurent Riffard [EMAIL PROTECTED] wrote:
 
  Le 21.11.2007 05:45, Andrew Morton a écrit :
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
  Hello, 
 
  My system hangs shortly after I logged in Gnome desktop. SysRq-W 
  shows
  that a bunch of task are blocked in D state, they seem to wait for
  some I/O completion. I can try to hand-copy some data if requested.
 
  I found these messages in dmesg:
 
  ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
  EXT3-fs: mounted filesystem with ordered data mode.
  sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sda, sector 16460
  ReiserFS: sda7: found reiserfs format 3.6 with standard journal
  ReiserFS: sda7: using ordered data mode
  --
  ReiserFS: sda7: Using r5 hash to sort names
  sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sdb, sector 19632
  sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
  driverbyte=DRIVER_OK,SUGGEST_OK
  end_request: I/O error, dev sdb, sector 40037363
  Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
  extents:1 across:1048568k
  lp0: using parport0 (interrupt-driven).
 
  These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
  reproducible.
  2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.
 
  Maybe something is broken in pata_via driver ?
 
  Could be - 
  libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
  and 
  pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
  touch pata_via.c.
  None of the above...
 
  I did a bisection, it spotted git-scsi-misc.patch. 
  I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
  fine.
 
  I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do 
  not 
  requeue requests if REQ_FAILFAST is set is the real culprit. The 
  other 
  commits are touching documentation or drivers I don't use. I'll try 
  to revert only this one this evening.
  I can confirm : reverting commit 
  8655a546c83fc43f0a73416bbd126d02de7ad6c0 
  does fix the problem.
 
  Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
  error where
  I shouldn't. Checking ...
 
  Ok, found it. We are blocking even special commands (ie requests with 
  PREEMPT not set)
  when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
  this.
  Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
  errors.
  I think the problem is the way we treat BLOCKED and QUIESCED (the latter
  is the state that the domain validation uses and which we cannot kill
  fastfail on).  It's definitely wrong to kill fastfail requests when the
  state is QUIESCE.
 
  This patch (which is applied on top of Hannes original) separates the
  BLOCK and QUIESCE states correctly ... does this fix the problem?
 
  No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
  
  OK, could you post dmesgs again, please.  I actually tested this with an
  aic79xx card, and for me it does cause Domain Validation to succeed
  again.
  
 
 Are the patches indeed to fix that problem as well ? 
 
 http://lkml.org/lkml/2007/11/23/5

That dmesg is from an unknown SCSI card exhibiting Domain Validation
problems, so it's a reasonable probability, yes ... but you'll need the
additional hack I just did to prevent further intermittent failures.

James


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1 (sync is slow ?)

2007-11-24 Thread Gabriel C
kosaki wrote:
 Hi, Andrew 
 
 Hi, Andrew

 I got following result in 'sync' command.
 It was too slow. (memory controller config is off ;)
 I attaches my .config.
 ==
  (snip)
 Well I wonder how we did that.

 It seems OK here from a quick test (i386, ext3-on-IDE).

 Maybe device driver/block breakage?

Try revert

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff_plain;h=8655a546c83fc43f0a73416bbd126d02de7ad6c0;hp=5bc717b6bdaaf52edf365eb7d9d8c89fec79df5d

See also :
http://lkml.org/lkml/2007/11/23/5

and search for '2.6.24-rc3-mm1: I/O error, system hangs' on LKML

 
 I tested x86, ext3-on-SATA(/dev/sda).
 It seems works well.
 
 Hmm...

IDE/SATA is fine here as well just SCSI broke


Regards,

Gabriel 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc3-mm1: I/O error, system hangs

2007-11-24 Thread Gabriel C
James Bottomley wrote:
 On Sat, 2007-11-24 at 18:54 +0100, Gabriel C wrote:
 James Bottomley wrote:
 On Sat, 2007-11-24 at 13:57 +0100, Laurent Riffard wrote:
 Le 24.11.2007 07:42, James Bottomley a écrit :
 On Fri, 2007-11-23 at 18:52 +0100, Laurent Riffard wrote:
 Le 23.11.2007 12:38, Hannes Reinecke a écrit :
 Hannes Reinecke wrote:
 Laurent Riffard wrote:
 Le 21.11.2007 23:41, Andrew Morton a écrit :
 On Wed, 21 Nov 2007 22:45:22 +0100
 Laurent Riffard [EMAIL PROTECTED] wrote:

 Le 21.11.2007 05:45, Andrew Morton a écrit :
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc3/2.6.24-rc3-mm1/
 Hello, 

 My system hangs shortly after I logged in Gnome desktop. SysRq-W 
 shows
 that a bunch of task are blocked in D state, they seem to wait for
 some I/O completion. I can try to hand-copy some data if requested.

 I found these messages in dmesg:

 ~$ grep -C2 end_request dmesg-2.6.24-rc3-mm1 
 EXT3-fs: mounted filesystem with ordered data mode.
 sd 0:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sda, sector 16460
 ReiserFS: sda7: found reiserfs format 3.6 with standard journal
 ReiserFS: sda7: using ordered data mode
 --
 ReiserFS: sda7: Using r5 hash to sort names
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 19632
 sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT 
 driverbyte=DRIVER_OK,SUGGEST_OK
 end_request: I/O error, dev sdb, sector 40037363
 Adding 1048568k swap on /dev/mapper/vglinux1-lvswap.  Priority:-1 
 extents:1 across:1048568k
 lp0: using parport0 (interrupt-driven).

 These errors occur *only* with 2.6.24-rc3-mm1, they are 100% 
 reproducible.
 2.6.24-rc3 and 2.6.24-rc2-mm1 are fine.

 Maybe something is broken in pata_via driver ?

 Could be - 
 libata-reimplement-ata_acpi_cbl_80wire-using-ata_acpi_gtm_xfermask.patch
 and 
 pata_amd-pata_via-de-couple-programming-of-pio-mwdma-and-udma-timings.patch
 touch pata_via.c.
 None of the above...

 I did a bisection, it spotted git-scsi-misc.patch. 
 I just run 2.6.24-rc3-mm1 + revert-git-scsi-misc.patch, and it works 
 fine.

 I guess commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 [SCSI] Do 
 not 
 requeue requests if REQ_FAILFAST is set is the real culprit. The 
 other 
 commits are touching documentation or drivers I don't use. I'll try 
 to revert only this one this evening.
 I can confirm : reverting commit 
 8655a546c83fc43f0a73416bbd126d02de7ad6c0 
 does fix the problem.

 Hmm. Weird. I'll have a look into it. Apparently I'll be returning an 
 error where
 I shouldn't. Checking ...

 Ok, found it. We are blocking even special commands (ie requests with 
 PREEMPT not set)
 when FAILFAST is set. Which is clearly wrong. The attached patch fixes 
 this.
 Sorry, it's not enough. 2.6.24-rc3-mm1 + your patch still hangs with I/O 
 errors.
 I think the problem is the way we treat BLOCKED and QUIESCED (the latter
 is the state that the domain validation uses and which we cannot kill
 fastfail on).  It's definitely wrong to kill fastfail requests when the
 state is QUIESCE.

 This patch (which is applied on top of Hannes original) separates the
 BLOCK and QUIESCE states correctly ... does this fix the problem?
 No, it doesn't help... (2.6.24-rc3-mm1 + your patch still has problems)
 OK, could you post dmesgs again, please.  I actually tested this with an
 aic79xx card, and for me it does cause Domain Validation to succeed
 again.

 Are the patches indeed to fix that problem as well ? 

 http://lkml.org/lkml/2007/11/23/5
 
 That dmesg is from an unknown SCSI card exhibiting Domain Validation
 problems, so it's a reasonable probability, yes ... but you'll need the
 additional hack I just did to prevent further intermittent failures.

My controller is:

03:0e.0 SCSI storage controller [0100]: Adaptec AIC-7892P U160/m [9005:008f] 
(rev 02)

I'll try the patches in a bit.

 
 James
 

Gabriel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   >