Re: Notes on kern/57133

2023-10-12 Thread Brian Buhrow
Hello.  As a followup on this bug I now know what's causing the panic 
to get triggered.
The issue is that if a request gets requeued requeud, resid gets set to 0, 
which causes the
KASSERT to fire.  So the question is, is it always the case that if a request 
gets requeued,
resid gets set to 0, or is that only in certain cases?  A related question is 
has it always
been the case that resid gets reset to 0 on a requeue, but the mpii(4) driver 
didn't used to
check for this condition?  Or, is the resetting of resid to 0 a new phenomenon? 
 Note that I
did a search of all the scsi/atapi drivers in the tree and I couldn't find 
another instance
where this check is done.  So, I'm inclined to think this check is relatively 
new and it's just
that no one has run into it before.
While it would be interesting to know why these particular requests are getting 
requeued, and
I'll try to figure that out, if it is true that resid gets set to 0 on requeue, 
then this check
is just wrong in the mpii(4) driver.

Any thoughts from anyone?

-Brian


Re: Notes on kern/57133

2023-10-06 Thread Edgar Fuß
> One change I can try is to put the diagnostic printf higher in the
> scsipi_request function
As the problem is so simple to reproduce, I'd put it just below xs = arg.

Or set a breakpoint on scsipi_get_opcodeinfo(), then, when hitting it, one on 
mpii_scsipi_request() (provided you find it's address) and then step through it.

Maybe dump the whole xs to see whether other fields are corrupted, too.

But I'd bet the problem is the ccb_done routine being called prematurely.

You could introduce a mpii_nothing_done() routine that prints something when 
called, replace ccb->ccb_done = mpii_scsi_cmd_done with = mpii_nothing_done 
and move the real assignement just before the mpii_start() call.
Or so I think.


Re: Notes on kern/57133

2023-10-06 Thread Brian Buhrow
hello.  Following up on this thread, I have more diagnostic output.  

The problem occurs when a call to scsipi_get_opcodeinfo() is made for a device. 
 By the time
the request hits the adapter, as shown in the below output, the discrepancy 
between xs->resid
and xs->datalen has already occurred.  I can see where xs->resid is set equal 
to xs->datalen in
scsipi_execute_xs(), but I haven't found where xs->resid gets zeroed out before 
it makes it to
the mpii(4) driver.  Clearly I'm missing something, so I'll keep looking.  I 
don't think it's
corruption or a threading problem, but, rather, something that's different 
about this
particular request path. One change I can try is to put the diagnostic printf 
higher in the
scsipi_request function, in case all the loading of data into the IOC's ccb 
buffers is
corrupting something.
In the mean time, if anyone has ideas, I'm all ears.

-thanks
-Brian


The below output is a snippet from the probing of sd0 on mpii0.
See the e-mail in kern/57133 for a full dmesg output on this card.


[  10.5750642] sd0: tagged queueing
[  10.7650647] mpii0(scsipi_request): resid = 0, datalen = 16384
[  10.7750642] mpii0(scsipi_request): SCSI command info is: 0xa3 0c 80 00 00 00 
00 00 40 00 00 00
[  10.7750642] mpii0(scsipi_request): sx_control = 0x1020, XS_CTL_POLL = 0x2
[  10.7850642] mpii0: resid = 0, datalen = 16384
[  10.7950642] mpii0: SCSI command info is: 0xa3 0c 80 00 00 00 00 00 40 00 00 
00


Re: Notes on kern/57133

2023-10-04 Thread Brian Buhrow
hello Edgar.  I agree with your analysis.  I'll put the check in and 
recompile my test kernel
and see what we get.  I suspect it is an error race in the mpii(4) driver.  I 
notice the call
to scsipi_get_opcodeinfo only happens if the device in question reports as a 
SCSI V3 device.
Is there a way to easily check the SCSI version that a given device reports?  
scsictl identify
doesn't seem to do it.
-thanks
-Brian


Re: Notes on kern/57133

2023-10-04 Thread Edgar Fuß
> provide details on what this command is?
A3h/0Ch is REPORT SUPPORTED OPERATION CODES

The call is most probably from dev/scsipi:scsipi_get_opcodeinfo().

I'm still unsure how resid can be 0 at that point. scsipi_enqueue_xs() 
sets resid to datalen (which is undocumented). Apart from the path 
interpreting sense info nobody tampers with resid.

Can you check for resid != datalen in mpii_scsi_request() just before the
xs->xs_control & XS_CTL_POLL test (and, if it fires, print whether it's 
a polled request)?

Otherwise, I suspect ccb_done set to mpii_scsi_cmd_done where it shouldn't 
(i.e. some race/error for a mpt command that's not plain SCSI).


Notes on kern/57133

2023-10-04 Thread Brian Buhrow
hello everyone!  Recently I ran across the problem described in 
kern/57133 with the
mpii(4) driver.  I patched sys/dev/pci/mpii.c to provide more diagnostic 
information when the
condition occurs and to avoid the panic.  I also provided output showing the 
SCSI command that
triggers the condition. Could someone look at the output below, or checkout the 
bug report in
gnats and provide details on what this command is?  I also looked through the 
kernel sources
and don't see any other drivers that implement the assertion that triggered the 
original panic
which created the bug report.  Any reason I shouldn't commit the work around 
for the mpii(4)
driver?  I realize it doesn't fix the actual problem, but it will allow folks 
to use systems
with NetBSD that present the problem and will provide diagnostic details which 
can be reported
back to the community for further analysis.


Thoughts?
-Brian


[  10.5574554] sd0 at scsibus0 target 0 lun 0:  
disk fixed
[  10.5574554] sd0: 3726 GB, 3815448 cyl, 16 head, 127 sec, 512 bytes/sect x 
7814037168 sectors
[  10.5681363] dk0 at sd0: "1adc589e-4f32-11ee-b97c-00259036fd2e", 7814033005 
blocks at 34, type: raidframe
[  10.5853388] sd0: tagged queueing
[  10.7574556] mpii0: resid = 0, datalen = 16384
[  10.7574556] mpii0: SCSI command info is: 0xa3 0c 80 00 00 00 00 00 40 00 00 
00
sd1 at scsibus0 target 1 lun 0:  disk fixed
[  10.7774550] sd1: 3726 GB, 3815448 cyl, 16 head, 127 sec, 512 bytes/sect x 
7814037168 sectors
[  10.8374548] mssd1: tagged queueing
[  11.0074552] mpii0: resid = 0, datalen = 16384
[  11.0074552] mpii0: SCSI command info is: 0xa3 0c 80 00 00 00 00 00 40 00 00 
00
sd2 at scsibus0 target 2 lun 0:  disk fixed
[  11.0274552] sd2: 3726 GB, 3815448 cyl, 16 head, 127 sec, 512 bytes/sect x 
7814037168 sectors
[  11.0574559] sd2: 3907019088 trailing sectors not covered by disklabel
[  11.0674553] sd2: tagged queueing
[  11.2574559] mpii0: resid = 0, datalen = 16384
[  11.2574559] mpii0: SCSI command info is: 0xa3 0c 80 00 00 00 00 00 40 00 00 
00
sd3 at scsibus0 target 5 lun 0:  disk fixed
[  11.2774557] sd3: 3726 GB, 3815448 cyl, 16 head, 127 sec, 512 bytes/sect x 
7814037168 sectors
[  11.2874554] dk1 at sd3: "549e5298-5113-11ee-910e-00259036fd2e", 7814033005 
blocks at 34, type: raidframe
[  11.2974554] sd3: tagged queueing