Re: Notes on kern/57133
Hello. As a followup on this bug I now know what's causing the panic to get triggered. The issue is that if a request gets requeued requeud, resid gets set to 0, which causes the KASSERT to fire. So the question is, is it always the case that if a request gets requeued, resid gets set to 0, or is that only in certain cases? A related question is has it always been the case that resid gets reset to 0 on a requeue, but the mpii(4) driver didn't used to check for this condition? Or, is the resetting of resid to 0 a new phenomenon? Note that I did a search of all the scsi/atapi drivers in the tree and I couldn't find another instance where this check is done. So, I'm inclined to think this check is relatively new and it's just that no one has run into it before. While it would be interesting to know why these particular requests are getting requeued, and I'll try to figure that out, if it is true that resid gets set to 0 on requeue, then this check is just wrong in the mpii(4) driver. Any thoughts from anyone? -Brian
Re: Notes on kern/57133
> One change I can try is to put the diagnostic printf higher in the > scsipi_request function As the problem is so simple to reproduce, I'd put it just below xs = arg. Or set a breakpoint on scsipi_get_opcodeinfo(), then, when hitting it, one on mpii_scsipi_request() (provided you find it's address) and then step through it. Maybe dump the whole xs to see whether other fields are corrupted, too. But I'd bet the problem is the ccb_done routine being called prematurely. You could introduce a mpii_nothing_done() routine that prints something when called, replace ccb->ccb_done = mpii_scsi_cmd_done with = mpii_nothing_done and move the real assignement just before the mpii_start() call. Or so I think.
Re: Notes on kern/57133
hello. Following up on this thread, I have more diagnostic output. The problem occurs when a call to scsipi_get_opcodeinfo() is made for a device. By the time the request hits the adapter, as shown in the below output, the discrepancy between xs->resid and xs->datalen has already occurred. I can see where xs->resid is set equal to xs->datalen in scsipi_execute_xs(), but I haven't found where xs->resid gets zeroed out before it makes it to the mpii(4) driver. Clearly I'm missing something, so I'll keep looking. I don't think it's corruption or a threading problem, but, rather, something that's different about this particular request path. One change I can try is to put the diagnostic printf higher in the scsipi_request function, in case all the loading of data into the IOC's ccb buffers is corrupting something. In the mean time, if anyone has ideas, I'm all ears. -thanks -Brian The below output is a snippet from the probing of sd0 on mpii0. See the e-mail in kern/57133 for a full dmesg output on this card. [ 10.5750642] sd0: tagged queueing [ 10.7650647] mpii0(scsipi_request): resid = 0, datalen = 16384 [ 10.7750642] mpii0(scsipi_request): SCSI command info is: 0xa3 0c 80 00 00 00 00 00 40 00 00 00 [ 10.7750642] mpii0(scsipi_request): sx_control = 0x1020, XS_CTL_POLL = 0x2 [ 10.7850642] mpii0: resid = 0, datalen = 16384 [ 10.7950642] mpii0: SCSI command info is: 0xa3 0c 80 00 00 00 00 00 40 00 00 00
Re: Notes on kern/57133
hello Edgar. I agree with your analysis. I'll put the check in and recompile my test kernel and see what we get. I suspect it is an error race in the mpii(4) driver. I notice the call to scsipi_get_opcodeinfo only happens if the device in question reports as a SCSI V3 device. Is there a way to easily check the SCSI version that a given device reports? scsictl identify doesn't seem to do it. -thanks -Brian
Re: Notes on kern/57133
> provide details on what this command is? A3h/0Ch is REPORT SUPPORTED OPERATION CODES The call is most probably from dev/scsipi:scsipi_get_opcodeinfo(). I'm still unsure how resid can be 0 at that point. scsipi_enqueue_xs() sets resid to datalen (which is undocumented). Apart from the path interpreting sense info nobody tampers with resid. Can you check for resid != datalen in mpii_scsi_request() just before the xs->xs_control & XS_CTL_POLL test (and, if it fires, print whether it's a polled request)? Otherwise, I suspect ccb_done set to mpii_scsi_cmd_done where it shouldn't (i.e. some race/error for a mpt command that's not plain SCSI).
Notes on kern/57133
hello everyone! Recently I ran across the problem described in kern/57133 with the mpii(4) driver. I patched sys/dev/pci/mpii.c to provide more diagnostic information when the condition occurs and to avoid the panic. I also provided output showing the SCSI command that triggers the condition. Could someone look at the output below, or checkout the bug report in gnats and provide details on what this command is? I also looked through the kernel sources and don't see any other drivers that implement the assertion that triggered the original panic which created the bug report. Any reason I shouldn't commit the work around for the mpii(4) driver? I realize it doesn't fix the actual problem, but it will allow folks to use systems with NetBSD that present the problem and will provide diagnostic details which can be reported back to the community for further analysis. Thoughts? -Brian [ 10.5574554] sd0 at scsibus0 target 0 lun 0: disk fixed [ 10.5574554] sd0: 3726 GB, 3815448 cyl, 16 head, 127 sec, 512 bytes/sect x 7814037168 sectors [ 10.5681363] dk0 at sd0: "1adc589e-4f32-11ee-b97c-00259036fd2e", 7814033005 blocks at 34, type: raidframe [ 10.5853388] sd0: tagged queueing [ 10.7574556] mpii0: resid = 0, datalen = 16384 [ 10.7574556] mpii0: SCSI command info is: 0xa3 0c 80 00 00 00 00 00 40 00 00 00 sd1 at scsibus0 target 1 lun 0: disk fixed [ 10.7774550] sd1: 3726 GB, 3815448 cyl, 16 head, 127 sec, 512 bytes/sect x 7814037168 sectors [ 10.8374548] mssd1: tagged queueing [ 11.0074552] mpii0: resid = 0, datalen = 16384 [ 11.0074552] mpii0: SCSI command info is: 0xa3 0c 80 00 00 00 00 00 40 00 00 00 sd2 at scsibus0 target 2 lun 0: disk fixed [ 11.0274552] sd2: 3726 GB, 3815448 cyl, 16 head, 127 sec, 512 bytes/sect x 7814037168 sectors [ 11.0574559] sd2: 3907019088 trailing sectors not covered by disklabel [ 11.0674553] sd2: tagged queueing [ 11.2574559] mpii0: resid = 0, datalen = 16384 [ 11.2574559] mpii0: SCSI command info is: 0xa3 0c 80 00 00 00 00 00 40 00 00 00 sd3 at scsibus0 target 5 lun 0: disk fixed [ 11.2774557] sd3: 3726 GB, 3815448 cyl, 16 head, 127 sec, 512 bytes/sect x 7814037168 sectors [ 11.2874554] dk1 at sd3: "549e5298-5113-11ee-910e-00259036fd2e", 7814033005 blocks at 34, type: raidframe [ 11.2974554] sd3: tagged queueing