Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Mon, Jun 13, 2016 at 12:04 AM, Hannes Reineckewrote: > > And we have been running the very patch in SLES for over a year now, > without a single issue being reported. Oh, ok. So it's not "all qemu kvm instances are broken", it was a very unusual issue, and the patch has actually gotten wider testing. That makes me much happier about it. Linus
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Mon, Jun 13, 2016 at 12:04 AM, Hannes Reinecke wrote: > > And we have been running the very patch in SLES for over a year now, > without a single issue being reported. Oh, ok. So it's not "all qemu kvm instances are broken", it was a very unusual issue, and the patch has actually gotten wider testing. That makes me much happier about it. Linus
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On 06/11/2016 11:03 PM, James Bottomley wrote: > On Sat, 2016-06-11 at 13:25 -0700, Linus Torvalds wrote: >> On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley >>wrote: >>> >>> The QEMU people have accepted it as their bug and are fixing it. >> >> Of course they are. Somebody found a bug in their device model, I'd >> expect nothing else. >> >> But I'm not worried about qemu. I'm worried about all the other >> random devices that have never been tested. > > Most of the other devices that are likely to misbehave don't advertise > high levels of SCSI conformance, so we seem to be mostly covered. > And we have been running the very patch in SLES for over a year now, without a single issue being reported. >>> There's no other course of action, really because we can't stop >>> people >>> sending this command using the BLOCK_PC interface from user space, >>> so >>> it's now a known and easy to use way of stopping the device from >>> responding. >> >> Bah. That's not an argument from kernel space. We've had that >> forever. Broken device that hangs up when you try to read past the >> end? If you can open the raw device for reading, you can still do a >> SCSI_IOCTL_SEND_COMMAND to send that read command past the end. >> >> The fact that you can craft special commands that can cause problems >> for specific devices (if you have access to the raw device) does >> *not* at all argue that the kernel should then do those accesses of >> its own volition. >> >> My worry basically comes down to: we're clearly now doing something >> that has never ever been tested by anybody before. >> Not quite. See above. The reported issue came from someone who has been running the very latest linux kernel in a VM which was hosted on an ancient version of QEMU. Hardly a common scenario. >> And I think that the assumption that the bug would magically be >> limited to qemu is a *big* assumption. > > How do we ever find out if we don't test it, though? I'm sure some > obscure minor celebrity trying to get on the chat show circuit once > said "what is userspace except a test case for the kernel?" > > If this is the only problem that turns up, I think we're done. If we > get any more we can consider either blacklisting all CD type devices or > raising the conformance bar to SPC-3. > I'm fully with James here. The alternative would be to whitelist _every_ conformant device, resulting in lots of unhappy customers until we've got the whitelist settled. Having to discuss with customers why Linux doesn't follow the specs is infinitely harder than discussing with customers whose _hardware_ doesn't follow the specs. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On 06/11/2016 11:03 PM, James Bottomley wrote: > On Sat, 2016-06-11 at 13:25 -0700, Linus Torvalds wrote: >> On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley >> wrote: >>> >>> The QEMU people have accepted it as their bug and are fixing it. >> >> Of course they are. Somebody found a bug in their device model, I'd >> expect nothing else. >> >> But I'm not worried about qemu. I'm worried about all the other >> random devices that have never been tested. > > Most of the other devices that are likely to misbehave don't advertise > high levels of SCSI conformance, so we seem to be mostly covered. > And we have been running the very patch in SLES for over a year now, without a single issue being reported. >>> There's no other course of action, really because we can't stop >>> people >>> sending this command using the BLOCK_PC interface from user space, >>> so >>> it's now a known and easy to use way of stopping the device from >>> responding. >> >> Bah. That's not an argument from kernel space. We've had that >> forever. Broken device that hangs up when you try to read past the >> end? If you can open the raw device for reading, you can still do a >> SCSI_IOCTL_SEND_COMMAND to send that read command past the end. >> >> The fact that you can craft special commands that can cause problems >> for specific devices (if you have access to the raw device) does >> *not* at all argue that the kernel should then do those accesses of >> its own volition. >> >> My worry basically comes down to: we're clearly now doing something >> that has never ever been tested by anybody before. >> Not quite. See above. The reported issue came from someone who has been running the very latest linux kernel in a VM which was hosted on an ancient version of QEMU. Hardly a common scenario. >> And I think that the assumption that the bug would magically be >> limited to qemu is a *big* assumption. > > How do we ever find out if we don't test it, though? I'm sure some > obscure minor celebrity trying to get on the chat show circuit once > said "what is userspace except a test case for the kernel?" > > If this is the only problem that turns up, I think we're done. If we > get any more we can consider either blacklisting all CD type devices or > raising the conformance bar to SPC-3. > I'm fully with James here. The alternative would be to whitelist _every_ conformant device, resulting in lots of unhappy customers until we've got the whitelist settled. Having to discuss with customers why Linux doesn't follow the specs is infinitely harder than discussing with customers whose _hardware_ doesn't follow the specs. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, 2016-06-11 at 13:25 -0700, Linus Torvalds wrote: > On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley >wrote: > > > > The QEMU people have accepted it as their bug and are fixing it. > > Of course they are. Somebody found a bug in their device model, I'd > expect nothing else. > > But I'm not worried about qemu. I'm worried about all the other > random devices that have never been tested. Most of the other devices that are likely to misbehave don't advertise high levels of SCSI conformance, so we seem to be mostly covered. > > There's no other course of action, really because we can't stop > > people > > sending this command using the BLOCK_PC interface from user space, > > so > > it's now a known and easy to use way of stopping the device from > > responding. > > Bah. That's not an argument from kernel space. We've had that > forever. Broken device that hangs up when you try to read past the > end? If you can open the raw device for reading, you can still do a > SCSI_IOCTL_SEND_COMMAND to send that read command past the end. > > The fact that you can craft special commands that can cause problems > for specific devices (if you have access to the raw device) does > *not* at all argue that the kernel should then do those accesses of > its own volition. > > My worry basically comes down to: we're clearly now doing something > that has never ever been tested by anybody before. > > And I think that the assumption that the bug would magically be > limited to qemu is a *big* assumption. How do we ever find out if we don't test it, though? I'm sure some obscure minor celebrity trying to get on the chat show circuit once said "what is userspace except a test case for the kernel?" If this is the only problem that turns up, I think we're done. If we get any more we can consider either blacklisting all CD type devices or raising the conformance bar to SPC-3. James
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, 2016-06-11 at 13:25 -0700, Linus Torvalds wrote: > On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley > wrote: > > > > The QEMU people have accepted it as their bug and are fixing it. > > Of course they are. Somebody found a bug in their device model, I'd > expect nothing else. > > But I'm not worried about qemu. I'm worried about all the other > random devices that have never been tested. Most of the other devices that are likely to misbehave don't advertise high levels of SCSI conformance, so we seem to be mostly covered. > > There's no other course of action, really because we can't stop > > people > > sending this command using the BLOCK_PC interface from user space, > > so > > it's now a known and easy to use way of stopping the device from > > responding. > > Bah. That's not an argument from kernel space. We've had that > forever. Broken device that hangs up when you try to read past the > end? If you can open the raw device for reading, you can still do a > SCSI_IOCTL_SEND_COMMAND to send that read command past the end. > > The fact that you can craft special commands that can cause problems > for specific devices (if you have access to the raw device) does > *not* at all argue that the kernel should then do those accesses of > its own volition. > > My worry basically comes down to: we're clearly now doing something > that has never ever been tested by anybody before. > > And I think that the assumption that the bug would magically be > limited to qemu is a *big* assumption. How do we ever find out if we don't test it, though? I'm sure some obscure minor celebrity trying to get on the chat show circuit once said "what is userspace except a test case for the kernel?" If this is the only problem that turns up, I think we're done. If we get any more we can consider either blacklisting all CD type devices or raising the conformance bar to SPC-3. James
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, 2016-06-11 at 12:57 -0700, Linus Torvalds wrote: > On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley >wrote: > > > > It looks like there's a hole where the emulation should be for the > > VPD > > inquiry, which is what cause the whole hang up and never speak to > > us > > again problem. > > So? What makes you think real hardware doesn't have those kinds of > issues? It does. USB SCSI emulation chips are notorious for crashing when they see the "wrong" command. However, our fix is mostly to blacklist the command from ever being seen by those devices as well. That's what the usb unusual_devs.h file is full of. > This is no different from all the various real pieces of hardware > that lock up when you ask for a particular mode page that they don't > support. We are *very* careful not to randomly read mode pages > because of that issue. Several things are done only if the device > claims it is of a sufficiently new SCSI revision etc. > > None of this is new. Right, we've got a way of coping, it's either blacklist or programmatic compensation if we can recognise the characteristic. > The fact that Windows apparently doesn't do the VPD inquiry - since > qemu works work windows - should make you very very worried. Instead, > you completely ignore this and say "oh, it's just a qemu bug". Windows does do VPD inquiries, but not, apparently for CD ROMS. It is feasible for us to blacklist all CD ROMS (type ROM and WORM) like windows apparently does, but then I'll get complaints from people whose high end devices are no-longer optimally configured by default and whose burn speeds have dropped as a result. > Software that isn't tested doesn't work. That's simply a fact that > all programmers should be intimately familiar with. I'm hoping you > know that. > > Buit why the hell do you then believe that hardware is any different? > Because I can most definitely tell you it isn't. I did describe how we build heuristics to cope with buggy hardware. A Blacklist is part of that. > Really. You have entirely ignored the "nobody has apparently ever > asked for vpd information" argument. OK, so historically everyone mostly ignored the VPD pages because the limits they describe only mattered for a few devices and most I/O stacks couldn't be programmed to the characteristics they were describing anyway. Then came SSDs and a high performance penalty in the FTL for not sending optimally sized and aligned writes and suddenly we all had to rewrite our I/O stacks to cope and be able to receive this wisdom from the device via the block characteristics VPD pages so our users could get the expected IOPS out of their shiny new and expensive devices ... and by we I mean Linux, Windows and a host of other operating systems. Once we did it for SSDs, everyone wanted go faster stripes for their devices as well (array vendors and even some spinning rust type vendors), so we got sucked into a whole cycle. > You also don't say what it was that we did to start asking for vpd > information. What's the commit this fixes? What is it that actually > introduced this problem? And why can it not - like our mode page > things - look at a lot of other fields first to judge whether the > known problematic access is really worth it. I think it's this one: ommit e4c36ad756ec2de36b05c66bb50ed4ff47b0fdb0 Author: Hannes Reinecke Date: Fri Apr 1 08:57:37 2016 +0200 scsi: vpd pages are mandatory for SPC-2 The QEMU device seems to identify itself as SCSI 3 SPC-2 (SPC means SCSI Primary Command Conformance, it's a flag the device can set to show which standard revision it claims to comply with). If we raise the bar to SPC-3 then we'd not ask it for VPD pages, but we'd miss a lot of conforming SPC-2 devices which we'd then have to whitelist. > Look at all those other blacklists we have. They are for real > devices. Things like "don't do mode page 3f" etc. All the checks we > do before we decide it's even worth it asking for caching > information. > > We have a lot of "let's be cautious" with mode page accesses. What > did we do that is different for vpd? Maybe that wasn't cautious > enough. We tied them to a particular level of advertised conformance. Initially it was SPC-3 and above, but with the above commit we lowered it to SPC-2 and above. > Because we do have *other* devices that already have skip_vpd_pages > set. So this whole thing was clearly never limited to qemu > emulation. > > So answer me already: what was the change that actually made us break > recently? See above. > And *why* do you seem to continue to believe that hardware cannot be > broken, despite all the evidence that said hardware has never ever > been tested? We don't. We tried the blacklist everything and then conservatively whitelist it with discard and that didn't really work out so well for us either, so now we're trying the probe the device for what they say the support in the
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, 2016-06-11 at 12:57 -0700, Linus Torvalds wrote: > On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley > wrote: > > > > It looks like there's a hole where the emulation should be for the > > VPD > > inquiry, which is what cause the whole hang up and never speak to > > us > > again problem. > > So? What makes you think real hardware doesn't have those kinds of > issues? It does. USB SCSI emulation chips are notorious for crashing when they see the "wrong" command. However, our fix is mostly to blacklist the command from ever being seen by those devices as well. That's what the usb unusual_devs.h file is full of. > This is no different from all the various real pieces of hardware > that lock up when you ask for a particular mode page that they don't > support. We are *very* careful not to randomly read mode pages > because of that issue. Several things are done only if the device > claims it is of a sufficiently new SCSI revision etc. > > None of this is new. Right, we've got a way of coping, it's either blacklist or programmatic compensation if we can recognise the characteristic. > The fact that Windows apparently doesn't do the VPD inquiry - since > qemu works work windows - should make you very very worried. Instead, > you completely ignore this and say "oh, it's just a qemu bug". Windows does do VPD inquiries, but not, apparently for CD ROMS. It is feasible for us to blacklist all CD ROMS (type ROM and WORM) like windows apparently does, but then I'll get complaints from people whose high end devices are no-longer optimally configured by default and whose burn speeds have dropped as a result. > Software that isn't tested doesn't work. That's simply a fact that > all programmers should be intimately familiar with. I'm hoping you > know that. > > Buit why the hell do you then believe that hardware is any different? > Because I can most definitely tell you it isn't. I did describe how we build heuristics to cope with buggy hardware. A Blacklist is part of that. > Really. You have entirely ignored the "nobody has apparently ever > asked for vpd information" argument. OK, so historically everyone mostly ignored the VPD pages because the limits they describe only mattered for a few devices and most I/O stacks couldn't be programmed to the characteristics they were describing anyway. Then came SSDs and a high performance penalty in the FTL for not sending optimally sized and aligned writes and suddenly we all had to rewrite our I/O stacks to cope and be able to receive this wisdom from the device via the block characteristics VPD pages so our users could get the expected IOPS out of their shiny new and expensive devices ... and by we I mean Linux, Windows and a host of other operating systems. Once we did it for SSDs, everyone wanted go faster stripes for their devices as well (array vendors and even some spinning rust type vendors), so we got sucked into a whole cycle. > You also don't say what it was that we did to start asking for vpd > information. What's the commit this fixes? What is it that actually > introduced this problem? And why can it not - like our mode page > things - look at a lot of other fields first to judge whether the > known problematic access is really worth it. I think it's this one: ommit e4c36ad756ec2de36b05c66bb50ed4ff47b0fdb0 Author: Hannes Reinecke Date: Fri Apr 1 08:57:37 2016 +0200 scsi: vpd pages are mandatory for SPC-2 The QEMU device seems to identify itself as SCSI 3 SPC-2 (SPC means SCSI Primary Command Conformance, it's a flag the device can set to show which standard revision it claims to comply with). If we raise the bar to SPC-3 then we'd not ask it for VPD pages, but we'd miss a lot of conforming SPC-2 devices which we'd then have to whitelist. > Look at all those other blacklists we have. They are for real > devices. Things like "don't do mode page 3f" etc. All the checks we > do before we decide it's even worth it asking for caching > information. > > We have a lot of "let's be cautious" with mode page accesses. What > did we do that is different for vpd? Maybe that wasn't cautious > enough. We tied them to a particular level of advertised conformance. Initially it was SPC-3 and above, but with the above commit we lowered it to SPC-2 and above. > Because we do have *other* devices that already have skip_vpd_pages > set. So this whole thing was clearly never limited to qemu > emulation. > > So answer me already: what was the change that actually made us break > recently? See above. > And *why* do you seem to continue to believe that hardware cannot be > broken, despite all the evidence that said hardware has never ever > been tested? We don't. We tried the blacklist everything and then conservatively whitelist it with discard and that didn't really work out so well for us either, so now we're trying the probe the device for what they say the support in the standard and use it, with blacklists for devices that
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, Jun 11, 2016 at 12:41 PM, James Bottomleywrote: > > The QEMU people have accepted it as their bug and are fixing it. Of course they are. Somebody found a bug in their device model, I'd expect nothing else. But I'm not worried about qemu. I'm worried about all the other random devices that have never been tested. > There's no other course of action, really because we can't stop people > sending this command using the BLOCK_PC interface from user space, so > it's now a known and easy to use way of stopping the device from > responding. Bah. That's not an argument from kernel space. We've had that forever. Broken device that hangs up when you try to read past the end? If you can open the raw device for reading, you can still do a SCSI_IOCTL_SEND_COMMAND to send that read command past the end. The fact that you can craft special commands that can cause problems for specific devices (if you have access to the raw device) does *not* at all argue that the kernel should then do those accesses of its own volition. My worry basically comes down to: we're clearly now doing something that has never ever been tested by anybody before. And I think that the assumption that the bug would magically be limited to qemu is a *big* assumption. Linus
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley wrote: > > The QEMU people have accepted it as their bug and are fixing it. Of course they are. Somebody found a bug in their device model, I'd expect nothing else. But I'm not worried about qemu. I'm worried about all the other random devices that have never been tested. > There's no other course of action, really because we can't stop people > sending this command using the BLOCK_PC interface from user space, so > it's now a known and easy to use way of stopping the device from > responding. Bah. That's not an argument from kernel space. We've had that forever. Broken device that hangs up when you try to read past the end? If you can open the raw device for reading, you can still do a SCSI_IOCTL_SEND_COMMAND to send that read command past the end. The fact that you can craft special commands that can cause problems for specific devices (if you have access to the raw device) does *not* at all argue that the kernel should then do those accesses of its own volition. My worry basically comes down to: we're clearly now doing something that has never ever been tested by anybody before. And I think that the assumption that the bug would magically be limited to qemu is a *big* assumption. Linus
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, Jun 11, 2016 at 12:41 PM, James Bottomleywrote: > > It looks like there's a hole where the emulation should be for the VPD > inquiry, which is what cause the whole hang up and never speak to us > again problem. So? What makes you think real hardware doesn't have those kinds of issues? This is no different from all the various real pieces of hardware that lock up when you ask for a particular mode page that they don't support. We are *very* careful not to randomly read mode pages because of that issue. Several things are done only if the device claims it is of a sufficiently new SCSI revision etc. None of this is new. The fact that Windows apparently doesn't do the VPD inquiry - since qemu works work windows - should make you very very worried. Instead, you completely ignore this and say "oh, it's just a qemu bug". Software that isn't tested doesn't work. That's simply a fact that all programmers should be intimately familiar with. I'm hoping you know that. Buit why the hell do you then believe that hardware is any different? Because I can most definitely tell you it isn't. Really. You have entirely ignored the "nobody has apparently ever asked for vpd information" argument. You also don't say what it was that we did to start asking for vpd information. What's the commit this fixes? What is it that actually introduced this problem? And why can it not - like our mode page things - look at a lot of other fields first to judge whether the known problematic access is really worth it. Look at all those other blacklists we have. They are for real devices. Things like "don't do mode page 3f" etc. All the checks we do before we decide it's even worth it asking for caching information. We have a lot of "let's be cautious" with mode page accesses. What did we do that is different for vpd? Maybe that wasn't cautious enough. Because we do have *other* devices that already have skip_vpd_pages set. So this whole thing was clearly never limited to qemu emulation. So answer me already: what was the change that actually made us break recently? And *why* do you seem to continue to believe that hardware cannot be broken, despite all the evidence that said hardware has never ever been tested? Linus
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley wrote: > > It looks like there's a hole where the emulation should be for the VPD > inquiry, which is what cause the whole hang up and never speak to us > again problem. So? What makes you think real hardware doesn't have those kinds of issues? This is no different from all the various real pieces of hardware that lock up when you ask for a particular mode page that they don't support. We are *very* careful not to randomly read mode pages because of that issue. Several things are done only if the device claims it is of a sufficiently new SCSI revision etc. None of this is new. The fact that Windows apparently doesn't do the VPD inquiry - since qemu works work windows - should make you very very worried. Instead, you completely ignore this and say "oh, it's just a qemu bug". Software that isn't tested doesn't work. That's simply a fact that all programmers should be intimately familiar with. I'm hoping you know that. Buit why the hell do you then believe that hardware is any different? Because I can most definitely tell you it isn't. Really. You have entirely ignored the "nobody has apparently ever asked for vpd information" argument. You also don't say what it was that we did to start asking for vpd information. What's the commit this fixes? What is it that actually introduced this problem? And why can it not - like our mode page things - look at a lot of other fields first to judge whether the known problematic access is really worth it. Look at all those other blacklists we have. They are for real devices. Things like "don't do mode page 3f" etc. All the checks we do before we decide it's even worth it asking for caching information. We have a lot of "let's be cautious" with mode page accesses. What did we do that is different for vpd? Maybe that wasn't cautious enough. Because we do have *other* devices that already have skip_vpd_pages set. So this whole thing was clearly never limited to qemu emulation. So answer me already: what was the change that actually made us break recently? And *why* do you seem to continue to believe that hardware cannot be broken, despite all the evidence that said hardware has never ever been tested? Linus
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, 2016-06-11 at 12:12 -0700, Linus Torvalds wrote: > On Sat, Jun 11, 2016 at 11:54 AM, Linus Torvalds >wrote: > > > > Is there some reason to believe that the qemu CD-ROM emulation is > > the only one with this problem? > > Side note:the one thing that makes the qemu cd-rom emulator "special" > is not that it's not real hardware: it's that it's a lot more likely > to be tested than just about any other actual cd-rom out there, > especially in environments that test new kernels. Lots of developers > tend to have rather modern machines (and I haven't had a CD-ROM in my > machine for the last couple of years, I think), or alternatively they > end up booting things in emulation because it makes for easy testing. > > So I really don't think that "oh, it happened only with a broken > emulated device" is a very strong argument for saying that that > emulated device was the problem. It looks like there's a hole where the emulation should be for the VPD inquiry, which is what cause the whole hang up and never speak to us again problem. > I really think it's likely that the whole "require VPD" is garbage. > The whole "everybody and their dog has used qemu, and the qemu cd-rom > emulation worked perfectly fine before" is a damn strong argument > that it's the new kernel doing something wrong. > > So please figure our what the real breakage was, and fix *that* > instead of blaming qemu. The QEMU people have accepted it as their bug and are fixing it. There's no other course of action, really because we can't stop people sending this command using the BLOCK_PC interface from user space, so it's now a known and easy to use way of stopping the device from responding. Fortunately, the effects seem to be confined to the CD only ... but it could have been worse (*cough* venom floppy driver *cough*) James
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, 2016-06-11 at 12:12 -0700, Linus Torvalds wrote: > On Sat, Jun 11, 2016 at 11:54 AM, Linus Torvalds > wrote: > > > > Is there some reason to believe that the qemu CD-ROM emulation is > > the only one with this problem? > > Side note:the one thing that makes the qemu cd-rom emulator "special" > is not that it's not real hardware: it's that it's a lot more likely > to be tested than just about any other actual cd-rom out there, > especially in environments that test new kernels. Lots of developers > tend to have rather modern machines (and I haven't had a CD-ROM in my > machine for the last couple of years, I think), or alternatively they > end up booting things in emulation because it makes for easy testing. > > So I really don't think that "oh, it happened only with a broken > emulated device" is a very strong argument for saying that that > emulated device was the problem. It looks like there's a hole where the emulation should be for the VPD inquiry, which is what cause the whole hang up and never speak to us again problem. > I really think it's likely that the whole "require VPD" is garbage. > The whole "everybody and their dog has used qemu, and the qemu cd-rom > emulation worked perfectly fine before" is a damn strong argument > that it's the new kernel doing something wrong. > > So please figure our what the real breakage was, and fix *that* > instead of blaming qemu. The QEMU people have accepted it as their bug and are fixing it. There's no other course of action, really because we can't stop people sending this command using the BLOCK_PC interface from user space, so it's now a known and easy to use way of stopping the device from responding. Fortunately, the effects seem to be confined to the CD only ... but it could have been worse (*cough* venom floppy driver *cough*) James
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, 2016-06-11 at 11:54 -0700, Linus Torvalds wrote: > On Sat, Jun 11, 2016 at 11:09 AM, James Bottomley >wrote: > > Two current fixes: one affects Qemu CD ROM emulation, which stopped > > working after the updates in SCSI to require VPD pages from all > > conformant devices. Fix temporarily by blacklisting Qemu (we can > > relax > > later when they come into compliance). > > Is there some reason to believe that the qemu CD-ROM emulation is the > only one with this problem? CD ROM manufacturers are usually reasonably good at standards compliance, so yes, I expect real devices to work. The reason the QEMU one is failing is because it's emulated not produced by a manufacturer. > When some device is known to break because the SCSI layer made some > check stricter, why didn't you just relax the check again? > > In other words, you already have one known device that behaves a way > that the new code doesn't like, why do you think the new code is > correct? This isn't a stricter check, it's probing the device for more information which we then use to set block parameters. If we don't set them, we can still use the device, but possibly not as efficiently (even for a fast 16x or 32x DVD, this will affect the burn speed). The device is allowed to say no. The problem with the QEMU device is that it wasn't saying anything. It was hanging up and never responding again, leading to a complete boot failure. > And don't answer "specs". Specs are just so much toilet paper. I actually ran out of SCSI specs a while ago. Fortunately the kernel ABI doc is soft, strong, and very, very long. > If the Qemu CD-ROM emulation has worked with not just older versions > of Linux, but presumably Windows and other OS's too, then it's likely > that nobody has ever cared about the VPD pages before, and thus the > new code that requires VPD is likely to break other things too. > > So give a reason why you think qemu is _sop_ special, and why the new > and clearly broken "require VPD" code is _so_ important. > > And really, if the reason is "specs", then somebody needs to get > their head examined. We've had so many devices that only glance > quickly at the specs that people need to realize that paperwork is > one thing, and reality is another, and the two have only a very > tenuous connection. > > The commit that adds this "no VPD" entry doesn't even talk about when > we broke this. What change exactly was it that broke things? > > I've pulled this, but I really think that this was completely bogus. > Even if blacklisting ends up being the right thing to do in the end, > the lack of explanations is awful, and the assumption that it's just > one particular device is very very suspect. So on the specs thing, we don't expect full compliance and we do code around problem responses or things missing programmatically (mostly). However, the problem here is the device effectively committing suicide when it sees a VPD inquiry. We do expect things like this sometimes (mosly from USB manufacturers who seem to regard use of specs the same way you do) but usually we have to blacklist them too to prevent the problem command reaching the device. James
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, 2016-06-11 at 11:54 -0700, Linus Torvalds wrote: > On Sat, Jun 11, 2016 at 11:09 AM, James Bottomley > wrote: > > Two current fixes: one affects Qemu CD ROM emulation, which stopped > > working after the updates in SCSI to require VPD pages from all > > conformant devices. Fix temporarily by blacklisting Qemu (we can > > relax > > later when they come into compliance). > > Is there some reason to believe that the qemu CD-ROM emulation is the > only one with this problem? CD ROM manufacturers are usually reasonably good at standards compliance, so yes, I expect real devices to work. The reason the QEMU one is failing is because it's emulated not produced by a manufacturer. > When some device is known to break because the SCSI layer made some > check stricter, why didn't you just relax the check again? > > In other words, you already have one known device that behaves a way > that the new code doesn't like, why do you think the new code is > correct? This isn't a stricter check, it's probing the device for more information which we then use to set block parameters. If we don't set them, we can still use the device, but possibly not as efficiently (even for a fast 16x or 32x DVD, this will affect the burn speed). The device is allowed to say no. The problem with the QEMU device is that it wasn't saying anything. It was hanging up and never responding again, leading to a complete boot failure. > And don't answer "specs". Specs are just so much toilet paper. I actually ran out of SCSI specs a while ago. Fortunately the kernel ABI doc is soft, strong, and very, very long. > If the Qemu CD-ROM emulation has worked with not just older versions > of Linux, but presumably Windows and other OS's too, then it's likely > that nobody has ever cared about the VPD pages before, and thus the > new code that requires VPD is likely to break other things too. > > So give a reason why you think qemu is _sop_ special, and why the new > and clearly broken "require VPD" code is _so_ important. > > And really, if the reason is "specs", then somebody needs to get > their head examined. We've had so many devices that only glance > quickly at the specs that people need to realize that paperwork is > one thing, and reality is another, and the two have only a very > tenuous connection. > > The commit that adds this "no VPD" entry doesn't even talk about when > we broke this. What change exactly was it that broke things? > > I've pulled this, but I really think that this was completely bogus. > Even if blacklisting ends up being the right thing to do in the end, > the lack of explanations is awful, and the assumption that it's just > one particular device is very very suspect. So on the specs thing, we don't expect full compliance and we do code around problem responses or things missing programmatically (mostly). However, the problem here is the device effectively committing suicide when it sees a VPD inquiry. We do expect things like this sometimes (mosly from USB manufacturers who seem to regard use of specs the same way you do) but usually we have to blacklist them too to prevent the problem command reaching the device. James
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, Jun 11, 2016 at 11:54 AM, Linus Torvaldswrote: > > Is there some reason to believe that the qemu CD-ROM emulation is the > only one with this problem? Side note:the one thing that makes the qemu cd-rom emulator "special" is not that it's not real hardware: it's that it's a lot more likely to be tested than just about any other actual cd-rom out there, especially in environments that test new kernels. Lots of developers tend to have rather modern machines (and I haven't had a CD-ROM in my machine for the last couple of years, I think), or alternatively they end up booting things in emulation because it makes for easy testing. So I really don't think that "oh, it happened only with a broken emulated device" is a very strong argument for saying that that emulated device was the problem. I really think it's likely that the whole "require VPD" is garbage. The whole "everybody and their dog has used qemu, and the qemu cd-rom emulation worked perfectly fine before" is a damn strong argument that it's the new kernel doing something wrong. So please figure our what the real breakage was, and fix *that* instead of blaming qemu. Linus
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, Jun 11, 2016 at 11:54 AM, Linus Torvalds wrote: > > Is there some reason to believe that the qemu CD-ROM emulation is the > only one with this problem? Side note:the one thing that makes the qemu cd-rom emulator "special" is not that it's not real hardware: it's that it's a lot more likely to be tested than just about any other actual cd-rom out there, especially in environments that test new kernels. Lots of developers tend to have rather modern machines (and I haven't had a CD-ROM in my machine for the last couple of years, I think), or alternatively they end up booting things in emulation because it makes for easy testing. So I really don't think that "oh, it happened only with a broken emulated device" is a very strong argument for saying that that emulated device was the problem. I really think it's likely that the whole "require VPD" is garbage. The whole "everybody and their dog has used qemu, and the qemu cd-rom emulation worked perfectly fine before" is a damn strong argument that it's the new kernel doing something wrong. So please figure our what the real breakage was, and fix *that* instead of blaming qemu. Linus
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, Jun 11, 2016 at 11:09 AM, James Bottomleywrote: > Two current fixes: one affects Qemu CD ROM emulation, which stopped > working after the updates in SCSI to require VPD pages from all > conformant devices. Fix temporarily by blacklisting Qemu (we can relax > later when they come into compliance). Is there some reason to believe that the qemu CD-ROM emulation is the only one with this problem? When some device is known to break because the SCSI layer made some check stricter, why didn't you just relax the check again? In other words, you already have one known device that behaves a way that the new code doesn't like, why do you think the new code is correct? And don't answer "specs". Specs are just so much toilet paper. If the Qemu CD-ROM emulation has worked with not just older versions of Linux, but presumably Windows and other OS's too, then it's likely that nobody has ever cared about the VPD pages before, and thus the new code that requires VPD is likely to break other things too. So give a reason why you think qemu is _sop_ special, and why the new and clearly broken "require VPD" code is _so_ important. And really, if the reason is "specs", then somebody needs to get their head examined. We've had so many devices that only glance quickly at the specs that people need to realize that paperwork is one thing, and reality is another, and the two have only a very tenuous connection. The commit that adds this "no VPD" entry doesn't even talk about when we broke this. What change exactly was it that broke things? I've pulled this, but I really think that this was completely bogus. Even if blacklisting ends up being the right thing to do in the end, the lack of explanations is awful, and the assumption that it's just one particular device is very very suspect. Linus
Re: [GIT PULL] SCSI fixes for 4.7-rc2
On Sat, Jun 11, 2016 at 11:09 AM, James Bottomley wrote: > Two current fixes: one affects Qemu CD ROM emulation, which stopped > working after the updates in SCSI to require VPD pages from all > conformant devices. Fix temporarily by blacklisting Qemu (we can relax > later when they come into compliance). Is there some reason to believe that the qemu CD-ROM emulation is the only one with this problem? When some device is known to break because the SCSI layer made some check stricter, why didn't you just relax the check again? In other words, you already have one known device that behaves a way that the new code doesn't like, why do you think the new code is correct? And don't answer "specs". Specs are just so much toilet paper. If the Qemu CD-ROM emulation has worked with not just older versions of Linux, but presumably Windows and other OS's too, then it's likely that nobody has ever cared about the VPD pages before, and thus the new code that requires VPD is likely to break other things too. So give a reason why you think qemu is _sop_ special, and why the new and clearly broken "require VPD" code is _so_ important. And really, if the reason is "specs", then somebody needs to get their head examined. We've had so many devices that only glance quickly at the specs that people need to realize that paperwork is one thing, and reality is another, and the two have only a very tenuous connection. The commit that adds this "no VPD" entry doesn't even talk about when we broke this. What change exactly was it that broke things? I've pulled this, but I really think that this was completely bogus. Even if blacklisting ends up being the right thing to do in the end, the lack of explanations is awful, and the assumption that it's just one particular device is very very suspect. Linus
[GIT PULL] SCSI fixes for 4.7-rc2
Two current fixes: one affects Qemu CD ROM emulation, which stopped working after the updates in SCSI to require VPD pages from all conformant devices. Fix temporarily by blacklisting Qemu (we can relax later when they come into compliance). The other is a fix to the optimal transfer size. We set up a minefield for ourselves by being confused about whether the limits are in bytes or sectors (SCSI optimal is in blocks and the queue parameter is in bytes). This tries to fix the problem (wrong setting for queue limits max_sectors) and make the problem more obvious by introducing a wrapper function. The patch is available here: git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-fixes The short changelog is: Ewan D. Milne (1): scsi: Add QEMU CD-ROM to VPD Inquiry Blacklist Martin K. Petersen (1): sd: Fix rw_max for devices that report an optimal xfer size And the diffstat: drivers/scsi/scsi_devinfo.c | 1 + drivers/scsi/sd.c | 8 drivers/scsi/sd.h | 5 + 3 files changed, 10 insertions(+), 4 deletions(-) With full diff below. James --- diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c index 3408578..ff41c31 100644 --- a/drivers/scsi/scsi_devinfo.c +++ b/drivers/scsi/scsi_devinfo.c @@ -230,6 +230,7 @@ static struct { {"PIONEER", "CD-ROM DRM-624X", NULL, BLIST_FORCELUN | BLIST_SINGLELUN}, {"Promise", "VTrak E610f", NULL, BLIST_SPARSELUN | BLIST_NO_RSOC}, {"Promise", "", NULL, BLIST_SPARSELUN}, + {"QEMU", "QEMU CD-ROM", NULL, BLIST_SKIP_VPD_PAGES}, {"QNAP", "iSCSI Storage", NULL, BLIST_MAX_1024}, {"SYNOLOGY", "iSCSI Storage", NULL, BLIST_MAX_1024}, {"QUANTUM", "XP34301", "1071", BLIST_NOTQ}, diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index f459dff..60bff78 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2867,10 +2867,10 @@ static int sd_revalidate_disk(struct gendisk *disk) if (sdkp->opt_xfer_blocks && sdkp->opt_xfer_blocks <= dev_max && sdkp->opt_xfer_blocks <= SD_DEF_XFER_BLOCKS && - sdkp->opt_xfer_blocks * sdp->sector_size >= PAGE_SIZE) - rw_max = q->limits.io_opt = - sdkp->opt_xfer_blocks * sdp->sector_size; - else + logical_to_bytes(sdp, sdkp->opt_xfer_blocks) >= PAGE_SIZE) { + q->limits.io_opt = logical_to_bytes(sdp, sdkp->opt_xfer_blocks); + rw_max = logical_to_sectors(sdp, sdkp->opt_xfer_blocks); + } else rw_max = BLK_DEF_MAX_SECTORS; /* Combine with controller limits */ diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 654630b..765a6f1 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -151,6 +151,11 @@ static inline sector_t logical_to_sectors(struct scsi_device *sdev, sector_t blo return blocks << (ilog2(sdev->sector_size) - 9); } +static inline unsigned int logical_to_bytes(struct scsi_device *sdev, sector_t blocks) +{ + return blocks * sdev->sector_size; +} + /* * A DIF-capable target device can be formatted with different * protection schemes. Currently 0 through 3 are defined:
[GIT PULL] SCSI fixes for 4.7-rc2
Two current fixes: one affects Qemu CD ROM emulation, which stopped working after the updates in SCSI to require VPD pages from all conformant devices. Fix temporarily by blacklisting Qemu (we can relax later when they come into compliance). The other is a fix to the optimal transfer size. We set up a minefield for ourselves by being confused about whether the limits are in bytes or sectors (SCSI optimal is in blocks and the queue parameter is in bytes). This tries to fix the problem (wrong setting for queue limits max_sectors) and make the problem more obvious by introducing a wrapper function. The patch is available here: git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-fixes The short changelog is: Ewan D. Milne (1): scsi: Add QEMU CD-ROM to VPD Inquiry Blacklist Martin K. Petersen (1): sd: Fix rw_max for devices that report an optimal xfer size And the diffstat: drivers/scsi/scsi_devinfo.c | 1 + drivers/scsi/sd.c | 8 drivers/scsi/sd.h | 5 + 3 files changed, 10 insertions(+), 4 deletions(-) With full diff below. James --- diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c index 3408578..ff41c31 100644 --- a/drivers/scsi/scsi_devinfo.c +++ b/drivers/scsi/scsi_devinfo.c @@ -230,6 +230,7 @@ static struct { {"PIONEER", "CD-ROM DRM-624X", NULL, BLIST_FORCELUN | BLIST_SINGLELUN}, {"Promise", "VTrak E610f", NULL, BLIST_SPARSELUN | BLIST_NO_RSOC}, {"Promise", "", NULL, BLIST_SPARSELUN}, + {"QEMU", "QEMU CD-ROM", NULL, BLIST_SKIP_VPD_PAGES}, {"QNAP", "iSCSI Storage", NULL, BLIST_MAX_1024}, {"SYNOLOGY", "iSCSI Storage", NULL, BLIST_MAX_1024}, {"QUANTUM", "XP34301", "1071", BLIST_NOTQ}, diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index f459dff..60bff78 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2867,10 +2867,10 @@ static int sd_revalidate_disk(struct gendisk *disk) if (sdkp->opt_xfer_blocks && sdkp->opt_xfer_blocks <= dev_max && sdkp->opt_xfer_blocks <= SD_DEF_XFER_BLOCKS && - sdkp->opt_xfer_blocks * sdp->sector_size >= PAGE_SIZE) - rw_max = q->limits.io_opt = - sdkp->opt_xfer_blocks * sdp->sector_size; - else + logical_to_bytes(sdp, sdkp->opt_xfer_blocks) >= PAGE_SIZE) { + q->limits.io_opt = logical_to_bytes(sdp, sdkp->opt_xfer_blocks); + rw_max = logical_to_sectors(sdp, sdkp->opt_xfer_blocks); + } else rw_max = BLK_DEF_MAX_SECTORS; /* Combine with controller limits */ diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 654630b..765a6f1 100644 --- a/drivers/scsi/sd.h +++ b/drivers/scsi/sd.h @@ -151,6 +151,11 @@ static inline sector_t logical_to_sectors(struct scsi_device *sdev, sector_t blo return blocks << (ilog2(sdev->sector_size) - 9); } +static inline unsigned int logical_to_bytes(struct scsi_device *sdev, sector_t blocks) +{ + return blocks * sdev->sector_size; +} + /* * A DIF-capable target device can be formatted with different * protection schemes. Currently 0 through 3 are defined: