subject:"\[GIT PULL\] SCSI fixes for 4.7\-rc2"

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-13 Thread Linus Torvalds

On Mon, Jun 13, 2016 at 12:04 AM, Hannes Reinecke  wrote:
>
> And we have been running the very patch in SLES for over a year now,
> without a single issue being reported.

Oh, ok. So it's not "all qemu kvm instances are broken", it was a very
unusual issue, and the patch has actually gotten wider testing.

That makes me much happier about it.

Linus

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-13 Thread Linus Torvalds

On Mon, Jun 13, 2016 at 12:04 AM, Hannes Reinecke  wrote:
>
> And we have been running the very patch in SLES for over a year now,
> without a single issue being reported.

Oh, ok. So it's not "all qemu kvm instances are broken", it was a very
unusual issue, and the patch has actually gotten wider testing.

That makes me much happier about it.

Linus

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-13 Thread Hannes Reinecke

On 06/11/2016 11:03 PM, James Bottomley wrote:
> On Sat, 2016-06-11 at 13:25 -0700, Linus Torvalds wrote:
>> On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley
>>  wrote:
>>>
>>> The QEMU people have accepted it as their bug and are fixing it.
>>
>> Of course they are. Somebody found a bug in their device model, I'd
>> expect nothing else.
>>
>> But I'm not worried about qemu. I'm worried about all the other 
>> random devices that have never been tested.
> 
> Most of the other devices that are likely to misbehave don't advertise
> high levels of SCSI conformance, so we seem to be mostly covered.
> 
And we have been running the very patch in SLES for over a year now,
without a single issue being reported.

>>>  There's no other course of action, really because we can't stop
>>> people
>>> sending this command using the BLOCK_PC interface from user space,
>>> so
>>> it's now a known and easy to use way of stopping the device from
>>> responding.
>>
>> Bah. That's not an argument from kernel space. We've had that 
>> forever. Broken device that hangs up when you try to read past the 
>> end? If you can open the raw device for reading, you can still do a
>> SCSI_IOCTL_SEND_COMMAND to send that read command past the end.
>>
>> The fact that you can craft special commands that can cause problems
>> for specific devices (if you have access to the raw device) does 
>> *not* at all argue that the kernel should then do those accesses of 
>> its own volition.
>>
>> My worry basically comes down to: we're clearly now doing something
>> that has never ever been tested by anybody before.
>>
Not quite. See above.

The reported issue came from someone who has been running the very
latest linux kernel in a VM which was hosted on an ancient version of
QEMU. Hardly a common scenario.

>> And I think that the assumption that the bug would magically be
>> limited to qemu is a *big* assumption.
> 
> How do we ever find out if we don't test it, though?  I'm sure some
> obscure minor celebrity trying to get on the chat show circuit once
> said "what is userspace except a test case for the kernel?"
> 
> If this is the only problem that turns up, I think we're done.  If we
> get any more we can consider either blacklisting all CD type devices or
> raising the conformance bar to SPC-3.
> 
I'm fully with James here.
The alternative would be to whitelist _every_ conformant device,
resulting in lots of unhappy customers until we've got the whitelist
settled.

Having to discuss with customers why Linux doesn't follow the specs is
infinitely harder than discussing with customers whose _hardware_
doesn't follow the specs.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-13 Thread Hannes Reinecke

On 06/11/2016 11:03 PM, James Bottomley wrote:
> On Sat, 2016-06-11 at 13:25 -0700, Linus Torvalds wrote:
>> On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley
>>  wrote:
>>>
>>> The QEMU people have accepted it as their bug and are fixing it.
>>
>> Of course they are. Somebody found a bug in their device model, I'd
>> expect nothing else.
>>
>> But I'm not worried about qemu. I'm worried about all the other 
>> random devices that have never been tested.
> 
> Most of the other devices that are likely to misbehave don't advertise
> high levels of SCSI conformance, so we seem to be mostly covered.
> 
And we have been running the very patch in SLES for over a year now,
without a single issue being reported.

>>>  There's no other course of action, really because we can't stop
>>> people
>>> sending this command using the BLOCK_PC interface from user space,
>>> so
>>> it's now a known and easy to use way of stopping the device from
>>> responding.
>>
>> Bah. That's not an argument from kernel space. We've had that 
>> forever. Broken device that hangs up when you try to read past the 
>> end? If you can open the raw device for reading, you can still do a
>> SCSI_IOCTL_SEND_COMMAND to send that read command past the end.
>>
>> The fact that you can craft special commands that can cause problems
>> for specific devices (if you have access to the raw device) does 
>> *not* at all argue that the kernel should then do those accesses of 
>> its own volition.
>>
>> My worry basically comes down to: we're clearly now doing something
>> that has never ever been tested by anybody before.
>>
Not quite. See above.

The reported issue came from someone who has been running the very
latest linux kernel in a VM which was hosted on an ancient version of
QEMU. Hardly a common scenario.

>> And I think that the assumption that the bug would magically be
>> limited to qemu is a *big* assumption.
> 
> How do we ever find out if we don't test it, though?  I'm sure some
> obscure minor celebrity trying to get on the chat show circuit once
> said "what is userspace except a test case for the kernel?"
> 
> If this is the only problem that turns up, I think we're done.  If we
> get any more we can consider either blacklisting all CD type devices or
> raising the conformance bar to SPC-3.
> 
I'm fully with James here.
The alternative would be to whitelist _every_ conformant device,
resulting in lots of unhappy customers until we've got the whitelist
settled.

Having to discuss with customers why Linux doesn't follow the specs is
infinitely harder than discussing with customers whose _hardware_
doesn't follow the specs.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread James Bottomley

On Sat, 2016-06-11 at 13:25 -0700, Linus Torvalds wrote:
> On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley
>  wrote:
> > 
> > The QEMU people have accepted it as their bug and are fixing it.
> 
> Of course they are. Somebody found a bug in their device model, I'd
> expect nothing else.
> 
> But I'm not worried about qemu. I'm worried about all the other 
> random devices that have never been tested.

Most of the other devices that are likely to misbehave don't advertise
high levels of SCSI conformance, so we seem to be mostly covered.

> >  There's no other course of action, really because we can't stop
> > people
> > sending this command using the BLOCK_PC interface from user space,
> > so
> > it's now a known and easy to use way of stopping the device from
> > responding.
> 
> Bah. That's not an argument from kernel space. We've had that 
> forever. Broken device that hangs up when you try to read past the 
> end? If you can open the raw device for reading, you can still do a
> SCSI_IOCTL_SEND_COMMAND to send that read command past the end.
> 
> The fact that you can craft special commands that can cause problems
> for specific devices (if you have access to the raw device) does 
> *not* at all argue that the kernel should then do those accesses of 
> its own volition.
> 
> My worry basically comes down to: we're clearly now doing something
> that has never ever been tested by anybody before.
> 
> And I think that the assumption that the bug would magically be
> limited to qemu is a *big* assumption.

How do we ever find out if we don't test it, though?  I'm sure some
obscure minor celebrity trying to get on the chat show circuit once
said "what is userspace except a test case for the kernel?"

If this is the only problem that turns up, I think we're done.  If we
get any more we can consider either blacklisting all CD type devices or
raising the conformance bar to SPC-3.

James

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread James Bottomley

On Sat, 2016-06-11 at 13:25 -0700, Linus Torvalds wrote:
> On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley
>  wrote:
> > 
> > The QEMU people have accepted it as their bug and are fixing it.
> 
> Of course they are. Somebody found a bug in their device model, I'd
> expect nothing else.
> 
> But I'm not worried about qemu. I'm worried about all the other 
> random devices that have never been tested.

Most of the other devices that are likely to misbehave don't advertise
high levels of SCSI conformance, so we seem to be mostly covered.

> >  There's no other course of action, really because we can't stop
> > people
> > sending this command using the BLOCK_PC interface from user space,
> > so
> > it's now a known and easy to use way of stopping the device from
> > responding.
> 
> Bah. That's not an argument from kernel space. We've had that 
> forever. Broken device that hangs up when you try to read past the 
> end? If you can open the raw device for reading, you can still do a
> SCSI_IOCTL_SEND_COMMAND to send that read command past the end.
> 
> The fact that you can craft special commands that can cause problems
> for specific devices (if you have access to the raw device) does 
> *not* at all argue that the kernel should then do those accesses of 
> its own volition.
> 
> My worry basically comes down to: we're clearly now doing something
> that has never ever been tested by anybody before.
> 
> And I think that the assumption that the bug would magically be
> limited to qemu is a *big* assumption.

How do we ever find out if we don't test it, though?  I'm sure some
obscure minor celebrity trying to get on the chat show circuit once
said "what is userspace except a test case for the kernel?"

If this is the only problem that turns up, I think we're done.  If we
get any more we can consider either blacklisting all CD type devices or
raising the conformance bar to SPC-3.

James

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread James Bottomley

On Sat, 2016-06-11 at 12:57 -0700, Linus Torvalds wrote:
> On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley
>  wrote:
> > 
> > It looks like there's a hole where the emulation should be for the
> > VPD
> > inquiry, which is what cause the whole hang up and never speak to
> > us
> > again problem.
> 
> So? What makes you think real hardware doesn't have those kinds of
> issues?

It does.  USB SCSI emulation chips are notorious for crashing when they
see the "wrong" command.  However, our fix is mostly to blacklist the
command from ever being seen by those devices as well.  That's what the
usb unusual_devs.h file is full of.

> This is no different from all the various real pieces of hardware 
> that lock up when you ask for a particular mode page that they don't
> support. We are *very* careful not to randomly read mode pages 
> because of that issue. Several things are done only if the device 
> claims it is of a sufficiently new SCSI revision etc.
> 
> None of this is new.

Right, we've got a way of coping, it's either blacklist or programmatic
compensation if we can recognise the characteristic.

> The fact that Windows apparently doesn't do the VPD inquiry - since
> qemu works work windows - should make you very very worried. Instead,
> you completely ignore this and say "oh, it's just a qemu bug".

Windows does do VPD inquiries, but not, apparently for CD ROMS.  It is
feasible for us to blacklist all CD ROMS (type ROM and WORM) like
windows apparently does, but then I'll get complaints from people whose
high end devices are no-longer optimally configured by default and
whose burn speeds have dropped as a result.

> Software that isn't tested doesn't work. That's simply a fact that
> all programmers should be intimately familiar with. I'm hoping you
> know that.
> 
> Buit why the hell do you then believe that hardware is any different?
> Because I can most definitely tell you it isn't.

I did describe how we build heuristics to cope with buggy hardware.  A
Blacklist is part of that.

> Really. You have entirely ignored the "nobody has apparently ever
> asked for vpd information" argument.

OK, so historically everyone mostly ignored the VPD pages because the
limits they describe only mattered for a few devices and most I/O
stacks couldn't be programmed to the characteristics they were
describing anyway.  Then came SSDs and a high performance penalty in
the FTL for not sending optimally sized and aligned writes and suddenly
we all had to rewrite our I/O stacks to cope and be able to receive
this wisdom from the device via the block characteristics VPD pages so
our users could get the expected IOPS out of their shiny new and
expensive devices ... and by we I mean Linux, Windows and a host of
other operating systems.

Once we did it for SSDs, everyone wanted go faster stripes for their
devices as well (array vendors and even some spinning rust type
vendors), so we got sucked into a whole cycle.

> You also don't say what it was that we did to start asking for vpd
> information. What's the commit this fixes? What is it that actually
> introduced this problem? And why can it not - like our mode page
> things - look at a lot of other fields first to judge whether the
> known problematic access is really worth it.

I think it's this one:

ommit e4c36ad756ec2de36b05c66bb50ed4ff47b0fdb0
Author: Hannes Reinecke 
Date:   Fri Apr 1 08:57:37 2016 +0200

scsi: vpd pages are mandatory for SPC-2

The QEMU device seems to identify itself as SCSI 3 SPC-2 (SPC means
SCSI Primary Command Conformance, it's a flag the device can set to
show which standard revision it claims to comply with).  If we raise
the bar to SPC-3 then we'd not ask it for VPD pages, but we'd miss a
lot of conforming SPC-2 devices which we'd then have to whitelist.

> Look at all those other blacklists we have. They are for real 
> devices. Things like "don't do mode page 3f" etc. All the checks we 
> do before we decide it's even worth it asking for caching
> information.
> 
> We have a lot of "let's be cautious" with mode page accesses. What 
> did we do that is different for vpd? Maybe that wasn't cautious
> enough.

We tied them to a particular level of advertised conformance. 
 Initially it was SPC-3 and above, but with the above commit we lowered
it to SPC-2 and above.

> Because we do have *other* devices that already have skip_vpd_pages
> set.  So this whole thing was clearly never limited to qemu
> emulation.
> 
> So answer me already: what was the change that actually made us break
> recently?

See above.

> And *why* do you seem to continue to believe that hardware cannot be
> broken, despite all the evidence that said hardware has never ever
> been tested?

We don't.  We tried the blacklist everything and then conservatively
whitelist it with discard and that didn't really work out so well for
us either, so now we're trying the probe the device for what they say
the support in the

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread James Bottomley

On Sat, 2016-06-11 at 12:57 -0700, Linus Torvalds wrote:
> On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley
>  wrote:
> > 
> > It looks like there's a hole where the emulation should be for the
> > VPD
> > inquiry, which is what cause the whole hang up and never speak to
> > us
> > again problem.
> 
> So? What makes you think real hardware doesn't have those kinds of
> issues?

It does.  USB SCSI emulation chips are notorious for crashing when they
see the "wrong" command.  However, our fix is mostly to blacklist the
command from ever being seen by those devices as well.  That's what the
usb unusual_devs.h file is full of.

> This is no different from all the various real pieces of hardware 
> that lock up when you ask for a particular mode page that they don't
> support. We are *very* careful not to randomly read mode pages 
> because of that issue. Several things are done only if the device 
> claims it is of a sufficiently new SCSI revision etc.
> 
> None of this is new.

Right, we've got a way of coping, it's either blacklist or programmatic
compensation if we can recognise the characteristic.

> The fact that Windows apparently doesn't do the VPD inquiry - since
> qemu works work windows - should make you very very worried. Instead,
> you completely ignore this and say "oh, it's just a qemu bug".

Windows does do VPD inquiries, but not, apparently for CD ROMS.  It is
feasible for us to blacklist all CD ROMS (type ROM and WORM) like
windows apparently does, but then I'll get complaints from people whose
high end devices are no-longer optimally configured by default and
whose burn speeds have dropped as a result.

> Software that isn't tested doesn't work. That's simply a fact that
> all programmers should be intimately familiar with. I'm hoping you
> know that.
> 
> Buit why the hell do you then believe that hardware is any different?
> Because I can most definitely tell you it isn't.

I did describe how we build heuristics to cope with buggy hardware.  A
Blacklist is part of that.

> Really. You have entirely ignored the "nobody has apparently ever
> asked for vpd information" argument.

OK, so historically everyone mostly ignored the VPD pages because the
limits they describe only mattered for a few devices and most I/O
stacks couldn't be programmed to the characteristics they were
describing anyway.  Then came SSDs and a high performance penalty in
the FTL for not sending optimally sized and aligned writes and suddenly
we all had to rewrite our I/O stacks to cope and be able to receive
this wisdom from the device via the block characteristics VPD pages so
our users could get the expected IOPS out of their shiny new and
expensive devices ... and by we I mean Linux, Windows and a host of
other operating systems.

Once we did it for SSDs, everyone wanted go faster stripes for their
devices as well (array vendors and even some spinning rust type
vendors), so we got sucked into a whole cycle.

> You also don't say what it was that we did to start asking for vpd
> information. What's the commit this fixes? What is it that actually
> introduced this problem? And why can it not - like our mode page
> things - look at a lot of other fields first to judge whether the
> known problematic access is really worth it.

I think it's this one:

ommit e4c36ad756ec2de36b05c66bb50ed4ff47b0fdb0
Author: Hannes Reinecke 
Date:   Fri Apr 1 08:57:37 2016 +0200

scsi: vpd pages are mandatory for SPC-2

The QEMU device seems to identify itself as SCSI 3 SPC-2 (SPC means
SCSI Primary Command Conformance, it's a flag the device can set to
show which standard revision it claims to comply with).  If we raise
the bar to SPC-3 then we'd not ask it for VPD pages, but we'd miss a
lot of conforming SPC-2 devices which we'd then have to whitelist.

> Look at all those other blacklists we have. They are for real 
> devices. Things like "don't do mode page 3f" etc. All the checks we 
> do before we decide it's even worth it asking for caching
> information.
> 
> We have a lot of "let's be cautious" with mode page accesses. What 
> did we do that is different for vpd? Maybe that wasn't cautious
> enough.

We tied them to a particular level of advertised conformance. 
 Initially it was SPC-3 and above, but with the above commit we lowered
it to SPC-2 and above.

> Because we do have *other* devices that already have skip_vpd_pages
> set.  So this whole thing was clearly never limited to qemu
> emulation.
> 
> So answer me already: what was the change that actually made us break
> recently?

See above.

> And *why* do you seem to continue to believe that hardware cannot be
> broken, despite all the evidence that said hardware has never ever
> been tested?

We don't.  We tried the blacklist everything and then conservatively
whitelist it with discard and that didn't really work out so well for
us either, so now we're trying the probe the device for what they say
the support in the standard and use it, with blacklists for devices
that

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread Linus Torvalds

On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley
 wrote:
>
> The QEMU people have accepted it as their bug and are fixing it.

Of course they are. Somebody found a bug in their device model, I'd
expect nothing else.

But I'm not worried about qemu. I'm worried about all the other random
devices that have never been tested.

>  There's no other course of action, really because we can't stop people
> sending this command using the BLOCK_PC interface from user space, so
> it's now a known and easy to use way of stopping the device from
> responding.

Bah. That's not an argument from kernel space. We've had that forever.
Broken device that hangs up when you try to read past the end? If you
can open the raw device for reading, you can still do a
SCSI_IOCTL_SEND_COMMAND to send that read command past the end.

The fact that you can craft special commands that can cause problems
for specific devices (if you have access to the raw device) does *not*
at all argue that the kernel should then do those accesses of its own
volition.

My worry basically comes down to: we're clearly now doing something
that has never ever been tested by anybody before.

And I think that the assumption that the bug would magically be
limited to qemu is a *big* assumption.

Linus

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread Linus Torvalds

On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley
 wrote:
>
> The QEMU people have accepted it as their bug and are fixing it.

Of course they are. Somebody found a bug in their device model, I'd
expect nothing else.

But I'm not worried about qemu. I'm worried about all the other random
devices that have never been tested.

>  There's no other course of action, really because we can't stop people
> sending this command using the BLOCK_PC interface from user space, so
> it's now a known and easy to use way of stopping the device from
> responding.

Bah. That's not an argument from kernel space. We've had that forever.
Broken device that hangs up when you try to read past the end? If you
can open the raw device for reading, you can still do a
SCSI_IOCTL_SEND_COMMAND to send that read command past the end.

The fact that you can craft special commands that can cause problems
for specific devices (if you have access to the raw device) does *not*
at all argue that the kernel should then do those accesses of its own
volition.

My worry basically comes down to: we're clearly now doing something
that has never ever been tested by anybody before.

And I think that the assumption that the bug would magically be
limited to qemu is a *big* assumption.

Linus

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread Linus Torvalds

On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley
 wrote:
>
> It looks like there's a hole where the emulation should be for the VPD
> inquiry, which is what cause the whole hang up and never speak to us
> again problem.

So? What makes you think real hardware doesn't have those kinds of issues?

This is no different from all the various real pieces of hardware that
lock up when you ask for a particular mode page that they don't
support. We are *very* careful not to randomly read mode pages because
of that issue. Several things are done only if the device claims it is
of a sufficiently new SCSI revision etc.

None of this is new.

The fact that Windows apparently doesn't do the VPD inquiry - since
qemu works work windows - should make you very very worried. Instead,
you completely ignore this and say "oh, it's just a qemu bug".

Software that isn't tested doesn't work. That's simply a fact that all
programmers should be intimately familiar with. I'm hoping you know
that.

Buit why the hell do you then believe that hardware is any different?
Because I can most definitely tell you it isn't.

Really. You have entirely ignored the "nobody has apparently ever
asked for vpd information" argument.

You also don't say what it was that we did to start asking for vpd
information. What's the commit this fixes? What is it that actually
introduced this problem? And why can it not - like our mode page
things - look at a lot of other fields first to judge whether the
known problematic access is really worth it.

Look at all those other blacklists we have. They are for real devices.
Things like "don't do mode page 3f" etc. All the checks we do before
we decide it's even worth it asking for caching information.

We have a lot of "let's be cautious" with mode page accesses. What did
we do that is different for vpd? Maybe that wasn't cautious enough.

Because we do have *other* devices that already have skip_vpd_pages
set.  So this whole thing was clearly never limited to qemu emulation.

So answer me already: what was the change that actually made us break recently?

And *why* do you seem to continue to believe that hardware cannot be
broken, despite all the evidence that said hardware has never ever
been tested?

Linus

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread Linus Torvalds

On Sat, Jun 11, 2016 at 12:41 PM, James Bottomley
 wrote:
>
> It looks like there's a hole where the emulation should be for the VPD
> inquiry, which is what cause the whole hang up and never speak to us
> again problem.

So? What makes you think real hardware doesn't have those kinds of issues?

This is no different from all the various real pieces of hardware that
lock up when you ask for a particular mode page that they don't
support. We are *very* careful not to randomly read mode pages because
of that issue. Several things are done only if the device claims it is
of a sufficiently new SCSI revision etc.

None of this is new.

The fact that Windows apparently doesn't do the VPD inquiry - since
qemu works work windows - should make you very very worried. Instead,
you completely ignore this and say "oh, it's just a qemu bug".

Software that isn't tested doesn't work. That's simply a fact that all
programmers should be intimately familiar with. I'm hoping you know
that.

Buit why the hell do you then believe that hardware is any different?
Because I can most definitely tell you it isn't.

Really. You have entirely ignored the "nobody has apparently ever
asked for vpd information" argument.

You also don't say what it was that we did to start asking for vpd
information. What's the commit this fixes? What is it that actually
introduced this problem? And why can it not - like our mode page
things - look at a lot of other fields first to judge whether the
known problematic access is really worth it.

Look at all those other blacklists we have. They are for real devices.
Things like "don't do mode page 3f" etc. All the checks we do before
we decide it's even worth it asking for caching information.

We have a lot of "let's be cautious" with mode page accesses. What did
we do that is different for vpd? Maybe that wasn't cautious enough.

Because we do have *other* devices that already have skip_vpd_pages
set.  So this whole thing was clearly never limited to qemu emulation.

So answer me already: what was the change that actually made us break recently?

And *why* do you seem to continue to believe that hardware cannot be
broken, despite all the evidence that said hardware has never ever
been tested?

Linus

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread James Bottomley

On Sat, 2016-06-11 at 12:12 -0700, Linus Torvalds wrote:
> On Sat, Jun 11, 2016 at 11:54 AM, Linus Torvalds
>  wrote:
> > 
> > Is there some reason to believe that the qemu CD-ROM emulation is 
> > the only one with this problem?
> 
> Side note:the one thing that makes the qemu cd-rom emulator "special"
> is not that it's not real hardware: it's that it's a lot more likely
> to be tested than just about any other actual cd-rom out there,
> especially in environments that test new kernels. Lots of developers
> tend to have rather modern machines (and I haven't had a CD-ROM in my
> machine for the last couple of years, I think), or alternatively they
> end up booting things in emulation because it makes for easy testing.
> 
> So I really don't think that "oh, it happened only with a broken
> emulated device" is a very strong argument for saying that that
> emulated device was the problem.

It looks like there's a hole where the emulation should be for the VPD
inquiry, which is what cause the whole hang up and never speak to us
again problem.

> I really think it's likely that the whole "require VPD" is garbage.
> The whole "everybody and their dog has used qemu, and the qemu cd-rom
> emulation worked perfectly fine before" is a damn strong argument 
> that it's the new kernel doing something wrong.
> 
> So please figure our what the real breakage was, and fix *that*
> instead of blaming qemu.

The QEMU people have accepted it as their bug and are fixing it. 
 There's no other course of action, really because we can't stop people
sending this command using the BLOCK_PC interface from user space, so
it's now a known and easy to use way of stopping the device from
responding.  Fortunately, the effects seem to be confined to the CD
only ... but it could have been worse (*cough* venom floppy driver
*cough*)

James

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread James Bottomley

On Sat, 2016-06-11 at 12:12 -0700, Linus Torvalds wrote:
> On Sat, Jun 11, 2016 at 11:54 AM, Linus Torvalds
>  wrote:
> > 
> > Is there some reason to believe that the qemu CD-ROM emulation is 
> > the only one with this problem?
> 
> Side note:the one thing that makes the qemu cd-rom emulator "special"
> is not that it's not real hardware: it's that it's a lot more likely
> to be tested than just about any other actual cd-rom out there,
> especially in environments that test new kernels. Lots of developers
> tend to have rather modern machines (and I haven't had a CD-ROM in my
> machine for the last couple of years, I think), or alternatively they
> end up booting things in emulation because it makes for easy testing.
> 
> So I really don't think that "oh, it happened only with a broken
> emulated device" is a very strong argument for saying that that
> emulated device was the problem.

It looks like there's a hole where the emulation should be for the VPD
inquiry, which is what cause the whole hang up and never speak to us
again problem.

> I really think it's likely that the whole "require VPD" is garbage.
> The whole "everybody and their dog has used qemu, and the qemu cd-rom
> emulation worked perfectly fine before" is a damn strong argument 
> that it's the new kernel doing something wrong.
> 
> So please figure our what the real breakage was, and fix *that*
> instead of blaming qemu.

The QEMU people have accepted it as their bug and are fixing it. 
 There's no other course of action, really because we can't stop people
sending this command using the BLOCK_PC interface from user space, so
it's now a known and easy to use way of stopping the device from
responding.  Fortunately, the effects seem to be confined to the CD
only ... but it could have been worse (*cough* venom floppy driver
*cough*)

James

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread James Bottomley

On Sat, 2016-06-11 at 11:54 -0700, Linus Torvalds wrote:
> On Sat, Jun 11, 2016 at 11:09 AM, James Bottomley
>  wrote:
> > Two current fixes: one affects Qemu CD ROM emulation, which stopped
> > working after the updates in SCSI to require VPD pages from all
> > conformant devices.  Fix temporarily by blacklisting Qemu (we can
> > relax
> > later when they come into compliance).
> 
> Is there some reason to believe that the qemu CD-ROM emulation is the
> only one with this problem?

CD ROM manufacturers are usually reasonably good at standards
compliance, so yes, I expect real devices to work.  The reason the QEMU
one is failing is because it's emulated not produced by a manufacturer.

> When some device is known to break because the SCSI layer made some
> check stricter, why didn't you just relax the check again?
> 
> In other words, you already have one known device that behaves a way
> that the new code doesn't like, why do you think the new code is
> correct?

This isn't a stricter check, it's probing the device for more
information which we then use to set block parameters.  If we don't set
them, we can still use the device, but possibly not as efficiently
(even for a fast 16x or 32x DVD, this will affect the burn speed).

The device is allowed to say no.  The problem with the QEMU device is
that it wasn't saying anything.  It was hanging up and never responding
again, leading to a complete boot failure.

> And don't answer "specs". Specs are just so much toilet paper.

I actually ran out of SCSI specs a while ago.  Fortunately the kernel
ABI doc is soft, strong, and very, very long.

> If the Qemu CD-ROM emulation has worked with not just older versions
> of Linux, but presumably Windows and other OS's too, then it's likely
> that nobody has ever cared about the VPD pages before, and thus the
> new code that requires VPD is likely to break other things too.
> 
> So give a reason why you think qemu is _sop_ special, and why the new
> and clearly broken "require VPD" code is _so_ important.
> 
> And really, if the reason is "specs", then somebody needs to get 
> their head examined. We've had so many devices that only glance 
> quickly at the specs that people need to realize that paperwork is 
> one thing, and reality is another, and the two have only a very
> tenuous connection.
> 
> The commit that adds this "no VPD" entry doesn't even talk about when
> we broke this.  What change exactly was it that broke things?
> 
> I've pulled this, but I really think that this was completely bogus.
> Even if blacklisting ends up being the right thing to do in the end,
> the lack of explanations is awful, and the assumption that it's just
> one particular device is very very suspect.

So on the specs thing, we don't expect full compliance and we do code
around problem responses or things missing programmatically (mostly). 
 However, the problem here is the device effectively committing suicide
when it sees a VPD inquiry.  We do expect things like this sometimes
(mosly from USB manufacturers who seem to regard use of specs the same
way you do) but usually we have to blacklist them too to prevent the
problem command reaching the device.

James

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread James Bottomley

On Sat, 2016-06-11 at 11:54 -0700, Linus Torvalds wrote:
> On Sat, Jun 11, 2016 at 11:09 AM, James Bottomley
>  wrote:
> > Two current fixes: one affects Qemu CD ROM emulation, which stopped
> > working after the updates in SCSI to require VPD pages from all
> > conformant devices.  Fix temporarily by blacklisting Qemu (we can
> > relax
> > later when they come into compliance).
> 
> Is there some reason to believe that the qemu CD-ROM emulation is the
> only one with this problem?

CD ROM manufacturers are usually reasonably good at standards
compliance, so yes, I expect real devices to work.  The reason the QEMU
one is failing is because it's emulated not produced by a manufacturer.

> When some device is known to break because the SCSI layer made some
> check stricter, why didn't you just relax the check again?
> 
> In other words, you already have one known device that behaves a way
> that the new code doesn't like, why do you think the new code is
> correct?

This isn't a stricter check, it's probing the device for more
information which we then use to set block parameters.  If we don't set
them, we can still use the device, but possibly not as efficiently
(even for a fast 16x or 32x DVD, this will affect the burn speed).

The device is allowed to say no.  The problem with the QEMU device is
that it wasn't saying anything.  It was hanging up and never responding
again, leading to a complete boot failure.

> And don't answer "specs". Specs are just so much toilet paper.

I actually ran out of SCSI specs a while ago.  Fortunately the kernel
ABI doc is soft, strong, and very, very long.

> If the Qemu CD-ROM emulation has worked with not just older versions
> of Linux, but presumably Windows and other OS's too, then it's likely
> that nobody has ever cared about the VPD pages before, and thus the
> new code that requires VPD is likely to break other things too.
> 
> So give a reason why you think qemu is _sop_ special, and why the new
> and clearly broken "require VPD" code is _so_ important.
> 
> And really, if the reason is "specs", then somebody needs to get 
> their head examined. We've had so many devices that only glance 
> quickly at the specs that people need to realize that paperwork is 
> one thing, and reality is another, and the two have only a very
> tenuous connection.
> 
> The commit that adds this "no VPD" entry doesn't even talk about when
> we broke this.  What change exactly was it that broke things?
> 
> I've pulled this, but I really think that this was completely bogus.
> Even if blacklisting ends up being the right thing to do in the end,
> the lack of explanations is awful, and the assumption that it's just
> one particular device is very very suspect.

So on the specs thing, we don't expect full compliance and we do code
around problem responses or things missing programmatically (mostly). 
 However, the problem here is the device effectively committing suicide
when it sees a VPD inquiry.  We do expect things like this sometimes
(mosly from USB manufacturers who seem to regard use of specs the same
way you do) but usually we have to blacklist them too to prevent the
problem command reaching the device.

James

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread Linus Torvalds

On Sat, Jun 11, 2016 at 11:54 AM, Linus Torvalds
 wrote:
>
> Is there some reason to believe that the qemu CD-ROM emulation is the
> only one with this problem?

Side note:the one thing that makes the qemu cd-rom emulator "special"
is not that it's not real hardware: it's that it's a lot more likely
to be tested than just about any other actual cd-rom out there,
especially in environments that test new kernels. Lots of developers
tend to have rather modern machines (and I haven't had a CD-ROM in my
machine for the last couple of years, I think), or alternatively they
end up booting things in emulation because it makes for easy testing.

So I really don't think that "oh, it happened only with a broken
emulated device" is a very strong argument for saying that that
emulated device was the problem.

I really think it's likely that the whole "require VPD" is garbage.
The whole "everybody and their dog has used qemu, and the qemu cd-rom
emulation worked perfectly fine before" is a damn strong argument that
it's the new kernel doing something wrong.

So please figure our what the real breakage was, and fix *that*
instead of blaming qemu.

Linus

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread Linus Torvalds

On Sat, Jun 11, 2016 at 11:54 AM, Linus Torvalds
 wrote:
>
> Is there some reason to believe that the qemu CD-ROM emulation is the
> only one with this problem?

Side note:the one thing that makes the qemu cd-rom emulator "special"
is not that it's not real hardware: it's that it's a lot more likely
to be tested than just about any other actual cd-rom out there,
especially in environments that test new kernels. Lots of developers
tend to have rather modern machines (and I haven't had a CD-ROM in my
machine for the last couple of years, I think), or alternatively they
end up booting things in emulation because it makes for easy testing.

So I really don't think that "oh, it happened only with a broken
emulated device" is a very strong argument for saying that that
emulated device was the problem.

I really think it's likely that the whole "require VPD" is garbage.
The whole "everybody and their dog has used qemu, and the qemu cd-rom
emulation worked perfectly fine before" is a damn strong argument that
it's the new kernel doing something wrong.

So please figure our what the real breakage was, and fix *that*
instead of blaming qemu.

Linus

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread Linus Torvalds

On Sat, Jun 11, 2016 at 11:09 AM, James Bottomley
 wrote:
> Two current fixes: one affects Qemu CD ROM emulation, which stopped
> working after the updates in SCSI to require VPD pages from all
> conformant devices.  Fix temporarily by blacklisting Qemu (we can relax
> later when they come into compliance).

Is there some reason to believe that the qemu CD-ROM emulation is the
only one with this problem?

When some device is known to break because the SCSI layer made some
check stricter, why didn't you just relax the check again?

In other words, you already have one known device that behaves a way
that the new code doesn't like, why do you think the new code is
correct?

And don't answer "specs". Specs are just so much toilet paper.

If the Qemu CD-ROM emulation has worked with not just older versions
of Linux, but presumably Windows and other OS's too, then it's likely
that nobody has ever cared about the VPD pages before, and thus the
new code that requires VPD is likely to break other things too.

So give a reason why you think qemu is _sop_ special, and why the new
and clearly broken "require VPD" code is _so_ important.

And really, if the reason is "specs", then somebody needs to get their
head examined. We've had so many devices that only glance quickly at
the specs that people need to realize that paperwork is one thing, and
reality is another, and the two have only a very tenuous connection.

The commit that adds this "no VPD" entry doesn't even talk about when
we broke this.  What change exactly was it that broke things?

I've pulled this, but I really think that this was completely bogus.
Even if blacklisting ends up being the right thing to do in the end,
the lack of explanations is awful, and the assumption that it's just
one particular device is very very suspect.

Linus

Re: [GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread Linus Torvalds

On Sat, Jun 11, 2016 at 11:09 AM, James Bottomley
 wrote:
> Two current fixes: one affects Qemu CD ROM emulation, which stopped
> working after the updates in SCSI to require VPD pages from all
> conformant devices.  Fix temporarily by blacklisting Qemu (we can relax
> later when they come into compliance).

Is there some reason to believe that the qemu CD-ROM emulation is the
only one with this problem?

When some device is known to break because the SCSI layer made some
check stricter, why didn't you just relax the check again?

In other words, you already have one known device that behaves a way
that the new code doesn't like, why do you think the new code is
correct?

And don't answer "specs". Specs are just so much toilet paper.

If the Qemu CD-ROM emulation has worked with not just older versions
of Linux, but presumably Windows and other OS's too, then it's likely
that nobody has ever cared about the VPD pages before, and thus the
new code that requires VPD is likely to break other things too.

So give a reason why you think qemu is _sop_ special, and why the new
and clearly broken "require VPD" code is _so_ important.

And really, if the reason is "specs", then somebody needs to get their
head examined. We've had so many devices that only glance quickly at
the specs that people need to realize that paperwork is one thing, and
reality is another, and the two have only a very tenuous connection.

The commit that adds this "no VPD" entry doesn't even talk about when
we broke this.  What change exactly was it that broke things?

I've pulled this, but I really think that this was completely bogus.
Even if blacklisting ends up being the right thing to do in the end,
the lack of explanations is awful, and the assumption that it's just
one particular device is very very suspect.

Linus

[GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread James Bottomley

Two current fixes: one affects Qemu CD ROM emulation, which stopped
working after the updates in SCSI to require VPD pages from all
conformant devices.  Fix temporarily by blacklisting Qemu (we can relax
later when they come into compliance).  The other is a fix to the
optimal transfer size.  We set up a minefield for ourselves by being
confused about whether the limits are in bytes or sectors (SCSI optimal
is in blocks and the queue parameter is in bytes).  This tries to fix
the problem (wrong setting for queue limits max_sectors) and make the
problem more obvious by introducing a wrapper function.

The patch is available here:

git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-fixes

The short changelog is:

Ewan D. Milne (1):
  scsi: Add QEMU CD-ROM to VPD Inquiry Blacklist

Martin K. Petersen (1):
  sd: Fix rw_max for devices that report an optimal xfer size

And the diffstat:

 drivers/scsi/scsi_devinfo.c | 1 +
 drivers/scsi/sd.c   | 8 
 drivers/scsi/sd.h   | 5 +
 3 files changed, 10 insertions(+), 4 deletions(-)

With full diff below.

James

---

diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c
index 3408578..ff41c31 100644
--- a/drivers/scsi/scsi_devinfo.c
+++ b/drivers/scsi/scsi_devinfo.c
@@ -230,6 +230,7 @@ static struct {
{"PIONEER", "CD-ROM DRM-624X", NULL, BLIST_FORCELUN | BLIST_SINGLELUN},
{"Promise", "VTrak E610f", NULL, BLIST_SPARSELUN | BLIST_NO_RSOC},
{"Promise", "", NULL, BLIST_SPARSELUN},
+   {"QEMU", "QEMU CD-ROM", NULL, BLIST_SKIP_VPD_PAGES},
{"QNAP", "iSCSI Storage", NULL, BLIST_MAX_1024},
{"SYNOLOGY", "iSCSI Storage", NULL, BLIST_MAX_1024},
{"QUANTUM", "XP34301", "1071", BLIST_NOTQ},
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index f459dff..60bff78 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2867,10 +2867,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
if (sdkp->opt_xfer_blocks &&
sdkp->opt_xfer_blocks <= dev_max &&
sdkp->opt_xfer_blocks <= SD_DEF_XFER_BLOCKS &&
-   sdkp->opt_xfer_blocks * sdp->sector_size >= PAGE_SIZE)
-   rw_max = q->limits.io_opt =
-   sdkp->opt_xfer_blocks * sdp->sector_size;
-   else
+   logical_to_bytes(sdp, sdkp->opt_xfer_blocks) >= PAGE_SIZE) {
+   q->limits.io_opt = logical_to_bytes(sdp, sdkp->opt_xfer_blocks);
+   rw_max = logical_to_sectors(sdp, sdkp->opt_xfer_blocks);
+   } else
rw_max = BLK_DEF_MAX_SECTORS;
 
/* Combine with controller limits */
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 654630b..765a6f1 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -151,6 +151,11 @@ static inline sector_t logical_to_sectors(struct 
scsi_device *sdev, sector_t blo
return blocks << (ilog2(sdev->sector_size) - 9);
 }
 
+static inline unsigned int logical_to_bytes(struct scsi_device *sdev, sector_t 
blocks)
+{
+   return blocks * sdev->sector_size;
+}
+
 /*
  * A DIF-capable target device can be formatted with different
  * protection schemes.  Currently 0 through 3 are defined:

[GIT PULL] SCSI fixes for 4.7-rc2

2016-06-11 Thread James Bottomley

Two current fixes: one affects Qemu CD ROM emulation, which stopped
working after the updates in SCSI to require VPD pages from all
conformant devices.  Fix temporarily by blacklisting Qemu (we can relax
later when they come into compliance).  The other is a fix to the
optimal transfer size.  We set up a minefield for ourselves by being
confused about whether the limits are in bytes or sectors (SCSI optimal
is in blocks and the queue parameter is in bytes).  This tries to fix
the problem (wrong setting for queue limits max_sectors) and make the
problem more obvious by introducing a wrapper function.

The patch is available here:

git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-fixes

The short changelog is:

Ewan D. Milne (1):
  scsi: Add QEMU CD-ROM to VPD Inquiry Blacklist

Martin K. Petersen (1):
  sd: Fix rw_max for devices that report an optimal xfer size

And the diffstat:

 drivers/scsi/scsi_devinfo.c | 1 +
 drivers/scsi/sd.c   | 8 
 drivers/scsi/sd.h   | 5 +
 3 files changed, 10 insertions(+), 4 deletions(-)

With full diff below.

James

---

diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c
index 3408578..ff41c31 100644
--- a/drivers/scsi/scsi_devinfo.c
+++ b/drivers/scsi/scsi_devinfo.c
@@ -230,6 +230,7 @@ static struct {
{"PIONEER", "CD-ROM DRM-624X", NULL, BLIST_FORCELUN | BLIST_SINGLELUN},
{"Promise", "VTrak E610f", NULL, BLIST_SPARSELUN | BLIST_NO_RSOC},
{"Promise", "", NULL, BLIST_SPARSELUN},
+   {"QEMU", "QEMU CD-ROM", NULL, BLIST_SKIP_VPD_PAGES},
{"QNAP", "iSCSI Storage", NULL, BLIST_MAX_1024},
{"SYNOLOGY", "iSCSI Storage", NULL, BLIST_MAX_1024},
{"QUANTUM", "XP34301", "1071", BLIST_NOTQ},
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index f459dff..60bff78 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2867,10 +2867,10 @@ static int sd_revalidate_disk(struct gendisk *disk)
if (sdkp->opt_xfer_blocks &&
sdkp->opt_xfer_blocks <= dev_max &&
sdkp->opt_xfer_blocks <= SD_DEF_XFER_BLOCKS &&
-   sdkp->opt_xfer_blocks * sdp->sector_size >= PAGE_SIZE)
-   rw_max = q->limits.io_opt =
-   sdkp->opt_xfer_blocks * sdp->sector_size;
-   else
+   logical_to_bytes(sdp, sdkp->opt_xfer_blocks) >= PAGE_SIZE) {
+   q->limits.io_opt = logical_to_bytes(sdp, sdkp->opt_xfer_blocks);
+   rw_max = logical_to_sectors(sdp, sdkp->opt_xfer_blocks);
+   } else
rw_max = BLK_DEF_MAX_SECTORS;
 
/* Combine with controller limits */
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 654630b..765a6f1 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -151,6 +151,11 @@ static inline sector_t logical_to_sectors(struct 
scsi_device *sdev, sector_t blo
return blocks << (ilog2(sdev->sector_size) - 9);
 }
 
+static inline unsigned int logical_to_bytes(struct scsi_device *sdev, sector_t 
blocks)
+{
+   return blocks * sdev->sector_size;
+}
+
 /*
  * A DIF-capable target device can be formatted with different
  * protection schemes.  Currently 0 through 3 are defined:

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

Re: [GIT PULL] SCSI fixes for 4.7-rc2

[GIT PULL] SCSI fixes for 4.7-rc2

[GIT PULL] SCSI fixes for 4.7-rc2

22 matches

Site Navigation

Mail list logo

Footer information