Re: [BUG] Regression introduced with "block: split bios to max possible length"

2016-01-22 Thread Ming Lei
On Sat, Jan 23, 2016 at 1:03 AM, Linus Torvalds
 wrote:
> On Fri, Jan 22, 2016 at 7:06 AM, Ming Lei  wrote:
>>
>> Yes, it is one problem, something like below does fix my test
>> with 4K block size.
>
> It just doesn't look very legible.

OK, I will try to make it better.

>
> Also, how could this
>
>> -   goto split;
>> +   if (sectors)
>> +   goto split;
>
> ever matter? If sectors is 0, something is seriously wrong afaik.

I guess that is possible because blk_max_size_offset() depends on
offset and may return a small size which is less than logical
block size.

--
Ming Lei


Re: [BUG] Regression introduced with "block: split bios to max possible length"

2016-01-22 Thread Jens Axboe

On 01/22/2016 07:56 AM, Keith Busch wrote:

On Thu, Jan 21, 2016 at 08:15:37PM -0800, Linus Torvalds wrote:

For the case of nvme, for example, I think the max sector number is so
high that you'll never hit that anyway, and you'll only ever hit the
chunk limit. No?


The device's max transfer and chunk size are not very large, both fixed
at 128KB. We can lose ~70% of potential throughput when IO isn't aligned,
and end users reported this when the block layer stopped splitting on
alignment for the NVMe drive.

So it's a big deal for this h/w, but now I feel awkward defending a
device specific feature for the generic block layer.


Honestly, the splitting code is what is a piece of crap, we never should 
have gone down that route. Hopefully we can get rid of it soon. In the 
mean time, this does need to work. It's an odd hw construct (basically 
two devices bolted together), but it's not really an esoteric thing to 
support.



Anyway, the patch was developed with incorrect assumptions. I'd still
like to try again after reconciling the queue limit constraints, but
I defer to Jens for the near term.


Instead of scrambling for -rc1, I'd suggest we just revert again and 
ensure what we merge for -rc2 is clean and passes the test cases.


--
Jens Axboe



Re: [BUG] Regression introduced with "block: split bios to max possible length"

2016-01-22 Thread Linus Torvalds
On Fri, Jan 22, 2016 at 7:06 AM, Ming Lei  wrote:
>
> Yes, it is one problem, something like below does fix my test
> with 4K block size.

It just doesn't look very legible.

Also, how could this

> -   goto split;
> +   if (sectors)
> +   goto split;

ever matter? If sectors is 0, something is seriously wrong afaik.

  Linus


Re: [BUG] Regression introduced with "block: split bios to max possible length"

2016-01-22 Thread Ming Lei
On Thu, 21 Jan 2016 20:15:37 -0800
Linus Torvalds  wrote:

> On Thu, Jan 21, 2016 at 7:21 PM, Keith Busch  wrote:
> > On Thu, Jan 21, 2016 at 05:12:13PM -0800, Linus Torvalds wrote:
> >>
> >> I assume that in this case it's simply that
> >>
> >>  - max_sectors is some odd number in sectors (ie 65535)
> >>
> >>  - the block size is larger than a sector (ie 4k)
> >
> > Wouldn't that make max sectors non-sensical? Or am I mistaken to think max
> > sectors is supposed to be a valid transfer in multiples of the physical
> > sector size?
> 
> If the controller interface is some 16-bit register, then the maximum
> number of sectors you can specify is 65535.
> 
> But if the disk then doesn't like 512-byte accesses, but wants 4kB or
> whatever, then clearly you can't actually *feed* it that maximum
> number. Not because it's a maximal, but because it's not aligned.
> 
> But that doesn't mean that it's non-sensical. It just means that you
> have to take both things into account.  There may be two totally
> independent things that cause the two (very different) rules on what
> the IO can look like.
> 
> Obviously there are probably games we could play, like always limiting
> the maximum sector number to a multiple of the sector size. That would
> presumably work for Stefan's case, by simply "artificially" making
> max_sectors be 65528 instead.

Yes, it is one problem, something like below does fix my test
with 4K block size.

--
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1699df5..49e0394 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -81,6 +81,7 @@ static struct bio *blk_bio_segment_split(struct request_queue 
*q,
unsigned front_seg_size = bio->bi_seg_front_size;
bool do_split = true;
struct bio *new = NULL;
+   unsigned max_sectors;
 
bio_for_each_segment(bv, bio, iter) {
/*
@@ -90,20 +91,23 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
goto split;
 
-   if (sectors + (bv.bv_len >> 9) >
-   blk_max_size_offset(q, bio->bi_iter.bi_sector)) 
{
+   /* I/O shoule be aligned to logical block size */
+   max_sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector);
+   max_sectors = ((max_sectors << 9) &
+   ~(queue_logical_block_size(q) - 1)) >> 9;
+   if (sectors + (bv.bv_len >> 9) > max_sectors) {
/*
 * Consider this a new segment if we're splitting in
 * the middle of this vector.
 */
if (nsegs < queue_max_segments(q) &&
-   sectors < blk_max_size_offset(q,
-   bio->bi_iter.bi_sector)) {
+   sectors < max_sectors) {
nsegs++;
-   sectors = blk_max_size_offset(q,
-   bio->bi_iter.bi_sector);
+   sectors = max_sectors;
}
-   goto split;
+   if (sectors)
+   goto split;
+   /* It is OK to put single bvec into one segment */
}
 
if (bvprvp && blk_queue_cluster(q)) {


> 
> But I do think it's better to consider them independent issues, and
> just make sure that we always honor those things independently.
> 
> That "honor things independently" used to happen automatically before,
> simply because we'd never split in the middle of a bio segment. And
> since each bio segment was created with the limitations of the device
> in mind, that all worked.
> 
> Now that it splits in the middle of a vector entry, that splitting
> just needs to honor _all_ the rules. Not just the max sector one.
> 
> >> What I think it _should_ do is:
> >>
> >>  (a) check against max sectors like it used to do:
> >>
> >> if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q))
> >> goto split;
> >
> > This can create less optimal splits for h/w that advertise chunk size. I
> > know it's a quirky feature (wasn't my idea), but the h/w is very slow
> > to not split at the necessary alignments, and we used to handle this
> > split correctly.
> 
> I suspect few high-performance controllers will really have big issues
> with the max_sectors thing. If you have big enough IO that you could
> hit the maximum sector number, you're already pretty well off, you
> might as well split at that point.
> 
> So I think it's ok to split at the max sector case early.
> 
> For the case of nvme, for example, I think the max sector number is so
> high that you'll never hit that anyway, and you'll only ever hit the
> chunk limit. No?
> 
> So in practice

Re: [BUG] Regression introduced with "block: split bios to max possible length"

2016-01-22 Thread Keith Busch
On Thu, Jan 21, 2016 at 08:15:37PM -0800, Linus Torvalds wrote:
> For the case of nvme, for example, I think the max sector number is so
> high that you'll never hit that anyway, and you'll only ever hit the
> chunk limit. No?

The device's max transfer and chunk size are not very large, both fixed
at 128KB. We can lose ~70% of potential throughput when IO isn't aligned,
and end users reported this when the block layer stopped splitting on
alignment for the NVMe drive.

So it's a big deal for this h/w, but now I feel awkward defending a
device specific feature for the generic block layer.

Anyway, the patch was developed with incorrect assumptions. I'd still
like to try again after reconciling the queue limit constraints, but
I defer to Jens for the near term.

Thanks!


Re: [BUG] Regression introduced with "block: split bios to max possible length"

2016-01-21 Thread Linus Torvalds
On Thu, Jan 21, 2016 at 7:21 PM, Keith Busch  wrote:
> On Thu, Jan 21, 2016 at 05:12:13PM -0800, Linus Torvalds wrote:
>>
>> I assume that in this case it's simply that
>>
>>  - max_sectors is some odd number in sectors (ie 65535)
>>
>>  - the block size is larger than a sector (ie 4k)
>
> Wouldn't that make max sectors non-sensical? Or am I mistaken to think max
> sectors is supposed to be a valid transfer in multiples of the physical
> sector size?

If the controller interface is some 16-bit register, then the maximum
number of sectors you can specify is 65535.

But if the disk then doesn't like 512-byte accesses, but wants 4kB or
whatever, then clearly you can't actually *feed* it that maximum
number. Not because it's a maximal, but because it's not aligned.

But that doesn't mean that it's non-sensical. It just means that you
have to take both things into account.  There may be two totally
independent things that cause the two (very different) rules on what
the IO can look like.

Obviously there are probably games we could play, like always limiting
the maximum sector number to a multiple of the sector size. That would
presumably work for Stefan's case, by simply "artificially" making
max_sectors be 65528 instead.

But I do think it's better to consider them independent issues, and
just make sure that we always honor those things independently.

That "honor things independently" used to happen automatically before,
simply because we'd never split in the middle of a bio segment. And
since each bio segment was created with the limitations of the device
in mind, that all worked.

Now that it splits in the middle of a vector entry, that splitting
just needs to honor _all_ the rules. Not just the max sector one.

>> What I think it _should_ do is:
>>
>>  (a) check against max sectors like it used to do:
>>
>> if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q))
>> goto split;
>
> This can create less optimal splits for h/w that advertise chunk size. I
> know it's a quirky feature (wasn't my idea), but the h/w is very slow
> to not split at the necessary alignments, and we used to handle this
> split correctly.

I suspect few high-performance controllers will really have big issues
with the max_sectors thing. If you have big enough IO that you could
hit the maximum sector number, you're already pretty well off, you
might as well split at that point.

So I think it's ok to split at the max sector case early.

For the case of nvme, for example, I think the max sector number is so
high that you'll never hit that anyway, and you'll only ever hit the
chunk limit. No?

So in practice it won't matter, I suspect.

 Linus


Re: [BUG] Regression introduced with "block: split bios to max possible length"

2016-01-21 Thread Keith Busch
On Thu, Jan 21, 2016 at 05:12:13PM -0800, Linus Torvalds wrote:
> On Thu, Jan 21, 2016 at 2:51 PM, Keith Busch  wrote:
> >
> > My apologies for the trouble. I trust it really is broken, but I don't
> > quite see how. The patch supposedly splits the transfer to the max size
> > the request queue says it allows. How does the max allowed size end up
> > an invalid multiple?
> 
> I assume that in this case it's simply that
> 
>  - max_sectors is some odd number in sectors (ie 65535)
> 
>  - the block size is larger than a sector (ie 4k)

Wouldn't that make max sectors non-sensical? Or am I mistaken to think max
sectors is supposed to be a valid transfer in multiples of the physical
sector size?

I do think this is what's happening though. A recent commit (ca369d51b)
limits the max_sectors to 255 max by default, which isn't right for
4k. A driver has to override the queue's limits.max_dev_sectors first
to get the desired limit for their storage.

I'm not sure if that was the intention. There are lots of drivers
requesting more than 255 and probably unaware they're not getting it,
DASD included. I don't think we'd have seen this problem if the requested
setting wasn't overridden.
 
> What I think it _should_ do is:
> 
>  (a) check against max sectors like it used to do:
> 
> if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q))
> goto split;

This can create less optimal splits for h/w that advertise chunk size. I
know it's a quirky feature (wasn't my idea), but the h/w is very slow
to not split at the necessary alignments, and we used to handle this
split correctly.

Point taken, though. The implementation needs some cleaning up.


Re: [BUG] Regression introduced with "block: split bios to max possible length"

2016-01-21 Thread Linus Torvalds
On Thu, Jan 21, 2016 at 2:51 PM, Keith Busch  wrote:
>
> My apologies for the trouble. I trust it really is broken, but I don't
> quite see how. The patch supposedly splits the transfer to the max size
> the request queue says it allows. How does the max allowed size end up
> an invalid multiple?

I assume that in this case it's simply that

 - max_sectors is some odd number in sectors (ie 65535)

 - the block size is larger than a sector (ie 4k)

 - the device probably doesn't even have any silly chunk size issue
(so chunk_sectors is zero).

and because the block size is 4k, no valid IO can ever generate
anything but 4k-aligned IO's, and everything is fine.

Except now the "split bios" patch will split blindly at the
max_sectors size, which is pure and utter garbage, since it doesn't
take the minimum block size into account.

Also, quite frankly, I think that whole "split bios" patch is garbage *anyway*.

The thing is, the whole "blk_max_size_offset()" use there is broken.
What I think it _should_ do is:

 (a) check against max sectors like it used to do:

if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q))
goto split;

 (b) completely separately, and _independently_ of that max sector
check, it should check against the "chunk_sectors" limit if it exists.

instead, it uses that nasty blk_max_size_offset() crap, which is
broken because it's confusing, but also because it doesn't honor
max_sectors AT ALL if there is a chunking size.

So I think chunking size should be independent of max_sectors. I could
see some device that has some absolute max sector size, but that
_also_ wants to split so that the bio never crosses a particular chunk
size (perhaps due to RAID, perhaps due to some internal device block
handling rules).

Trying to mix the two things with those "blk_max_size_offset()" games
is just wrong.

   Linus


Re: [BUG] Regression introduced with "block: split bios to max possible length"

2016-01-21 Thread Keith Busch
On Thu, Jan 21, 2016 at 02:34:28PM -0700, Jens Axboe wrote:
> On 01/21/2016 07:57 AM, Stefan Haberland wrote:
> >Hi,
> >
> >unfortunately commit e36f62042880 "block: split bios to maxpossible length"
> >breaks the DASD driver on s390.  We expect the block requests to be
> >multiple
> >of 4k in size. With the patch applied I see the requests split up in
> >multiple
> >of 512 byte and therefore the requests get rejected and lots of I/Os fail.
> 
> Sigh, that's definitely a bug. I'll take a look at it and see if I
> can get a tested fix in for 4.5-rc1. If not, we'll revert it, again.

My apologies for the trouble. I trust it really is broken, but I don't
quite see how. The patch supposedly splits the transfer to the max size
the request queue says it allows. How does the max allowed size end up
an invalid multiple?


Re: [BUG] Regression introduced with "block: split bios to max possible length"

2016-01-21 Thread Jens Axboe

On 01/21/2016 07:57 AM, Stefan Haberland wrote:

Hi,

unfortunately commit e36f62042880 "block: split bios to maxpossible length"
breaks the DASD driver on s390.  We expect the block requests to be
multiple
of 4k in size. With the patch applied I see the requests split up in
multiple
of 512 byte and therefore the requests get rejected and lots of I/Os fail.


Sigh, that's definitely a bug. I'll take a look at it and see if I can 
get a tested fix in for 4.5-rc1. If not, we'll revert it, again.



--
Jens Axboe



[BUG] Regression introduced with "block: split bios to max possible length"

2016-01-21 Thread Stefan Haberland

Hi,

unfortunately commit e36f62042880 "block: split bios to maxpossible length"
breaks the DASD driver on s390.  We expect the block requests to be multiple
of 4k in size. With the patch applied I see the requests split up in 
multiple

of 512 byte and therefore the requests get rejected and lots of I/Os fail.

Could you please fix or revert the patch?

We have following values set for the block queue:

hw_sector_size 4096
logical_block_size 4096
minimum_io_size4096
physical_block_size4096
max_hw_sectors_kb   760
max_sectors_kb  760
max_segment_size   4096
max_segments  65535

Regards,
Stefan