Re: [BUG] Regression introduced with "block: split bios to max possible length"
On Sat, Jan 23, 2016 at 1:03 AM, Linus Torvalds wrote: > On Fri, Jan 22, 2016 at 7:06 AM, Ming Lei wrote: >> >> Yes, it is one problem, something like below does fix my test >> with 4K block size. > > It just doesn't look very legible. OK, I will try to make it better. > > Also, how could this > >> - goto split; >> + if (sectors) >> + goto split; > > ever matter? If sectors is 0, something is seriously wrong afaik. I guess that is possible because blk_max_size_offset() depends on offset and may return a small size which is less than logical block size. -- Ming Lei
Re: [BUG] Regression introduced with "block: split bios to max possible length"
On 01/22/2016 07:56 AM, Keith Busch wrote: On Thu, Jan 21, 2016 at 08:15:37PM -0800, Linus Torvalds wrote: For the case of nvme, for example, I think the max sector number is so high that you'll never hit that anyway, and you'll only ever hit the chunk limit. No? The device's max transfer and chunk size are not very large, both fixed at 128KB. We can lose ~70% of potential throughput when IO isn't aligned, and end users reported this when the block layer stopped splitting on alignment for the NVMe drive. So it's a big deal for this h/w, but now I feel awkward defending a device specific feature for the generic block layer. Honestly, the splitting code is what is a piece of crap, we never should have gone down that route. Hopefully we can get rid of it soon. In the mean time, this does need to work. It's an odd hw construct (basically two devices bolted together), but it's not really an esoteric thing to support. Anyway, the patch was developed with incorrect assumptions. I'd still like to try again after reconciling the queue limit constraints, but I defer to Jens for the near term. Instead of scrambling for -rc1, I'd suggest we just revert again and ensure what we merge for -rc2 is clean and passes the test cases. -- Jens Axboe
Re: [BUG] Regression introduced with "block: split bios to max possible length"
On Fri, Jan 22, 2016 at 7:06 AM, Ming Lei wrote: > > Yes, it is one problem, something like below does fix my test > with 4K block size. It just doesn't look very legible. Also, how could this > - goto split; > + if (sectors) > + goto split; ever matter? If sectors is 0, something is seriously wrong afaik. Linus
Re: [BUG] Regression introduced with "block: split bios to max possible length"
On Thu, 21 Jan 2016 20:15:37 -0800 Linus Torvalds wrote: > On Thu, Jan 21, 2016 at 7:21 PM, Keith Busch wrote: > > On Thu, Jan 21, 2016 at 05:12:13PM -0800, Linus Torvalds wrote: > >> > >> I assume that in this case it's simply that > >> > >> - max_sectors is some odd number in sectors (ie 65535) > >> > >> - the block size is larger than a sector (ie 4k) > > > > Wouldn't that make max sectors non-sensical? Or am I mistaken to think max > > sectors is supposed to be a valid transfer in multiples of the physical > > sector size? > > If the controller interface is some 16-bit register, then the maximum > number of sectors you can specify is 65535. > > But if the disk then doesn't like 512-byte accesses, but wants 4kB or > whatever, then clearly you can't actually *feed* it that maximum > number. Not because it's a maximal, but because it's not aligned. > > But that doesn't mean that it's non-sensical. It just means that you > have to take both things into account. There may be two totally > independent things that cause the two (very different) rules on what > the IO can look like. > > Obviously there are probably games we could play, like always limiting > the maximum sector number to a multiple of the sector size. That would > presumably work for Stefan's case, by simply "artificially" making > max_sectors be 65528 instead. Yes, it is one problem, something like below does fix my test with 4K block size. -- diff --git a/block/blk-merge.c b/block/blk-merge.c index 1699df5..49e0394 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -81,6 +81,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, unsigned front_seg_size = bio->bi_seg_front_size; bool do_split = true; struct bio *new = NULL; + unsigned max_sectors; bio_for_each_segment(bv, bio, iter) { /* @@ -90,20 +91,23 @@ static struct bio *blk_bio_segment_split(struct request_queue *q, if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset)) goto split; - if (sectors + (bv.bv_len >> 9) > - blk_max_size_offset(q, bio->bi_iter.bi_sector)) { + /* I/O shoule be aligned to logical block size */ + max_sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector); + max_sectors = ((max_sectors << 9) & + ~(queue_logical_block_size(q) - 1)) >> 9; + if (sectors + (bv.bv_len >> 9) > max_sectors) { /* * Consider this a new segment if we're splitting in * the middle of this vector. */ if (nsegs < queue_max_segments(q) && - sectors < blk_max_size_offset(q, - bio->bi_iter.bi_sector)) { + sectors < max_sectors) { nsegs++; - sectors = blk_max_size_offset(q, - bio->bi_iter.bi_sector); + sectors = max_sectors; } - goto split; + if (sectors) + goto split; + /* It is OK to put single bvec into one segment */ } if (bvprvp && blk_queue_cluster(q)) { > > But I do think it's better to consider them independent issues, and > just make sure that we always honor those things independently. > > That "honor things independently" used to happen automatically before, > simply because we'd never split in the middle of a bio segment. And > since each bio segment was created with the limitations of the device > in mind, that all worked. > > Now that it splits in the middle of a vector entry, that splitting > just needs to honor _all_ the rules. Not just the max sector one. > > >> What I think it _should_ do is: > >> > >> (a) check against max sectors like it used to do: > >> > >> if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q)) > >> goto split; > > > > This can create less optimal splits for h/w that advertise chunk size. I > > know it's a quirky feature (wasn't my idea), but the h/w is very slow > > to not split at the necessary alignments, and we used to handle this > > split correctly. > > I suspect few high-performance controllers will really have big issues > with the max_sectors thing. If you have big enough IO that you could > hit the maximum sector number, you're already pretty well off, you > might as well split at that point. > > So I think it's ok to split at the max sector case early. > > For the case of nvme, for example, I think the max sector number is so > high that you'll never hit that anyway, and you'll only ever hit the > chunk limit. No? > > So in practice
Re: [BUG] Regression introduced with "block: split bios to max possible length"
On Thu, Jan 21, 2016 at 08:15:37PM -0800, Linus Torvalds wrote: > For the case of nvme, for example, I think the max sector number is so > high that you'll never hit that anyway, and you'll only ever hit the > chunk limit. No? The device's max transfer and chunk size are not very large, both fixed at 128KB. We can lose ~70% of potential throughput when IO isn't aligned, and end users reported this when the block layer stopped splitting on alignment for the NVMe drive. So it's a big deal for this h/w, but now I feel awkward defending a device specific feature for the generic block layer. Anyway, the patch was developed with incorrect assumptions. I'd still like to try again after reconciling the queue limit constraints, but I defer to Jens for the near term. Thanks!
Re: [BUG] Regression introduced with "block: split bios to max possible length"
On Thu, Jan 21, 2016 at 7:21 PM, Keith Busch wrote: > On Thu, Jan 21, 2016 at 05:12:13PM -0800, Linus Torvalds wrote: >> >> I assume that in this case it's simply that >> >> - max_sectors is some odd number in sectors (ie 65535) >> >> - the block size is larger than a sector (ie 4k) > > Wouldn't that make max sectors non-sensical? Or am I mistaken to think max > sectors is supposed to be a valid transfer in multiples of the physical > sector size? If the controller interface is some 16-bit register, then the maximum number of sectors you can specify is 65535. But if the disk then doesn't like 512-byte accesses, but wants 4kB or whatever, then clearly you can't actually *feed* it that maximum number. Not because it's a maximal, but because it's not aligned. But that doesn't mean that it's non-sensical. It just means that you have to take both things into account. There may be two totally independent things that cause the two (very different) rules on what the IO can look like. Obviously there are probably games we could play, like always limiting the maximum sector number to a multiple of the sector size. That would presumably work for Stefan's case, by simply "artificially" making max_sectors be 65528 instead. But I do think it's better to consider them independent issues, and just make sure that we always honor those things independently. That "honor things independently" used to happen automatically before, simply because we'd never split in the middle of a bio segment. And since each bio segment was created with the limitations of the device in mind, that all worked. Now that it splits in the middle of a vector entry, that splitting just needs to honor _all_ the rules. Not just the max sector one. >> What I think it _should_ do is: >> >> (a) check against max sectors like it used to do: >> >> if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q)) >> goto split; > > This can create less optimal splits for h/w that advertise chunk size. I > know it's a quirky feature (wasn't my idea), but the h/w is very slow > to not split at the necessary alignments, and we used to handle this > split correctly. I suspect few high-performance controllers will really have big issues with the max_sectors thing. If you have big enough IO that you could hit the maximum sector number, you're already pretty well off, you might as well split at that point. So I think it's ok to split at the max sector case early. For the case of nvme, for example, I think the max sector number is so high that you'll never hit that anyway, and you'll only ever hit the chunk limit. No? So in practice it won't matter, I suspect. Linus
Re: [BUG] Regression introduced with "block: split bios to max possible length"
On Thu, Jan 21, 2016 at 05:12:13PM -0800, Linus Torvalds wrote: > On Thu, Jan 21, 2016 at 2:51 PM, Keith Busch wrote: > > > > My apologies for the trouble. I trust it really is broken, but I don't > > quite see how. The patch supposedly splits the transfer to the max size > > the request queue says it allows. How does the max allowed size end up > > an invalid multiple? > > I assume that in this case it's simply that > > - max_sectors is some odd number in sectors (ie 65535) > > - the block size is larger than a sector (ie 4k) Wouldn't that make max sectors non-sensical? Or am I mistaken to think max sectors is supposed to be a valid transfer in multiples of the physical sector size? I do think this is what's happening though. A recent commit (ca369d51b) limits the max_sectors to 255 max by default, which isn't right for 4k. A driver has to override the queue's limits.max_dev_sectors first to get the desired limit for their storage. I'm not sure if that was the intention. There are lots of drivers requesting more than 255 and probably unaware they're not getting it, DASD included. I don't think we'd have seen this problem if the requested setting wasn't overridden. > What I think it _should_ do is: > > (a) check against max sectors like it used to do: > > if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q)) > goto split; This can create less optimal splits for h/w that advertise chunk size. I know it's a quirky feature (wasn't my idea), but the h/w is very slow to not split at the necessary alignments, and we used to handle this split correctly. Point taken, though. The implementation needs some cleaning up.
Re: [BUG] Regression introduced with "block: split bios to max possible length"
On Thu, Jan 21, 2016 at 2:51 PM, Keith Busch wrote: > > My apologies for the trouble. I trust it really is broken, but I don't > quite see how. The patch supposedly splits the transfer to the max size > the request queue says it allows. How does the max allowed size end up > an invalid multiple? I assume that in this case it's simply that - max_sectors is some odd number in sectors (ie 65535) - the block size is larger than a sector (ie 4k) - the device probably doesn't even have any silly chunk size issue (so chunk_sectors is zero). and because the block size is 4k, no valid IO can ever generate anything but 4k-aligned IO's, and everything is fine. Except now the "split bios" patch will split blindly at the max_sectors size, which is pure and utter garbage, since it doesn't take the minimum block size into account. Also, quite frankly, I think that whole "split bios" patch is garbage *anyway*. The thing is, the whole "blk_max_size_offset()" use there is broken. What I think it _should_ do is: (a) check against max sectors like it used to do: if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q)) goto split; (b) completely separately, and _independently_ of that max sector check, it should check against the "chunk_sectors" limit if it exists. instead, it uses that nasty blk_max_size_offset() crap, which is broken because it's confusing, but also because it doesn't honor max_sectors AT ALL if there is a chunking size. So I think chunking size should be independent of max_sectors. I could see some device that has some absolute max sector size, but that _also_ wants to split so that the bio never crosses a particular chunk size (perhaps due to RAID, perhaps due to some internal device block handling rules). Trying to mix the two things with those "blk_max_size_offset()" games is just wrong. Linus
Re: [BUG] Regression introduced with "block: split bios to max possible length"
On Thu, Jan 21, 2016 at 02:34:28PM -0700, Jens Axboe wrote: > On 01/21/2016 07:57 AM, Stefan Haberland wrote: > >Hi, > > > >unfortunately commit e36f62042880 "block: split bios to maxpossible length" > >breaks the DASD driver on s390. We expect the block requests to be > >multiple > >of 4k in size. With the patch applied I see the requests split up in > >multiple > >of 512 byte and therefore the requests get rejected and lots of I/Os fail. > > Sigh, that's definitely a bug. I'll take a look at it and see if I > can get a tested fix in for 4.5-rc1. If not, we'll revert it, again. My apologies for the trouble. I trust it really is broken, but I don't quite see how. The patch supposedly splits the transfer to the max size the request queue says it allows. How does the max allowed size end up an invalid multiple?
Re: [BUG] Regression introduced with "block: split bios to max possible length"
On 01/21/2016 07:57 AM, Stefan Haberland wrote: Hi, unfortunately commit e36f62042880 "block: split bios to maxpossible length" breaks the DASD driver on s390. We expect the block requests to be multiple of 4k in size. With the patch applied I see the requests split up in multiple of 512 byte and therefore the requests get rejected and lots of I/Os fail. Sigh, that's definitely a bug. I'll take a look at it and see if I can get a tested fix in for 4.5-rc1. If not, we'll revert it, again. -- Jens Axboe
[BUG] Regression introduced with "block: split bios to max possible length"
Hi, unfortunately commit e36f62042880 "block: split bios to maxpossible length" breaks the DASD driver on s390. We expect the block requests to be multiple of 4k in size. With the patch applied I see the requests split up in multiple of 512 byte and therefore the requests get rejected and lots of I/Os fail. Could you please fix or revert the patch? We have following values set for the block queue: hw_sector_size 4096 logical_block_size 4096 minimum_io_size4096 physical_block_size4096 max_hw_sectors_kb 760 max_sectors_kb 760 max_segment_size 4096 max_segments 65535 Regards, Stefan