The behind idea is simple:
1) for none scheduler, driver tag has to be borrowed for flush
rq, otherwise we may run out of tag, and IO hang is caused.
get/put driver tag is actually a nop, so reorder tags isn't
necessary at all.
2) for real I/O scheduler, we needn't to allocate driver tag
In case of IO scheduler, any request shouldn't have a tag
assigned before dispatching, so add the warning to monitor
possible bug.
Signed-off-by: Ming Lei
---
block/blk-mq-sched.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/block/blk-mq-sched.c
We need this helper to put the tag for flush rq, since
we will not share tag in the flush request sequences
in case of I/O scheduler.
Also the driver tag need to be released before requeuing.
Signed-off-by: Ming Lei
---
block/blk-mq.c | 32
Now we always preallocate one driver tag before blk_insert_flush(),
and flush request will be marked as RQF_FLUSH_SEQ once it is in
flush machinary.
So if RQF_FLUSH_SEQ isn't set, we call blk_insert_flush()
to handle the request, otherwise the flush request is
dispatched to ->dispatch list
Block flush need this function without needing run queue,
so introduce the parameter.
Signed-off-by: Ming Lei
---
block/blk-core.c | 2 +-
block/blk-mq.c | 5 +++--
block/blk-mq.h | 2 +-
3 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/block/blk-core.c
In the following patch, we will use RQF_FLUSH_SEQ
to decide:
- if the flag isn't set, the flush rq need to be inserted
via blk_insert_flush()
- otherwise, the flush rq need to be dispatched directly
since it is in flush machinery now.
So we use blk_mq_request_bypass_insert() for requsts
of
Hi,
This patchset avoids to allocate driver tag beforehand for flush rq
in case of I/O scheduler, then flush rq isn't treated specially
wrt. get/put driver tag, code gets cleanup much, such as,
reorder_tags_to_front() is removed, and we needn't to worry
about request order in dispatch list for
Hi,
This patchset avoids to allocate driver tag beforehand for flush rq
in case of I/O scheduler, then flush rq isn't treated specially
wrt. get/put driver tag, code gets cleanup much, such as,
reorder_tags_to_front() is removed, and we needn't to worry
about request order in dispatch list for
blk_insert_flush() should only insert request since run
queue always follows it.
For the case of bypassing flush, we don't need to run queue
since every blk_insert_flush() follows one run queue.
Signed-off-by: Ming Lei
---
block/blk-flush.c | 2 +-
1 file changed, 1
On Sat, 2017-09-16 at 07:35 +0900, Damien Le Moal wrote:
> rw16 is mandatory for ZBC drives. So it has to be set to true. If the
> HBA does not support rw16 (why would that happen ?), then the disk
> should not be used.
It's good that all HBAs support rw16. But it's nontrivial to analyze whether
On Fri, Sep 15, 2017 at 12:06:41PM -0600, Jens Axboe wrote:
> On 09/15/2017 08:29 AM, Jens Axboe wrote:
> > On 09/14/2017 08:20 PM, Ming Lei wrote:
> >> On Thu, Sep 14, 2017 at 12:51:24PM -0600, Jens Axboe wrote:
> >>> On 09/14/2017 10:42 AM, Ming Lei wrote:
> Hi,
>
> This patchset
On 9/15/17 19:44, Hannes Reinecke wrote:
> On 09/15/2017 12:06 PM, Damien Le Moal wrote:
>> Fix comments style (do not use documented comment style) and add some
>> comments to clarify some functions. Also fix some functions signature
>> indentation and remove a useless blank line in
On 9/16/17 02:45, Christoph Hellwig wrote:
> On Fri, Sep 15, 2017 at 07:06:34PM +0900, Damien Le Moal wrote:
>> __blk_mq_debugfs_rq_show() and blk_mq_debugfs_rq_show() are exported
>> symbols but ar eonly declared in the block internal file
>> block/blk-mq-debugfs.h. which is not cleanly
On 9/16/17 06:02, Bart Van Assche wrote:
> On Fri, 2017-09-15 at 19:51 +0200, h...@lst.de wrote:
>> On Fri, Sep 15, 2017 at 02:51:03PM +, Bart Van Assche wrote:
>>> On Fri, 2017-09-15 at 19:06 +0900, Damien Le Moal wrote:
Rearrange sd_zbc_setup() to include use_16_for_rw and use_10_for_rw
On Sat, 2017-09-16 at 00:44 +0800, Ming Lei wrote:
> +static void save_path_queue_depth(struct pgpath *p)
> +{
> + struct request_queue *q = bdev_get_queue(p->path.dev->bdev);
> +
> + p->old_nr_requests = q->nr_requests;
> + p->queue_depth = q->queue_depth;
> +
> + /* one extra
On Sat, 2017-09-16 at 00:44 +0800, Ming Lei wrote:
> 1) lpfc.lpfc_lun_queue_depth=3, so that it is same with .cmd_per_lun
Nobody I know uses such a low queue depth for the lpfc driver. Please also
include performance results for a more realistic queue depth.
Thanks,
Bart.
On Fri, 2017-09-15 at 19:51 +0200, h...@lst.de wrote:
> On Fri, Sep 15, 2017 at 02:51:03PM +, Bart Van Assche wrote:
> > On Fri, 2017-09-15 at 19:06 +0900, Damien Le Moal wrote:
> > > Rearrange sd_zbc_setup() to include use_16_for_rw and use_10_for_rw
> > > assignments and move the calculation
On Sat, 2017-09-16 at 00:44 +0800, Ming Lei wrote:
> ---
> |v4.13+ |v4.13+
> |+scsi_mq_perf |+scsi_mq_perf+patches
> -
> IOPS(K) |MQ-DEADLINE|MQ-DEADLINE
>
On Fri, 2017-09-15 at 16:06 -0400, Mike Snitzer wrote:
> The problem is that multipath_clone_and_map() is now treated as common
> code (thanks to both blk-mq and old .request_fn now enjoying the use of
> blk_get_request) BUT: Ming please understand that this code is used by
> old .request_fn too.
On Fri, Sep 15 2017 at 12:44pm -0400,
Ming Lei wrote:
> The actual I/O schedule is done in dm-mpath layer, and the
> underlying I/O schedule is simply bypassed.
>
> This patch sets underlying queue's nr_requests as its queue's
> queue_depth, then we can get its queue busy
On Fri, Sep 15 2017 at 1:29pm -0400,
Bart Van Assche wrote:
> On Sat, 2017-09-16 at 00:44 +0800, Ming Lei wrote:
> > blk-mq will rerun queue via RESTART after one request is completion,
> > so not necessary to wait random time for requeuing, it should trust
> > blk-mq to
Hi Anish,
I looked over the code a bit, and I'm rather confused by the newly
added commands. Which controller supports them? Also the NVMe
working group went down a very different way with the ALUA approch,
which uses different grouping concepts and doesn't require path
activations - for Linux
On 09/15/2017 08:29 AM, Jens Axboe wrote:
> On 09/14/2017 08:20 PM, Ming Lei wrote:
>> On Thu, Sep 14, 2017 at 12:51:24PM -0600, Jens Axboe wrote:
>>> On 09/14/2017 10:42 AM, Ming Lei wrote:
Hi,
This patchset avoids to allocate driver tag beforehand for flush rq
in case of I/O
On Sat, 2017-09-16 at 00:44 +0800, Ming Lei wrote:
> If .queue_rq() returns BLK_STS_RESOURCE, blk-mq will rerun
> the queue in the three situations:
>
> 1) if BLK_MQ_S_SCHED_RESTART is set
> - queue is rerun after one rq is completed, see blk_mq_sched_restart()
> which is run from
Looks fine,
Reviewed-by: Christoph Hellwig
Looks fine,
Reviewed-by: Christoph Hellwig
On Fri, Sep 15, 2017 at 02:51:03PM +, Bart Van Assche wrote:
> On Fri, 2017-09-15 at 19:06 +0900, Damien Le Moal wrote:
> > Rearrange sd_zbc_setup() to include use_16_for_rw and use_10_for_rw
> > assignments and move the calculation of sdkp->zone_shift together
> > with the assignment of the
Looks fine,
Reviewed-by: Christoph Hellwig
Looks fine,
Reviewed-by: Christoph Hellwig
> +struct blk_zoned {
> + unsigned intnr_zones;
> + unsigned long *seq_zones;
> +};
> +
> struct blk_zone_report_hdr {
> unsigned intnr_zones;
> u8 padding[60];
> @@ -492,6 +497,10 @@ struct request_queue {
> struct blk_integrity integrity;
>
Same as for patch 1: this should stay local to block/ - we don't
want random drivers to grow I/O schedulers.
On Fri, Sep 15, 2017 at 07:06:34PM +0900, Damien Le Moal wrote:
> __blk_mq_debugfs_rq_show() and blk_mq_debugfs_rq_show() are exported
> symbols but ar eonly declared in the block internal file
> block/blk-mq-debugfs.h. which is not cleanly accessible to files outside
> of the block directory.
>
On Sat, 2017-09-16 at 00:44 +0800, Ming Lei wrote:
> blk-mq will rerun queue via RESTART after one request is completion,
> so not necessary to wait random time for requeuing, it should trust
> blk-mq to do it.
>
> Signed-off-by: Ming Lei
> ---
> drivers/md/dm-mpath.c | 2
It is very normal to see allocation failure, so not necessary
to dump it and annoy people.
Signed-off-by: Ming Lei
---
drivers/md/dm-mpath.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index f5a1088a6e79..f57ad8621c4c
The actual I/O schedule is done in dm-mpath layer, and the
underlying I/O schedule is simply bypassed.
This patch sets underlying queue's nr_requests as its queue's
queue_depth, then we can get its queue busy feedback by simply
checking if blk_get_request() returns successfully.
In this way,
dm-mpath need this API for improving IO scheduling.
The IO schedule is actually done on dm-rq(mpath) queue,
instead of underlying devices.
If we set q->nr_requests as q->queue_depth on underlying
devices, we can get the queue's busy feedback by simply
checking if blk_get_request() returns
blk-mq will rerun queue via RESTART after one request is completion,
so not necessary to wait random time for requeuing, it should trust
blk-mq to do it.
Signed-off-by: Ming Lei
---
drivers/md/dm-mpath.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git
Hi,
We depend on I/O scheduler in dm-mpath layer, and underlying
I/O scheduler is bypassed basically.
I/O scheduler depends on queue busy condition to
trigger I/O merge, unfortunatley inside dm-mpath,
the underlying queue busy feedback is not accurate
enough, and we just allocate one request and
Hi,
We depend on I/O scheduler in dm-mpath layer, and underlying
I/O scheduler is bypassed basically.
I/O scheduler depends on queue busy condition to
trigger I/O merge, unfortunatley inside dm-mpath,
the underlying queue busy feedback is not accurate
enough, and we just allocate one request and
On Fri, 2017-09-15 at 19:06 +0900, Damien Le Moal wrote:
> instead of open coding, use the min() macro to calculate a report zones
> reply buffer length in sd_zbc_check_zone_size() and the round_up()
> macro for calculating the number of zones in sd_zbc_setup().
Reviewed-by: Bart Van Assche
On Fri, 2017-09-15 at 19:06 +0900, Damien Le Moal wrote:
> + * There is no write constraints on conventional zones. So any write
^^^
Should this have been "There are no"?
> - if (sdkp->zones_wlock &&
> - test_and_set_bit(zno, sdkp->zones_wlock))
> + if
On Thu, Sep 14, 2017 at 02:02:04PM -0700, Shaohua Li wrote:
> From: Shaohua Li
>
> kthread usually runs jobs on behalf of other threads. The jobs should be
> charged to cgroup of original threads. But the jobs run in a kthread,
> where we lose the cgroup context of original threads.
On Fri, 2017-09-15 at 19:06 +0900, Damien Le Moal wrote:
> Rearrange sd_zbc_setup() to include use_16_for_rw and use_10_for_rw
> assignments and move the calculation of sdkp->zone_shift together
> with the assignment of the verified zone_blocks value in
> sd_zbc_check_zone_size().
Both functions
On Fri, 2017-09-15 at 19:06 +0900, Damien Le Moal wrote:
> @@ -492,6 +497,10 @@ struct request_queue {
> struct blk_integrity integrity;
> #endif /* CONFIG_BLK_DEV_INTEGRITY */
>
> +#ifdef CONFIG_BLK_DEV_ZONED
> + struct blk_zonedzoned;
> +#endif
> +
> #ifdef CONFIG_PM
On Fri, 2017-09-15 at 19:06 +0900, Damien Le Moal wrote:
> The functions blk_mq_sched_free_hctx_data(), blk_mq_sched_try_merge(),
> blk_mq_sched_try_insert_merge() and blk_mq_sched_request_inserted() are
> all exported symbols but are declared only internally in
> block/blk-mq-sched.h. Move these
On Fri, 2017-09-15 at 19:06 +0900, Damien Le Moal wrote:
> __blk_mq_debugfs_rq_show() and blk_mq_debugfs_rq_show() are exported
> symbols but ar eonly declared in the block internal file
are only?
> block/blk-mq-debugfs.h. which is not cleanly accessible to
On 09/14/2017 08:20 PM, Ming Lei wrote:
> On Thu, Sep 14, 2017 at 12:51:24PM -0600, Jens Axboe wrote:
>> On 09/14/2017 10:42 AM, Ming Lei wrote:
>>> Hi,
>>>
>>> This patchset avoids to allocate driver tag beforehand for flush rq
>>> in case of I/O scheduler, then flush rq isn't treated specially
On Fri, Sep 01, 2017 at 10:16:45PM +0800, weiping zhang wrote:
> The first patch is the V2 of [PATCH] blkcg: check pol->cpd_free_fn
> before free cpd, it fixs a checking before free cpd.
>
> The second patch add some wrappers for struct blkcg_policy->xxx_fn, because
> not
> every block cgroup
A partial read I/O in pblk is an I/O where some sectors reside in the
write buffer in main memory and some are persisted on the device. Such
an I/O must at least contain 2 lbas, therefore checking for the case
where a single lba is mapped is not necessary.
Signed-off-by: Javier González
As part of pblk's recovery scheme, we store the lba mapped to each
physical sector on the device's out-of-bound (OOB) area.
On the read path, we can use this information to validate that the data
being delivered to the upper layers corresponds to the lba being
requested. The cost of this check is
pblk schedules user I/O, metadata I/O and erases on the write path in
order to minimize collisions at the media level. Until now, there has
been a dependency between user and metadata I/Os that could lead to a
deadlock as both take the per-LUN semaphore to schedule submission.
This path removes
Make sure that the variable controlling block threshold for allocating
extra metadata sectors in case of a line with bad blocks does not get a
negative value. Otherwise, the line will be marked as corrupted and
wasted.
Signed-off-by: Javier González
---
Metadata I/Os are scheduled to minimize their impact on user data I/Os.
When there are enough LUNs instantiated (i.e., enought bandwidth), it is
easy to interleave metadata and data one after the other so that
metadata I/Os are the ones being blocked and not viceversa.
We do this by calculating
When a line is recycled during garbage collection, reads can still be
issued to the line. If the line is freed in the middle of this process,
data corruption might occur.
This patch guarantees that lines are not freed in the middle of reads
that target them (lines). Specifically, we use the
Use a constant to set the maximum number of inflight GC requests
allowed.
Signed-off-by: Javier González
---
drivers/lightnvm/pblk-gc.c | 4 ++--
drivers/lightnvm/pblk.h| 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/lightnvm/pblk-gc.c
Normalize the way we name ppa variables to improve code readability.
Signed-off-by: Javier González
---
drivers/lightnvm/pblk-core.c | 48 +++-
1 file changed, 25 insertions(+), 23 deletions(-)
diff --git
Refactor lba sanity check on read path to avoid code duplication.
Signed-off-by: Javier González
---
drivers/lightnvm/pblk-read.c | 29 ++---
1 file changed, 10 insertions(+), 19 deletions(-)
diff --git a/drivers/lightnvm/pblk-read.c
When a line is selected for recycling by the garbage collector (GC), the
line state changes and the invalid bitmap is frozen, preventing
invalidations from happening. Throughout the GC, the L2P map is checked
to verify that not data being recycled has been updated. The last check
is done before
Simplify put bio by doing it on bio end_io instead of manually putting
it on the completion path.
Signed-off-by: Javier González
---
drivers/lightnvm/pblk-core.c | 10 +++---
drivers/lightnvm/pblk-read.c | 1 -
drivers/lightnvm/pblk-recovery.c | 1 -
Simplify the part of the garbage collector where data is read from the
line being recycled and moved into an internal queue before being copied
to the memory buffer. This allows to get rid of a dedicated function,
which introduces an unnecessary dependency on the code.
Signed-off-by: Javier
On REQ_PREFLUSH, directly tag the I/O context flags to signal a flush in
the write to cache path, instead of finding the correct entry context
and imposing a memory barrier. This simplifies the code and might
potentially prevent race conditions when adding functionality to the
write path.
Each request type sent to the LightNVM subsystem requires different
metadata. Until now, we have tailored this metadata based on write, read
and erase commands. However, pblk uses different metadata for internal
writes that do not hit the write buffer. Instead of abusing the metadata
for reads,
Refactor the rqd allocation and free functions so that all I/O types can
use these helper functions.
Signed-off-by: Javier González
---
drivers/lightnvm/pblk-core.c | 40 ++--
drivers/lightnvm/pblk-read.c | 2 --
Wait until we know the exact number of ppas to be sent to the device,
before allocating the bio.
Signed-off-by: Javier González
---
drivers/lightnvm/pblk-rb.c| 5 +++--
drivers/lightnvm/pblk-write.c | 20 ++--
drivers/lightnvm/pblk.h | 4 ++--
3
For consistency with the rest of pblk, use rqd->end_io to point to the
function taking care of ending the request on the completion path.
Signed-off-by: Javier González
---
drivers/lightnvm/pblk-core.c | 7 ---
drivers/lightnvm/pblk-read.c | 5 ++---
2 files changed, 2
This patchset is a general cleanup to improve code readability.
Javier González (11):
lightnvm: pblk: use constant for GC max inflight
lightnvm: pblk: normalize ppa namings
lightnvm: pblk: refactor read lba sanity check
lightnvm: pblk: simplify data validity check on GC
lightnvm: pblk:
On 09/15/2017 12:06 PM, Damien Le Moal wrote:
> The zoned I/O scheduler is mostly identical to mq-deadline and retains
> the same configuration attributes. The main difference is that the
> zoned scheduler will ensure that at any time at most one write request
> per sequential zone is in flight
On 09/15/2017 12:06 PM, Damien Le Moal wrote:
> In the case of a ZBC disk used with scsi-mq, zone write locking does
> not prevent write reordering in sequential zones. Unlike the legacy
> case, zone locking is done after the command request is removed from
> the scheduler dispatch queue. That is,
On 09/15/2017 12:06 PM, Damien Le Moal wrote:
> Zoned block devices have no write constraints for conventional zones.
> So write locking of conventional zones is not necessary and can even
> hurt performance by unnecessarily operating the disk under low queue
> depth. To avoid this, use the disk
On 09/15/2017 12:06 PM, Damien Le Moal wrote:
> Allocate and initialize the disk request queue zoned structure on disk
> revalidate. As the bitmap allocation for the seq_zones field of the
> zoned structure is identical to the allocation of the zones write lock
> bitmap, introduce the helper
On 09/15/2017 12:06 PM, Damien Le Moal wrote:
> The three values starting at byte 8 of the Zoned Block Device
> Characteristics VPD page B6h are 32 bits values, not 64bits. So use
> get_unaligned_be32() to retrieve the values and not get_unaligned_be64()
>
> Fixes: 89d947561077 ("sd: Implement
On 09/15/2017 12:06 PM, Damien Le Moal wrote:
> instead of open coding, use the min() macro to calculate a report zones
> reply buffer length in sd_zbc_check_zone_size() and the round_up()
> macro for calculating the number of zones in sd_zbc_setup().
>
> No functional change is introduced by
On 09/15/2017 12:06 PM, Damien Le Moal wrote:
> Rearrange sd_zbc_setup() to include use_16_for_rw and use_10_for_rw
> assignments and move the calculation of sdkp->zone_shift together
> with the assignment of the verified zone_blocks value in
> sd_zbc_check_zone_size().
>
> No functional change
On 09/15/2017 12:06 PM, Damien Le Moal wrote:
> Fix comments style (do not use documented comment style) and add some
> comments to clarify some functions. Also fix some functions signature
> indentation and remove a useless blank line in sd_zbc_read_zones().
>
> No functional change is
The zoned I/O scheduler is mostly identical to mq-deadline and retains
the same configuration attributes. The main difference is that the
zoned scheduler will ensure that at any time at most one write request
per sequential zone is in flight (has been dispatched to the disk) in
order to protect
Allocate and initialize the disk request queue zoned structure on disk
revalidate. As the bitmap allocation for the seq_zones field of the
zoned structure is identical to the allocation of the zones write lock
bitmap, introduce the helper sd_zbc_alloc_zone_bitmap().
Using this helper, wait for the
In the case of a ZBC disk used with scsi-mq, zone write locking does
not prevent write reordering in sequential zones. Unlike the legacy
case, zone locking is done after the command request is removed from
the scheduler dispatch queue. That is, at the time of zone locking,
the write command may
Zoned block devices have no write constraints for conventional zones.
So write locking of conventional zones is not necessary and can even
hurt performance by unnecessarily operating the disk under low queue
depth. To avoid this, use the disk request queue seq_zones bitmap to
allow any write to be
instead of open coding, use the min() macro to calculate a report zones
reply buffer length in sd_zbc_check_zone_size() and the round_up()
macro for calculating the number of zones in sd_zbc_setup().
No functional change is introduced by this patch.
Signed-off-by: Damien Le Moal
The three values starting at byte 8 of the Zoned Block Device
Characteristics VPD page B6h are 32 bits values, not 64bits. So use
get_unaligned_be32() to retrieve the values and not get_unaligned_be64()
Fixes: 89d947561077 ("sd: Implement support for ZBC devices")
Cc:
Rearrange sd_zbc_setup() to include use_16_for_rw and use_10_for_rw
assignments and move the calculation of sdkp->zone_shift together
with the assignment of the verified zone_blocks value in
sd_zbc_check_zone_size().
No functional change is introduced by this patch.
Signed-off-by: Damien Le Moal
__blk_mq_debugfs_rq_show() and blk_mq_debugfs_rq_show() are exported
symbols but ar eonly declared in the block internal file
block/blk-mq-debugfs.h. which is not cleanly accessible to files outside
of the block directory.
Move the declaration of these functions to the new file
Fix comments style (do not use documented comment style) and add some
comments to clarify some functions. Also fix some functions signature
indentation and remove a useless blank line in sd_zbc_read_zones().
No functional change is introduced by this patch.
Signed-off-by: Damien Le Moal
Components relying only on the requeuest_queue structure for managing
and controlling block devices (e.g. I/O schedulers) have a limited
view/knowledged of the device being controlled. For instance, the device
capacity cannot be known easily, which for a zoned block device also
result in the
Move standard macro definitions for the zone types and zone conditions
to scsi_proto.h together with the definitions related to the
REPORT ZONES command. While at it, define all values in the enums to
be clear.
Also remove unnecessary includes in sd_zbc.c.
No functional change is introduced by
This series implements support for ZBC disks used through the scsi-mq I/O path.
The current scsi level support of ZBC disks guarantees write request ordering
using a per-zone write lock which prevents issuing simultaneously multiple
write commands to a zone, doing so avoid reordering of
The functions blk_mq_sched_free_hctx_data(), blk_mq_sched_try_merge(),
blk_mq_sched_try_insert_merge() and blk_mq_sched_request_inserted() are
all exported symbols but are declared only internally in
block/blk-mq-sched.h. Move these declarations to the new file
include/linux/blk-mq-sched.h to make
When account the nr_phys_segments during merging bios into rq,
only consider segments merging in individual bio but not all
the bios in a rq. This leads to the bigger nr_phys_segments of
rq than the real one when the segments of bios in rq are
contiguous and mergeable. The nr_phys_segments of rq
If the bio_integrity_merge_rq() return false or nr_phys_segments exceeds
the max_segments, the merging fails, but the bi_front/back_seg_size may
have been modified. To avoid it, move the sanity checking ahead.
Signed-off-by: Jianchao Wang
---
block/blk-merge.c | 16
89 matches
Mail list logo