Re: [PATCH 1/8] block: add WRITE_BG
> * non-volatile media on completion. > + * WRITE_BG Background write. This is for background activity like > + * the periodic flush and background threshold writeback > * > */ > #define RW_MASK REQ_OP_WRITE > @@ -202,6 +204,7 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t > offset, > #define WRITE_FLUSH (REQ_SYNC | REQ_NOIDLE | REQ_PREFLUSH) > #define WRITE_FUA(REQ_SYNC | REQ_NOIDLE | REQ_FUA) > #define WRITE_FLUSH_FUA (REQ_SYNC | REQ_NOIDLE | REQ_PREFLUSH | > REQ_FUA) > +#define WRITE_BG (REQ_NOIDLE | REQ_BG) I've been trying to kill off these WRITE_ flags as they aren't exactly helpful, see my branch here that I'm waiting for the previous serious to go in: http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-flags Which also begs the question why you add the REQ_NOIDLE flag above, as it's only applied to synchronous queues in cfq as far as I can tell. And while I'm at nitpicking about the most trivial patch of the series anyway: any good reason to not just spell out the "BACKGROUND" ? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/12] blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()
On 10/27/2016 12:53 AM, Bart Van Assche wrote: > Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls > are followed by kicking the requeue list. Hence add an argument to > these two functions that allows to kick the requeue list. This was > proposed by Christoph Hellwig. > > Signed-off-by: Bart Van Assche> Cc: Christoph Hellwig > Cc: Hannes Reinecke > Cc: Sagi Grimberg > Cc: Johannes Thumshirn > --- > block/blk-flush.c| 5 + > block/blk-mq.c | 10 +++--- > drivers/block/xen-blkfront.c | 2 +- > drivers/md/dm-rq.c | 2 +- > drivers/nvme/host/core.c | 2 +- > drivers/scsi/scsi_lib.c | 4 +--- > include/linux/blk-mq.h | 5 +++-- > 7 files changed, 15 insertions(+), 15 deletions(-) > > diff --git a/block/blk-flush.c b/block/blk-flush.c > index 6a14b68..a834aed 100644 > --- a/block/blk-flush.c > +++ b/block/blk-flush.c > @@ -134,10 +134,7 @@ static void blk_flush_restore_request(struct request *rq) > static bool blk_flush_queue_rq(struct request *rq, bool add_front) > { > if (rq->q->mq_ops) { > - struct request_queue *q = rq->q; > - > - blk_mq_add_to_requeue_list(rq, add_front); > - blk_mq_kick_requeue_list(q); > + blk_mq_add_to_requeue_list(rq, add_front, true); > return false; > } else { > if (add_front) > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 4945437..688bcc3 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -492,12 +492,12 @@ static void __blk_mq_requeue_request(struct request *rq) > } > } > > -void blk_mq_requeue_request(struct request *rq) > +void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list) > { > __blk_mq_requeue_request(rq); > > BUG_ON(blk_queued_rq(rq)); > - blk_mq_add_to_requeue_list(rq, true); > + blk_mq_add_to_requeue_list(rq, true, kick_requeue_list); > } > EXPORT_SYMBOL(blk_mq_requeue_request); > > @@ -535,7 +535,8 @@ static void blk_mq_requeue_work(struct work_struct *work) > blk_mq_start_hw_queues(q); > } > > -void blk_mq_add_to_requeue_list(struct request *rq, bool at_head) > +void blk_mq_add_to_requeue_list(struct request *rq, bool at_head, > + bool kick_requeue_list) > { > struct request_queue *q = rq->q; > unsigned long flags; > @@ -554,6 +555,9 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool > at_head) > list_add_tail(>queuelist, >requeue_list); > } > spin_unlock_irqrestore(>requeue_lock, flags); > + > + if (kick_requeue_list) > + blk_mq_kick_requeue_list(q); > } > EXPORT_SYMBOL(blk_mq_add_to_requeue_list); > > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c > index 9908597..1ca702d 100644 > --- a/drivers/block/xen-blkfront.c > +++ b/drivers/block/xen-blkfront.c > @@ -2043,7 +2043,7 @@ static int blkif_recover(struct blkfront_info *info) > /* Requeue pending requests (flush or discard) */ > list_del_init(>queuelist); > BUG_ON(req->nr_phys_segments > segs); > - blk_mq_requeue_request(req); > + blk_mq_requeue_request(req, false); > } > blk_mq_kick_requeue_list(info->rq); > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > index dc75bea..fbd37b4 100644 > --- a/drivers/md/dm-rq.c > +++ b/drivers/md/dm-rq.c > @@ -354,7 +354,7 @@ EXPORT_SYMBOL(dm_mq_kick_requeue_list); > > static void dm_mq_delay_requeue_request(struct request *rq, unsigned long > msecs) > { > - blk_mq_requeue_request(rq); > + blk_mq_requeue_request(rq, false); > __dm_mq_kick_requeue_list(rq->q, msecs); > } > Hmm. __dm_mq_kick_requeue_list() does essentially the same. Have you checked if you can use 'true' here and drop the call to it? However, it does take the queue_lock when doing so. Is that required? None of the other drivers do it ... Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/12] dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code
On 10/27/2016 12:54 AM, Bart Van Assche wrote: > Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED > in the dm start and stop queue functions, only manipulate the latter > flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped(). > > Signed-off-by: Bart Van Assche> Reviewed-by: Christoph Hellwig > Cc: Mike Snitzer > --- > drivers/md/dm-rq.c | 18 ++ > 1 file changed, 2 insertions(+), 16 deletions(-) > > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c > index fbd37b4..d47a504 100644 > --- a/drivers/md/dm-rq.c > +++ b/drivers/md/dm-rq.c > @@ -75,12 +75,6 @@ static void dm_old_start_queue(struct request_queue *q) > > static void dm_mq_start_queue(struct request_queue *q) > { > - unsigned long flags; > - > - spin_lock_irqsave(q->queue_lock, flags); > - queue_flag_clear(QUEUE_FLAG_STOPPED, q); > - spin_unlock_irqrestore(q->queue_lock, flags); > - > blk_mq_start_stopped_hw_queues(q, true); > blk_mq_kick_requeue_list(q); > } > @@ -105,16 +99,8 @@ static void dm_old_stop_queue(struct request_queue *q) > > static void dm_mq_stop_queue(struct request_queue *q) > { > - unsigned long flags; > - > - spin_lock_irqsave(q->queue_lock, flags); > - if (blk_queue_stopped(q)) { > - spin_unlock_irqrestore(q->queue_lock, flags); > + if (blk_mq_queue_stopped(q)) > return; > - } > - > - queue_flag_set(QUEUE_FLAG_STOPPED, q); > - spin_unlock_irqrestore(q->queue_lock, flags); > > /* Avoid that requeuing could restart the queue. */ > blk_mq_cancel_requeue_work(q); > @@ -341,7 +327,7 @@ static void __dm_mq_kick_requeue_list(struct > request_queue *q, unsigned long mse > unsigned long flags; > > spin_lock_irqsave(q->queue_lock, flags); > - if (!blk_queue_stopped(q)) > + if (!blk_mq_queue_stopped(q)) > blk_mq_delay_kick_requeue_list(q, msecs); > spin_unlock_irqrestore(q->queue_lock, flags); > } > Ah. Right. That answers my previous question. Reviewed-by: Hannes Reinecke Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A question regarding "multiple SGL"
Hi Robert, please explain your use cases that isn't handled. The one and only reason to set MSDBD to 1 is to make the code a lot simpler given that there is no real use case for supporting more. RDMA uses memory registrations to register large and possibly discontiguous data regions for a single rkey, aka single SGL descriptor in NVMe terms. There would be two reasons to support multiple SGL descriptors: a) to support a larger I/O size than supported by a single MR, or b) to support a data region format not mappable by a single MR. iSER only supports a single rkey (or stag in IETF terminology) and has been doing fine on a) and mostly fine on b). There are a few possible data layouts not supported by the traditional IB/iWarp FR WRs, but the limit is in fact exactly the same as imposed by the NVMe PRPs used for PCIe NVMe devices, so the Linux block layer has support to not generate them. Also with modern Mellanox IB/RoCE hardware we can actually register completely arbitrary SGLs. iSER supports using this registration mode already with a trivial code addition, but for NVMe we didn't have a pressing need yet. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A question regarding "multiple SGL"
Hi Robert, There is no feature called "Multiple SGL in one NVMe capsule". The NVMe over Fabrics specification allows a controller to advertise how many SGL descriptors it supports using the MSDBD Identify field: "Maximum SGL Data Block Descriptors (MSDBD): This field indicates the maximum number of (Keyed) SGL Data Block descriptors that a host is allowed to place in a capsule. A value of 0h indicates no limit." Setting this value to 1 is perfectly valid. Similarly a host is free to chose any number of SGL descriptors between 0 (only for command that don't transfer data) to the limit imposed by the controller using the MSDBD field. There are no plans to support a MSDBD value larger than 1 in the Linux NVMe target, and there are no plans to ever submit commands with multiple SGLs from the host driver either. Cheers, Christoph -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A question regarding "multiple SGL"
Hi Christoph, Thanks , got it. Could you please do me favor to let me know the background why we ONLY support " MSDBD ==1"? I am NOT trying to resist or oppose anything , I just want to know the reason. You know, it is a little wired for me, as "MSDBD ==1" does not fulfill all the use cases which is depicted in the spec. Best, Robert Qiuxin Robert Qiuxin 华为技术有限公司 Huawei Technologies Co., Ltd. Phone: +86-755-28420357 Fax: Mobile: +86 15986638429 Email: qiu...@huawei.com 地址:深圳市龙岗区坂田华为基地 邮编:518129 Huawei Technologies Co., Ltd. Bantian, Longgang District,Shenzhen 518129, P.R.China http://www.huawei.com 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁 止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中 的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件! This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! -邮件原件- 发件人: Christoph Hellwig [mailto:h...@lst.de] 发送时间: 2016年10月27日 14:41 收件人: 鑫愿 抄送: Bart Van Assche; Jens Axboe; linux-block@vger.kernel.org; James Bottomley; Martin K. Petersen; Mike Snitzer; linux-r...@vger.kernel.org; Ming Lei; linux-n...@lists.infradead.org; Keith Busch; Doug Ledford; linux-s...@vger.kernel.org; Laurence Oberman; Christoph Hellwig; Tiger zhao; Qiuxin (robert) 主题: Re: A question regarding "multiple SGL" Hi Robert, There is no feature called "Multiple SGL in one NVMe capsule". The NVMe over Fabrics specification allows a controller to advertise how many SGL descriptors it supports using the MSDBD Identify field: "Maximum SGL Data Block Descriptors (MSDBD): This field indicates the maximum number of (Keyed) SGL Data Block descriptors that a host is allowed to place in a capsule. A value of 0h indicates no limit." Setting this value to 1 is perfectly valid. Similarly a host is free to chose any number of SGL descriptors between 0 (only for command that don't transfer data) to the limit imposed by the controller using the MSDBD field. There are no plans to support a MSDBD value larger than 1 in the Linux NVMe target, and there are no plans to ever submit commands with multiple SGLs from the host driver either. Cheers, Christoph
Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
On Thu, Oct 27, 2016 at 11:26 AM, Jan Karawrote: > On Wed 26-10-16 10:12:38, Jens Axboe wrote: >> On 10/26/2016 10:04 AM, Paolo Valente wrote: >> > >> >>Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe ha >> >>scritto: >> >> >> >>On 10/26/2016 09:29 AM, Christoph Hellwig wrote: >> >>>On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote: >> The question to ask first is whether to actually have pluggable >> schedulers on blk-mq at all, or just have one that is meant to >> do the right thing in every case (and possibly can be bypassed >> completely). >> >>> >> >>>That would be my preference. Have a BFQ-variant for blk-mq as an >> >>>option (default to off unless opted in by the driver or user), and >> >>>not other scheduler for blk-mq. Don't bother with bfq for non >> >>>blk-mq. It's not like there is any advantage in the legacy-request >> >>>device even for slow devices, except for the option of having I/O >> >>>scheduling. >> >> >> >>It's the only right way forward. blk-mq might not offer any substantial >> >>advantages to rotating storage, but with scheduling, it won't offer a >> >>downside either. And it'll take us towards the real goal, which is to >> >>have just one IO path. >> > >> >ok >> > >> >>Adding a new scheduler for the legacy IO path >> >>makes no sense. >> > >> >I would fully agree if effective and stable I/O scheduling would be >> >available in blk-mq in one or two months. But I guess that it will >> >take at least one year optimistically, given the current status of the >> >needed infrastructure, and given the great difficulties of doing >> >effective scheduling at the high parallelism and extreme target speeds >> >of blk-mq. Of course, this holds true unless little clever scheduling >> >is performed. >> > >> >So, what's the point in forcing a lot of users wait another year or >> >more, for a solution that has yet to be even defined, while they could >> >enjoy a much better system, and then switch an even better system when >> >scheduling is ready in blk-mq too? >> >> That same argument could have been made 2 years ago. Saying no to a new >> scheduler for the legacy framework goes back roughly that long. We could >> have had BFQ for mq NOW, if we didn't keep coming back to this very >> point. >> >> I'm hesistant to add a new scheduler because it's very easy to add, very >> difficult to get rid of. If we do add BFQ as a legacy scheduler now, >> it'll take us years and years to get rid of it again. We should be >> moving towards LESS moving parts in the legacy path, not more. >> >> We can keep having this discussion every few years, but I think we'd >> both prefer to make some actual progress here. It's perfectly fine to >> add an interface for a single queue interface for an IO scheduler for >> blk-mq, since we don't care too much about scalability there. And that >> won't take years, that should be a few weeks. Retrofitting BFQ on top of >> that should not be hard either. That can co-exist with a real multiqueue >> scheduler as well, something that's geared towards some fairness for >> faster devices. > > OK, so some solution like having a variant of blk_sq_make_request() that > will consume requests, do IO scheduling decisions on them, and feed them > into the HW queue is it sees fit would be acceptable? That will provide the > IO scheduler a global view that it needs for complex scheduling decisions > so it should indeed be relatively easy to port BFQ to work like that. > > Honza > -- > Jan Kara > SUSE Labs, CR > -- > To unsubscribe from this list: send the line "unsubscribe linux-block" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Hello, Let me first say that I'm in no way associated with Paolo Valente or any other BFQ developer. I'm a mere user who has had great experience using BFQ My workload is one that takes my disks to their limits. I often use large files like raw Blu-ray streams which then I remux to mkv's while at the same time streaming at least 2 movies to various devices in house and using my system as I do while the remuxing process is going on. At times, I'm also pushing video files to my NAS at close to Gbps speed while the stuff I mentioned is in progress My experience with BFQ is that it has never resulted in the video streams being interrupted due to disk trashing. I've extensively used all the other Linux disk schedulers in the past and what I've observed is that whenever I start the remuxing (and copying) process, the streams will begin to hiccup, stutter and often multi-seconds long "waits" will occur. It gets even worse, when I do this kind of workload, the whole system will come to almost a halt and interactivity goes out the window. Impossible to start an app in a reasonable amount of time. Loading a visited website makes Chrome hang
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
You do realise that local filesystems can silently change the location of file data at any point in time, so there is no such thing as a "stable mapping" of file data to block device addresses in userspace? If you want remote access to the blocks owned and controlled by a filesystem, then you need to use a filesystem with a remote locking mechanism to allow co-ordinated, coherent access to the data in those blocks. Anything else is just asking for ongoing, unfixable filesystem corruption or data leakage problems (i.e. security issues). And at least for XFS we have such a mechanism :) E.g. I have a prototype of a pNFS layout that uses XFS+DAX to allow clients to do RDMA directly to XFS files, with the same locking mechanism we use for the current block and scsi layout in xfs_pnfs.c. Christoph, did you manage to leap to the future and solve the RDMA persistency hole? :) e.g. what happens with O_DSYNC in this model? Or you did a message exchange for commits? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] lightnvm: add ECC error codes
Add ECC error codes to enable the appropriate handling in the target. Signed-off-by: Javier González--- include/linux/lightnvm.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h index e3ccaff..33643ae 100644 --- a/include/linux/lightnvm.h +++ b/include/linux/lightnvm.h @@ -107,6 +107,8 @@ enum { NVM_RSP_NOT_CHANGEABLE = 0x1, NVM_RSP_ERR_FAILWRITE = 0x40ff, NVM_RSP_ERR_EMPTYPAGE = 0x42ff, + NVM_RSP_ERR_FAILECC = 0x4281, + NVM_RSP_WARN_HIGHECC= 0x4700, /* Device opcodes */ NVM_OP_HBREAD = 0x02, -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] lightnvm: do not decide on device blocks
Device blocks should be marked by the device and considered as bad blocks by the media manager. Thus, do not make assumptions on which blocks are going to be used by the device. In doing so we might lose valid blocks from the free list. Signed-off-by: Javier González--- drivers/lightnvm/gennvm.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c index 730d736..a7e17fa 100644 --- a/drivers/lightnvm/gennvm.c +++ b/drivers/lightnvm/gennvm.c @@ -371,12 +371,6 @@ static int gen_blocks_init(struct nvm_dev *dev, struct gen_dev *gn) block->lun = >vlun; block->id = cur_block_id++; - /* First block is reserved for device */ - if (unlikely(lun_iter == 0 && blk_iter == 0)) { - lun->vlun.nr_free_blocks--; - continue; - } - list_add_tail(>list, >free_list); } -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] lightnvm: export set bad block table
Bad blocks should be managed by block owners. This would be either targets for data blocks or sysblk for system blocks. In order to support this, export two functions: One to mark a block as an specific type (e.g., bad block) and another to update the bad block table on the device. Move bad block management to rrpc. Signed-off-by: Javier González--- drivers/lightnvm/core.c | 27 +++ drivers/lightnvm/gennvm.c | 25 + drivers/lightnvm/rrpc.c | 34 +- drivers/lightnvm/sysblk.c | 29 + include/linux/lightnvm.h | 17 - 5 files changed, 82 insertions(+), 50 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 4be3879..a81ed1c 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -197,6 +197,33 @@ void nvm_mark_blk(struct nvm_dev *dev, struct ppa_addr ppa, int type) } EXPORT_SYMBOL(nvm_mark_blk); +int nvm_set_bb_tbl(struct nvm_dev *dev, struct ppa_addr *ppas, int nr_ppas, + int type) +{ + struct nvm_rq rqd; + int ret; + + if (nr_ppas > dev->ops->max_phys_sect) { + pr_err("nvm: unable to update all sysblocks atomically\n"); + return -EINVAL; + } + + memset(, 0, sizeof(struct nvm_rq)); + + nvm_set_rqd_ppalist(dev, , ppas, nr_ppas, 1); + nvm_generic_to_addr_mode(dev, ); + + ret = dev->ops->set_bb_tbl(dev, _addr, rqd.nr_ppas, type); + nvm_free_rqd_ppalist(dev, ); + if (ret) { + pr_err("nvm: sysblk failed bb mark\n"); + return -EINVAL; + } + + return 0; +} +EXPORT_SYMBOL(nvm_set_bb_tbl); + int nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd) { return dev->mt->submit_io(dev, rqd); diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c index 575afc4..ae19a61 100644 --- a/drivers/lightnvm/gennvm.c +++ b/drivers/lightnvm/gennvm.c @@ -611,34 +611,11 @@ static void gen_mark_blk(struct nvm_dev *dev, struct ppa_addr ppa, int type) blk->state = type; } -/* - * mark block bad in gen. It is expected that the target recovers separately - */ -static void gen_mark_blk_bad(struct nvm_dev *dev, struct nvm_rq *rqd) -{ - int bit = -1; - int max_secs = dev->ops->max_phys_sect; - void *comp_bits = >ppa_status; - - nvm_addr_to_generic_mode(dev, rqd); - - /* look up blocks and mark them as bad */ - if (rqd->nr_ppas == 1) { - gen_mark_blk(dev, rqd->ppa_addr, NVM_BLK_ST_BAD); - return; - } - - while ((bit = find_next_bit(comp_bits, max_secs, bit + 1)) < max_secs) - gen_mark_blk(dev, rqd->ppa_list[bit], NVM_BLK_ST_BAD); -} - static void gen_end_io(struct nvm_rq *rqd) { struct nvm_tgt_instance *ins = rqd->ins; - if (rqd->error == NVM_RSP_ERR_FAILWRITE) - gen_mark_blk_bad(rqd->dev, rqd); - + /* Write failures and bad blocks are managed within the target */ ins->tt->end_io(rqd); } diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c index cb30ccf..8deef2e 100644 --- a/drivers/lightnvm/rrpc.c +++ b/drivers/lightnvm/rrpc.c @@ -716,6 +716,34 @@ static void rrpc_run_gc(struct rrpc *rrpc, struct rrpc_block *rblk) queue_work(rrpc->kgc_wq, >ws_gc); } +static void __rrpc_mark_bad_block(struct nvm_dev *dev, struct ppa_addr *ppa) +{ + nvm_mark_blk(dev, *ppa, NVM_BLK_ST_BAD); + nvm_set_bb_tbl(dev, ppa, 1, NVM_BLK_T_GRWN_BAD); +} + +static void rrpc_mark_bad_block(struct rrpc *rrpc, struct nvm_rq *rqd) +{ + struct nvm_dev *dev = rrpc->dev; + void *comp_bits = >ppa_status; + struct ppa_addr ppa, prev_ppa; + int nr_ppas = rqd->nr_ppas; + int bit; + + if (rqd->nr_ppas == 1) + __rrpc_mark_bad_block(dev, >ppa_addr); + + ppa_set_empty(_ppa); + bit = -1; + while ((bit = find_next_bit(comp_bits, nr_ppas, bit + 1)) < nr_ppas) { + ppa = rqd->ppa_list[bit]; + if (ppa_cmp_blk(ppa, prev_ppa)) + continue; + + __rrpc_mark_bad_block(dev, ); + } +} + static void rrpc_end_io_write(struct rrpc *rrpc, struct rrpc_rq *rrqd, sector_t laddr, uint8_t npages) { @@ -742,8 +770,12 @@ static void rrpc_end_io(struct nvm_rq *rqd) uint8_t npages = rqd->nr_ppas; sector_t laddr = rrpc_get_laddr(rqd->bio) - npages; - if (bio_data_dir(rqd->bio) == WRITE) + if (bio_data_dir(rqd->bio) == WRITE) { + if (rqd->error == NVM_RSP_ERR_FAILWRITE) + rrpc_mark_bad_block(rrpc, rqd); + rrpc_end_io_write(rrpc, rrqd, laddr, npages); + } bio_put(rqd->bio); diff --git a/drivers/lightnvm/sysblk.c
Re: [PATCH 05/12] blk-mq: Introduce blk_mq_quiesce_queue()
On 10/26/2016 10:52 PM, Hannes Reinecke wrote: Hmm. Can't say I like having to have two RCU structure for the same purpose, but I guess that can't be helped. Hello Hannes, There are two RCU structures because of BLK_MQ_F_BLOCKING. Today only the nbd driver sets that flag. If the nbd driver would be modified such that it doesn't set that flag then the BLK_MQ_F_BLOCKING flag and also queue_rq_srcu could be eliminated from the blk-mq core. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] iopmem : A block device for PCIe memory
On Thu, Oct 27, 2016 at 01:22:49PM +0300, Sagi Grimberg wrote: > Christoph, did you manage to leap to the future and solve the > RDMA persistency hole? :) > > e.g. what happens with O_DSYNC in this model? Or you did > a message exchange for commits? Yes, pNFS calls this the layoutcommit. That being said once we get a RDMA commit or flush operation we could easily make the layoutcommit optional for some operations. There already is a precedence for the in the flexfiles layout specification. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/12] blk-mq: Introduce blk_mq_quiesce_queue()
Looks good, Reviewed-by: Christoph Hellwig-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] lightnvm: drop reserve and release LUN callbacks
On target initialization, targets use callbacks to the media manager to configure the LUNs they use. In order to simplify the flow, drop this callbacks and manage everything internally on the media manager. By making use of the newly introduce LUN management structure, the media manager knows which target exclusively owns each target and can therefore allocate and free all the necessary structures before initializing the target. Not exclusively owned LUNs belong to the media manager in any case. Adapt rrpc to not use the reserve_lun/release_lun callback functions. Signed-off-by: Javier González--- drivers/lightnvm/gennvm.c | 62 +++ drivers/lightnvm/rrpc.c | 12 + include/linux/lightnvm.h | 5 +--- 3 files changed, 49 insertions(+), 30 deletions(-) diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c index 8bff725..575afc4 100644 --- a/drivers/lightnvm/gennvm.c +++ b/drivers/lightnvm/gennvm.c @@ -35,6 +35,30 @@ static const struct block_device_operations gen_fops = { .owner = THIS_MODULE, }; +static int gen_reserve_luns(struct nvm_dev *dev, int lun_begin, int lun_end, + struct nvm_target *t) +{ + struct gen_dev *gn = dev->mp; + struct gen_lun *lun; + int i; + + for (i = lun_begin; i <= lun_end; i++) { + if (test_and_set_bit(i, dev->lun_map)) { + pr_err("gennvm: lun %d is already allocated\n", i); + goto fail; + } + + lun = >luns[i]; + } + + return 0; +fail: + while (--i > lun_begin) + clear_bit(i, dev->lun_map); + + return 1; +} + static int gen_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create) { struct gen_dev *gn = dev->mp; @@ -80,6 +104,9 @@ static int gen_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create) tdisk->fops = _fops; tdisk->queue = tqueue; + if (tt->exclusive && gen_reserve_luns(dev, s->lun_begin, s->lun_end, t)) + goto err_init; + targetdata = tt->init(dev, tdisk, s->lun_begin, s->lun_end); if (IS_ERR(targetdata)) goto err_init; @@ -110,7 +137,23 @@ err_t: return -ENOMEM; } -static void __gen_remove_target(struct nvm_target *t) +static void gen_release_luns(struct nvm_dev *dev, struct nvm_target *t) +{ + struct gen_dev *gn = dev->mp; + struct gen_lun *lun; + int lunid; + int i; + + gen_for_each_lun(gn, lun, i) { + if (lun->tgt != t) + continue; + + lunid = lun->vlun.id; + WARN_ON(!test_and_clear_bit(lunid, dev->lun_map)); + } +} + +static void __gen_remove_target(struct nvm_dev *dev, struct nvm_target *t) { struct nvm_tgt_type *tt = t->type; struct gendisk *tdisk = t->disk; @@ -122,6 +165,7 @@ static void __gen_remove_target(struct nvm_target *t) if (tt->exit) tt->exit(tdisk->private_data); + gen_release_luns(dev, t); put_disk(tdisk); list_del(>list); @@ -152,7 +196,7 @@ static int gen_remove_tgt(struct nvm_dev *dev, struct nvm_ioctl_remove *remove) mutex_unlock(>lock); return 1; } - __gen_remove_target(t); + __gen_remove_target(dev, t); mutex_unlock(>lock); return 0; @@ -474,7 +518,7 @@ static void gen_unregister(struct nvm_dev *dev) list_for_each_entry_safe(t, tmp, >targets, list) { if (t->dev != dev) continue; - __gen_remove_target(t); + __gen_remove_target(dev, t); } mutex_unlock(>lock); @@ -618,16 +662,6 @@ static int gen_erase_blk(struct nvm_dev *dev, struct nvm_block *blk, int flags) return nvm_erase_ppa(dev, , 1, flags); } -static int gen_reserve_lun(struct nvm_dev *dev, int lunid) -{ - return test_and_set_bit(lunid, dev->lun_map); -} - -static void gen_release_lun(struct nvm_dev *dev, int lunid) -{ - WARN_ON(!test_and_clear_bit(lunid, dev->lun_map)); -} - static struct nvm_lun *gen_get_lun(struct nvm_dev *dev, int lunid) { struct gen_dev *gn = dev->mp; @@ -674,8 +708,6 @@ static struct nvmm_type gen = { .mark_blk = gen_mark_blk, .get_lun= gen_get_lun, - .reserve_lun= gen_reserve_lun, - .release_lun= gen_release_lun, .lun_info_print = gen_lun_info_print, .get_area = gen_get_area, diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c index f293d00..cb30ccf 100644 --- a/drivers/lightnvm/rrpc.c +++ b/drivers/lightnvm/rrpc.c @@ -1167,8 +1167,6 @@ static void rrpc_core_free(struct rrpc *rrpc) static void rrpc_luns_free(struct rrpc *rrpc) { - struct nvm_dev *dev = rrpc->dev; -
[PATCH 7/7] lightnvm: rrpc: split bios of size > 256kb
rrpc cannot handle bios of size > 256kb due to NVME's 64 bit completion bitmap. If a larger bio comes, split it explicitly. Signed-off-by: Javier González--- drivers/lightnvm/rrpc.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c index 8deef2e..0b8251f 100644 --- a/drivers/lightnvm/rrpc.c +++ b/drivers/lightnvm/rrpc.c @@ -984,6 +984,12 @@ static blk_qc_t rrpc_make_rq(struct request_queue *q, struct bio *bio) struct nvm_rq *rqd; int err; + /* +* Multipage is supported up until 256kb due to NVME's 64 bit completion +* bitmap. +*/ + blk_queue_split(q, , q->bio_split); + if (bio_op(bio) == REQ_OP_DISCARD) { rrpc_discard(rrpc, bio); return BLK_QC_T_NONE; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] lightnvm: enable to send hint to erase command
Erases might be subject to host hints. An example is multi-plane programming to erase blocks in parallel. Enable targets to specify this hints. Signed-off-by: Javier González--- drivers/lightnvm/core.c | 9 ++--- drivers/lightnvm/gennvm.c| 5 ++--- drivers/lightnvm/rrpc.c | 2 +- drivers/lightnvm/sysblk.c| 4 ++-- drivers/nvme/host/lightnvm.c | 1 + include/linux/lightnvm.h | 7 +++ 6 files changed, 15 insertions(+), 13 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index a2393e1..f752087 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -204,9 +204,9 @@ int nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd) } EXPORT_SYMBOL(nvm_submit_io); -int nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *blk) +int nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *blk, int flags) { - return dev->mt->erase_blk(dev, blk, 0); + return dev->mt->erase_blk(dev, blk, flags); } EXPORT_SYMBOL(nvm_erase_blk); @@ -287,7 +287,8 @@ void nvm_free_rqd_ppalist(struct nvm_dev *dev, struct nvm_rq *rqd) } EXPORT_SYMBOL(nvm_free_rqd_ppalist); -int nvm_erase_ppa(struct nvm_dev *dev, struct ppa_addr *ppas, int nr_ppas) +int nvm_erase_ppa(struct nvm_dev *dev, struct ppa_addr *ppas, int nr_ppas, + int flags) { struct nvm_rq rqd; int ret; @@ -303,6 +304,8 @@ int nvm_erase_ppa(struct nvm_dev *dev, struct ppa_addr *ppas, int nr_ppas) nvm_generic_to_addr_mode(dev, ); + rqd.flags = flags; + ret = dev->ops->erase_block(dev, ); nvm_free_rqd_ppalist(dev, ); diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c index b74174c..730d736 100644 --- a/drivers/lightnvm/gennvm.c +++ b/drivers/lightnvm/gennvm.c @@ -593,12 +593,11 @@ static int gen_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd) return dev->ops->submit_io(dev, rqd); } -static int gen_erase_blk(struct nvm_dev *dev, struct nvm_block *blk, - unsigned long flags) +static int gen_erase_blk(struct nvm_dev *dev, struct nvm_block *blk, int flags) { struct ppa_addr addr = block_to_ppa(dev, blk); - return nvm_erase_ppa(dev, , 1); + return nvm_erase_ppa(dev, , 1, flags); } static int gen_reserve_lun(struct nvm_dev *dev, int lunid) diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c index 37fcaad..067e890 100644 --- a/drivers/lightnvm/rrpc.c +++ b/drivers/lightnvm/rrpc.c @@ -404,7 +404,7 @@ static void rrpc_block_gc(struct work_struct *work) if (rrpc_move_valid_pages(rrpc, rblk)) goto put_back; - if (nvm_erase_blk(dev, rblk->parent)) + if (nvm_erase_blk(dev, rblk->parent, 0)) goto put_back; rrpc_put_blk(rrpc, rblk); diff --git a/drivers/lightnvm/sysblk.c b/drivers/lightnvm/sysblk.c index a75bd28..d229067 100644 --- a/drivers/lightnvm/sysblk.c +++ b/drivers/lightnvm/sysblk.c @@ -379,7 +379,7 @@ static int nvm_prepare_new_sysblks(struct nvm_dev *dev, struct sysblk_scan *s) ppa = >ppas[scan_ppa_idx(i, nxt_blk)]; ppa->g.pg = ppa_to_slc(dev, 0); - ret = nvm_erase_ppa(dev, ppa, 1); + ret = nvm_erase_ppa(dev, ppa, 1, 0); if (ret) return ret; @@ -725,7 +725,7 @@ int nvm_dev_factory(struct nvm_dev *dev, int flags) /* continue to erase until list of blks until empty */ while ((ppa_cnt = nvm_fact_get_blks(dev, ppas, max_ppas, blk_bitmap)) > 0) - nvm_erase_ppa(dev, ppas, ppa_cnt); + nvm_erase_ppa(dev, ppas, ppa_cnt, 0); /* mark host reserved blocks free */ if (flags & NVM_FACTORY_RESET_HOST_BLKS) { diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index f5e3011..9470d51 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -543,6 +543,7 @@ static int nvme_nvm_erase_block(struct nvm_dev *dev, struct nvm_rq *rqd) c.erase.nsid = cpu_to_le32(ns->ns_id); c.erase.spba = cpu_to_le64(rqd->ppa_addr.ppa); c.erase.length = cpu_to_le16(rqd->nr_ppas - 1); + c.erase.control = cpu_to_le16(rqd->flags); return nvme_submit_sync_cmd(q, (struct nvme_command *), NULL, 0); } diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h index d190786..d7da953 100644 --- a/include/linux/lightnvm.h +++ b/include/linux/lightnvm.h @@ -472,8 +472,7 @@ typedef int (nvmm_open_blk_fn)(struct nvm_dev *, struct nvm_block *); typedef int (nvmm_close_blk_fn)(struct nvm_dev *, struct nvm_block *); typedef void (nvmm_flush_blk_fn)(struct nvm_dev *, struct nvm_block *); typedef int (nvmm_submit_io_fn)(struct nvm_dev *, struct nvm_rq *); -typedef int (nvmm_erase_blk_fn)(struct nvm_dev *, struct nvm_block *, -
RE: A question regarding "multiple SGL"
> > Hi Robert, > > Hey Robert, Christoph, > > > please explain your use cases that isn't handled. The one and only > > reason to set MSDBD to 1 is to make the code a lot simpler given that > > there is no real use case for supporting more. > > > > RDMA uses memory registrations to register large and possibly > > discontiguous data regions for a single rkey, aka single SGL descriptor > > in NVMe terms. There would be two reasons to support multiple SGL > > descriptors: a) to support a larger I/O size than supported by a single > > MR, or b) to support a data region format not mappable by a single > > MR. > > > > iSER only supports a single rkey (or stag in IETF terminology) and has > > been doing fine on a) and mostly fine on b). There are a few possible > > data layouts not supported by the traditional IB/iWarp FR WRs, but the > > limit is in fact exactly the same as imposed by the NVMe PRPs used for > > PCIe NVMe devices, so the Linux block layer has support to not generate > > them. Also with modern Mellanox IB/RoCE hardware we can actually > > register completely arbitrary SGLs. iSER supports using this registration > > mode already with a trivial code addition, but for NVMe we didn't have a > > pressing need yet. > > Good explanation :) > > The IO transfer size is a bit more pressing on some devices (e.g. > cxgb3/4) where the number of pages per-MR can be indeed lower than > a reasonable transfer size (Steve can correct me if I'm wrong). > Currently, cxgb4 support 128KB REG_MR operations on a host with 4K page size, via a max mr page list depth of 32. Soon it will be bumped up from 32 to 128 and life will be better... > However, if there is a real demand for this we'll happily accept > patches :) > > Just a note, having this feature in-place can bring unexpected behavior > depending on how we implement it: > - If we can use multiple MRs per IO (for multiple SGLs) we can either > prepare for the worst-case and allocate enough MRs to satisfy the > various IO patterns. This will be much heavier in terms of resource > allocation and can limit the scalability of the host driver. > - Or we can implement a shared MR pool with a reasonable number of MRs. > In this case each IO can consume one or more MRs on the expense of > other IOs. In this case we may need to requeue the IO later when we > have enough available MRs to satisfy the IO. This can yield some > unexpected performance gaps for some workloads. > I would like to see the storage protocols deal with lack of resources for the worst case. This allows much smaller resource usage for both MRs, and SQ resources, at the expense of adding flow control logic to deal with lack of available MR and/or SQ slots to process the next IO. I think it can be implemented efficiently such that when in flow-control mode, the code is driving new IO submissions off of SQ completions which will free up SQ slots and most likely MRs from the QP's MR pool. Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] lightnvm: manage block list on LUN owner
LUNs can be exclusively owned by a target through the media manager's reserve_lun function. In this case, the target should implement its own provisioning and manage internally the free/used/bad block list. This patch introduces a LUN management structure that can be passed on to LUN exclusive owners. The media manager is the default owner. On boot up, it populates the lists with based on the device's bad block table and sysblk metadata. Then, if the LUN is owned by a target, the management structured is passed on to it. From this moment, the target is responsible for list management. Thus, since LUNs are managed strictly by the target, there is no need for the media manager to reserve GC blocks. As a FTL, rrpc owns exclusively its LUNs. Therefore, adapt rrpc to use the new management interface. Signed-off-by: Javier González--- drivers/lightnvm/core.c | 5 ++-- drivers/lightnvm/gennvm.c | 76 +++ drivers/lightnvm/gennvm.h | 19 ++-- drivers/lightnvm/rrpc.c | 74 ++--- drivers/lightnvm/rrpc.h | 4 +++ include/linux/lightnvm.h | 21 + 6 files changed, 145 insertions(+), 54 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index f752087..4be3879 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -178,10 +178,9 @@ static struct nvm_dev *nvm_find_nvm_dev(const char *name) return NULL; } -struct nvm_block *nvm_get_blk(struct nvm_dev *dev, struct nvm_lun *lun, - unsigned long flags) +struct nvm_block *nvm_get_blk(struct nvm_dev *dev, struct nvm_lun *lun) { - return dev->mt->get_blk(dev, lun, flags); + return dev->mt->get_blk(dev, lun); } EXPORT_SYMBOL(nvm_get_blk); diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c index a7e17fa..8bff725 100644 --- a/drivers/lightnvm/gennvm.c +++ b/drivers/lightnvm/gennvm.c @@ -243,6 +243,7 @@ static void gen_luns_free(struct nvm_dev *dev) static int gen_luns_init(struct nvm_dev *dev, struct gen_dev *gn) { struct gen_lun *lun; + struct nvm_lun_mgmt *mgmt; int i; gn->luns = kcalloc(dev->nr_luns, sizeof(struct gen_lun), GFP_KERNEL); @@ -250,18 +251,31 @@ static int gen_luns_init(struct nvm_dev *dev, struct gen_dev *gn) return -ENOMEM; gen_for_each_lun(gn, lun, i) { + mgmt = kmalloc(sizeof(struct nvm_lun_mgmt), GFP_KERNEL); + if (!mgmt) + goto free; + + lun->mgmt = mgmt; + lun->tgt = NULL; + spin_lock_init(>vlun.lock); - INIT_LIST_HEAD(>free_list); - INIT_LIST_HEAD(>used_list); - INIT_LIST_HEAD(>bb_list); + INIT_LIST_HEAD(>mgmt->free_list); + INIT_LIST_HEAD(>mgmt->used_list); + INIT_LIST_HEAD(>mgmt->bb_list); + lun->mgmt->nr_free_blocks = dev->blks_per_lun; - lun->reserved_blocks = 2; /* for GC only */ lun->vlun.id = i; lun->vlun.lun_id = i % dev->luns_per_chnl; lun->vlun.chnl_id = i / dev->luns_per_chnl; - lun->vlun.nr_free_blocks = dev->blks_per_lun; + lun->vlun.priv = NULL; } return 0; + +free: + gen_for_each_lun(gn, lun, i) + kfree(lun->mgmt); + + return -ENOMEM; } static int gen_block_bb(struct gen_dev *gn, struct ppa_addr ppa, @@ -279,12 +293,13 @@ static int gen_block_bb(struct gen_dev *gn, struct ppa_addr ppa, lun = >luns[(dev->luns_per_chnl * ppa.g.ch) + ppa.g.lun]; for (i = 0; i < nr_blks; i++) { - if (blks[i] == 0) + if (blks[i] == NVM_BLK_T_FREE && i > 0) continue; blk = >vlun.blocks[i]; - list_move_tail(>list, >bb_list); - lun->vlun.nr_free_blocks--; + list_move_tail(>list, >mgmt->bb_list); + blk->state = NVM_BLK_ST_BAD; + lun->mgmt->nr_free_blocks--; } return 0; @@ -333,9 +348,9 @@ static int gen_block_map(u64 slba, u32 nlb, __le64 *entries, void *private) * block. It's up to the FTL on top to re-etablish the * block state. The block is assumed to be open. */ - list_move_tail(>list, >used_list); + list_move_tail(>list, >mgmt->used_list); blk->state = NVM_BLK_ST_TGT; - lun->vlun.nr_free_blocks--; + lun->mgmt->nr_free_blocks--; } } @@ -371,7 +386,7 @@ static int gen_blocks_init(struct nvm_dev *dev, struct gen_dev *gn) block->lun = >vlun; block->id = cur_block_id++; -
[PATCH] null_blk: Add notes to use LightNVM
If CONFIG_NVM is disabled, loading null_block module with use_lightnvm=1 fails. But there are no messages and documents related to the failure. So the patch adds the notes to use LightNVM. Signed-off-by: Yasuaki IshimatsuCc: Jens Axboe --- Documentation/block/null_blk.txt | 2 +- drivers/block/null_blk.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt index d8880ca..0365a26 100644 --- a/Documentation/block/null_blk.txt +++ b/Documentation/block/null_blk.txt @@ -72,4 +72,4 @@ use_per_node_hctx=[0/1]: Default: 0 queue for each CPU node in the system. use_lightnvm=[0/1]: Default: 0 - Register device with LightNVM. Requires blk-mq to be used. + Register device with LightNVM. Requires blk-mq and CONFIG_NVM=y to be used. diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index ba6f4a2e..4943ee2 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -577,6 +577,7 @@ static void null_nvm_unregister(struct nullb *nullb) #else static int null_nvm_register(struct nullb *nullb) { + pr_err("null_blk: CONFIG_NVM needs to be enabled for LightNVM\n"); return -EINVAL; } static void null_nvm_unregister(struct nullb *nullb) {} -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] lightnvm: do not decide on device blocks
Device blocks should be marked by the device and considered as bad blocks by the media manager. Thus, do not make assumptions on which blocks are going to be used by the device. In doing so we might lose valid blocks from the free list. Signed-off-by: Javier González--- drivers/lightnvm/gennvm.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c index 730d736..a7e17fa 100644 --- a/drivers/lightnvm/gennvm.c +++ b/drivers/lightnvm/gennvm.c @@ -371,12 +371,6 @@ static int gen_blocks_init(struct nvm_dev *dev, struct gen_dev *gn) block->lun = >vlun; block->id = cur_block_id++; - /* First block is reserved for device */ - if (unlikely(lun_iter == 0 && blk_iter == 0)) { - lun->vlun.nr_free_blocks--; - continue; - } - list_add_tail(>list, >free_list); } -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
On Thu, Oct 27, 2016 at 08:41:27PM +0100, Mark Brown wrote: > Plus the benchmarking to verify that it works well of course, especially > initially where it'll also be a new queue infrastructure as well as the > blk-mq conversion itself. It does feel like something that's going to > take at least a couple of kernel releases to get through. Or to put it the other way around: it could have been long done if people had started it the first it was suggestead. Instead you guys keep arguing and nothing gets done. Get started now, waiting won't make anything go faster. > I think there's also value in having improvements there for people who > benefit from them while queue infrastructure for blk-mq is being worked > on. Well, apply it to you vendor tree then and maintain it yourself if you disagree with our direction. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/7] LightNVM patchset V2
V2: Rebase patches on Matias' for-next Javier González (7): lightnvm: enable to send hint to erase command lightnvm: do not decide on device blocks lightnvm: manage block list on LUN owner lightnvm: drop reserve and release LUN callbacks lightnvm: export set bad block table lightnvm: add ECC error codes lightnvm: rrpc: split bios of size > 256kb drivers/lightnvm/core.c | 41 -- drivers/lightnvm/gennvm.c| 180 +-- drivers/lightnvm/gennvm.h| 19 ++--- drivers/lightnvm/rrpc.c | 128 -- drivers/lightnvm/rrpc.h | 4 + drivers/lightnvm/sysblk.c| 33 ++-- drivers/nvme/host/lightnvm.c | 1 + include/linux/lightnvm.h | 52 + 8 files changed, 305 insertions(+), 153 deletions(-) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
[...] >> Instead, what I can tell, as we have been looking into converting mmc >> (which I maintains) and that is indeed a significant amount of work. >> We will need to rip out all of the mmc request management, and most >> likely we also need to extend the blkmq interface - as to be able to >> do re-implement all the current request optimizations. We are looking >> into this, but it just takes time. > > > It's usually as much work as you make it into, for most cases it's > pretty straight forward and usually removes more code than it adds. > Hence the end result is better for it as well - less code in a driver is > better. >From a scalability and maintenance point of view, converting to blkmq makes perfect sense. Although, me personally don't want to sacrifice on performance (at least very little), just for the sake of gaining in scalability/maintainability. I would rather strive to adopt the blkmq framework to also suit my needs. Then it simply do takes more time. For example, in the mmc case we have implemented an asynchronous request path, which greatly improves performance on some systems. > >> I can imagine, that it's not always a straight forward "convert to blk >> mq" patch for every block device driver. > > > Well, I've actually done a few conversions, and it's not difficult at > all. The grunt of the work is usually around converting to using some of > the blk-mq features for parts of the driver that it had implemented > privately, like timeout handling, etc. > > I'm always happy to help people with converting drivers. Great, we ping you if we need some help! Thanks! > 3) While we work on scheduling in blkmq (at least for single queue devices), it's of course important that we set high goals. Having BFQ (and the other schedulers) in the legacy blk, provides a good reference for what we could aim for. >>> >>> >>> >>> Sure, but you don't need BFQ to be included in the kernel for that. >> >> >> Perhaps not. >> >> But does that mean, you expect Paolo to maintain an up to date BFQ >> tree for you? > > > I don't expect anything. If Paolo or others want to compare with BFQ on > the legacy IO path, then they can do that however way they want. If you > (and others) want to have that reference point, it's up to you how to > accomplish that. Do I get this right? You personally don't care about using BFQ as reference when evolving blkmq for single queue devices? Paolo and lots of other Linux users certainly do care about this. Moreover, I am still trying to understand what's the big deal to why you say no to BFQ as a legacy scheduler. Ideally it shouldn't cause you any maintenance burden and it doesn't make the removal of the legacy blk layer any more difficult, right? Kind regards Ulf Hansson -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
On Thu, Oct 27, 2016 at 12:21:06PM -0600, Jens Axboe wrote: > On 10/27/2016 12:13 PM, Ulf Hansson wrote: > > I can imagine, that it's not always a straight forward "convert to blk > > mq" patch for every block device driver. > Well, I've actually done a few conversions, and it's not difficult at > all. The grunt of the work is usually around converting to using some of > the blk-mq features for parts of the driver that it had implemented > privately, like timeout handling, etc. Plus the benchmarking to verify that it works well of course, especially initially where it'll also be a new queue infrastructure as well as the blk-mq conversion itself. It does feel like something that's going to take at least a couple of kernel releases to get through. > > > > 3) > > > > While we work on scheduling in blkmq (at least for single queue > > > > devices), it's of course important that we set high goals. Having BFQ > > > > (and the other schedulers) in the legacy blk, provides a good > > > > reference for what we could aim for. > > > Sure, but you don't need BFQ to be included in the kernel for that. > > Perhaps not. > > But does that mean, you expect Paolo to maintain an up to date BFQ > > tree for you? > I don't expect anything. If Paolo or others want to compare with BFQ on > the legacy IO path, then they can do that however way they want. If you > (and others) want to have that reference point, it's up to you how to > accomplish that. I think there's also value in having improvements there for people who benefit from them while queue infrastructure for blk-mq is being worked on. signature.asc Description: PGP signature
[PATCH 7/7] lightnvm: rrpc: split bios of size > 256kb
rrpc cannot handle bios of size > 256kb due to NVMe using a 64 bit bitmap to signal I/O completion. If a larger bio comes, split it explicitly. Signed-off-by: Javier González--- drivers/lightnvm/rrpc.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c index 8deef2e..0b8251f 100644 --- a/drivers/lightnvm/rrpc.c +++ b/drivers/lightnvm/rrpc.c @@ -984,6 +984,12 @@ static blk_qc_t rrpc_make_rq(struct request_queue *q, struct bio *bio) struct nvm_rq *rqd; int err; + /* +* Multipage is supported up until 256kb due to NVME's 64 bit completion +* bitmap. +*/ + blk_queue_split(q, , q->bio_split); + if (bio_op(bio) == REQ_OP_DISCARD) { rrpc_discard(rrpc, bio); return BLK_QC_T_NONE; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] lightnvm: add ECC error codes
Add ECC error codes to enable the appropriate handling in the target. Signed-off-by: Javier González--- include/linux/lightnvm.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h index e3ccaff..33643ae 100644 --- a/include/linux/lightnvm.h +++ b/include/linux/lightnvm.h @@ -107,6 +107,8 @@ enum { NVM_RSP_NOT_CHANGEABLE = 0x1, NVM_RSP_ERR_FAILWRITE = 0x40ff, NVM_RSP_ERR_EMPTYPAGE = 0x42ff, + NVM_RSP_ERR_FAILECC = 0x4281, + NVM_RSP_WARN_HIGHECC= 0x4700, /* Device opcodes */ NVM_OP_HBREAD = 0x02, -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] lightnvm: manage block list on LUN owner
From: Javier GonzálezLUNs can be exclusively owned by a target through the media manager's reserve_lun function. In this case, the target should implement its own provisioning and manage internally the free/used/bad block list. This patch introduces a LUN management structure that can be passed on to LUN exclusive owners. The media manager is the default owner. On boot up, it populates the lists with based on the device's bad block table and sysblk metadata. Then, if the LUN is owned by a target, the management structured is passed on to it. From this moment, the target is responsible for list management. Thus, since LUNs are managed strictly by the target, there is no need for the media manager to reserve GC blocks. As a FTL, rrpc owns exclusively its LUNs. Therefore, adapt rrpc to use the new management interface. Signed-off-by: Javier González --- drivers/lightnvm/core.c | 5 ++-- drivers/lightnvm/gennvm.c | 76 +++ drivers/lightnvm/gennvm.h | 19 ++-- drivers/lightnvm/rrpc.c | 74 ++--- drivers/lightnvm/rrpc.h | 4 +++ include/linux/lightnvm.h | 21 + 6 files changed, 145 insertions(+), 54 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 1a6a7e6..4c6577e 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -178,10 +178,9 @@ static struct nvm_dev *nvm_find_nvm_dev(const char *name) return NULL; } -struct nvm_block *nvm_get_blk(struct nvm_dev *dev, struct nvm_lun *lun, - unsigned long flags) +struct nvm_block *nvm_get_blk(struct nvm_dev *dev, struct nvm_lun *lun) { - return dev->mt->get_blk(dev, lun, flags); + return dev->mt->get_blk(dev, lun); } EXPORT_SYMBOL(nvm_get_blk); diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c index a7e17fa..8bff725 100644 --- a/drivers/lightnvm/gennvm.c +++ b/drivers/lightnvm/gennvm.c @@ -243,6 +243,7 @@ static void gen_luns_free(struct nvm_dev *dev) static int gen_luns_init(struct nvm_dev *dev, struct gen_dev *gn) { struct gen_lun *lun; + struct nvm_lun_mgmt *mgmt; int i; gn->luns = kcalloc(dev->nr_luns, sizeof(struct gen_lun), GFP_KERNEL); @@ -250,18 +251,31 @@ static int gen_luns_init(struct nvm_dev *dev, struct gen_dev *gn) return -ENOMEM; gen_for_each_lun(gn, lun, i) { + mgmt = kmalloc(sizeof(struct nvm_lun_mgmt), GFP_KERNEL); + if (!mgmt) + goto free; + + lun->mgmt = mgmt; + lun->tgt = NULL; + spin_lock_init(>vlun.lock); - INIT_LIST_HEAD(>free_list); - INIT_LIST_HEAD(>used_list); - INIT_LIST_HEAD(>bb_list); + INIT_LIST_HEAD(>mgmt->free_list); + INIT_LIST_HEAD(>mgmt->used_list); + INIT_LIST_HEAD(>mgmt->bb_list); + lun->mgmt->nr_free_blocks = dev->blks_per_lun; - lun->reserved_blocks = 2; /* for GC only */ lun->vlun.id = i; lun->vlun.lun_id = i % dev->luns_per_chnl; lun->vlun.chnl_id = i / dev->luns_per_chnl; - lun->vlun.nr_free_blocks = dev->blks_per_lun; + lun->vlun.priv = NULL; } return 0; + +free: + gen_for_each_lun(gn, lun, i) + kfree(lun->mgmt); + + return -ENOMEM; } static int gen_block_bb(struct gen_dev *gn, struct ppa_addr ppa, @@ -279,12 +293,13 @@ static int gen_block_bb(struct gen_dev *gn, struct ppa_addr ppa, lun = >luns[(dev->luns_per_chnl * ppa.g.ch) + ppa.g.lun]; for (i = 0; i < nr_blks; i++) { - if (blks[i] == 0) + if (blks[i] == NVM_BLK_T_FREE && i > 0) continue; blk = >vlun.blocks[i]; - list_move_tail(>list, >bb_list); - lun->vlun.nr_free_blocks--; + list_move_tail(>list, >mgmt->bb_list); + blk->state = NVM_BLK_ST_BAD; + lun->mgmt->nr_free_blocks--; } return 0; @@ -333,9 +348,9 @@ static int gen_block_map(u64 slba, u32 nlb, __le64 *entries, void *private) * block. It's up to the FTL on top to re-etablish the * block state. The block is assumed to be open. */ - list_move_tail(>list, >used_list); + list_move_tail(>list, >mgmt->used_list); blk->state = NVM_BLK_ST_TGT; - lun->vlun.nr_free_blocks--; + lun->mgmt->nr_free_blocks--; } } @@ -371,7 +386,7 @@ static int gen_blocks_init(struct nvm_dev *dev, struct gen_dev *gn) block->lun = >vlun;
[PATCH 4/7] lightnvm: drop reserve and release LUN callbacks
From: Javier GonzálezOn target initialization, targets use callbacks to the media manager to configure the LUNs they use. In order to simplify the flow, drop this callbacks and manage everything internally on the media manager. By making use of the newly introduce LUN management structure, the media manager knows which target exclusively owns each target and can therefore allocate and free all the necessary structures before initializing the target. Not exclusively owned LUNs belong to the media manager in any case. Adapt rrpc to not use the reserve_lun/release_lun callback functions. Signed-off-by: Javier González --- drivers/lightnvm/gennvm.c | 68 --- drivers/lightnvm/rrpc.c | 12 + include/linux/lightnvm.h | 5 +--- 3 files changed, 55 insertions(+), 30 deletions(-) diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c index 8bff725..a340685 100644 --- a/drivers/lightnvm/gennvm.c +++ b/drivers/lightnvm/gennvm.c @@ -35,6 +35,50 @@ static const struct block_device_operations gen_fops = { .owner = THIS_MODULE, }; +static int gen_reserve_luns(struct nvm_dev *dev, int lun_begin, int lun_end, + struct nvm_target *t) +{ + struct gen_dev *gn = dev->mp; + struct gen_lun *lun; + int i; + + for (i = lun_begin; i <= lun_end; i++) { + if (test_and_set_bit(i, dev->lun_map)) { + pr_err("gennvm: lun %d is already allocated\n", i); + goto fail; + } + + lun = >luns[i]; + lun->tgt = t; + lun->vlun.priv = lun->mgmt; + } + + return 0; +fail: + while (--i > lun_begin) + clear_bit(i, dev->lun_map); + + return 1; +} + +static void gen_release_luns(struct nvm_dev *dev, struct nvm_target *t) +{ + struct gen_dev *gn = dev->mp; + struct gen_lun *lun; + int lunid; + int i; + + gen_for_each_lun(gn, lun, i) { + if (lun->tgt != t) + continue; + + lunid = lun->vlun.id; + WARN_ON(!test_and_clear_bit(lunid, dev->lun_map)); + lun->vlun.priv = NULL; + lun->tgt = NULL; + } +} + static int gen_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create) { struct gen_dev *gn = dev->mp; @@ -80,6 +124,9 @@ static int gen_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create) tdisk->fops = _fops; tdisk->queue = tqueue; + if (tt->exclusive && gen_reserve_luns(dev, s->lun_begin, s->lun_end, t)) + goto err_reserve; + targetdata = tt->init(dev, tdisk, s->lun_begin, s->lun_end); if (IS_ERR(targetdata)) goto err_init; @@ -102,6 +149,8 @@ static int gen_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create) return 0; err_init: + gen_release_luns(dev, t); +err_reserve: put_disk(tdisk); err_queue: blk_cleanup_queue(tqueue); @@ -110,7 +159,7 @@ static int gen_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create) return -ENOMEM; } -static void __gen_remove_target(struct nvm_target *t) +static void __gen_remove_target(struct nvm_dev *dev, struct nvm_target *t) { struct nvm_tgt_type *tt = t->type; struct gendisk *tdisk = t->disk; @@ -122,6 +171,7 @@ static void __gen_remove_target(struct nvm_target *t) if (tt->exit) tt->exit(tdisk->private_data); + gen_release_luns(dev, t); put_disk(tdisk); list_del(>list); @@ -152,7 +202,7 @@ static int gen_remove_tgt(struct nvm_dev *dev, struct nvm_ioctl_remove *remove) mutex_unlock(>lock); return 1; } - __gen_remove_target(t); + __gen_remove_target(dev, t); mutex_unlock(>lock); return 0; @@ -474,7 +524,7 @@ static void gen_unregister(struct nvm_dev *dev) list_for_each_entry_safe(t, tmp, >targets, list) { if (t->dev != dev) continue; - __gen_remove_target(t); + __gen_remove_target(dev, t); } mutex_unlock(>lock); @@ -618,16 +668,6 @@ static int gen_erase_blk(struct nvm_dev *dev, struct nvm_block *blk, int flags) return nvm_erase_ppa(dev, , 1, flags); } -static int gen_reserve_lun(struct nvm_dev *dev, int lunid) -{ - return test_and_set_bit(lunid, dev->lun_map); -} - -static void gen_release_lun(struct nvm_dev *dev, int lunid) -{ - WARN_ON(!test_and_clear_bit(lunid, dev->lun_map)); -} - static struct nvm_lun *gen_get_lun(struct nvm_dev *dev, int lunid) { struct gen_dev *gn = dev->mp; @@ -674,8 +714,6 @@ static struct nvmm_type gen = { .mark_blk = gen_mark_blk, .get_lun= gen_get_lun, -
[PATCH 1/7] lightnvm: enable to send hint to erase command
Erases might be subject to host hints. An example is multi-plane programming to erase blocks in parallel. Enable targets to specify this hints. Signed-off-by: Javier González--- drivers/lightnvm/core.c | 9 ++--- drivers/lightnvm/gennvm.c| 5 ++--- drivers/lightnvm/rrpc.c | 2 +- drivers/lightnvm/sysblk.c| 4 ++-- drivers/nvme/host/lightnvm.c | 1 + include/linux/lightnvm.h | 7 +++ 6 files changed, 15 insertions(+), 13 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 1cac0f8..1a6a7e6 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -204,9 +204,9 @@ int nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd) } EXPORT_SYMBOL(nvm_submit_io); -int nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *blk) +int nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *blk, int flags) { - return dev->mt->erase_blk(dev, blk, 0); + return dev->mt->erase_blk(dev, blk, flags); } EXPORT_SYMBOL(nvm_erase_blk); @@ -287,7 +287,8 @@ void nvm_free_rqd_ppalist(struct nvm_dev *dev, struct nvm_rq *rqd) } EXPORT_SYMBOL(nvm_free_rqd_ppalist); -int nvm_erase_ppa(struct nvm_dev *dev, struct ppa_addr *ppas, int nr_ppas) +int nvm_erase_ppa(struct nvm_dev *dev, struct ppa_addr *ppas, int nr_ppas, + int flags) { struct nvm_rq rqd; int ret; @@ -303,6 +304,8 @@ int nvm_erase_ppa(struct nvm_dev *dev, struct ppa_addr *ppas, int nr_ppas) nvm_generic_to_addr_mode(dev, ); + rqd.flags = flags; + ret = dev->ops->erase_block(dev, ); nvm_free_rqd_ppalist(dev, ); diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c index b74174c..730d736 100644 --- a/drivers/lightnvm/gennvm.c +++ b/drivers/lightnvm/gennvm.c @@ -593,12 +593,11 @@ static int gen_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd) return dev->ops->submit_io(dev, rqd); } -static int gen_erase_blk(struct nvm_dev *dev, struct nvm_block *blk, - unsigned long flags) +static int gen_erase_blk(struct nvm_dev *dev, struct nvm_block *blk, int flags) { struct ppa_addr addr = block_to_ppa(dev, blk); - return nvm_erase_ppa(dev, , 1); + return nvm_erase_ppa(dev, , 1, flags); } static int gen_reserve_lun(struct nvm_dev *dev, int lunid) diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c index 37fcaad..067e890 100644 --- a/drivers/lightnvm/rrpc.c +++ b/drivers/lightnvm/rrpc.c @@ -404,7 +404,7 @@ static void rrpc_block_gc(struct work_struct *work) if (rrpc_move_valid_pages(rrpc, rblk)) goto put_back; - if (nvm_erase_blk(dev, rblk->parent)) + if (nvm_erase_blk(dev, rblk->parent, 0)) goto put_back; rrpc_put_blk(rrpc, rblk); diff --git a/drivers/lightnvm/sysblk.c b/drivers/lightnvm/sysblk.c index a75bd28..d229067 100644 --- a/drivers/lightnvm/sysblk.c +++ b/drivers/lightnvm/sysblk.c @@ -379,7 +379,7 @@ static int nvm_prepare_new_sysblks(struct nvm_dev *dev, struct sysblk_scan *s) ppa = >ppas[scan_ppa_idx(i, nxt_blk)]; ppa->g.pg = ppa_to_slc(dev, 0); - ret = nvm_erase_ppa(dev, ppa, 1); + ret = nvm_erase_ppa(dev, ppa, 1, 0); if (ret) return ret; @@ -725,7 +725,7 @@ int nvm_dev_factory(struct nvm_dev *dev, int flags) /* continue to erase until list of blks until empty */ while ((ppa_cnt = nvm_fact_get_blks(dev, ppas, max_ppas, blk_bitmap)) > 0) - nvm_erase_ppa(dev, ppas, ppa_cnt); + nvm_erase_ppa(dev, ppas, ppa_cnt, 0); /* mark host reserved blocks free */ if (flags & NVM_FACTORY_RESET_HOST_BLKS) { diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index f5e3011..9470d51 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -543,6 +543,7 @@ static int nvme_nvm_erase_block(struct nvm_dev *dev, struct nvm_rq *rqd) c.erase.nsid = cpu_to_le32(ns->ns_id); c.erase.spba = cpu_to_le64(rqd->ppa_addr.ppa); c.erase.length = cpu_to_le16(rqd->nr_ppas - 1); + c.erase.control = cpu_to_le16(rqd->flags); return nvme_submit_sync_cmd(q, (struct nvme_command *), NULL, 0); } diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h index d190786..d7da953 100644 --- a/include/linux/lightnvm.h +++ b/include/linux/lightnvm.h @@ -472,8 +472,7 @@ typedef int (nvmm_open_blk_fn)(struct nvm_dev *, struct nvm_block *); typedef int (nvmm_close_blk_fn)(struct nvm_dev *, struct nvm_block *); typedef void (nvmm_flush_blk_fn)(struct nvm_dev *, struct nvm_block *); typedef int (nvmm_submit_io_fn)(struct nvm_dev *, struct nvm_rq *); -typedef int (nvmm_erase_blk_fn)(struct nvm_dev *, struct nvm_block *, -
Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
On 10/27/2016 01:34 PM, Ulf Hansson wrote: [...] Instead, what I can tell, as we have been looking into converting mmc (which I maintains) and that is indeed a significant amount of work. We will need to rip out all of the mmc request management, and most likely we also need to extend the blkmq interface - as to be able to do re-implement all the current request optimizations. We are looking into this, but it just takes time. It's usually as much work as you make it into, for most cases it's pretty straight forward and usually removes more code than it adds. Hence the end result is better for it as well - less code in a driver is better. From a scalability and maintenance point of view, converting to blkmq makes perfect sense. Although, me personally don't want to sacrifice on performance (at least very little), just for the sake of gaining in scalability/maintainability. Nobody has said anything about sacrificing performance. And whether you like it or not, maintainability is always the most important aspect. Even performance takes a backseat to maintainability. I would rather strive to adopt the blkmq framework to also suit my needs. Then it simply do takes more time. For example, in the mmc case we have implemented an asynchronous request path, which greatly improves performance on some systems. blk-mq has evolved to support a variety of devices, there's nothing special about mmc that can't work well within that framework. 3) While we work on scheduling in blkmq (at least for single queue devices), it's of course important that we set high goals. Having BFQ (and the other schedulers) in the legacy blk, provides a good reference for what we could aim for. Sure, but you don't need BFQ to be included in the kernel for that. Perhaps not. But does that mean, you expect Paolo to maintain an up to date BFQ tree for you? I don't expect anything. If Paolo or others want to compare with BFQ on the legacy IO path, then they can do that however way they want. If you (and others) want to have that reference point, it's up to you how to accomplish that. Do I get this right? You personally don't care about using BFQ as reference when evolving blkmq for single queue devices? Paolo and lots of other Linux users certainly do care about this. I'm getting a little tired of this putting words in my mouth... That is not what I'm saying at all. What I'm saying is that the people working on BFQ can do what they need to do to have a reference implementation to compare against. You don't need BFQ in the kernel for that. I said it's up to YOU, with the you here meaning the people that want to work on it, how that goes down. Moreover, I am still trying to understand what's the big deal to why you say no to BFQ as a legacy scheduler. Ideally it shouldn't cause you any maintenance burden and it doesn't make the removal of the legacy blk layer any more difficult, right? Not sure I can state it much clearer. It's a new scheduler, and a complicated one at that. It WILL carry a maintenance burden. And I'm really not that interested in adding such a burden for something that will be defunct as soon as the single queue blk-mq version is done. Additionally, if we put BFQ in right now, the motivation to do the real work will be gone. The path forward is clear. It'd be a lot better to put some work behind that, rather than continue this email thread. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/12] dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code
On Wed, Oct 26, 2016 at 03:54:07PM -0700, Bart Van Assche wrote: > Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED > in the dm start and stop queue functions, only manipulate the latter > flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped(). > > Signed-off-by: Bart Van Assche> Reviewed-by: Christoph Hellwig Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] blk-mq: Move more code into blk_mq_direct_issue_request()
Looks good: Reviewed-by: Christoph Hellwig-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/12] nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code
Looks good, Reviewed-by: Sagi Grimberg-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
On 10/27/2016 08:34 AM, Grozdan wrote: On Thu, Oct 27, 2016 at 11:26 AM, Jan Karawrote: On Wed 26-10-16 10:12:38, Jens Axboe wrote: On 10/26/2016 10:04 AM, Paolo Valente wrote: Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe ha scritto: On 10/26/2016 09:29 AM, Christoph Hellwig wrote: On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote: The question to ask first is whether to actually have pluggable schedulers on blk-mq at all, or just have one that is meant to do the right thing in every case (and possibly can be bypassed completely). That would be my preference. Have a BFQ-variant for blk-mq as an option (default to off unless opted in by the driver or user), and not other scheduler for blk-mq. Don't bother with bfq for non blk-mq. It's not like there is any advantage in the legacy-request device even for slow devices, except for the option of having I/O scheduling. It's the only right way forward. blk-mq might not offer any substantial advantages to rotating storage, but with scheduling, it won't offer a downside either. And it'll take us towards the real goal, which is to have just one IO path. ok Adding a new scheduler for the legacy IO path makes no sense. I would fully agree if effective and stable I/O scheduling would be available in blk-mq in one or two months. But I guess that it will take at least one year optimistically, given the current status of the needed infrastructure, and given the great difficulties of doing effective scheduling at the high parallelism and extreme target speeds of blk-mq. Of course, this holds true unless little clever scheduling is performed. So, what's the point in forcing a lot of users wait another year or more, for a solution that has yet to be even defined, while they could enjoy a much better system, and then switch an even better system when scheduling is ready in blk-mq too? That same argument could have been made 2 years ago. Saying no to a new scheduler for the legacy framework goes back roughly that long. We could have had BFQ for mq NOW, if we didn't keep coming back to this very point. I'm hesistant to add a new scheduler because it's very easy to add, very difficult to get rid of. If we do add BFQ as a legacy scheduler now, it'll take us years and years to get rid of it again. We should be moving towards LESS moving parts in the legacy path, not more. We can keep having this discussion every few years, but I think we'd both prefer to make some actual progress here. It's perfectly fine to add an interface for a single queue interface for an IO scheduler for blk-mq, since we don't care too much about scalability there. And that won't take years, that should be a few weeks. Retrofitting BFQ on top of that should not be hard either. That can co-exist with a real multiqueue scheduler as well, something that's geared towards some fairness for faster devices. OK, so some solution like having a variant of blk_sq_make_request() that will consume requests, do IO scheduling decisions on them, and feed them into the HW queue is it sees fit would be acceptable? That will provide the IO scheduler a global view that it needs for complex scheduling decisions so it should indeed be relatively easy to port BFQ to work like that. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hello, Let me first say that I'm in no way associated with Paolo Valente or any other BFQ developer. I'm a mere user who has had great experience using BFQ My workload is one that takes my disks to their limits. I often use large files like raw Blu-ray streams which then I remux to mkv's while at the same time streaming at least 2 movies to various devices in house and using my system as I do while the remuxing process is going on. At times, I'm also pushing video files to my NAS at close to Gbps speed while the stuff I mentioned is in progress My experience with BFQ is that it has never resulted in the video streams being interrupted due to disk trashing. I've extensively used all the other Linux disk schedulers in the past and what I've observed is that whenever I start the remuxing (and copying) process, the streams will begin to hiccup, stutter and often multi-seconds long "waits" will occur. It gets even worse, when I do this kind of workload, the whole system will come to almost a halt and interactivity goes out the window. Impossible to start an app in a reasonable amount of time. Loading a visited website makes Chrome hang while trying to get the contents from its cache, etc BFQ has greatly helped to have a responsive system during such operations and as I said, I have never experience any interruption of the video streams. Do I think BFQ is the best
Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
On 10/27/2016 03:26 AM, Jan Kara wrote: On Wed 26-10-16 10:12:38, Jens Axboe wrote: On 10/26/2016 10:04 AM, Paolo Valente wrote: Il giorno 26 ott 2016, alle ore 17:32, Jens Axboeha scritto: On 10/26/2016 09:29 AM, Christoph Hellwig wrote: On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote: The question to ask first is whether to actually have pluggable schedulers on blk-mq at all, or just have one that is meant to do the right thing in every case (and possibly can be bypassed completely). That would be my preference. Have a BFQ-variant for blk-mq as an option (default to off unless opted in by the driver or user), and not other scheduler for blk-mq. Don't bother with bfq for non blk-mq. It's not like there is any advantage in the legacy-request device even for slow devices, except for the option of having I/O scheduling. It's the only right way forward. blk-mq might not offer any substantial advantages to rotating storage, but with scheduling, it won't offer a downside either. And it'll take us towards the real goal, which is to have just one IO path. ok Adding a new scheduler for the legacy IO path makes no sense. I would fully agree if effective and stable I/O scheduling would be available in blk-mq in one or two months. But I guess that it will take at least one year optimistically, given the current status of the needed infrastructure, and given the great difficulties of doing effective scheduling at the high parallelism and extreme target speeds of blk-mq. Of course, this holds true unless little clever scheduling is performed. So, what's the point in forcing a lot of users wait another year or more, for a solution that has yet to be even defined, while they could enjoy a much better system, and then switch an even better system when scheduling is ready in blk-mq too? That same argument could have been made 2 years ago. Saying no to a new scheduler for the legacy framework goes back roughly that long. We could have had BFQ for mq NOW, if we didn't keep coming back to this very point. I'm hesistant to add a new scheduler because it's very easy to add, very difficult to get rid of. If we do add BFQ as a legacy scheduler now, it'll take us years and years to get rid of it again. We should be moving towards LESS moving parts in the legacy path, not more. We can keep having this discussion every few years, but I think we'd both prefer to make some actual progress here. It's perfectly fine to add an interface for a single queue interface for an IO scheduler for blk-mq, since we don't care too much about scalability there. And that won't take years, that should be a few weeks. Retrofitting BFQ on top of that should not be hard either. That can co-exist with a real multiqueue scheduler as well, something that's geared towards some fairness for faster devices. OK, so some solution like having a variant of blk_sq_make_request() that will consume requests, do IO scheduling decisions on them, and feed them into the HW queue is it sees fit would be acceptable? That will provide the IO scheduler a global view that it needs for complex scheduling decisions so it should indeed be relatively easy to port BFQ to work like that. I'd probably start off Omar's base [1] that switches the software queues to store bios instead of requests, since that lifts the of the 1:1 mapping between what we can queue up and what we can dispatch. Without that, the IO scheduler won't have too much to work with. And with that in place, it'll be a "bio in, request out" type of setup, which is similar to what we have in the legacy path. I'd keep the software queues, but as a starting point, mandate 1 hardware queue to keep that as the per-device view of the state. The IO scheduler would be responsible for moving one or more bios from the software queues to the hardware queue, when they are ready to dispatch. [1] https://github.com/osandov/linux/commit/8ef3508628b6cf7c4712cd3d8084ee11ef5d2530 -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
[...] > > I'm hesistant to add a new scheduler because it's very easy to add, very > difficult to get rid of. If we do add BFQ as a legacy scheduler now, > it'll take us years and years to get rid of it again. We should be > moving towards LESS moving parts in the legacy path, not more. Jens, I think you are wrong here and let me try to elaborate on why. 1) We already have legacy schedulers like CFQ, DEADLINE, etc - and most block device drivers are still using the legacy blk interface. To be able to remove the legacy blk layer, all block device drivers must be converted to blkmq - of course. So to reach that goal, we will not only need to evolve blkmq to allow scheduling (at least for single queue devices), but we also need to convert *all* block device drivers to blkmq. For sure this will take *years* and not months. More important, when the transition to blkmq has been completed, then there is absolutely no difference (from effort point of view) in removing the legacy blk layer - no matter if we have BFQ in there or not. I do understand if you have concern from maintenance point of view, as I assume you would rather focus on evolving blkmq, than care about legacy blk code. So, would it help if Paolo volunteers to maintain the BFQ code in the meantime? 2) While we work on evolving blkmq and convert block device drivers to it, BFQ could as a separate legacy scheduler, help *lots* of Linux users to get a significant improved experience. Should we really prevent them from that? I think you block maintainer guys, really need to consider this fact. 3) While we work on scheduling in blkmq (at least for single queue devices), it's of course important that we set high goals. Having BFQ (and the other schedulers) in the legacy blk, provides a good reference for what we could aim for. > > We can keep having this discussion every few years, but I think we'd > both prefer to make some actual progress here. It's perfectly fine to > add an interface for a single queue interface for an IO scheduler for > blk-mq, since we don't care too much about scalability there. And that > won't take years, that should be a few weeks. Retrofitting BFQ on top of > that should not be hard either. That can co-exist with a real multiqueue > scheduler as well, something that's geared towards some fairness for > faster devices. That's really great news! I hope we get a possibility to meet and discuss the plans for this at Kernel summit/Linux Plumbers the next week! > > -- > Jens Axboe Kind regards Ulf Hansson -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
On 10/27/2016 11:32 AM, Ulf Hansson wrote: [...] I'm hesistant to add a new scheduler because it's very easy to add, very difficult to get rid of. If we do add BFQ as a legacy scheduler now, it'll take us years and years to get rid of it again. We should be moving towards LESS moving parts in the legacy path, not more. Jens, I think you are wrong here and let me try to elaborate on why. 1) We already have legacy schedulers like CFQ, DEADLINE, etc - and most block device drivers are still using the legacy blk interface. I don't think that's an accurate statement. In terms of coverage, most drivers do support blk-mq. Anything SCSI, nvme, virtio-blk, SATA runs on (or can run on) top of blk-mq. To be able to remove the legacy blk layer, all block device drivers must be converted to blkmq - of course. That's a given. So to reach that goal, we will not only need to evolve blkmq to allow scheduling (at least for single queue devices), but we also need to convert *all* block device drivers to blkmq. For sure this will take *years* and not months. Correct. More important, when the transition to blkmq has been completed, then there is absolutely no difference (from effort point of view) in removing the legacy blk layer - no matter if we have BFQ in there or not. I do understand if you have concern from maintenance point of view, as I assume you would rather focus on evolving blkmq, than care about legacy blk code. So, would it help if Paolo volunteers to maintain the BFQ code in the meantime? We're obviously still maintaining the legacy IO path. But we don't want to actively develop it, and we haven't, for a long time. And Paolo maintaining it is a strict requirement for inclusion, legacy or blk-mq aside. That would go for both. I'd never accept a major feature from an individual or company if they weren't willing and capable of maintaining it. Throwing submissions over the wall is not viable. 2) While we work on evolving blkmq and convert block device drivers to it, BFQ could as a separate legacy scheduler, help *lots* of Linux users to get a significant improved experience. Should we really prevent them from that? I think you block maintainer guys, really need to consider this fact. You still seem to be basing that assumption on the notion that we have to convert tons of drivers for BFQ to make sense under the blk-mq umbrella. That's not the case. 3) While we work on scheduling in blkmq (at least for single queue devices), it's of course important that we set high goals. Having BFQ (and the other schedulers) in the legacy blk, provides a good reference for what we could aim for. Sure, but you don't need BFQ to be included in the kernel for that. We can keep having this discussion every few years, but I think we'd both prefer to make some actual progress here. It's perfectly fine to add an interface for a single queue interface for an IO scheduler for blk-mq, since we don't care too much about scalability there. And that won't take years, that should be a few weeks. Retrofitting BFQ on top of that should not be hard either. That can co-exist with a real multiqueue scheduler as well, something that's geared towards some fairness for faster devices. That's really great news! I hope we get a possibility to meet and discuss the plans for this at Kernel summit/Linux Plumbers the next week! I'll be there. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/12] blk-mq: Introduce blk_mq_hctx_stopped()
Looks fine, Reviewed-by: Christoph Hellwig-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A question regarding "multiple SGL"
Hi Robert, Hey Robert, Christoph, please explain your use cases that isn't handled. The one and only reason to set MSDBD to 1 is to make the code a lot simpler given that there is no real use case for supporting more. RDMA uses memory registrations to register large and possibly discontiguous data regions for a single rkey, aka single SGL descriptor in NVMe terms. There would be two reasons to support multiple SGL descriptors: a) to support a larger I/O size than supported by a single MR, or b) to support a data region format not mappable by a single MR. iSER only supports a single rkey (or stag in IETF terminology) and has been doing fine on a) and mostly fine on b). There are a few possible data layouts not supported by the traditional IB/iWarp FR WRs, but the limit is in fact exactly the same as imposed by the NVMe PRPs used for PCIe NVMe devices, so the Linux block layer has support to not generate them. Also with modern Mellanox IB/RoCE hardware we can actually register completely arbitrary SGLs. iSER supports using this registration mode already with a trivial code addition, but for NVMe we didn't have a pressing need yet. Good explanation :) The IO transfer size is a bit more pressing on some devices (e.g. cxgb3/4) where the number of pages per-MR can be indeed lower than a reasonable transfer size (Steve can correct me if I'm wrong). However, if there is a real demand for this we'll happily accept patches :) Just a note, having this feature in-place can bring unexpected behavior depending on how we implement it: - If we can use multiple MRs per IO (for multiple SGLs) we can either prepare for the worst-case and allocate enough MRs to satisfy the various IO patterns. This will be much heavier in terms of resource allocation and can limit the scalability of the host driver. - Or we can implement a shared MR pool with a reasonable number of MRs. In this case each IO can consume one or more MRs on the expense of other IOs. In this case we may need to requeue the IO later when we have enough available MRs to satisfy the IO. This can yield some unexpected performance gaps for some workloads. Cheers, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/12] blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()
On Wed, Oct 26, 2016 at 03:53:39PM -0700, Bart Van Assche wrote: > Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls > are followed by kicking the requeue list. Hence add an argument to > these two functions that allows to kick the requeue list. This was > proposed by Christoph Hellwig. Looks good, Reviewed-by: Christoph Hellwig-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/12] blk-mq: Introduce blk_mq_hctx_stopped()
Looks fine, Reviewed-by: Sagi Grimberg-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12] SRP transport: Move queuecommand() wait code to SCSI core
On Wed, Oct 26, 2016 at 03:55:00PM -0700, Bart Van Assche wrote: > Additionally, rename srp_wait_for_queuecommand() into > scsi_wait_for_queuecommand() and add a comment about the > queuecommand() call from scsi_send_eh_cmnd(). > > Signed-off-by: Bart Van Assche> Reviewed-by: Christoph Hellwig Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/12] dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code
On Wed, Oct 26 2016 at 6:54pm -0400, Bart Van Asschewrote: > Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED > in the dm start and stop queue functions, only manipulate the latter > flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped(). > > Signed-off-by: Bart Van Assche > Reviewed-by: Christoph Hellwig > Cc: Mike Snitzer Acked-by: Mike Snitzer -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/12] SRP transport, scsi-mq: Wait for .queue_rq() if necessary
On Wed, Oct 26, 2016 at 03:55:34PM -0700, Bart Van Assche wrote: > Ensure that if scsi-mq is enabled that scsi_wait_for_queuecommand() > waits until ongoing shost->hostt->queuecommand() calls have finished. > > Signed-off-by: Bart Van Assche> Reviewed-by: Christoph Hellwig Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/12] blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()
On Wed, Oct 26, 2016 at 03:53:39PM -0700, Bart Van Assche wrote: > Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls > are followed by kicking the requeue list. Hence add an argument to > these two functions that allows to kick the requeue list. This was > proposed by Christoph Hellwig. > > Signed-off-by: Bart Van Assche> Cc: Christoph Hellwig Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/12] SRP transport, scsi-mq: Wait for .queue_rq() if necessary
Thanks for moving it, Reviewed-by: Sagi Grimberg-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] blk-mq: Move more code into blk_mq_direct_issue_request()
Looks good, Reviewed-by: Sagi Grimberg-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12] SRP transport: Move queuecommand() wait code to SCSI core
Reviewed-by: Sagi Grimberg-- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] blk-mq: Move more code into blk_mq_direct_issue_request()
On Wed, Oct 26, 2016 at 03:52:35PM -0700, Bart Van Assche wrote: > Move the "hctx stopped" test and the insert request calls into > blk_mq_direct_issue_request(). Rename that function into > blk_mq_try_issue_directly() to reflect its new semantics. Pass > the hctx pointer to that function instead of looking it up a > second time. These changes avoid that code has to be duplicated > in the next patch. > > Signed-off-by: Bart Van Assche> Cc: Christoph Hellwig Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/12] blk-mq: Do not invoke .queue_rq() for a stopped queue
On Wed, Oct 26, 2016 at 03:50:44PM -0700, Bart Van Assche wrote: > The meaning of the BLK_MQ_S_STOPPED flag is "do not call > .queue_rq()". Hence modify blk_mq_make_request() such that requests > are queued instead of issued if a queue has been stopped. > > Reported-by: Ming Lei> Signed-off-by: Bart Van Assche > Reviewed-by: Christoph Hellwig Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12] blk-mq: Move more code into blk_mq_direct_issue_request()
On Wed, Oct 26, 2016 at 03:52:35PM -0700, Bart Van Assche wrote: > Move the "hctx stopped" test and the insert request calls into > blk_mq_direct_issue_request(). Rename that function into > blk_mq_try_issue_directly() to reflect its new semantics. Pass > the hctx pointer to that function instead of looking it up a > second time. These changes avoid that code has to be duplicated > in the next patch. > > Signed-off-by: Bart Van Assche> Cc: Christoph Hellwig Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
On Wed 26-10-16 10:12:38, Jens Axboe wrote: > On 10/26/2016 10:04 AM, Paolo Valente wrote: > > > >>Il giorno 26 ott 2016, alle ore 17:32, Jens Axboeha > >>scritto: > >> > >>On 10/26/2016 09:29 AM, Christoph Hellwig wrote: > >>>On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote: > The question to ask first is whether to actually have pluggable > schedulers on blk-mq at all, or just have one that is meant to > do the right thing in every case (and possibly can be bypassed > completely). > >>> > >>>That would be my preference. Have a BFQ-variant for blk-mq as an > >>>option (default to off unless opted in by the driver or user), and > >>>not other scheduler for blk-mq. Don't bother with bfq for non > >>>blk-mq. It's not like there is any advantage in the legacy-request > >>>device even for slow devices, except for the option of having I/O > >>>scheduling. > >> > >>It's the only right way forward. blk-mq might not offer any substantial > >>advantages to rotating storage, but with scheduling, it won't offer a > >>downside either. And it'll take us towards the real goal, which is to > >>have just one IO path. > > > >ok > > > >>Adding a new scheduler for the legacy IO path > >>makes no sense. > > > >I would fully agree if effective and stable I/O scheduling would be > >available in blk-mq in one or two months. But I guess that it will > >take at least one year optimistically, given the current status of the > >needed infrastructure, and given the great difficulties of doing > >effective scheduling at the high parallelism and extreme target speeds > >of blk-mq. Of course, this holds true unless little clever scheduling > >is performed. > > > >So, what's the point in forcing a lot of users wait another year or > >more, for a solution that has yet to be even defined, while they could > >enjoy a much better system, and then switch an even better system when > >scheduling is ready in blk-mq too? > > That same argument could have been made 2 years ago. Saying no to a new > scheduler for the legacy framework goes back roughly that long. We could > have had BFQ for mq NOW, if we didn't keep coming back to this very > point. > > I'm hesistant to add a new scheduler because it's very easy to add, very > difficult to get rid of. If we do add BFQ as a legacy scheduler now, > it'll take us years and years to get rid of it again. We should be > moving towards LESS moving parts in the legacy path, not more. > > We can keep having this discussion every few years, but I think we'd > both prefer to make some actual progress here. It's perfectly fine to > add an interface for a single queue interface for an IO scheduler for > blk-mq, since we don't care too much about scalability there. And that > won't take years, that should be a few weeks. Retrofitting BFQ on top of > that should not be hard either. That can co-exist with a real multiqueue > scheduler as well, something that's geared towards some fairness for > faster devices. OK, so some solution like having a variant of blk_sq_make_request() that will consume requests, do IO scheduling decisions on them, and feed them into the HW queue is it sees fit would be acceptable? That will provide the IO scheduler a global view that it needs for complex scheduling decisions so it should indeed be relatively easy to port BFQ to work like that. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html