RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq
> -Original Message- > From: Ming Lei [mailto:ming@redhat.com] > Sent: Tuesday, February 13, 2018 6:11 AM > To: Kashyap Desai > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph > Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar Sandoval; > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; Peter > Rivera; Paolo Bonzini; Laurence Oberman > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce > force_blk_mq > > Hi Kashyap, > > On Tue, Feb 13, 2018 at 12:05:14AM +0530, Kashyap Desai wrote: > > > -Original Message- > > > From: Ming Lei [mailto:ming@redhat.com] > > > Sent: Sunday, February 11, 2018 11:01 AM > > > To: Kashyap Desai > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; > > > Christoph Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun > > > Easi; Omar > > Sandoval; > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; > > Peter > > > Rivera; Paolo Bonzini; Laurence Oberman > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > introduce force_blk_mq > > > > > > On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote: > > > > Hi Kashyap, > > > > > > > > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote: > > > > > > -Original Message- > > > > > > From: Ming Lei [mailto:ming@redhat.com] > > > > > > Sent: Friday, February 9, 2018 11:01 AM > > > > > > To: Kashyap Desai > > > > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; > > > > > > Christoph Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; > > > > > > Arun Easi; Omar > > > > > Sandoval; > > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don > > > > > > Brace; > > > > > Peter > > > > > > Rivera; Paolo Bonzini; Laurence Oberman > > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & > > > > > > introduce force_blk_mq > > > > > > > > > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote: > > > > > > > > -Original Message- > > > > > > > > From: Ming Lei [mailto:ming@redhat.com] > > > > > > > > Sent: Thursday, February 8, 2018 10:23 PM > > > > > > > > To: Hannes Reinecke > > > > > > > > Cc: Kashyap Desai; Jens Axboe; > > > > > > > > linux-block@vger.kernel.org; Christoph Hellwig; Mike > > > > > > > > Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar > > > > > > > Sandoval; > > > > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; > > > > > > > > Don Brace; > > > > > > > Peter > > > > > > > > Rivera; Paolo Bonzini; Laurence Oberman > > > > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global > > > > > > > > tags & introduce force_blk_mq > > > > > > > > > > > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke > > wrote: > > > > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote: > > > > > > > > > >> -Original Message- > > > > > > > > > >> From: Ming Lei [mailto:ming@redhat.com] > > > > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM > > > > > > > > > >> To: Hannes Reinecke > > > > > > > > > >> Cc: Kashyap Desai; Jens Axboe; > > > > > > > > > >> linux-block@vger.kernel.org; Christoph Hellwig; Mike > > > > > > > > > >> Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar > > > > > > > > > > Sandoval; > > > > > > > > > >> Martin K . Petersen; James Bottomley; Christoph > > > > > > > > > >> Hellwig; Don Brace; > > > > > > > > > > Peter > > > > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman > > > > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support > > > > > > > > > >> global tags & introduce force_blk_mq > > > > > > > > > >> > > > > > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes > > > > > > > > > >> Reinecke > > > > > wrote: > > > > > > > > > >>> Hi all, > > > > > > > > > >>> > > > > > > > > > >>> [ .. ] > > > > > > > > > > > > > > > > > > > > Could you share us your patch for enabling > > > > > > > > > > global_tags/MQ on > > > > > > > > > megaraid_sas > > > > > > > > > > so that I can reproduce your test? > > > > > > > > > > > > > > > > > > > >> See below perf top data. "bt_iter" is consuming 4 > > > > > > > > > >> times more > > > > > > > CPU. > > > > > > > > > > > > > > > > > > > > Could you share us what the IOPS/CPU utilization > > > > > > > > > > effect is after > > > > > > > > > applying the > > > > > > > > > > patch V2? And your test script? > > > > > > > > > Regarding CPU utilization, I need to test one more > > time. > > > > > > > > > Currently system is in used. > > > > > > > > > > > > > > > > > > I run below fio test on total 24 SSDs expander > > attached. > > > > > > > > > > > > > > > > > > numactl -N 1 fio jbod.fio --rw=randread > > > > > > > > > --iodepth=64 --bs=4k --ioengine=libaio > > > > > > > > > --rw=randread > > > > > > > > > > > > > > > > > > Performance dropped from
[PATCH RESEND] blk-throttle: avoid double counted
If a bio is split after counted to the stat_bytes and stat_ios in blkcg_bio_issue_check(), the bio could be resubmitted and enters the block throttle layer again. This will cause the part of the bio is counted twice. The flag BIO_THROTTLED can not be used to fix this problem considering the following two cases. 1. The bio is throttled and resubmitted to the block throttle layer. It has the flag BIO_THROTTLED and should be counted. 2. The bio can be dispatched and has been counted, then it is split and resubmitted to the block throttle layer. It also has the flag BIO_THROTTLED but should not be counted again. So we add another flag BIO_THROTL_COUNTED to avoid double counted. Signed-off-by: Jiufei Xue--- block/bio.c| 2 ++ include/linux/bio.h| 6 -- include/linux/blk-cgroup.h | 3 ++- include/linux/blk_types.h | 1 + 4 files changed, 9 insertions(+), 3 deletions(-) diff --git a/block/bio.c b/block/bio.c index 9ef6cf3..4594c2e 100644 --- a/block/bio.c +++ b/block/bio.c @@ -601,6 +601,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src) bio_set_flag(bio, BIO_CLONED); if (bio_flagged(bio_src, BIO_THROTTLED)) bio_set_flag(bio, BIO_THROTTLED); + if (bio_flagged(bio_src, BIO_THROTL_COUNTED)) + bio_set_flag(bio, BIO_THROTL_COUNTED); bio->bi_opf = bio_src->bi_opf; bio->bi_write_hint = bio_src->bi_write_hint; bio->bi_iter = bio_src->bi_iter; diff --git a/include/linux/bio.h b/include/linux/bio.h index 23d29b3..aefc24c 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -492,8 +492,10 @@ extern struct bio *bio_copy_user_iov(struct request_queue *, #define bio_set_dev(bio, bdev) \ do { \ - if ((bio)->bi_disk != (bdev)->bd_disk) \ - bio_clear_flag(bio, BIO_THROTTLED);\ + if ((bio)->bi_disk != (bdev)->bd_disk) { \ + bio_clear_flag(bio, BIO_THROTTLED); \ + bio_clear_flag(bio, BIO_THROTL_COUNTED);\ + } \ (bio)->bi_disk = (bdev)->bd_disk; \ (bio)->bi_partno = (bdev)->bd_partno; \ } while (0) diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index e9825ff..c151bc9 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -701,11 +701,12 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q, throtl = blk_throtl_bio(q, blkg, bio); - if (!throtl) { + if (!throtl && !bio_flagged(bio, BIO_THROTL_COUNTED)) { blkg = blkg ?: q->root_blkg; blkg_rwstat_add(>stat_bytes, bio->bi_opf, bio->bi_iter.bi_size); blkg_rwstat_add(>stat_ios, bio->bi_opf, 1); + bio_set_flag(bio, BIO_THROTL_COUNTED); } rcu_read_unlock(); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 9e7d8bd..7a3890a 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -135,6 +135,7 @@ struct bio { * throttling rules. Don't do it again. */ #define BIO_TRACE_COMPLETION 10/* bio_endio() should trace the final completion * of this bio. */ +#define BIO_THROTL_COUNTED 11 /* This bio has already counted to rwstat. */ /* See BVEC_POOL_OFFSET below before adding new flags */ /* -- 1.8.3.1
Re: [PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices
On 2/13/2018 11:30 AM, Or Gerlitz wrote: On Tue, Feb 6, 2018 at 11:45 AM, Tal Gilboawrote: On 2/6/2018 11:34 AM, Sagi Grimberg wrote: Hi Tal, I think Tal has idea/s on how the existing library can be changed to support more modes/models What I was thinking is allowing DIM algorithm to disregard data which is 0. Currently if bytes == 0 we return "SAME" immediately. We can change it to simply move to the packets check (which may be renamed to "completions"). This way you could use DIM while only optimizing to (P1) high packet rate and (P2) low interrupt rate. That was exactly where I started from. But unfortunately it did not work well :( From my experiments, the moderation was all over the place failing to converge. At least the workloads that I've tested with, it was more successful to have a stricter step policy and pulling towards latency if we are consistently catching single completion per event. I'm not an expert here at all, but at this point, based on my attempts so far, I'm not convinced the current net_dim scheme could work. I do believe we can make it work. I see your addition of the cpe part to stats compare. Might not be a bad idea for networking devices. Overall, it seems to me like this would be a private case of the general DIM optimization, since it doesn't need to account for aggregation, for instance, which breaks the "more packets == more data" ratio. Did U2 came to agreement/lead on how to re-use the upstream library for the matter Sagi is pushing for? I don't think so (yet). Sagi, I would like to avoid having 2 "net DIM"s if possible. You mentioned you tried implementing over net DIM lib and it wasn't working well. Can you share this code with me?
Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling
Hello, Bart. Sorry about the delay. On Thu, Feb 08, 2018 at 04:31:43PM +, Bart Van Assche wrote: > The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. The > instruction at that address tries to dereference scsi_cmnd.device (%rax). The > register dump shows that that pointer has the value NULL. The only function I > know of that clears the scsi_cmnd.device pointer is scsi_req_init(). The only > caller of that function in the SCSI core is scsi_initialize_rq(). That > function > has two callers, namely scsi_init_command() and blk_get_request(). However, > the scsi_cmnd.device pointer is not cleared when a request finishes. This is > why I think that the above crash report indicates that scsi_times_out() was > called for a request that was being reinitialized and not by device > hotplugging. Can you please give the following patch a shot? While timeout path is synchornizing against the completion path (and the following re-init) while taking back control of a timed-out request, it wasn't doing that while giving it back, so the timer registration could race against completion and re-issue. I'm still not quite sure how that can lead to the oops tho. Anyways, we need something like this one way or the other. This isn't the final patch. We should add batching-up of rcu synchronize calls similar to the abort path. Thanks. diff --git a/block/blk-mq.c b/block/blk-mq.c index df93102..b66aec3 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -816,7 +816,8 @@ struct blk_mq_timeout_data { unsigned int nr_expired; }; -static void blk_mq_rq_timed_out(struct request *req, bool reserved) +static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request *req, + bool reserved) { const struct blk_mq_ops *ops = req->q->mq_ops; enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; @@ -836,8 +837,12 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) * ->aborted_gstate is set, this may lead to ignored * completions and further spurious timeouts. */ - blk_mq_rq_update_aborted_gstate(req, 0); blk_add_timer(req); + if (!(hctx->flags & BLK_MQ_F_BLOCKING)) + synchronize_rcu(); + else + synchronize_srcu(hctx->srcu); + blk_mq_rq_update_aborted_gstate(req, 0); break; case BLK_EH_NOT_HANDLED: break; @@ -893,7 +898,7 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, */ if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && READ_ONCE(rq->gstate) == rq->aborted_gstate) - blk_mq_rq_timed_out(rq, reserved); + blk_mq_rq_timed_out(hctx, rq, reserved); } static void blk_mq_timeout_work(struct work_struct *work)
Re: v4.16-rc1 + dm-mpath + BFQ
On Tue, 2018-02-13 at 19:38 +0100, Paolo Valente wrote: > as a first attempt, I've followed your steps, but got: > Error: could not find sg_reset Please install the sg3_utils package. Every Linux distro I know of supports that package. And in case you would like to install it from source, the source code of that package is available from http://sg.danny.cz/sg/sg3_utils.html. > For ib_srp-backport, I get a lot of warnings like the following one, > at "make install" (preceded by corresponding warnings at the end of > the compilation): > depmod: WARNING: /lib/modules/4.16.0-rc1+/extra/ib_srp.ko needs unknown > symbol rdma_resolve_addr > > Unfortunately, it gets worse while executing "make scst srpt": Please neither install the ib_srp-backport driver nor SCST. These drivers have not yet been tested against kernel v4.16-rc1. I provided you a kernel tree in which both the SRP initiator and target drivers support RoCE such that you don't need to install these out-of-tree drivers. I think all that you need from the srp-test README document are the instructions to configure /etc/multipath.conf and the instructions for installing the required packages. From that README document: Install the following software packages if these have not yet been installed: fio, gcc-c++, make, multipath-tools or device-mapper-multipath, sg3_utils, srptools, e2fsprogs and xfsprogs. Thanks, Bart.
Re: v4.16-rc1 + dm-mpath + BFQ
> Il giorno 12 feb 2018, alle ore 17:31, Bart Van Assche >ha scritto: > > On 02/11/18 23:35, Paolo Valente wrote: >> Also this smells a little bit like some spurious elevator call. >> Unfortunately I have no clue on the cause. To go on, I need at least >> to reproduce it. In this respect: Bart, could you please tell me how >> to setup the offending configuration, and to cause the failure? >> Possibly with just one, or at most two PCs. I don't have fancier hw >> at the moment. > > Hello Paolo, > > Although I expect that it is possible to reproduce this with an unmodified > v4.16-rc1 kernel, this is how I ran into this issue: > * Clone the for-next branch of https://github.com/bvanassche/linux. > * Build and install that kernel in a virtual machine. > * Clone https://github.com/bvanassche/srp-test. > * Run the following command: > srp-test/run_tests -c -d -r 10 -t 02-mq -e bfq > Hi Bart, as a first attempt, I've followed your steps, but got: Error: could not find sg_reset expectedly because of dependencies that you are implying in your steps. So, I have followed the instructions in the srp-test README for the case "Running the Tests on an Ethernet Setup", directly on a 4.16-rc1. For ib_srp-backport, I get a lot of warnings like the following one, at "make install" (preceded by corresponding warnings at the end of the compilation): depmod: WARNING: /lib/modules/4.16.0-rc1+/extra/ib_srp.ko needs unknown symbol rdma_resolve_addr Unfortunately, it gets worse while executing "make scst srpt": CC [M] /home/paolo/scst/srpt/src/ib_srpt.o In file included from /home/paolo/scst/srpt/src/ib_srpt.c:62:0: /home/paolo/scst/srpt/src/ib_srpt.h:481:8: error: redefinition of ‘struct srp_login_req_rdma’ struct srp_login_req_rdma { ^~ In file included from /home/paolo/scst/srpt/src/ib_srpt.h:44:0, from /home/paolo/scst/srpt/src/ib_srpt.c:62: /mnt/linux-dev/linux/include/scsi/srp.h:139:8: note: originally defined here struct srp_login_req_rdma { ^~ Could you please give me some help, so as to not get lost among these issues? Thanks, Paolo > Thanks, > > Bart.
Re: [PATCH] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into
On Tue, 2018-02-13 at 17:54 +0900, Sergey Senozhatsky wrote: > On (02/12/18 11:05), Bart Van Assche wrote: > [..] > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > > index ac4740cf74be..cf17626604c2 100644 > > --- a/include/linux/blkdev.h > > +++ b/include/linux/blkdev.h > > @@ -1026,14 +1026,25 @@ static inline int blk_rq_cur_bytes(const struct > > request *rq) > > > > extern unsigned int blk_rq_err_bytes(const struct request *rq); > > > > +/* > > + * Variables of type sector_t represent an offset or size that is a > > multiple of > > + * 2**9 bytes. Hence these two constants. > > + */ > > +#ifndef SECTOR_SHIFT > > +enum { SECTOR_SHIFT = 9 }; > > +#endif > > +#ifndef SECTOR_SIZE > > +enum { SECTOR_SIZE = 512 }; > > +#endif > > Shouldn't SECTOR_SIZE depend on SECTOR_SHIFT? > > 1 << SECTOR_SHIFT Not sure if that change will really make a difference. Anyway, I will make that change. Bart.
Re: [PATCH] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into
On Tue, 2018-02-13 at 09:43 +0100, Johannes Thumshirn wrote: > On Mon, 2018-02-12 at 11:05 -0800, Bart Van Assche wrote: > > +/* > > + * Variables of type sector_t represent an offset or size that is a > > multiple of > > + * 2**9 bytes. Hence these two constants. > > + */ > > +#ifndef SECTOR_SHIFT > > +enum { SECTOR_SHIFT = 9 }; > > +#endif > > +#ifndef SECTOR_SIZE > > +enum { SECTOR_SIZE = 512 }; > > +#endif > > Can you please make a #define out of these enums? I know gdb can cope > better with enums than defines but IIRC adding -ggdb3 to the CFLAGS > solves this issue. > > Apart from that: > Reviewed-by: Johannes ThumshirnOK, I will change the enums into defines. Thanks for the review. Bart.
Re: [PATCH v3] blk: optimization for classic polling
On 2/13/18 8:48 AM, Nitesh Shetty wrote: > This removes the dependency on interrupts to wake up task. Set task > state as TASK_RUNNING, if need_resched() returns true, > while polling for IO completion. > Earlier, polling task used to sleep, relying on interrupt to wake it up. > This made some IO take very long when interrupt-coalescing is enabled in > NVMe. Thanks, applied. -- Jens Axboe
[PATCH v3] blk: optimization for classic polling
This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe. Reference: http://lists.infradead.org/pipermail/linux-nvme/2018-February/015435.html Changes since v2->v3: -using __set_current_state() instead of set_current_state() Changes since v1->v2: -setting task state once in blk_poll, instead of multiple callers. Signed-off-by: Nitesh Shetty--- block/blk-mq.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index df93102..3574927 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3164,6 +3164,7 @@ static bool __blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq) cpu_relax(); } + __set_current_state(TASK_RUNNING); return false; } -- 2.7.4
Re: [PATCH] blk-throttle: avoid multiple counting for same bio
On Tue, Feb 13, 2018 at 02:45:50PM +0800, Chengguang Xu wrote: > In current throttling/upper limit policy of blkio cgroup > blkio.throttle.io_service_bytes does not exactly represent > the number of bytes issued to the disk by the group, sometimes > this number could be counted multiple times of real bytes. > This fix introduces BIO_COUNTED flag to avoid multiple counting > for same bio. > > Signed-off-by: Chengguang XuWe had a series of fixes / changes for this problem during the last cycle. Can you please see whether the current linus master has the same problem. Thanks. -- tejun
Re: [PATCH v2 RESENT] blk: optimization for classic polling
On 2/13/18 11:56 AM, Nitesh Shetty wrote: > This removes the dependency on interrupts to wake up task. Set task > state as TASK_RUNNING, if need_resched() returns true, > while polling for IO completion. > Earlier, polling task used to sleep, relying on interrupt to wake it up. > This made some IO take very long when interrupt-coalescing is enabled in > NVMe. __set_current_state() should suffice here. -- Jens Axboe
[PATCH 3/8] lightnvm: add support for 2.0 address format
Add support for 2.0 address format. Also, align address bits for 1.2 and 2.0 to align. Signed-off-by: Javier González--- include/linux/lightnvm.h | 45 - 1 file changed, 32 insertions(+), 13 deletions(-) diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h index 6a567bd19b73..e035ae4c9acc 100644 --- a/include/linux/lightnvm.h +++ b/include/linux/lightnvm.h @@ -16,12 +16,21 @@ enum { NVM_IOTYPE_GC = 1, }; -#define NVM_BLK_BITS (16) -#define NVM_PG_BITS (16) -#define NVM_SEC_BITS (8) -#define NVM_PL_BITS (8) -#define NVM_LUN_BITS (8) -#define NVM_CH_BITS (7) +/* 1.2 format */ +#define NVM_12_CH_BITS (8) +#define NVM_12_LUN_BITS (8) +#define NVM_12_BLK_BITS (16) +#define NVM_12_PG_BITS (16) +#define NVM_12_PL_BITS (4) +#define NVM_12_SEC_BITS (4) +#define NVM_12_RESERVED (8) + +/* 2.0 format */ +#define NVM_20_CH_BITS (8) +#define NVM_20_LUN_BITS (8) +#define NVM_20_CHK_BITS (16) +#define NVM_20_SEC_BITS (24) +#define NVM_20_RESERVED (8) enum { NVM_OCSSD_SPEC_12 = 12, @@ -31,16 +40,26 @@ enum { struct ppa_addr { /* Generic structure for all addresses */ union { + /* 1.2 device format */ struct { - u64 blk : NVM_BLK_BITS; - u64 pg : NVM_PG_BITS; - u64 sec : NVM_SEC_BITS; - u64 pl : NVM_PL_BITS; - u64 lun : NVM_LUN_BITS; - u64 ch : NVM_CH_BITS; - u64 reserved: 1; + u64 ch : NVM_12_CH_BITS; + u64 lun : NVM_12_LUN_BITS; + u64 blk : NVM_12_BLK_BITS; + u64 pg : NVM_12_PG_BITS; + u64 pl : NVM_12_PL_BITS; + u64 sec : NVM_12_SEC_BITS; + u64 reserved: NVM_12_RESERVED; } g; + /* 2.0 device format */ + struct { + u64 ch : NVM_20_CH_BITS; + u64 lun : NVM_20_LUN_BITS; + u64 chk : NVM_20_CHK_BITS; + u64 sec : NVM_20_SEC_BITS; + u64 reserved: NVM_20_RESERVED; + } m; + struct { u64 line: 63; u64 is_cached : 1; -- 2.7.4
[PATCH 1/8] lightnvm: exposed generic geometry to targets
With the inclusion of 2.0 support, we need a generic geometry that describes the OCSSD independently of the specification that it implements. Otherwise, geometry specific code is required, which complicates targets and makes maintenance much more difficult. This patch refactors the identify path and populates a generic geometry that is then given to the targets on creation. Since the 2.0 geometry is much more abstract that 1.2, the generic geometry resembles 2.0, but it is not identical, as it needs to understand 1.2 abstractions too. Signed-off-by: Javier González--- drivers/lightnvm/core.c | 143 ++- drivers/lightnvm/pblk-core.c | 16 +- drivers/lightnvm/pblk-gc.c | 2 +- drivers/lightnvm/pblk-init.c | 149 --- drivers/lightnvm/pblk-read.c | 2 +- drivers/lightnvm/pblk-recovery.c | 14 +- drivers/lightnvm/pblk-rl.c | 2 +- drivers/lightnvm/pblk-sysfs.c| 39 ++-- drivers/lightnvm/pblk-write.c| 2 +- drivers/lightnvm/pblk.h | 105 +-- drivers/nvme/host/lightnvm.c | 379 --- include/linux/lightnvm.h | 220 +-- 12 files changed, 586 insertions(+), 487 deletions(-) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 9b1255b3e05e..80492fa6ee76 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -111,6 +111,7 @@ static void nvm_release_luns_err(struct nvm_dev *dev, int lun_begin, static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, int clear) { struct nvm_dev *dev = tgt_dev->parent; + struct nvm_dev_geo *dev_geo = >dev_geo; struct nvm_dev_map *dev_map = tgt_dev->map; int i, j; @@ -122,7 +123,7 @@ static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, int clear) if (clear) { for (j = 0; j < ch_map->nr_luns; j++) { int lun = j + lun_offs[j]; - int lunid = (ch * dev->geo.nr_luns) + lun; + int lunid = (ch * dev_geo->num_lun) + lun; WARN_ON(!test_and_clear_bit(lunid, dev->lun_map)); @@ -143,19 +144,20 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct nvm_dev *dev, u16 lun_begin, u16 lun_end, u16 op) { + struct nvm_dev_geo *dev_geo = >dev_geo; struct nvm_tgt_dev *tgt_dev = NULL; struct nvm_dev_map *dev_rmap = dev->rmap; struct nvm_dev_map *dev_map; struct ppa_addr *luns; int nr_luns = lun_end - lun_begin + 1; int luns_left = nr_luns; - int nr_chnls = nr_luns / dev->geo.nr_luns; - int nr_chnls_mod = nr_luns % dev->geo.nr_luns; - int bch = lun_begin / dev->geo.nr_luns; - int blun = lun_begin % dev->geo.nr_luns; + int nr_chnls = nr_luns / dev_geo->num_lun; + int nr_chnls_mod = nr_luns % dev_geo->num_lun; + int bch = lun_begin / dev_geo->num_lun; + int blun = lun_begin % dev_geo->num_lun; int lunid = 0; int lun_balanced = 1; - int prev_nr_luns; + int sec_per_lun, prev_nr_luns; int i, j; nr_chnls = (nr_chnls_mod == 0) ? nr_chnls : nr_chnls + 1; @@ -173,15 +175,15 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct nvm_dev *dev, if (!luns) goto err_luns; - prev_nr_luns = (luns_left > dev->geo.nr_luns) ? - dev->geo.nr_luns : luns_left; + prev_nr_luns = (luns_left > dev_geo->num_lun) ? + dev_geo->num_lun : luns_left; for (i = 0; i < nr_chnls; i++) { struct nvm_ch_map *ch_rmap = _rmap->chnls[i + bch]; int *lun_roffs = ch_rmap->lun_offs; struct nvm_ch_map *ch_map = _map->chnls[i]; int *lun_offs; - int luns_in_chnl = (luns_left > dev->geo.nr_luns) ? - dev->geo.nr_luns : luns_left; + int luns_in_chnl = (luns_left > dev_geo->num_lun) ? + dev_geo->num_lun : luns_left; if (lun_balanced && prev_nr_luns != luns_in_chnl) lun_balanced = 0; @@ -215,18 +217,23 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct nvm_dev *dev, if (!tgt_dev) goto err_ch; - memcpy(_dev->geo, >geo, sizeof(struct nvm_geo)); /* Target device only owns a portion of the physical device */ - tgt_dev->geo.nr_chnls = nr_chnls; + tgt_dev->geo.num_ch = nr_chnls; + tgt_dev->geo.num_lun = (lun_balanced) ? prev_nr_luns : -1; tgt_dev->geo.all_luns = nr_luns; - tgt_dev->geo.nr_luns = (lun_balanced) ? prev_nr_luns : -1; +
[PATCH 6/8] lightnvm: pblk: implement get log report chunk
From: Javier GonzálezIn preparation of pblk supporting 2.0, implement the get log report chunk in pblk. This patch only replicates de bad block functionality as the rest of the metadata requires new pblk functionality (e.g., wear-index to implement wear-leveling). This functionality will come in future patches. Signed-off-by: Javier González --- drivers/lightnvm/pblk-core.c | 118 +++ drivers/lightnvm/pblk-init.c | 186 +++--- drivers/lightnvm/pblk-sysfs.c | 67 +++ drivers/lightnvm/pblk.h | 20 + 4 files changed, 327 insertions(+), 64 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 519af8b9eab7..01b78ee5c0e0 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -44,11 +44,12 @@ static void pblk_line_mark_bb(struct work_struct *work) } static void pblk_mark_bb(struct pblk *pblk, struct pblk_line *line, -struct ppa_addr *ppa) +struct ppa_addr ppa_addr) { struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = >geo; - int pos = pblk_ppa_to_pos(geo, *ppa); + struct ppa_addr *ppa; + int pos = pblk_ppa_to_pos(geo, ppa_addr); pr_debug("pblk: erase failed: line:%d, pos:%d\n", line->id, pos); atomic_long_inc(>erase_failed); @@ -58,6 +59,15 @@ static void pblk_mark_bb(struct pblk *pblk, struct pblk_line *line, pr_err("pblk: attempted to erase bb: line:%d, pos:%d\n", line->id, pos); + /* Not necessary to mark bad blocks on 2.0 spec. */ + if (geo->c.version == NVM_OCSSD_SPEC_20) + return; + + ppa = kmalloc(sizeof(struct ppa_addr), GFP_ATOMIC); + if (!ppa) + return; + + *ppa = ppa_addr; pblk_gen_run_ws(pblk, NULL, ppa, pblk_line_mark_bb, GFP_ATOMIC, pblk->bb_wq); } @@ -69,16 +79,8 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct nvm_rq *rqd) line = >lines[pblk_ppa_to_line(rqd->ppa_addr)]; atomic_dec(>left_seblks); - if (rqd->error) { - struct ppa_addr *ppa; - - ppa = kmalloc(sizeof(struct ppa_addr), GFP_ATOMIC); - if (!ppa) - return; - - *ppa = rqd->ppa_addr; - pblk_mark_bb(pblk, line, ppa); - } + if (rqd->error) + pblk_mark_bb(pblk, line, rqd->ppa_addr); atomic_dec(>inflight_io); } @@ -92,6 +94,47 @@ static void pblk_end_io_erase(struct nvm_rq *rqd) mempool_free(rqd, pblk->e_rq_pool); } +/* + * Get information for all chunks from the device. + * + * The caller is responsible for freeing the returned structure + */ +struct nvm_chunk_log_page *pblk_chunk_get_info(struct pblk *pblk) +{ + struct nvm_tgt_dev *dev = pblk->dev; + struct nvm_geo *geo = >geo; + struct nvm_chunk_log_page *log; + unsigned long len; + int ret; + + len = geo->all_chunks * sizeof(*log); + log = kzalloc(len, GFP_KERNEL); + if (!log) + return ERR_PTR(-ENOMEM); + + ret = nvm_get_chunk_log_page(dev, log, 0, len); + if (ret) { + pr_err("pblk: could not get chunk log page (%d)\n", ret); + kfree(log); + return ERR_PTR(-EIO); + } + + return log; +} + +struct nvm_chunk_log_page *pblk_chunk_get_off(struct pblk *pblk, + struct nvm_chunk_log_page *lp, + struct ppa_addr ppa) +{ + struct nvm_tgt_dev *dev = pblk->dev; + struct nvm_geo *geo = >geo; + int ch_off = ppa.m.ch * geo->c.num_chk * geo->num_lun; + int lun_off = ppa.m.lun * geo->c.num_chk; + int chk_off = ppa.m.chk; + + return lp + ch_off + lun_off + chk_off; +} + void __pblk_map_invalidate(struct pblk *pblk, struct pblk_line *line, u64 paddr) { @@ -1094,10 +1137,38 @@ static int pblk_line_init_bb(struct pblk *pblk, struct pblk_line *line, return 1; } +static int pblk_prepare_new_line(struct pblk *pblk, struct pblk_line *line) +{ + struct pblk_line_meta *lm = >lm; + struct nvm_tgt_dev *dev = pblk->dev; + struct nvm_geo *geo = >geo; + int blk_to_erase = atomic_read(>blk_in_line); + int i; + + for (i = 0; i < lm->blk_per_line; i++) { + int state = line->chks[i].state; + struct pblk_lun *rlun = >luns[i]; + + /* Free chunks should not be erased */ + if (state & NVM_CHK_ST_FREE) { + set_bit(pblk_ppa_to_pos(geo, rlun->chunk_bppa), + line->erase_bitmap); + blk_to_erase--; +
[PATCH 4/8] lightnvm: convert address based on spec. version
Create the device ppa for both 1.2 and 2.0. Signed-off-by: Javier González--- include/linux/lightnvm.h | 52 +--- 1 file changed, 36 insertions(+), 16 deletions(-) diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h index e035ae4c9acc..1148b3f22b27 100644 --- a/include/linux/lightnvm.h +++ b/include/linux/lightnvm.h @@ -412,16 +412,26 @@ static inline struct ppa_addr generic_to_dev_addr(struct nvm_tgt_dev *tgt_dev, struct ppa_addr r) { struct nvm_geo *geo = _dev->geo; - struct nvm_addr_format_12 *ppaf = - (struct nvm_addr_format_12 *)>c.addrf; struct ppa_addr l; - l.ppa = ((u64)r.g.ch) << ppaf->ch_offset; - l.ppa |= ((u64)r.g.lun) << ppaf->lun_offset; - l.ppa |= ((u64)r.g.blk) << ppaf->blk_offset; - l.ppa |= ((u64)r.g.pg) << ppaf->pg_offset; - l.ppa |= ((u64)r.g.pl) << ppaf->pln_offset; - l.ppa |= ((u64)r.g.sec) << ppaf->sec_offset; + if (geo->c.version == NVM_OCSSD_SPEC_12) { + struct nvm_addr_format_12 *ppaf = + (struct nvm_addr_format_12 *)>c.addrf; + + l.ppa = ((u64)r.g.ch) << ppaf->ch_offset; + l.ppa |= ((u64)r.g.lun) << ppaf->lun_offset; + l.ppa |= ((u64)r.g.blk) << ppaf->blk_offset; + l.ppa |= ((u64)r.g.pg) << ppaf->pg_offset; + l.ppa |= ((u64)r.g.pl) << ppaf->pln_offset; + l.ppa |= ((u64)r.g.sec) << ppaf->sec_offset; + } else { + struct nvm_addr_format *lbaf = >c.addrf; + + l.ppa = ((u64)r.m.ch) << lbaf->ch_offset; + l.ppa |= ((u64)r.m.lun) << lbaf->lun_offset; + l.ppa |= ((u64)r.m.chk) << lbaf->chk_offset; + l.ppa |= ((u64)r.m.sec) << lbaf->sec_offset; + } return l; } @@ -430,18 +440,28 @@ static inline struct ppa_addr dev_to_generic_addr(struct nvm_tgt_dev *tgt_dev, struct ppa_addr r) { struct nvm_geo *geo = _dev->geo; - struct nvm_addr_format_12 *ppaf = - (struct nvm_addr_format_12 *)>c.addrf; struct ppa_addr l; l.ppa = 0; - l.g.ch = (r.ppa & ppaf->ch_mask) >> ppaf->ch_offset; - l.g.lun = (r.ppa & ppaf->lun_mask) >> ppaf->lun_offset; - l.g.blk = (r.ppa & ppaf->blk_mask) >> ppaf->blk_offset; - l.g.pg = (r.ppa & ppaf->pg_mask) >> ppaf->pg_offset; - l.g.pl = (r.ppa & ppaf->pln_mask) >> ppaf->pln_offset; - l.g.sec = (r.ppa & ppaf->sec_mask) >> ppaf->sec_offset; + if (geo->c.version == NVM_OCSSD_SPEC_12) { + struct nvm_addr_format_12 *ppaf = + (struct nvm_addr_format_12 *)>c.addrf; + + l.g.ch = (r.ppa & ppaf->ch_mask) >> ppaf->ch_offset; + l.g.lun = (r.ppa & ppaf->lun_mask) >> ppaf->lun_offset; + l.g.blk = (r.ppa & ppaf->blk_mask) >> ppaf->blk_offset; + l.g.pg = (r.ppa & ppaf->pg_mask) >> ppaf->pg_offset; + l.g.pl = (r.ppa & ppaf->pln_mask) >> ppaf->pln_offset; + l.g.sec = (r.ppa & ppaf->sec_mask) >> ppaf->sec_offset; + } else { + struct nvm_addr_format *lbaf = >c.addrf; + + l.m.ch = (r.ppa & lbaf->ch_mask) >> lbaf->ch_offset; + l.m.lun = (r.ppa & lbaf->lun_mask) >> lbaf->lun_offset; + l.m.chk = (r.ppa & lbaf->chk_mask) >> lbaf->chk_offset; + l.m.sec = (r.ppa & lbaf->sec_mask) >> lbaf->sec_offset; + } return l; } -- 2.7.4
[PATCH 5/8] lightnvm: implement get log report chunk helpers
From: Javier GonzálezThe 2.0 spec provides a report chunk log page that can be retrieved using the stangard nvme get log page. This replaces the dedicated get/put bad block table in 1.2. This patch implements the helper functions to allow targets retrieve the chunk metadata using get log page Signed-off-by: Javier González --- drivers/lightnvm/core.c | 28 + drivers/nvme/host/lightnvm.c | 50 include/linux/lightnvm.h | 32 3 files changed, 110 insertions(+) diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c index 80492fa6ee76..6857a888544a 100644 --- a/drivers/lightnvm/core.c +++ b/drivers/lightnvm/core.c @@ -43,6 +43,8 @@ struct nvm_ch_map { struct nvm_dev_map { struct nvm_ch_map *chnls; int nr_chnls; + int bch; + int blun; }; static struct nvm_target *nvm_find_target(struct nvm_dev *dev, const char *name) @@ -171,6 +173,9 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct nvm_dev *dev, if (!dev_map->chnls) goto err_chnls; + dev_map->bch = bch; + dev_map->blun = blun; + luns = kcalloc(nr_luns, sizeof(struct ppa_addr), GFP_KERNEL); if (!luns) goto err_luns; @@ -561,6 +566,19 @@ static void nvm_unregister_map(struct nvm_dev *dev) kfree(rmap); } +static unsigned long nvm_log_off_tgt_to_dev(struct nvm_tgt_dev *tgt_dev) +{ + struct nvm_dev_map *dev_map = tgt_dev->map; + struct nvm_geo *geo = _dev->geo; + int lun_off; + unsigned long off; + + lun_off = dev_map->blun + dev_map->bch * geo->num_lun; + off = lun_off * geo->c.num_chk * sizeof(struct nvm_chunk_log_page); + + return off; +} + static void nvm_map_to_dev(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *p) { struct nvm_dev_map *dev_map = tgt_dev->map; @@ -720,6 +738,16 @@ static void nvm_free_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, nvm_dev_dma_free(tgt_dev->parent, rqd->ppa_list, rqd->dma_ppa_list); } +int nvm_get_chunk_log_page(struct nvm_tgt_dev *tgt_dev, + struct nvm_chunk_log_page *log, + unsigned long off, unsigned long len) +{ + struct nvm_dev *dev = tgt_dev->parent; + + off += nvm_log_off_tgt_to_dev(tgt_dev); + + return dev->ops->get_chunk_log_page(tgt_dev->parent, log, off, len); +} int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas, int nr_ppas, int type) diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index 7bc75182c723..355d9b0cf084 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -35,6 +35,10 @@ enum nvme_nvm_admin_opcode { nvme_nvm_admin_set_bb_tbl = 0xf1, }; +enum nvme_nvm_log_page { + NVME_NVM_LOG_REPORT_CHUNK = 0xCA, +}; + struct nvme_nvm_ph_rw { __u8opcode; __u8flags; @@ -553,6 +557,50 @@ static int nvme_nvm_set_bb_tbl(struct nvm_dev *nvmdev, struct ppa_addr *ppas, return ret; } +static int nvme_nvm_get_chunk_log_page(struct nvm_dev *nvmdev, + struct nvm_chunk_log_page *log, + unsigned long off, + unsigned long total_len) +{ + struct nvme_ns *ns = nvmdev->q->queuedata; + struct nvme_command c = { }; + unsigned long offset = off, left = total_len; + unsigned long len, len_dwords; + void *buf = log; + int ret; + + /* The offset needs to be dword-aligned */ + if (offset & 0x3) + return -EINVAL; + + do { + /* Send 256KB at a time */ + len = (1 << 18) > left ? left : (1 << 18); + len_dwords = (len >> 2) - 1; + + c.get_log_page.opcode = nvme_admin_get_log_page; + c.get_log_page.nsid = cpu_to_le32(ns->head->ns_id); + c.get_log_page.lid = NVME_NVM_LOG_REPORT_CHUNK; + c.get_log_page.lpol = cpu_to_le32(offset & 0x); + c.get_log_page.lpou = cpu_to_le32(offset >> 32); + c.get_log_page.numdl = cpu_to_le16(len_dwords & 0x); + c.get_log_page.numdu = cpu_to_le16(len_dwords >> 16); + + ret = nvme_submit_sync_cmd(ns->ctrl->admin_q, , buf, len); + if (ret) { + dev_err(ns->ctrl->device, + "get chunk log page failed (%d)\n", ret); + break; + } + + buf += len; + offset += len; + left -= len; + } while (left); + + return ret; +} + static inline void nvme_nvm_rqtocmd(struct nvm_rq *rqd, struct nvme_ns *ns, struct
[PATCH 2/8] lightnvm: show generic geometry in sysfs
From: Javier GonzálezApart from showing the geometry returned by the different identify commands, provide the generic geometry too, as this is the geometry that targets will use to describe the device. Signed-off-by: Javier González --- drivers/nvme/host/lightnvm.c | 146 --- 1 file changed, 97 insertions(+), 49 deletions(-) diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index 97739e668602..7bc75182c723 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -944,8 +944,27 @@ static ssize_t nvm_dev_attr_show(struct device *dev, return scnprintf(page, PAGE_SIZE, "%u.%u\n", dev_geo->major_ver_id, dev_geo->minor_ver_id); - } else if (strcmp(attr->name, "capabilities") == 0) { - return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.cap); + } else if (strcmp(attr->name, "clba") == 0) { + return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.clba); + } else if (strcmp(attr->name, "csecs") == 0) { + return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.csecs); + } else if (strcmp(attr->name, "sos") == 0) { + return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.sos); + } else if (strcmp(attr->name, "ws_min") == 0) { + return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.ws_min); + } else if (strcmp(attr->name, "ws_opt") == 0) { + return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.ws_opt); + } else if (strcmp(attr->name, "maxoc") == 0) { + return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.maxoc); + } else if (strcmp(attr->name, "maxocpu") == 0) { + return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.maxocpu); + } else if (strcmp(attr->name, "mw_cunits") == 0) { + return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.mw_cunits); + } else if (strcmp(attr->name, "media_capabilities") == 0) { + return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.mccap); + } else if (strcmp(attr->name, "max_phys_secs") == 0) { + return scnprintf(page, PAGE_SIZE, "%u\n", + ndev->ops->max_phys_sect); } else if (strcmp(attr->name, "read_typ") == 0) { return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.trdt); } else if (strcmp(attr->name, "read_max") == 0) { @@ -984,19 +1003,8 @@ static ssize_t nvm_dev_attr_show_12(struct device *dev, attr = >attr; - if (strcmp(attr->name, "vendor_opcode") == 0) { - return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.vmnt); - } else if (strcmp(attr->name, "device_mode") == 0) { - return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.dom); - /* kept for compatibility */ - } else if (strcmp(attr->name, "media_manager") == 0) { - return scnprintf(page, PAGE_SIZE, "%s\n", "gennvm"); - } else if (strcmp(attr->name, "ppa_format") == 0) { + if (strcmp(attr->name, "ppa_format") == 0) { return nvm_dev_attr_show_ppaf((void *)_geo->c.addrf, page); - } else if (strcmp(attr->name, "media_type") == 0) { /* u8 */ - return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.mtype); - } else if (strcmp(attr->name, "flash_media_type") == 0) { - return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.fmtype); } else if (strcmp(attr->name, "num_channels") == 0) { return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->num_ch); } else if (strcmp(attr->name, "num_luns") == 0) { @@ -1011,8 +1019,6 @@ static ssize_t nvm_dev_attr_show_12(struct device *dev, return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.fpg_sz); } else if (strcmp(attr->name, "hw_sector_size") == 0) { return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.csecs); - } else if (strcmp(attr->name, "oob_sector_size") == 0) {/* u32 */ - return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.sos); } else if (strcmp(attr->name, "prog_typ") == 0) { return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.tprt); } else if (strcmp(attr->name, "prog_max") == 0) { @@ -1021,13 +1027,21 @@ static ssize_t nvm_dev_attr_show_12(struct device *dev, return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.tbet); } else if (strcmp(attr->name, "erase_max") == 0) { return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.tbem); + } else if (strcmp(attr->name, "vendor_opcode") == 0) { + return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.vmnt); + } else if (strcmp(attr->name, "device_mode") == 0) { + return scnprintf(page, PAGE_SIZE,
[PATCH 8/8] lightnvm: pblk: implement 2.0 support
Implement 2.0 support in pblk. This includes the address formatting and mapping paths, as well as the sysfs entries for them. Signed-off-by: Javier González--- drivers/lightnvm/pblk-init.c | 57 ++-- drivers/lightnvm/pblk-sysfs.c | 36 ++-- drivers/lightnvm/pblk.h | 198 -- 3 files changed, 233 insertions(+), 58 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 04685f2d39d3..d5a31fc986cc 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -231,20 +231,63 @@ static int pblk_set_addrf_12(struct nvm_geo *geo, return dst->blk_offset + src->blk_len; } +static int pblk_set_addrf_20(struct nvm_geo *geo, +struct nvm_addr_format *adst, +struct pblk_addr_format *udst) +{ + struct nvm_addr_format *src = >c.addrf; + + adst->ch_len = get_count_order(geo->num_ch); + adst->lun_len = get_count_order(geo->num_lun); + adst->chk_len = src->chk_len; + adst->sec_len = src->sec_len; + + adst->sec_offset = 0; + adst->ch_offset = adst->sec_len; + adst->lun_offset = adst->ch_offset + adst->ch_len; + adst->chk_offset = adst->lun_offset + adst->lun_len; + + adst->sec_mask = ((1ULL << adst->sec_len) - 1) << adst->sec_offset; + adst->chk_mask = ((1ULL << adst->chk_len) - 1) << adst->chk_offset; + adst->lun_mask = ((1ULL << adst->lun_len) - 1) << adst->lun_offset; + adst->ch_mask = ((1ULL << adst->ch_len) - 1) << adst->ch_offset; + + udst->sec_stripe = geo->c.ws_opt; + udst->ch_stripe = geo->num_ch; + udst->lun_stripe = geo->num_lun; + + udst->sec_lun_stripe = udst->sec_stripe * udst->ch_stripe; + udst->sec_ws_stripe = udst->sec_lun_stripe * udst->lun_stripe; + + return adst->chk_offset + adst->chk_len; +} + static int pblk_set_addrf(struct pblk *pblk) { struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = >geo; int mod; - div_u64_rem(geo->c.clba, pblk->min_write_pgs, ); - if (mod) { - pr_err("pblk: bad configuration of sectors/pages\n"); + switch (geo->c.version) { + case NVM_OCSSD_SPEC_12: + div_u64_rem(geo->c.clba, pblk->min_write_pgs, ); + if (mod) { + pr_err("pblk: bad configuration of sectors/pages\n"); + return -EINVAL; + } + + pblk->addrf_len = pblk_set_addrf_12(geo, (void *)>addrf); + break; + case NVM_OCSSD_SPEC_20: + pblk->addrf_len = pblk_set_addrf_20(geo, (void *)>addrf, + >uaddrf); + break; + default: + pr_err("pblk: OCSSD revision not supported (%d)\n", + geo->c.version); return -EINVAL; } - pblk->addrf_len = pblk_set_addrf_12(geo, (void *)>addrf); - return 0; } @@ -,7 +1154,9 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk, struct pblk *pblk; int ret; - if (geo->c.version != NVM_OCSSD_SPEC_12) { + /* pblk supports 1.2 and 2.0 versions */ + if (!(geo->c.version == NVM_OCSSD_SPEC_12 || + geo->c.version == NVM_OCSSD_SPEC_20)) { pr_err("pblk: OCSSD version not supported (%u)\n", geo->c.version); return ERR_PTR(-EINVAL); diff --git a/drivers/lightnvm/pblk-sysfs.c b/drivers/lightnvm/pblk-sysfs.c index 191af0c6591e..60b8d931e4ba 100644 --- a/drivers/lightnvm/pblk-sysfs.c +++ b/drivers/lightnvm/pblk-sysfs.c @@ -113,15 +113,16 @@ static ssize_t pblk_sysfs_ppaf(struct pblk *pblk, char *page) { struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = >geo; - struct nvm_addr_format_12 *ppaf; - struct nvm_addr_format_12 *geo_ppaf; ssize_t sz = 0; - ppaf = (struct nvm_addr_format_12 *)>addrf; - geo_ppaf = (struct nvm_addr_format_12 *)>c.addrf; + if (geo->c.version == NVM_OCSSD_SPEC_12) { + struct nvm_addr_format_12 *ppaf = + (struct nvm_addr_format_12 *)>addrf; + struct nvm_addr_format_12 *geo_ppaf = + (struct nvm_addr_format_12 *)>c.addrf; - sz = snprintf(page, PAGE_SIZE, - "pblk:(s:%d)ch:%d/%d,lun:%d/%d,blk:%d/%d,pg:%d/%d,pl:%d/%d,sec:%d/%d\n", + sz = snprintf(page, PAGE_SIZE, + "pblk:(s:%d)ch:%d/%d,lun:%d/%d,blk:%d/%d,pg:%d/%d,pl:%d/%d,sec:%d/%d\n", pblk->addrf_len, ppaf->ch_offset, ppaf->ch_len, ppaf->lun_offset, ppaf->lun_len, @@
[PATCH 7/8] lightnvm: pblk: refactor init/exit sequences
Refactor init and exit sequences to improve readability. In the way, fix bad free ordering on the init error path. Signed-off-by: Javier González--- drivers/lightnvm/pblk-init.c | 503 ++- 1 file changed, 254 insertions(+), 249 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index dfc68718e27e..04685f2d39d3 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -103,7 +103,40 @@ static void pblk_l2p_free(struct pblk *pblk) vfree(pblk->trans_map); } -static int pblk_l2p_init(struct pblk *pblk) +static int pblk_l2p_recover(struct pblk *pblk, bool factory_init) +{ + struct pblk_line *line = NULL; + + if (factory_init) { + pblk_setup_uuid(pblk); + } else { + line = pblk_recov_l2p(pblk); + if (IS_ERR(line)) { + pr_err("pblk: could not recover l2p table\n"); + return -EFAULT; + } + } + +#ifdef CONFIG_NVM_DEBUG + pr_info("pblk init: L2P CRC: %x\n", pblk_l2p_crc(pblk)); +#endif + + /* Free full lines directly as GC has not been started yet */ + pblk_gc_free_full_lines(pblk); + + if (!line) { + /* Configure next line for user data */ + line = pblk_line_get_first_data(pblk); + if (!line) { + pr_err("pblk: line list corrupted\n"); + return -EFAULT; + } + } + + return 0; +} + +static int pblk_l2p_init(struct pblk *pblk, bool factory_init) { sector_t i; struct ppa_addr ppa; @@ -119,7 +152,7 @@ static int pblk_l2p_init(struct pblk *pblk) for (i = 0; i < pblk->rl.nr_secs; i++) pblk_trans_map_set(pblk, i, ppa); - return 0; + return pblk_l2p_recover(pblk, factory_init); } static void pblk_rwb_free(struct pblk *pblk) @@ -268,87 +301,114 @@ static int pblk_core_init(struct pblk *pblk) { struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = >geo; + int max_write_ppas; + + atomic64_set(>user_wa, 0); + atomic64_set(>pad_wa, 0); + atomic64_set(>gc_wa, 0); + pblk->user_rst_wa = 0; + pblk->pad_rst_wa = 0; + pblk->gc_rst_wa = 0; + + atomic_long_set(>nr_flush, 0); + pblk->nr_flush_rst = 0; pblk->pgs_in_buffer = geo->c.mw_cunits * geo->c.ws_opt * geo->all_luns; + pblk->min_write_pgs = geo->c.ws_opt * (geo->c.csecs / PAGE_SIZE); + max_write_ppas = pblk->min_write_pgs * geo->all_luns; + pblk->max_write_pgs = (max_write_ppas < nvm_max_phys_sects(dev)) ? + max_write_ppas : nvm_max_phys_sects(dev); + pblk_set_sec_per_write(pblk, pblk->min_write_pgs); + + if (pblk->max_write_pgs > PBLK_MAX_REQ_ADDRS) { + pr_err("pblk: cannot support device max_phys_sect\n"); + return -EINVAL; + } + + pblk->pad_dist = kzalloc((pblk->min_write_pgs - 1) * sizeof(atomic64_t), + GFP_KERNEL); + if (!pblk->pad_dist) + return -ENOMEM; + if (pblk_init_global_caches(pblk)) - return -ENOMEM; + goto fail_free_pad_dist; /* Internal bios can be at most the sectors signaled by the device. */ pblk->page_bio_pool = mempool_create_page_pool(nvm_max_phys_sects(dev), 0); if (!pblk->page_bio_pool) - goto free_global_caches; + goto fail_free_global_caches; pblk->gen_ws_pool = mempool_create_slab_pool(PBLK_GEN_WS_POOL_SIZE, pblk_ws_cache); if (!pblk->gen_ws_pool) - goto free_page_bio_pool; + goto fail_free_page_bio_pool; pblk->rec_pool = mempool_create_slab_pool(geo->all_luns, pblk_rec_cache); if (!pblk->rec_pool) - goto free_gen_ws_pool; + goto fail_free_gen_ws_pool; pblk->r_rq_pool = mempool_create_slab_pool(geo->all_luns, pblk_g_rq_cache); if (!pblk->r_rq_pool) - goto free_rec_pool; + goto fail_free_rec_pool; pblk->e_rq_pool = mempool_create_slab_pool(geo->all_luns, pblk_g_rq_cache); if (!pblk->e_rq_pool) - goto free_r_rq_pool; + goto fail_free_r_rq_pool; pblk->w_rq_pool = mempool_create_slab_pool(geo->all_luns, pblk_w_rq_cache); if (!pblk->w_rq_pool) - goto free_e_rq_pool; + goto fail_free_e_rq_pool;
Re: [PATCH] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into
On (02/12/18 11:05), Bart Van Assche wrote: [..] > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index ac4740cf74be..cf17626604c2 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1026,14 +1026,25 @@ static inline int blk_rq_cur_bytes(const struct > request *rq) > > extern unsigned int blk_rq_err_bytes(const struct request *rq); > > +/* > + * Variables of type sector_t represent an offset or size that is a multiple > of > + * 2**9 bytes. Hence these two constants. > + */ > +#ifndef SECTOR_SHIFT > +enum { SECTOR_SHIFT = 9 }; > +#endif > +#ifndef SECTOR_SIZE > +enum { SECTOR_SIZE = 512 }; > +#endif Shouldn't SECTOR_SIZE depend on SECTOR_SHIFT? 1 << SECTOR_SHIFT -ss
Re: [PATCH] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into
On Mon, 2018-02-12 at 11:05 -0800, Bart Van Assche wrote: > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index ac4740cf74be..cf17626604c2 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1026,14 +1026,25 @@ static inline int blk_rq_cur_bytes(const > struct request *rq) > > extern unsigned int blk_rq_err_bytes(const struct request *rq); > > +/* > + * Variables of type sector_t represent an offset or size that is a > multiple of > + * 2**9 bytes. Hence these two constants. > + */ > +#ifndef SECTOR_SHIFT > +enum { SECTOR_SHIFT = 9 }; > +#endif > +#ifndef SECTOR_SIZE > +enum { SECTOR_SIZE = 512 }; > +#endif Can you please make a #define out of these enums? I know gdb can cope better with enums than defines but IIRC adding -ggdb3 to the CFLAGS solves this issue. Apart from that: Reviewed-by: Johannes Thumshirn-- Johannes Thumshirn Storage jthu msh...@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
[PATCH v2 RESENT] blk: optimization for classic polling
This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe. Reference: http://lists.infradead.org/pipermail/linux-nvme/2018-February/015435.html Changes since v1: -setting task state once in blk_poll, instead of multiple callers. Signed-off-by: Nitesh Shetty--- block/blk-mq.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index df93102..40285fe 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3164,6 +3164,7 @@ static bool __blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq) cpu_relax(); } + set_current_state(TASK_RUNNING); return false; } -- 2.7.4