RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-13 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Tuesday, February 13, 2018 6:11 AM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> Hi Kashyap,
>
> On Tue, Feb 13, 2018 at 12:05:14AM +0530, Kashyap Desai wrote:
> > > -Original Message-
> > > From: Ming Lei [mailto:ming@redhat.com]
> > > Sent: Sunday, February 11, 2018 11:01 AM
> > > To: Kashyap Desai
> > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org;
> > > Christoph Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun
> > > Easi; Omar
> > Sandoval;
> > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> > > Rivera; Paolo Bonzini; Laurence Oberman
> > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > introduce force_blk_mq
> > >
> > > On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote:
> > > > Hi Kashyap,
> > > >
> > > > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote:
> > > > > > -Original Message-
> > > > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > > > Sent: Friday, February 9, 2018 11:01 AM
> > > > > > To: Kashyap Desai
> > > > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org;
> > > > > > Christoph Hellwig; Mike Snitzer; linux-s...@vger.kernel.org;
> > > > > > Arun Easi; Omar
> > > > > Sandoval;
> > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > > > Brace;
> > > > > Peter
> > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > > introduce force_blk_mq
> > > > > >
> > > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > > > > > > -Original Message-
> > > > > > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > > > > > Sent: Thursday, February 8, 2018 10:23 PM
> > > > > > > > To: Hannes Reinecke
> > > > > > > > Cc: Kashyap Desai; Jens Axboe;
> > > > > > > > linux-block@vger.kernel.org; Christoph Hellwig; Mike
> > > > > > > > Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar
> > > > > > > Sandoval;
> > > > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig;
> > > > > > > > Don Brace;
> > > > > > > Peter
> > > > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global
> > > > > > > > tags & introduce force_blk_mq
> > > > > > > >
> > > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke
> > wrote:
> > > > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > > > > > > >> -Original Message-
> > > > > > > > > >> From: Ming Lei [mailto:ming@redhat.com]
> > > > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > > > > > > >> To: Hannes Reinecke
> > > > > > > > > >> Cc: Kashyap Desai; Jens Axboe;
> > > > > > > > > >> linux-block@vger.kernel.org; Christoph Hellwig; Mike
> > > > > > > > > >> Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar
> > > > > > > > > > Sandoval;
> > > > > > > > > >> Martin K . Petersen; James Bottomley; Christoph
> > > > > > > > > >> Hellwig; Don Brace;
> > > > > > > > > > Peter
> > > > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support
> > > > > > > > > >> global tags & introduce force_blk_mq
> > > > > > > > > >>
> > > > > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes
> > > > > > > > > >> Reinecke
> > > > > wrote:
> > > > > > > > > >>> Hi all,
> > > > > > > > > >>>
> > > > > > > > > >>> [ .. ]
> > > > > > > > > >
> > > > > > > > > > Could you share us your patch for enabling
> > > > > > > > > > global_tags/MQ on
> > > > > > > > >  megaraid_sas
> > > > > > > > > > so that I can reproduce your test?
> > > > > > > > > >
> > > > > > > > > >> See below perf top data. "bt_iter" is consuming 4
> > > > > > > > > >> times more
> > > > > > > CPU.
> > > > > > > > > >
> > > > > > > > > > Could you share us what the IOPS/CPU utilization
> > > > > > > > > > effect is after
> > > > > > > > >  applying the
> > > > > > > > > > patch V2? And your test script?
> > > > > > > > >  Regarding CPU utilization, I need to test one more
> > time.
> > > > > > > > >  Currently system is in used.
> > > > > > > > > 
> > > > > > > > >  I run below fio test on total 24 SSDs expander
> > attached.
> > > > > > > > > 
> > > > > > > > >  numactl -N 1 fio jbod.fio --rw=randread
> > > > > > > > >  --iodepth=64 --bs=4k --ioengine=libaio
> > > > > > > > >  --rw=randread
> > > > > > > > > 
> > > > > > > > >  Performance dropped from 

[PATCH RESEND] blk-throttle: avoid double counted

2018-02-13 Thread xuejiufei
If a bio is split after counted to the stat_bytes and stat_ios in
blkcg_bio_issue_check(), the bio could be resubmitted and enters the
block throttle layer again. This will cause the part of the bio is
counted twice.

The flag BIO_THROTTLED can not be used to fix this problem considering the
following two cases.
1. The bio is throttled and resubmitted to the block throttle layer. It
has the flag BIO_THROTTLED and should be counted.
2. The bio can be dispatched and has been counted, then it is split
and resubmitted to the block throttle layer. It also has the flag
BIO_THROTTLED but should not be counted again.

So we add another flag BIO_THROTL_COUNTED to avoid double counted.

Signed-off-by: Jiufei Xue 
---
 block/bio.c| 2 ++
 include/linux/bio.h| 6 --
 include/linux/blk-cgroup.h | 3 ++-
 include/linux/blk_types.h  | 1 +
 4 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 9ef6cf3..4594c2e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -601,6 +601,8 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
bio_set_flag(bio, BIO_CLONED);
if (bio_flagged(bio_src, BIO_THROTTLED))
bio_set_flag(bio, BIO_THROTTLED);
+   if (bio_flagged(bio_src, BIO_THROTL_COUNTED))
+   bio_set_flag(bio, BIO_THROTL_COUNTED);
bio->bi_opf = bio_src->bi_opf;
bio->bi_write_hint = bio_src->bi_write_hint;
bio->bi_iter = bio_src->bi_iter;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 23d29b3..aefc24c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -492,8 +492,10 @@ extern struct bio *bio_copy_user_iov(struct request_queue 
*,
 
 #define bio_set_dev(bio, bdev) \
 do {   \
-   if ((bio)->bi_disk != (bdev)->bd_disk)  \
-   bio_clear_flag(bio, BIO_THROTTLED);\
+   if ((bio)->bi_disk != (bdev)->bd_disk)  {   \
+   bio_clear_flag(bio, BIO_THROTTLED); \
+   bio_clear_flag(bio, BIO_THROTL_COUNTED);\
+   }   \
(bio)->bi_disk = (bdev)->bd_disk;   \
(bio)->bi_partno = (bdev)->bd_partno;   \
 } while (0)
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index e9825ff..c151bc9 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -701,11 +701,12 @@ static inline bool blkcg_bio_issue_check(struct 
request_queue *q,
 
throtl = blk_throtl_bio(q, blkg, bio);
 
-   if (!throtl) {
+   if (!throtl && !bio_flagged(bio, BIO_THROTL_COUNTED)) {
blkg = blkg ?: q->root_blkg;
blkg_rwstat_add(>stat_bytes, bio->bi_opf,
bio->bi_iter.bi_size);
blkg_rwstat_add(>stat_ios, bio->bi_opf, 1);
+   bio_set_flag(bio, BIO_THROTL_COUNTED);
}
 
rcu_read_unlock();
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 9e7d8bd..7a3890a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -135,6 +135,7 @@ struct bio {
 * throttling rules. Don't do it again. */
 #define BIO_TRACE_COMPLETION 10/* bio_endio() should trace the final 
completion
 * of this bio. */
+#define BIO_THROTL_COUNTED 11  /* This bio has already counted to rwstat. */
 /* See BVEC_POOL_OFFSET below before adding new flags */
 
 /*
-- 
1.8.3.1



Re: [PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices

2018-02-13 Thread Tal Gilboa

On 2/13/2018 11:30 AM, Or Gerlitz wrote:

On Tue, Feb 6, 2018 at 11:45 AM, Tal Gilboa  wrote:

On 2/6/2018 11:34 AM, Sagi Grimberg wrote:


Hi Tal,


I think Tal has idea/s on how the existing library can be changed to
support more modes/models


What I was thinking is allowing DIM algorithm to disregard data which is
0. Currently if bytes == 0 we return "SAME" immediately. We can change it to
simply move to the packets check (which may be renamed to "completions").
This way you could use DIM while only optimizing to (P1) high packet rate
and (P2) low interrupt rate.



That was exactly where I started from. But unfortunately it did not work
well :(

  From my experiments, the moderation was all over the place failing to
converge. At least the workloads that I've tested with, it was more
successful to have a stricter step policy and pulling towards latency
if we are consistently catching single completion per event.

I'm not an expert here at all, but at this point, based on my attempts
so far, I'm not convinced the current net_dim scheme could work.


I do believe we can make it work. I see your addition of the cpe part to
stats compare. Might not be a bad idea for networking devices. Overall, it
seems to me like this would be a private case of the general DIM
optimization, since it doesn't need to account for aggregation, for
instance, which breaks the "more packets == more data" ratio.


Did U2 came to agreement/lead on how to re-use the upstream library
for the matter Sagi is pushing for?


I don't think so (yet).
Sagi, I would like to avoid having 2 "net DIM"s if possible. You 
mentioned you tried implementing over net DIM lib and it wasn't working 
well. Can you share this code with me?


Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-13 Thread t...@kernel.org
Hello, Bart.

Sorry about the delay.

On Thu, Feb 08, 2018 at 04:31:43PM +, Bart Van Assche wrote:
> The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. The
> instruction at that address tries to dereference scsi_cmnd.device (%rax). The
> register dump shows that that pointer has the value NULL. The only function I
> know of that clears the scsi_cmnd.device pointer is scsi_req_init(). The only
> caller of that function in the SCSI core is scsi_initialize_rq(). That 
> function
> has two callers, namely scsi_init_command() and blk_get_request(). However,
> the scsi_cmnd.device pointer is not cleared when a request finishes. This is
> why I think that the above crash report indicates that scsi_times_out() was
> called for a request that was being reinitialized and not by device 
> hotplugging.

Can you please give the following patch a shot?  While timeout path is
synchornizing against the completion path (and the following re-init)
while taking back control of a timed-out request, it wasn't doing that
while giving it back, so the timer registration could race against
completion and re-issue.  I'm still not quite sure how that can lead
to the oops tho.  Anyways, we need something like this one way or the
other.

This isn't the final patch.  We should add batching-up of rcu
synchronize calls similar to the abort path.

Thanks.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102..b66aec3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -816,7 +816,8 @@ struct blk_mq_timeout_data {
unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request 
*req,
+   bool reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -836,8 +837,12 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 * ->aborted_gstate is set, this may lead to ignored
 * completions and further spurious timeouts.
 */
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+   synchronize_rcu();
+   else
+   synchronize_srcu(hctx->srcu);
+   blk_mq_rq_update_aborted_gstate(req, 0);
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -893,7 +898,7 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx 
*hctx,
 */
if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
READ_ONCE(rq->gstate) == rq->aborted_gstate)
-   blk_mq_rq_timed_out(rq, reserved);
+   blk_mq_rq_timed_out(hctx, rq, reserved);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)



Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-13 Thread Bart Van Assche
On Tue, 2018-02-13 at 19:38 +0100, Paolo Valente wrote:
> as a first attempt, I've followed your steps, but got:
> Error: could not find sg_reset

Please install the sg3_utils package. Every Linux distro I know of supports that
package. And in case you would like to install it from source, the source code 
of
that package is available from http://sg.danny.cz/sg/sg3_utils.html.

> For ib_srp-backport, I get a lot of warnings like the following one,
> at "make install" (preceded by corresponding warnings at the end of
> the compilation):
> depmod: WARNING: /lib/modules/4.16.0-rc1+/extra/ib_srp.ko needs unknown 
> symbol rdma_resolve_addr
> 
> Unfortunately, it gets worse while executing "make scst srpt":

Please neither install the ib_srp-backport driver nor SCST. These drivers have
not yet been tested against kernel v4.16-rc1. I provided you a kernel tree in
which both the SRP initiator and target drivers support RoCE such that you don't
need to install these out-of-tree drivers. I think all that you need from the
srp-test README document are the instructions to configure /etc/multipath.conf
and the instructions for installing the required packages. From that README
document:

Install the following software packages if these have not yet been installed:
fio, gcc-c++, make, multipath-tools or device-mapper-multipath, sg3_utils,
srptools, e2fsprogs and xfsprogs.

Thanks,

Bart.




Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-13 Thread Paolo Valente


> Il giorno 12 feb 2018, alle ore 17:31, Bart Van Assche 
>  ha scritto:
> 
> On 02/11/18 23:35, Paolo Valente wrote:
>> Also this smells a little bit like some spurious elevator call.
>> Unfortunately I have no clue on the cause.  To go on, I need at least
>> to reproduce it.  In this respect: Bart, could you please tell me how
>> to setup the offending configuration, and to cause the failure?
>> Possibly with just one, or at most two PCs.  I don't have fancier hw
>> at the moment.
> 
> Hello Paolo,
> 
> Although I expect that it is possible to reproduce this with an unmodified 
> v4.16-rc1 kernel, this is how I ran into this issue:
> * Clone the for-next branch of https://github.com/bvanassche/linux.
> * Build and install that kernel in a virtual machine.
> * Clone https://github.com/bvanassche/srp-test.
> * Run the following command:
>  srp-test/run_tests -c -d -r 10 -t 02-mq -e bfq
> 

Hi Bart,
as a first attempt, I've followed your steps, but got:
Error: could not find sg_reset
expectedly because of dependencies that you are implying in your steps.

So, I have followed the instructions in the srp-test README for the
case "Running the Tests on an Ethernet Setup", directly on a 4.16-rc1.

For ib_srp-backport, I get a lot of warnings like the following one,
at "make install" (preceded by corresponding warnings at the end of
the compilation):
depmod: WARNING: /lib/modules/4.16.0-rc1+/extra/ib_srp.ko needs unknown symbol 
rdma_resolve_addr

Unfortunately, it gets worse while executing "make scst srpt":

  CC [M]  /home/paolo/scst/srpt/src/ib_srpt.o
In file included from /home/paolo/scst/srpt/src/ib_srpt.c:62:0:
/home/paolo/scst/srpt/src/ib_srpt.h:481:8: error: redefinition of ‘struct 
srp_login_req_rdma’
 struct srp_login_req_rdma {
^~
In file included from /home/paolo/scst/srpt/src/ib_srpt.h:44:0,
 from /home/paolo/scst/srpt/src/ib_srpt.c:62:
/mnt/linux-dev/linux/include/scsi/srp.h:139:8: note: originally defined here
 struct srp_login_req_rdma {
^~

Could you please give me some help, so as to not get lost among these issues?

Thanks,
Paolo

> Thanks,
> 
> Bart.



Re: [PATCH] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into

2018-02-13 Thread Bart Van Assche
On Tue, 2018-02-13 at 17:54 +0900, Sergey Senozhatsky wrote:
> On (02/12/18 11:05), Bart Van Assche wrote:
> [..]
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index ac4740cf74be..cf17626604c2 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1026,14 +1026,25 @@ static inline int blk_rq_cur_bytes(const struct 
> > request *rq)
> >  
> >  extern unsigned int blk_rq_err_bytes(const struct request *rq);
> >  
> > +/*
> > + * Variables of type sector_t represent an offset or size that is a 
> > multiple of
> > + * 2**9 bytes. Hence these two constants.
> > + */
> > +#ifndef SECTOR_SHIFT
> > +enum { SECTOR_SHIFT = 9 };
> > +#endif
> > +#ifndef SECTOR_SIZE
> > +enum { SECTOR_SIZE = 512 };
> > +#endif
> 
> Shouldn't SECTOR_SIZE depend on SECTOR_SHIFT?
> 
> 1 << SECTOR_SHIFT

Not sure if that change will really make a difference. Anyway, I will make that 
change.

Bart.





Re: [PATCH] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into

2018-02-13 Thread Bart Van Assche
On Tue, 2018-02-13 at 09:43 +0100, Johannes Thumshirn wrote:
> On Mon, 2018-02-12 at 11:05 -0800, Bart Van Assche wrote:
> > +/*
> > + * Variables of type sector_t represent an offset or size that is a
> > multiple of
> > + * 2**9 bytes. Hence these two constants.
> > + */
> > +#ifndef SECTOR_SHIFT
> > +enum { SECTOR_SHIFT = 9 };
> > +#endif
> > +#ifndef SECTOR_SIZE
> > +enum { SECTOR_SIZE = 512 };
> > +#endif
> 
> Can you please make a #define out of these enums? I know gdb can cope
> better with enums than defines but IIRC adding -ggdb3 to the CFLAGS
> solves this issue.
> 
> Apart from that:
> Reviewed-by: Johannes Thumshirn 

OK, I will change the enums into defines.

Thanks for the review.

Bart.




Re: [PATCH v3] blk: optimization for classic polling

2018-02-13 Thread Jens Axboe
On 2/13/18 8:48 AM, Nitesh Shetty wrote:
> This removes the dependency on interrupts to wake up task. Set task
> state as TASK_RUNNING, if need_resched() returns true,
> while polling for IO completion.
> Earlier, polling task used to sleep, relying on interrupt to wake it up.
> This made some IO take very long when interrupt-coalescing is enabled in
> NVMe.

Thanks, applied.

-- 
Jens Axboe



[PATCH v3] blk: optimization for classic polling

2018-02-13 Thread Nitesh Shetty
This removes the dependency on interrupts to wake up task. Set task
state as TASK_RUNNING, if need_resched() returns true,
while polling for IO completion.
Earlier, polling task used to sleep, relying on interrupt to wake it up.
This made some IO take very long when interrupt-coalescing is enabled in
NVMe.

Reference:
http://lists.infradead.org/pipermail/linux-nvme/2018-February/015435.html

Changes since v2->v3:
-using __set_current_state() instead of set_current_state()

Changes since v1->v2:
-setting task state once in blk_poll, instead of multiple
callers.
Signed-off-by: Nitesh Shetty 
---
 block/blk-mq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102..3574927 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3164,6 +3164,7 @@ static bool __blk_mq_poll(struct blk_mq_hw_ctx *hctx, 
struct request *rq)
cpu_relax();
}
 
+   __set_current_state(TASK_RUNNING);
return false;
 }
 
-- 
2.7.4



Re: [PATCH] blk-throttle: avoid multiple counting for same bio

2018-02-13 Thread Tejun Heo
On Tue, Feb 13, 2018 at 02:45:50PM +0800, Chengguang Xu wrote:
> In current throttling/upper limit policy of blkio cgroup
> blkio.throttle.io_service_bytes does not exactly represent
> the number of bytes issued to the disk by the group, sometimes
> this number could be counted multiple times of real bytes.
> This fix introduces BIO_COUNTED flag to avoid multiple counting
> for same bio.
> 
> Signed-off-by: Chengguang Xu 

We had a series of fixes / changes for this problem during the last
cycle.  Can you please see whether the current linus master has the
same problem.

Thanks.

-- 
tejun


Re: [PATCH v2 RESENT] blk: optimization for classic polling

2018-02-13 Thread Jens Axboe
On 2/13/18 11:56 AM, Nitesh Shetty wrote:
> This removes the dependency on interrupts to wake up task. Set task
> state as TASK_RUNNING, if need_resched() returns true,
> while polling for IO completion.
> Earlier, polling task used to sleep, relying on interrupt to wake it up.
> This made some IO take very long when interrupt-coalescing is enabled in
> NVMe.

__set_current_state() should suffice here.

-- 
Jens Axboe



[PATCH 3/8] lightnvm: add support for 2.0 address format

2018-02-13 Thread Javier González
Add support for 2.0 address format. Also, align address bits for 1.2 and 2.0 to
align.

Signed-off-by: Javier González 
---
 include/linux/lightnvm.h | 45 -
 1 file changed, 32 insertions(+), 13 deletions(-)

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 6a567bd19b73..e035ae4c9acc 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -16,12 +16,21 @@ enum {
NVM_IOTYPE_GC = 1,
 };
 
-#define NVM_BLK_BITS (16)
-#define NVM_PG_BITS  (16)
-#define NVM_SEC_BITS (8)
-#define NVM_PL_BITS  (8)
-#define NVM_LUN_BITS (8)
-#define NVM_CH_BITS  (7)
+/* 1.2 format */
+#define NVM_12_CH_BITS  (8)
+#define NVM_12_LUN_BITS (8)
+#define NVM_12_BLK_BITS (16)
+#define NVM_12_PG_BITS  (16)
+#define NVM_12_PL_BITS  (4)
+#define NVM_12_SEC_BITS (4)
+#define NVM_12_RESERVED (8)
+
+/* 2.0 format */
+#define NVM_20_CH_BITS  (8)
+#define NVM_20_LUN_BITS (8)
+#define NVM_20_CHK_BITS (16)
+#define NVM_20_SEC_BITS (24)
+#define NVM_20_RESERVED (8)
 
 enum {
NVM_OCSSD_SPEC_12 = 12,
@@ -31,16 +40,26 @@ enum {
 struct ppa_addr {
/* Generic structure for all addresses */
union {
+   /* 1.2 device format */
struct {
-   u64 blk : NVM_BLK_BITS;
-   u64 pg  : NVM_PG_BITS;
-   u64 sec : NVM_SEC_BITS;
-   u64 pl  : NVM_PL_BITS;
-   u64 lun : NVM_LUN_BITS;
-   u64 ch  : NVM_CH_BITS;
-   u64 reserved: 1;
+   u64 ch  : NVM_12_CH_BITS;
+   u64 lun : NVM_12_LUN_BITS;
+   u64 blk : NVM_12_BLK_BITS;
+   u64 pg  : NVM_12_PG_BITS;
+   u64 pl  : NVM_12_PL_BITS;
+   u64 sec : NVM_12_SEC_BITS;
+   u64 reserved: NVM_12_RESERVED;
} g;
 
+   /* 2.0 device format */
+   struct {
+   u64 ch  : NVM_20_CH_BITS;
+   u64 lun : NVM_20_LUN_BITS;
+   u64 chk : NVM_20_CHK_BITS;
+   u64 sec : NVM_20_SEC_BITS;
+   u64 reserved: NVM_20_RESERVED;
+   } m;
+
struct {
u64 line: 63;
u64 is_cached   : 1;
-- 
2.7.4



[PATCH 1/8] lightnvm: exposed generic geometry to targets

2018-02-13 Thread Javier González
With the inclusion of 2.0 support, we need a generic geometry that
describes the OCSSD independently of the specification that it
implements. Otherwise, geometry specific code is required, which
complicates targets and makes maintenance much more difficult.

This patch refactors the identify path and populates a generic geometry
that is then given to the targets on creation. Since the 2.0 geometry is
much more abstract that 1.2, the generic geometry resembles 2.0, but it
is not identical, as it needs to understand 1.2 abstractions too.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c  | 143 ++-
 drivers/lightnvm/pblk-core.c |  16 +-
 drivers/lightnvm/pblk-gc.c   |   2 +-
 drivers/lightnvm/pblk-init.c | 149 ---
 drivers/lightnvm/pblk-read.c |   2 +-
 drivers/lightnvm/pblk-recovery.c |  14 +-
 drivers/lightnvm/pblk-rl.c   |   2 +-
 drivers/lightnvm/pblk-sysfs.c|  39 ++--
 drivers/lightnvm/pblk-write.c|   2 +-
 drivers/lightnvm/pblk.h  | 105 +--
 drivers/nvme/host/lightnvm.c | 379 ---
 include/linux/lightnvm.h | 220 +--
 12 files changed, 586 insertions(+), 487 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 9b1255b3e05e..80492fa6ee76 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -111,6 +111,7 @@ static void nvm_release_luns_err(struct nvm_dev *dev, int 
lun_begin,
 static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, int clear)
 {
struct nvm_dev *dev = tgt_dev->parent;
+   struct nvm_dev_geo *dev_geo = >dev_geo;
struct nvm_dev_map *dev_map = tgt_dev->map;
int i, j;
 
@@ -122,7 +123,7 @@ static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, 
int clear)
if (clear) {
for (j = 0; j < ch_map->nr_luns; j++) {
int lun = j + lun_offs[j];
-   int lunid = (ch * dev->geo.nr_luns) + lun;
+   int lunid = (ch * dev_geo->num_lun) + lun;
 
WARN_ON(!test_and_clear_bit(lunid,
dev->lun_map));
@@ -143,19 +144,20 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
  u16 lun_begin, u16 lun_end,
  u16 op)
 {
+   struct nvm_dev_geo *dev_geo = >dev_geo;
struct nvm_tgt_dev *tgt_dev = NULL;
struct nvm_dev_map *dev_rmap = dev->rmap;
struct nvm_dev_map *dev_map;
struct ppa_addr *luns;
int nr_luns = lun_end - lun_begin + 1;
int luns_left = nr_luns;
-   int nr_chnls = nr_luns / dev->geo.nr_luns;
-   int nr_chnls_mod = nr_luns % dev->geo.nr_luns;
-   int bch = lun_begin / dev->geo.nr_luns;
-   int blun = lun_begin % dev->geo.nr_luns;
+   int nr_chnls = nr_luns / dev_geo->num_lun;
+   int nr_chnls_mod = nr_luns % dev_geo->num_lun;
+   int bch = lun_begin / dev_geo->num_lun;
+   int blun = lun_begin % dev_geo->num_lun;
int lunid = 0;
int lun_balanced = 1;
-   int prev_nr_luns;
+   int sec_per_lun, prev_nr_luns;
int i, j;
 
nr_chnls = (nr_chnls_mod == 0) ? nr_chnls : nr_chnls + 1;
@@ -173,15 +175,15 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
if (!luns)
goto err_luns;
 
-   prev_nr_luns = (luns_left > dev->geo.nr_luns) ?
-   dev->geo.nr_luns : luns_left;
+   prev_nr_luns = (luns_left > dev_geo->num_lun) ?
+   dev_geo->num_lun : luns_left;
for (i = 0; i < nr_chnls; i++) {
struct nvm_ch_map *ch_rmap = _rmap->chnls[i + bch];
int *lun_roffs = ch_rmap->lun_offs;
struct nvm_ch_map *ch_map = _map->chnls[i];
int *lun_offs;
-   int luns_in_chnl = (luns_left > dev->geo.nr_luns) ?
-   dev->geo.nr_luns : luns_left;
+   int luns_in_chnl = (luns_left > dev_geo->num_lun) ?
+   dev_geo->num_lun : luns_left;
 
if (lun_balanced && prev_nr_luns != luns_in_chnl)
lun_balanced = 0;
@@ -215,18 +217,23 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
if (!tgt_dev)
goto err_ch;
 
-   memcpy(_dev->geo, >geo, sizeof(struct nvm_geo));
/* Target device only owns a portion of the physical device */
-   tgt_dev->geo.nr_chnls = nr_chnls;
+   tgt_dev->geo.num_ch = nr_chnls;
+   tgt_dev->geo.num_lun = (lun_balanced) ? prev_nr_luns : -1;
tgt_dev->geo.all_luns = nr_luns;
-   tgt_dev->geo.nr_luns = (lun_balanced) ? prev_nr_luns : -1;
+   

[PATCH 6/8] lightnvm: pblk: implement get log report chunk

2018-02-13 Thread Javier González
From: Javier González 

In preparation of pblk supporting 2.0, implement the get log report
chunk in pblk.

This patch only replicates de bad block functionality as the rest of the
metadata requires new pblk functionality (e.g., wear-index to implement
wear-leveling). This functionality will come in future patches.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-core.c  | 118 +++
 drivers/lightnvm/pblk-init.c  | 186 +++---
 drivers/lightnvm/pblk-sysfs.c |  67 +++
 drivers/lightnvm/pblk.h   |  20 +
 4 files changed, 327 insertions(+), 64 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 519af8b9eab7..01b78ee5c0e0 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -44,11 +44,12 @@ static void pblk_line_mark_bb(struct work_struct *work)
 }
 
 static void pblk_mark_bb(struct pblk *pblk, struct pblk_line *line,
-struct ppa_addr *ppa)
+struct ppa_addr ppa_addr)
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
-   int pos = pblk_ppa_to_pos(geo, *ppa);
+   struct ppa_addr *ppa;
+   int pos = pblk_ppa_to_pos(geo, ppa_addr);
 
pr_debug("pblk: erase failed: line:%d, pos:%d\n", line->id, pos);
atomic_long_inc(>erase_failed);
@@ -58,6 +59,15 @@ static void pblk_mark_bb(struct pblk *pblk, struct pblk_line 
*line,
pr_err("pblk: attempted to erase bb: line:%d, pos:%d\n",
line->id, pos);
 
+   /* Not necessary to mark bad blocks on 2.0 spec. */
+   if (geo->c.version == NVM_OCSSD_SPEC_20)
+   return;
+
+   ppa = kmalloc(sizeof(struct ppa_addr), GFP_ATOMIC);
+   if (!ppa)
+   return;
+
+   *ppa = ppa_addr;
pblk_gen_run_ws(pblk, NULL, ppa, pblk_line_mark_bb,
GFP_ATOMIC, pblk->bb_wq);
 }
@@ -69,16 +79,8 @@ static void __pblk_end_io_erase(struct pblk *pblk, struct 
nvm_rq *rqd)
line = >lines[pblk_ppa_to_line(rqd->ppa_addr)];
atomic_dec(>left_seblks);
 
-   if (rqd->error) {
-   struct ppa_addr *ppa;
-
-   ppa = kmalloc(sizeof(struct ppa_addr), GFP_ATOMIC);
-   if (!ppa)
-   return;
-
-   *ppa = rqd->ppa_addr;
-   pblk_mark_bb(pblk, line, ppa);
-   }
+   if (rqd->error)
+   pblk_mark_bb(pblk, line, rqd->ppa_addr);
 
atomic_dec(>inflight_io);
 }
@@ -92,6 +94,47 @@ static void pblk_end_io_erase(struct nvm_rq *rqd)
mempool_free(rqd, pblk->e_rq_pool);
 }
 
+/*
+ * Get information for all chunks from the device.
+ *
+ * The caller is responsible for freeing the returned structure
+ */
+struct nvm_chunk_log_page *pblk_chunk_get_info(struct pblk *pblk)
+{
+   struct nvm_tgt_dev *dev = pblk->dev;
+   struct nvm_geo *geo = >geo;
+   struct nvm_chunk_log_page *log;
+   unsigned long len;
+   int ret;
+
+   len = geo->all_chunks * sizeof(*log);
+   log = kzalloc(len, GFP_KERNEL);
+   if (!log)
+   return ERR_PTR(-ENOMEM);
+
+   ret = nvm_get_chunk_log_page(dev, log, 0, len);
+   if (ret) {
+   pr_err("pblk: could not get chunk log page (%d)\n", ret);
+   kfree(log);
+   return ERR_PTR(-EIO);
+   }
+
+   return log;
+}
+
+struct nvm_chunk_log_page *pblk_chunk_get_off(struct pblk *pblk,
+ struct nvm_chunk_log_page *lp,
+ struct ppa_addr ppa)
+{
+   struct nvm_tgt_dev *dev = pblk->dev;
+   struct nvm_geo *geo = >geo;
+   int ch_off = ppa.m.ch * geo->c.num_chk * geo->num_lun;
+   int lun_off = ppa.m.lun * geo->c.num_chk;
+   int chk_off = ppa.m.chk;
+
+   return lp + ch_off + lun_off + chk_off;
+}
+
 void __pblk_map_invalidate(struct pblk *pblk, struct pblk_line *line,
   u64 paddr)
 {
@@ -1094,10 +1137,38 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
return 1;
 }
 
+static int pblk_prepare_new_line(struct pblk *pblk, struct pblk_line *line)
+{
+   struct pblk_line_meta *lm = >lm;
+   struct nvm_tgt_dev *dev = pblk->dev;
+   struct nvm_geo *geo = >geo;
+   int blk_to_erase = atomic_read(>blk_in_line);
+   int i;
+
+   for (i = 0; i < lm->blk_per_line; i++) {
+   int state = line->chks[i].state;
+   struct pblk_lun *rlun = >luns[i];
+
+   /* Free chunks should not be erased */
+   if (state & NVM_CHK_ST_FREE) {
+   set_bit(pblk_ppa_to_pos(geo, rlun->chunk_bppa),
+   line->erase_bitmap);
+   blk_to_erase--;
+  

[PATCH 4/8] lightnvm: convert address based on spec. version

2018-02-13 Thread Javier González
Create the device ppa for both 1.2 and 2.0.

Signed-off-by: Javier González 
---
 include/linux/lightnvm.h | 52 +---
 1 file changed, 36 insertions(+), 16 deletions(-)

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index e035ae4c9acc..1148b3f22b27 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -412,16 +412,26 @@ static inline struct ppa_addr generic_to_dev_addr(struct 
nvm_tgt_dev *tgt_dev,
  struct ppa_addr r)
 {
struct nvm_geo *geo = _dev->geo;
-   struct nvm_addr_format_12 *ppaf =
-   (struct nvm_addr_format_12 *)>c.addrf;
struct ppa_addr l;
 
-   l.ppa = ((u64)r.g.ch) << ppaf->ch_offset;
-   l.ppa |= ((u64)r.g.lun) << ppaf->lun_offset;
-   l.ppa |= ((u64)r.g.blk) << ppaf->blk_offset;
-   l.ppa |= ((u64)r.g.pg) << ppaf->pg_offset;
-   l.ppa |= ((u64)r.g.pl) << ppaf->pln_offset;
-   l.ppa |= ((u64)r.g.sec) << ppaf->sec_offset;
+   if (geo->c.version == NVM_OCSSD_SPEC_12) {
+   struct nvm_addr_format_12 *ppaf =
+   (struct nvm_addr_format_12 *)>c.addrf;
+
+   l.ppa = ((u64)r.g.ch) << ppaf->ch_offset;
+   l.ppa |= ((u64)r.g.lun) << ppaf->lun_offset;
+   l.ppa |= ((u64)r.g.blk) << ppaf->blk_offset;
+   l.ppa |= ((u64)r.g.pg) << ppaf->pg_offset;
+   l.ppa |= ((u64)r.g.pl) << ppaf->pln_offset;
+   l.ppa |= ((u64)r.g.sec) << ppaf->sec_offset;
+   } else {
+   struct nvm_addr_format *lbaf = >c.addrf;
+
+   l.ppa = ((u64)r.m.ch) << lbaf->ch_offset;
+   l.ppa |= ((u64)r.m.lun) << lbaf->lun_offset;
+   l.ppa |= ((u64)r.m.chk) << lbaf->chk_offset;
+   l.ppa |= ((u64)r.m.sec) << lbaf->sec_offset;
+   }
 
return l;
 }
@@ -430,18 +440,28 @@ static inline struct ppa_addr dev_to_generic_addr(struct 
nvm_tgt_dev *tgt_dev,
  struct ppa_addr r)
 {
struct nvm_geo *geo = _dev->geo;
-   struct nvm_addr_format_12 *ppaf =
-   (struct nvm_addr_format_12 *)>c.addrf;
struct ppa_addr l;
 
l.ppa = 0;
 
-   l.g.ch = (r.ppa & ppaf->ch_mask) >> ppaf->ch_offset;
-   l.g.lun = (r.ppa & ppaf->lun_mask) >> ppaf->lun_offset;
-   l.g.blk = (r.ppa & ppaf->blk_mask) >> ppaf->blk_offset;
-   l.g.pg = (r.ppa & ppaf->pg_mask) >> ppaf->pg_offset;
-   l.g.pl = (r.ppa & ppaf->pln_mask) >> ppaf->pln_offset;
-   l.g.sec = (r.ppa & ppaf->sec_mask) >> ppaf->sec_offset;
+   if (geo->c.version == NVM_OCSSD_SPEC_12) {
+   struct nvm_addr_format_12 *ppaf =
+   (struct nvm_addr_format_12 *)>c.addrf;
+
+   l.g.ch = (r.ppa & ppaf->ch_mask) >> ppaf->ch_offset;
+   l.g.lun = (r.ppa & ppaf->lun_mask) >> ppaf->lun_offset;
+   l.g.blk = (r.ppa & ppaf->blk_mask) >> ppaf->blk_offset;
+   l.g.pg = (r.ppa & ppaf->pg_mask) >> ppaf->pg_offset;
+   l.g.pl = (r.ppa & ppaf->pln_mask) >> ppaf->pln_offset;
+   l.g.sec = (r.ppa & ppaf->sec_mask) >> ppaf->sec_offset;
+   } else {
+   struct nvm_addr_format *lbaf = >c.addrf;
+
+   l.m.ch = (r.ppa & lbaf->ch_mask) >> lbaf->ch_offset;
+   l.m.lun = (r.ppa & lbaf->lun_mask) >> lbaf->lun_offset;
+   l.m.chk = (r.ppa & lbaf->chk_mask) >> lbaf->chk_offset;
+   l.m.sec = (r.ppa & lbaf->sec_mask) >> lbaf->sec_offset;
+   }
 
return l;
 }
-- 
2.7.4



[PATCH 5/8] lightnvm: implement get log report chunk helpers

2018-02-13 Thread Javier González
From: Javier González 

The 2.0 spec provides a report chunk log page that can be retrieved
using the stangard nvme get log page. This replaces the dedicated
get/put bad block table in 1.2.

This patch implements the helper functions to allow targets retrieve the
chunk metadata using get log page

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c  | 28 +
 drivers/nvme/host/lightnvm.c | 50 
 include/linux/lightnvm.h | 32 
 3 files changed, 110 insertions(+)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 80492fa6ee76..6857a888544a 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -43,6 +43,8 @@ struct nvm_ch_map {
 struct nvm_dev_map {
struct nvm_ch_map *chnls;
int nr_chnls;
+   int bch;
+   int blun;
 };
 
 static struct nvm_target *nvm_find_target(struct nvm_dev *dev, const char 
*name)
@@ -171,6 +173,9 @@ static struct nvm_tgt_dev *nvm_create_tgt_dev(struct 
nvm_dev *dev,
if (!dev_map->chnls)
goto err_chnls;
 
+   dev_map->bch = bch;
+   dev_map->blun = blun;
+
luns = kcalloc(nr_luns, sizeof(struct ppa_addr), GFP_KERNEL);
if (!luns)
goto err_luns;
@@ -561,6 +566,19 @@ static void nvm_unregister_map(struct nvm_dev *dev)
kfree(rmap);
 }
 
+static unsigned long nvm_log_off_tgt_to_dev(struct nvm_tgt_dev *tgt_dev)
+{
+   struct nvm_dev_map *dev_map = tgt_dev->map;
+   struct nvm_geo *geo = _dev->geo;
+   int lun_off;
+   unsigned long off;
+
+   lun_off = dev_map->blun + dev_map->bch * geo->num_lun;
+   off = lun_off * geo->c.num_chk * sizeof(struct nvm_chunk_log_page);
+
+   return off;
+}
+
 static void nvm_map_to_dev(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *p)
 {
struct nvm_dev_map *dev_map = tgt_dev->map;
@@ -720,6 +738,16 @@ static void nvm_free_rqd_ppalist(struct nvm_tgt_dev 
*tgt_dev,
nvm_dev_dma_free(tgt_dev->parent, rqd->ppa_list, rqd->dma_ppa_list);
 }
 
+int nvm_get_chunk_log_page(struct nvm_tgt_dev *tgt_dev,
+  struct nvm_chunk_log_page *log,
+  unsigned long off, unsigned long len)
+{
+   struct nvm_dev *dev = tgt_dev->parent;
+
+   off += nvm_log_off_tgt_to_dev(tgt_dev);
+
+   return dev->ops->get_chunk_log_page(tgt_dev->parent, log, off, len);
+}
 
 int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas,
   int nr_ppas, int type)
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 7bc75182c723..355d9b0cf084 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -35,6 +35,10 @@ enum nvme_nvm_admin_opcode {
nvme_nvm_admin_set_bb_tbl   = 0xf1,
 };
 
+enum nvme_nvm_log_page {
+   NVME_NVM_LOG_REPORT_CHUNK   = 0xCA,
+};
+
 struct nvme_nvm_ph_rw {
__u8opcode;
__u8flags;
@@ -553,6 +557,50 @@ static int nvme_nvm_set_bb_tbl(struct nvm_dev *nvmdev, 
struct ppa_addr *ppas,
return ret;
 }
 
+static int nvme_nvm_get_chunk_log_page(struct nvm_dev *nvmdev,
+  struct nvm_chunk_log_page *log,
+  unsigned long off,
+  unsigned long total_len)
+{
+   struct nvme_ns *ns = nvmdev->q->queuedata;
+   struct nvme_command c = { };
+   unsigned long offset = off, left = total_len;
+   unsigned long len, len_dwords;
+   void *buf = log;
+   int ret;
+
+   /* The offset needs to be dword-aligned */
+   if (offset & 0x3)
+   return -EINVAL;
+
+   do {
+   /* Send 256KB at a time */
+   len = (1 << 18) > left ? left : (1 << 18);
+   len_dwords = (len >> 2) - 1;
+
+   c.get_log_page.opcode = nvme_admin_get_log_page;
+   c.get_log_page.nsid = cpu_to_le32(ns->head->ns_id);
+   c.get_log_page.lid = NVME_NVM_LOG_REPORT_CHUNK;
+   c.get_log_page.lpol = cpu_to_le32(offset & 0x);
+   c.get_log_page.lpou = cpu_to_le32(offset >> 32);
+   c.get_log_page.numdl = cpu_to_le16(len_dwords & 0x);
+   c.get_log_page.numdu = cpu_to_le16(len_dwords >> 16);
+
+   ret = nvme_submit_sync_cmd(ns->ctrl->admin_q, , buf, len);
+   if (ret) {
+   dev_err(ns->ctrl->device,
+   "get chunk log page failed (%d)\n", ret);
+   break;
+   }
+
+   buf += len;
+   offset += len;
+   left -= len;
+   } while (left);
+
+   return ret;
+}
+
 static inline void nvme_nvm_rqtocmd(struct nvm_rq *rqd, struct nvme_ns *ns,
struct 

[PATCH 2/8] lightnvm: show generic geometry in sysfs

2018-02-13 Thread Javier González
From: Javier González 

Apart from showing the geometry returned by the different identify
commands, provide the generic geometry too, as this is the geometry that
targets will use to describe the device.

Signed-off-by: Javier González 
---
 drivers/nvme/host/lightnvm.c | 146 ---
 1 file changed, 97 insertions(+), 49 deletions(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 97739e668602..7bc75182c723 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -944,8 +944,27 @@ static ssize_t nvm_dev_attr_show(struct device *dev,
return scnprintf(page, PAGE_SIZE, "%u.%u\n",
dev_geo->major_ver_id,
dev_geo->minor_ver_id);
-   } else if (strcmp(attr->name, "capabilities") == 0) {
-   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.cap);
+   } else if (strcmp(attr->name, "clba") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.clba);
+   } else if (strcmp(attr->name, "csecs") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.csecs);
+   } else if (strcmp(attr->name, "sos") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.sos);
+   } else if (strcmp(attr->name, "ws_min") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.ws_min);
+   } else if (strcmp(attr->name, "ws_opt") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.ws_opt);
+   } else if (strcmp(attr->name, "maxoc") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.maxoc);
+   } else if (strcmp(attr->name, "maxocpu") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.maxocpu);
+   } else if (strcmp(attr->name, "mw_cunits") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.mw_cunits);
+   } else if (strcmp(attr->name, "media_capabilities") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.mccap);
+   } else if (strcmp(attr->name, "max_phys_secs") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n",
+   ndev->ops->max_phys_sect);
} else if (strcmp(attr->name, "read_typ") == 0) {
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.trdt);
} else if (strcmp(attr->name, "read_max") == 0) {
@@ -984,19 +1003,8 @@ static ssize_t nvm_dev_attr_show_12(struct device *dev,
 
attr = >attr;
 
-   if (strcmp(attr->name, "vendor_opcode") == 0) {
-   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.vmnt);
-   } else if (strcmp(attr->name, "device_mode") == 0) {
-   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.dom);
-   /* kept for compatibility */
-   } else if (strcmp(attr->name, "media_manager") == 0) {
-   return scnprintf(page, PAGE_SIZE, "%s\n", "gennvm");
-   } else if (strcmp(attr->name, "ppa_format") == 0) {
+   if (strcmp(attr->name, "ppa_format") == 0) {
return nvm_dev_attr_show_ppaf((void *)_geo->c.addrf, page);
-   } else if (strcmp(attr->name, "media_type") == 0) { /* u8 */
-   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.mtype);
-   } else if (strcmp(attr->name, "flash_media_type") == 0) {
-   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.fmtype);
} else if (strcmp(attr->name, "num_channels") == 0) {
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->num_ch);
} else if (strcmp(attr->name, "num_luns") == 0) {
@@ -1011,8 +1019,6 @@ static ssize_t nvm_dev_attr_show_12(struct device *dev,
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.fpg_sz);
} else if (strcmp(attr->name, "hw_sector_size") == 0) {
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.csecs);
-   } else if (strcmp(attr->name, "oob_sector_size") == 0) {/* u32 */
-   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.sos);
} else if (strcmp(attr->name, "prog_typ") == 0) {
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.tprt);
} else if (strcmp(attr->name, "prog_max") == 0) {
@@ -1021,13 +1027,21 @@ static ssize_t nvm_dev_attr_show_12(struct device *dev,
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.tbet);
} else if (strcmp(attr->name, "erase_max") == 0) {
return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.tbem);
+   } else if (strcmp(attr->name, "vendor_opcode") == 0) {
+   return scnprintf(page, PAGE_SIZE, "%u\n", dev_geo->c.vmnt);
+   } else if (strcmp(attr->name, "device_mode") == 0) {
+   return scnprintf(page, PAGE_SIZE, 

[PATCH 8/8] lightnvm: pblk: implement 2.0 support

2018-02-13 Thread Javier González
Implement 2.0 support in pblk. This includes the address formatting and
mapping paths, as well as the sysfs entries for them.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-init.c  |  57 ++--
 drivers/lightnvm/pblk-sysfs.c |  36 ++--
 drivers/lightnvm/pblk.h   | 198 --
 3 files changed, 233 insertions(+), 58 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 04685f2d39d3..d5a31fc986cc 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -231,20 +231,63 @@ static int pblk_set_addrf_12(struct nvm_geo *geo,
return dst->blk_offset + src->blk_len;
 }
 
+static int pblk_set_addrf_20(struct nvm_geo *geo,
+struct nvm_addr_format *adst,
+struct pblk_addr_format *udst)
+{
+   struct nvm_addr_format *src = >c.addrf;
+
+   adst->ch_len = get_count_order(geo->num_ch);
+   adst->lun_len = get_count_order(geo->num_lun);
+   adst->chk_len = src->chk_len;
+   adst->sec_len = src->sec_len;
+
+   adst->sec_offset = 0;
+   adst->ch_offset = adst->sec_len;
+   adst->lun_offset = adst->ch_offset + adst->ch_len;
+   adst->chk_offset = adst->lun_offset + adst->lun_len;
+
+   adst->sec_mask = ((1ULL << adst->sec_len) - 1) << adst->sec_offset;
+   adst->chk_mask = ((1ULL << adst->chk_len) - 1) << adst->chk_offset;
+   adst->lun_mask = ((1ULL << adst->lun_len) - 1) << adst->lun_offset;
+   adst->ch_mask = ((1ULL << adst->ch_len) - 1) << adst->ch_offset;
+
+   udst->sec_stripe = geo->c.ws_opt;
+   udst->ch_stripe = geo->num_ch;
+   udst->lun_stripe = geo->num_lun;
+
+   udst->sec_lun_stripe = udst->sec_stripe * udst->ch_stripe;
+   udst->sec_ws_stripe = udst->sec_lun_stripe * udst->lun_stripe;
+
+   return adst->chk_offset + adst->chk_len;
+}
+
 static int pblk_set_addrf(struct pblk *pblk)
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
int mod;
 
-   div_u64_rem(geo->c.clba, pblk->min_write_pgs, );
-   if (mod) {
-   pr_err("pblk: bad configuration of sectors/pages\n");
+   switch (geo->c.version) {
+   case NVM_OCSSD_SPEC_12:
+   div_u64_rem(geo->c.clba, pblk->min_write_pgs, );
+   if (mod) {
+   pr_err("pblk: bad configuration of sectors/pages\n");
+   return -EINVAL;
+   }
+
+   pblk->addrf_len = pblk_set_addrf_12(geo, (void *)>addrf);
+   break;
+   case NVM_OCSSD_SPEC_20:
+   pblk->addrf_len = pblk_set_addrf_20(geo, (void *)>addrf,
+   >uaddrf);
+   break;
+   default:
+   pr_err("pblk: OCSSD revision not supported (%d)\n",
+   geo->c.version);
return -EINVAL;
}
 
-   pblk->addrf_len = pblk_set_addrf_12(geo, (void *)>addrf);
-
return 0;
 }
 
@@ -,7 +1154,9 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
struct pblk *pblk;
int ret;
 
-   if (geo->c.version != NVM_OCSSD_SPEC_12) {
+   /* pblk supports 1.2 and 2.0 versions */
+   if (!(geo->c.version == NVM_OCSSD_SPEC_12 ||
+   geo->c.version == NVM_OCSSD_SPEC_20)) {
pr_err("pblk: OCSSD version not supported (%u)\n",
geo->c.version);
return ERR_PTR(-EINVAL);
diff --git a/drivers/lightnvm/pblk-sysfs.c b/drivers/lightnvm/pblk-sysfs.c
index 191af0c6591e..60b8d931e4ba 100644
--- a/drivers/lightnvm/pblk-sysfs.c
+++ b/drivers/lightnvm/pblk-sysfs.c
@@ -113,15 +113,16 @@ static ssize_t pblk_sysfs_ppaf(struct pblk *pblk, char 
*page)
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
-   struct nvm_addr_format_12 *ppaf;
-   struct nvm_addr_format_12 *geo_ppaf;
ssize_t sz = 0;
 
-   ppaf = (struct nvm_addr_format_12 *)>addrf;
-   geo_ppaf = (struct nvm_addr_format_12 *)>c.addrf;
+   if (geo->c.version == NVM_OCSSD_SPEC_12) {
+   struct nvm_addr_format_12 *ppaf =
+   (struct nvm_addr_format_12 *)>addrf;
+   struct nvm_addr_format_12 *geo_ppaf =
+   (struct nvm_addr_format_12 *)>c.addrf;
 
-   sz = snprintf(page, PAGE_SIZE,
-   
"pblk:(s:%d)ch:%d/%d,lun:%d/%d,blk:%d/%d,pg:%d/%d,pl:%d/%d,sec:%d/%d\n",
+   sz = snprintf(page, PAGE_SIZE,
+   
"pblk:(s:%d)ch:%d/%d,lun:%d/%d,blk:%d/%d,pg:%d/%d,pl:%d/%d,sec:%d/%d\n",
pblk->addrf_len,
ppaf->ch_offset, ppaf->ch_len,
ppaf->lun_offset, ppaf->lun_len,
@@ 

[PATCH 7/8] lightnvm: pblk: refactor init/exit sequences

2018-02-13 Thread Javier González
Refactor init and exit sequences to improve readability. In the way, fix
bad free ordering on the init error path.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-init.c | 503 ++-
 1 file changed, 254 insertions(+), 249 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index dfc68718e27e..04685f2d39d3 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -103,7 +103,40 @@ static void pblk_l2p_free(struct pblk *pblk)
vfree(pblk->trans_map);
 }
 
-static int pblk_l2p_init(struct pblk *pblk)
+static int pblk_l2p_recover(struct pblk *pblk, bool factory_init)
+{
+   struct pblk_line *line = NULL;
+
+   if (factory_init) {
+   pblk_setup_uuid(pblk);
+   } else {
+   line = pblk_recov_l2p(pblk);
+   if (IS_ERR(line)) {
+   pr_err("pblk: could not recover l2p table\n");
+   return -EFAULT;
+   }
+   }
+
+#ifdef CONFIG_NVM_DEBUG
+   pr_info("pblk init: L2P CRC: %x\n", pblk_l2p_crc(pblk));
+#endif
+
+   /* Free full lines directly as GC has not been started yet */
+   pblk_gc_free_full_lines(pblk);
+
+   if (!line) {
+   /* Configure next line for user data */
+   line = pblk_line_get_first_data(pblk);
+   if (!line) {
+   pr_err("pblk: line list corrupted\n");
+   return -EFAULT;
+   }
+   }
+
+   return 0;
+}
+
+static int pblk_l2p_init(struct pblk *pblk, bool factory_init)
 {
sector_t i;
struct ppa_addr ppa;
@@ -119,7 +152,7 @@ static int pblk_l2p_init(struct pblk *pblk)
for (i = 0; i < pblk->rl.nr_secs; i++)
pblk_trans_map_set(pblk, i, ppa);
 
-   return 0;
+   return pblk_l2p_recover(pblk, factory_init);
 }
 
 static void pblk_rwb_free(struct pblk *pblk)
@@ -268,87 +301,114 @@ static int pblk_core_init(struct pblk *pblk)
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = >geo;
+   int max_write_ppas;
+
+   atomic64_set(>user_wa, 0);
+   atomic64_set(>pad_wa, 0);
+   atomic64_set(>gc_wa, 0);
+   pblk->user_rst_wa = 0;
+   pblk->pad_rst_wa = 0;
+   pblk->gc_rst_wa = 0;
+
+   atomic_long_set(>nr_flush, 0);
+   pblk->nr_flush_rst = 0;
 
pblk->pgs_in_buffer = geo->c.mw_cunits * geo->c.ws_opt * geo->all_luns;
 
+   pblk->min_write_pgs = geo->c.ws_opt * (geo->c.csecs / PAGE_SIZE);
+   max_write_ppas = pblk->min_write_pgs * geo->all_luns;
+   pblk->max_write_pgs = (max_write_ppas < nvm_max_phys_sects(dev)) ?
+   max_write_ppas : nvm_max_phys_sects(dev);
+   pblk_set_sec_per_write(pblk, pblk->min_write_pgs);
+
+   if (pblk->max_write_pgs > PBLK_MAX_REQ_ADDRS) {
+   pr_err("pblk: cannot support device max_phys_sect\n");
+   return -EINVAL;
+   }
+
+   pblk->pad_dist = kzalloc((pblk->min_write_pgs - 1) * sizeof(atomic64_t),
+   GFP_KERNEL);
+   if (!pblk->pad_dist)
+   return -ENOMEM;
+
if (pblk_init_global_caches(pblk))
-   return -ENOMEM;
+   goto fail_free_pad_dist;
 
/* Internal bios can be at most the sectors signaled by the device. */
pblk->page_bio_pool = mempool_create_page_pool(nvm_max_phys_sects(dev),
0);
if (!pblk->page_bio_pool)
-   goto free_global_caches;
+   goto fail_free_global_caches;
 
pblk->gen_ws_pool = mempool_create_slab_pool(PBLK_GEN_WS_POOL_SIZE,
pblk_ws_cache);
if (!pblk->gen_ws_pool)
-   goto free_page_bio_pool;
+   goto fail_free_page_bio_pool;
 
pblk->rec_pool = mempool_create_slab_pool(geo->all_luns,
pblk_rec_cache);
if (!pblk->rec_pool)
-   goto free_gen_ws_pool;
+   goto fail_free_gen_ws_pool;
 
pblk->r_rq_pool = mempool_create_slab_pool(geo->all_luns,
pblk_g_rq_cache);
if (!pblk->r_rq_pool)
-   goto free_rec_pool;
+   goto fail_free_rec_pool;
 
pblk->e_rq_pool = mempool_create_slab_pool(geo->all_luns,
pblk_g_rq_cache);
if (!pblk->e_rq_pool)
-   goto free_r_rq_pool;
+   goto fail_free_r_rq_pool;
 
pblk->w_rq_pool = mempool_create_slab_pool(geo->all_luns,
pblk_w_rq_cache);
if (!pblk->w_rq_pool)
-   goto free_e_rq_pool;
+   goto fail_free_e_rq_pool;

Re: [PATCH] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into

2018-02-13 Thread Sergey Senozhatsky
On (02/12/18 11:05), Bart Van Assche wrote:
[..]
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index ac4740cf74be..cf17626604c2 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1026,14 +1026,25 @@ static inline int blk_rq_cur_bytes(const struct 
> request *rq)
>  
>  extern unsigned int blk_rq_err_bytes(const struct request *rq);
>  
> +/*
> + * Variables of type sector_t represent an offset or size that is a multiple 
> of
> + * 2**9 bytes. Hence these two constants.
> + */
> +#ifndef SECTOR_SHIFT
> +enum { SECTOR_SHIFT = 9 };
> +#endif
> +#ifndef SECTOR_SIZE
> +enum { SECTOR_SIZE = 512 };
> +#endif

Shouldn't SECTOR_SIZE depend on SECTOR_SHIFT?

1 << SECTOR_SHIFT

-ss


Re: [PATCH] block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into

2018-02-13 Thread Johannes Thumshirn
On Mon, 2018-02-12 at 11:05 -0800, Bart Van Assche wrote:
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index ac4740cf74be..cf17626604c2 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1026,14 +1026,25 @@ static inline int blk_rq_cur_bytes(const
> struct request *rq)
>  
>  extern unsigned int blk_rq_err_bytes(const struct request *rq);
>  
> +/*
> + * Variables of type sector_t represent an offset or size that is a
> multiple of
> + * 2**9 bytes. Hence these two constants.
> + */
> +#ifndef SECTOR_SHIFT
> +enum { SECTOR_SHIFT = 9 };
> +#endif
> +#ifndef SECTOR_SIZE
> +enum { SECTOR_SIZE = 512 };
> +#endif

Can you please make a #define out of these enums? I know gdb can cope
better with enums than defines but IIRC adding -ggdb3 to the CFLAGS
solves this issue.

Apart from that:
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn                                          Storage
jthu
msh...@suse.de                                +49 911 74053 689
SUSE
LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane
Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38
9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH v2 RESENT] blk: optimization for classic polling

2018-02-13 Thread Nitesh Shetty
This removes the dependency on interrupts to wake up task. Set task
state as TASK_RUNNING, if need_resched() returns true,
while polling for IO completion.
Earlier, polling task used to sleep, relying on interrupt to wake it up.
This made some IO take very long when interrupt-coalescing is enabled in
NVMe.

Reference:
http://lists.infradead.org/pipermail/linux-nvme/2018-February/015435.html

Changes since v1:
-setting task state once in blk_poll, instead of multiple
callers.
Signed-off-by: Nitesh Shetty 
---
 block/blk-mq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102..40285fe 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3164,6 +3164,7 @@ static bool __blk_mq_poll(struct blk_mq_hw_ctx *hctx, 
struct request *rq)
cpu_relax();
}
 
+   set_current_state(TASK_RUNNING);
return false;
 }
 
-- 
2.7.4