Re: [PATCH V5 12/14] block: mq-deadline: Introduce zone locking support

2017-10-01 Thread Damien Le Moal
Bart,

On Mon, 2017-09-25 at 22:00 +, Bart Van Assche wrote:
> On Mon, 2017-09-25 at 15:14 +0900, Damien Le Moal wrote:
> > +static inline bool deadline_request_needs_zone_wlock(struct deadline_data
> > *dd,
> > +struct request *rq)
> > +{
> > +
> > +   if (!dd->zones_wlock)
> > +   return false;
> > +
> > +   if (blk_rq_is_passthrough(rq))
> > +   return false;
> > +
> > +   switch (req_op(rq)) {
> > +   case REQ_OP_WRITE_ZEROES:
> > +   case REQ_OP_WRITE_SAME:
> > +   case REQ_OP_WRITE:
> > +   return blk_rq_zone_is_seq(rq);
> > +   default:
> > +   return false;
> > +   }
> 
> If anyone ever adds a new write request type it will be easy to overlook
> this
> function. Should the 'default' case be left out and should all request types
> be mentioned in the switch/case statement such that the compiler will issue
> a
> warning if a new request operation type is added to enum req_opf?

I tried, but that does not work. The switch-case needs either a default case
or a return after it. Otherwise I get a compilation warning (reached end of
non-void function).

> > +/*
> > + * Abuse the elv.priv[0] pointer to indicate if a request has write
> > + * locked its target zone. Only write request to a zoned block device
> > + * can own a zone write lock.
> > + */
> > +#define RQ_ZONE_WLOCKED((void *)1UL)
> > +static inline void deadline_set_request_zone_wlock(struct request *rq)
> > +{
> > +   rq->elv.priv[0] = RQ_ZONE_WLOCKED;
> > +}
> > +
> > +#define RQ_ZONE_NO_WLOCK   ((void *)0UL)
> > +static inline void deadline_clear_request_zone_wlock(struct request *rq)
> > +{
> > +   rq->elv.priv[0] = RQ_ZONE_NO_WLOCK;
> > +}
> 
> Should an enumeration type be introduced for RQ_ZONE_WLOCKED and
> RQ_ZONE_NO_WLOCK?

Sure. Added in V6.

> > +/*
> > + * Write lock the target zone of a write request.
> > + */
> > +static void deadline_wlock_zone(struct deadline_data *dd,
> > +   struct request *rq)
> > +{
> > +   unsigned int zno = blk_rq_zone_no(rq);
> > +
> > +   WARN_ON_ONCE(deadline_request_has_zone_wlock(rq));
> > +   WARN_ON_ONCE(test_and_set_bit(zno, dd->zones_wlock));
> > +   deadline_set_request_zone_wlock(rq);
> > +}
> > +
> > +/*
> > + * Write unlock the target zone of a write request.
> > + */
> > +static void deadline_wunlock_zone(struct deadline_data *dd,
> > + struct request *rq)
> > +{
> > +   unsigned int zno = blk_rq_zone_no(rq);
> > +   unsigned long flags;
> > +
> > +   spin_lock_irqsave(>zone_lock, flags);
> > +
> > +   WARN_ON_ONCE(!test_and_clear_bit(zno, dd->zones_wlock));
> > +   deadline_clear_request_zone_wlock(rq);
> > +
> > +   spin_unlock_irqrestore(>zone_lock, flags);
> > +}
> 
> Why does deadline_wunlock_zone() protect modifications with dd->zone_lock
> but
> deadline_wlock_zone() not? If this code is correct, please add a
> lockdep_assert_held() statement in the first function.

Yes, that was a little confusing. In V6, I move the introduction of the
zone_lock spinlock to when it is actually needed, that is the patch following
this one. And I added more comments in both the commit message and in the code
to explain why the spinlock is needed.

> > +/*
> > + * Test the write lock state of the target zone of a write request.
> > + */
> > +static inline bool deadline_zone_is_wlocked(struct deadline_data *dd,
> > +   struct request *rq)
> > +{
> > +   unsigned int zno = blk_rq_zone_no(rq);
> > +
> > +   return test_bit(zno, dd->zones_wlock);
> > +}
> 
> Do we really need the local variable 'zno'?

No we don't. Fixed.

Best regards.

-- 
Damien Le Moal
Western Digital

Re: [PATCH V5 13/14] block: mq-deadline: Limit write request dispatch for zoned block devices

2017-10-01 Thread Damien Le Moal
On Mon, 2017-09-25 at 22:06 +, Bart Van Assche wrote:
> On Mon, 2017-09-25 at 15:14 +0900, Damien Le Moal wrote:
> > -   return rq_entry_fifo(dd->fifo_list[data_dir].next);
> > +   if (!dd->zones_wlock || data_dir == READ)
> > +   return rq_entry_fifo(dd->fifo_list[data_dir].next);
> > +
> > +   spin_lock_irqsave(>zone_lock, flags);
> > +
> > +   list_for_each_entry(rq, >fifo_list[WRITE], queuelist) {
> > +   if (deadline_can_dispatch_request(dd, rq))
> > +   goto out;
> > +   }
> > +   rq = NULL;
> > +
> > +out:
> > +   spin_unlock_irqrestore(>zone_lock, flags);
> 
> Is it documented somewhere what dd->zone_lock protects and when that lock
> should be
> acquired?

It was not well explained. I added comments in V6.

> 
> > /*
> >  * This may be a requeue of a request that has locked its
> > -* target zone. If this is the case, release the request zone
> > lock.
> > +* target zone. If this is the case, release the zone lock.
> >  */
> > if (deadline_request_has_zone_wlock(rq))
> > deadline_wunlock_zone(dd, rq);
> 
> Can this change be folded into the patch that introduced that comment?

Of course. Fixed in V6.

> > @@ -570,6 +621,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx
> > *hctx, struct request *rq,
> >  
> > blk_mq_sched_request_inserted(rq);
> >  
> > +   if (at_head && deadline_request_needs_zone_wlock(dd, rq))
> > +   pr_info(" Write at head !\n");
> > +
> > if (at_head || blk_rq_is_passthrough(rq)) {
> > if (at_head)
> > list_add(>queuelist, >dispatch);
> 
> Will it be easy to users who analyze a kernel log to figure out why that
> message has been generated? Should that message perhaps include the block
> device name, zone number and request sector number?

This was just a debug message for me that I forgot to remove. I did in V6.
Thanks for catching this.

Best regards.

-- 
Damien Le Moal
Western Digital

Re: [PATCH V5 10/14] block: mq-deadline: Add zoned block device data

2017-10-01 Thread Damien Le Moal
Bart,

On Mon, 2017-09-25 at 21:34 +, Bart Van Assche wrote:
> On Mon, 2017-09-25 at 15:14 +0900, Damien Le Moal wrote:
> > Modify mq-dealine init_queue and exit_queue elevator methods to handle
> 
>  ^^
>  mq-deadline ?
> 
> > +static int deadline_init_zones_wlock(struct request_queue *q,
> > +struct deadline_data *dd)
> > +{
> > +   /*
> > +* For regular drives or non-conforming zoned block device,
> > +* do not use zone write locking.
> > +*/
> > +   if (!blk_queue_nr_zones(q))
> > +   return 0;
> > +
> > +   /*
> > +* Treat host aware drives as regular disks.
> > +*/
> > +   if (blk_queue_zoned_model(q) != BLK_ZONED_HM)
> > +   return 0;
> > +
> > +   dd->zones_wlock =
> > kzalloc_node(BITS_TO_LONGS(blk_queue_nr_zones(q))
> > +  * sizeof(unsigned long),
> > +  GFP_KERNEL, q->node);
> 
> A request queue is created before disk validation occurs and before the
> number of zones is initialized (sd_probe_async()). If a scheduler is
> assigned to a ZBC drive through a udev rule, can it happen that
> deadline_init_zones_wlock() is called before the number of zones has been
> initialized?

Yes indeed. I am aware of this, hence the last patch of the series which
disables setting a default scheduler for blk-mq devices.

That is however a little extreme. So for V6, I am looking at diferring the
default elevator setup after device information is gathered. Not sure where
the right place to do so is though. Looking.

Best regards.

-- 
Damien Le Moal
Western Digital

Re: [PATCH V5 07/14] scsi: sd_zbc: Initialize device request queue zoned data

2017-10-01 Thread Damien Le Moal
On Mon, 2017-09-25 at 21:17 +, Bart Van Assche wrote:
> On Mon, 2017-09-25 at 15:14 +0900, Damien Le Moal wrote:
> > +   return kzalloc_node(BITS_TO_LONGS(sdkp->nr_zones)
> > +   * sizeof(unsigned long),
> 
> Does this perhaps fit on one line?
> 
> > +/**
> > + * sd_zbc_get_seq_zones - Parse report zones reply to identify sequential
> > zones
> > + * @sdkp: disk used
> > + * @buf: report reply buffer
> > + * @seq_zone_bitamp: bitmap of sequential zones to set
> > + * @zno: Zone number of the first zone in the report
> 
> 'zno' is an input and output parameter but the above line only describes
> what
> happens with the value passed to sd_zbc_get_seq_zones(). I think we also
> need a
> description for the value of 'zno' upon return.

zno is not necessary since it can be calculated from the reported zone start
LBA. So I removed it in V6 to simplify.

-- 
Damien Le Moal
Western Digital

Re: [PATCH v2] blk-throttle: fix possible io stall when upgrade to max

2017-10-01 Thread Shaohua Li
On Sat, Sep 30, 2017 at 02:38:49PM +0800, Joseph Qi wrote:
> From: Joseph Qi 
> 
> There is a case which will lead to io stall. The case is described as
> follows. 
> /test1
>   |-subtest1
> /test2
>   |-subtest2
> And subtest1 and subtest2 each has 32 queued bios already.
> 
> Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
> bios as follows:
> 1) tg=subtest1, do nothing;
> 2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
> left, no need to schedule next dispatch;
> 3) tg=subtest2, do nothing;
> 4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
> left, no need to schedule next dispatch;
> 5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
> test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
> to /; note that test1 and test2 each still has 16 queued bios left;
> 6) tg=/, try to schedule next dispatch, but since disptime is now
> (update in tg_update_disptime, wait=0), pending timer is not scheduled
> in fact;
> 7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
> 32 left. test1 and test2 each has 16 queued bios;
> 8) throtl_pending_timer_fn sees the left over bios, but could do
> nothing, because throtl_select_dispatch returns 0, and test1/test2 has
> no pending tg.
> 
> The blktrace shows the following:
> 8,32   00 2.539007641 0  m   N throtl upgrade to max
> 8,32   00 2.539072267 0  m   N throtl /test2 dispatch 
> nr_queued=16 read=0 write=16
> 8,32   70 2.539077142 0  m   N throtl /test1 dispatch 
> nr_queued=16 read=0 write=16
> 
> So force schedule dispatch if there are pending children.
>
> Signed-off-by: Joseph Qi 
> ---
>  block/blk-throttle.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 0fea76a..17816a0 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1911,11 +1911,11 @@ static void throtl_upgrade_state(struct throtl_data 
> *td)
>  
>   tg->disptime = jiffies - 1;
>   throtl_select_dispatch(sq);
> - throtl_schedule_next_dispatch(sq, false);
> + throtl_schedule_next_dispatch(sq, true);
>   }
>   rcu_read_unlock();
>   throtl_select_dispatch(>service_queue);
> - throtl_schedule_next_dispatch(>service_queue, false);
> + throtl_schedule_next_dispatch(>service_queue, true);
>   queue_work(kthrotld_workqueue, >dispatch_work);
>  }

Reviewed-by: Shaohua Li  


Loan

2017-10-01 Thread CAPITAL FINANCE
 UNSECURED BUSINESS/PERSONAL LOAN BY LOAN CAPITAL FINANCE
  - NO COLLATERAL
  - MINIMUM DOCUMENTATION
  - BUSINESS LOAN UP TO FIVE(5) MILLION US DOLLARS

  CONTACT US TODAY VIA EMAIL: financecapital...@mail.com



Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-10-01 Thread Michael Lyle
On Sun, Oct 1, 2017 at 10:23 AM, Coly Li  wrote:
> Hi Mike,
>
> Your data set is too small. Normally bcache users I talk with, they use
> bcache for distributed storage cluster or commercial data base, their
> catch device is large and fast. It is possible we see different I/O
> behaviors because we use different configurations.

A small dataset is sufficient to tell whether the I/O subsystem is
successfully aggregating sequential writes or not.  :P  It doesn't
matter whether the test is 10 minutes or 10 hours...  The writeback
stuff walks the data in order.  :P

***We are measuring whether the cache and I/O scheduler can correctly
order up-to-64-outstanding writebacks from a chunk of 500 dirty
extents-- we do not need to do 12 hours of writes first to measure
this.***

It's important that there be actual contiguous data, though, or the
difference will be less significant.  If you write too much, there
will be a lot more holes in the data from writeback during the test
and from writes bypassing the cache.

Having all the data to writeback be sequential is an
artificial/synthetic condition that allows the difference to be
measured more easily.  It's about a 2x difference under these
conditions in my test environment.  I expect with real data that is
not purely sequential it's more like a few percent.

Mike


Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-10-01 Thread Coly Li
On 2017/10/2 上午12:56, Michael Lyle wrote:
> That's strange-- are you doing the same test scenario?  How much
> random I/O did you ask for?
> 
> My tests took 6-7 minutes to do the 30G of 8k not-repeating I/Os in a
> 30G file (about 9k IOPs for me-- it's actually significantly faster
> but then starves every few seconds-- not new with these patches)..
> your cache device if 3.8T, so to have a similar 12-13% of the cache
> you'd need to do 15x as much (90 mins if you're the same speed--- but
> your I/O subsystem is also much faster...)
> 
> If you're doing more like 3.8T of writes--  note that's not the same
> test.  (It will result in less contiguous stuff in the cache and it
> will be less repeatable / more volatile).

Hi Mike,

Your data set is too small. Normally bcache users I talk with, they use
bcache for distributed storage cluster or commercial data base, their
catch device is large and fast. It is possible we see different I/O
behaviors because we use different configurations.

I use a 3.8T cache, and test two conditions,
1, dirty data is around 2.1TB, stop front end write and observe
background writeback.
2, dirty data is around dirty target (in my test it was 305GB), then
stop front end writing and observe bacm ground writeback.

It spent a lot of time to get the dirty data ready, and then record
writeback rate and iostat output for hours. At the very beginning I can
see write merge number is high (even more then 110 wrqm/s as peak value
on single disk) but a few minutes later there is almost no write merge.

When there is write merge, bcache with/without your patch 4,5 both work
well. But when there is no write merge, bcache with/without your patch
4,5 both work badly. Even writback_rate_debug displays rate: 488.2M/sec,
real write throughput is 10MB/sec in total, that's 2~3MB/sec on each
hard disk, obviously the bottleneck is not on hard disk.

Right now I am collecting the last group data about bcache without your
patche 4,5 and has 2.1TB dirty data, then observe how background
writeback works.

The progress is slower than I expected, tomorrow morning I will get the
data. Hope 2 days later I may have an benchmark analysis to share.

I will update the result then, if I didn't do anything wrong during my
performance testing.

Coly Li

> On Sat, Sep 30, 2017 at 9:51 PM, Coly Li  wrote:
>> On 2017/10/1 上午6:49, Michael Lyle wrote:
>>> One final attempt to resend, because gmail has been giving me trouble
>>> sending plain text mail.
>>>
>>> Two instances of this.  Tested as above, with a big set of random I/Os
>>> that ultimately cover every block in a file (e.g. allowing sequential
>>> writeback).
>>>
>>> With the 5 patches, samsung 940 SSD cache + crummy 5400 RPM USB hard drive:
>>>
>>> Typical seconds look like:
>>>
>>> Reading 38232K from cache in 4809 IO.  38232/4809=7.95k per cache device IO.
>>>
>>> Writing 38112k to cache in 400 I/O = 95.28k -- or we are combining
>>> about 11.9 extents to a contiguous writeback.  Tracing, there are
>>> still contiguous things that are not getting merged well, but it's OK.
>>> (I'm hoping plugging makes this better).
>>>
>>> sda4809.00 38232.00   446.00  38232446
>>> sdb 400.00 0.00 38112.00  0  38112
>>>
>>> Without the 5 patches, a typical second--
>>>
>>> sda2509.00 19968.00   316.00  19968316
>>> sdb 502.00 0.00 19648.00  0  38112
>>>
>>> or we are combining about 4.9 extents to a contiguous writeback, and
>>> writing back at about half the rate.  All of these numbers are +/- 10%
>>> and obtained by eyeballing and grabbing representative seconds.
>>>
>>> Mike


-- 
Coly Li


Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-10-01 Thread Michael Lyle
That's strange-- are you doing the same test scenario?  How much
random I/O did you ask for?

My tests took 6-7 minutes to do the 30G of 8k not-repeating I/Os in a
30G file (about 9k IOPs for me-- it's actually significantly faster
but then starves every few seconds-- not new with these patches)..
your cache device if 3.8T, so to have a similar 12-13% of the cache
you'd need to do 15x as much (90 mins if you're the same speed--- but
your I/O subsystem is also much faster...)

If you're doing more like 3.8T of writes--  note that's not the same
test.  (It will result in less contiguous stuff in the cache and it
will be less repeatable / more volatile).

Mike

On Sat, Sep 30, 2017 at 9:51 PM, Coly Li  wrote:
> On 2017/10/1 上午6:49, Michael Lyle wrote:
>> One final attempt to resend, because gmail has been giving me trouble
>> sending plain text mail.
>>
>> Two instances of this.  Tested as above, with a big set of random I/Os
>> that ultimately cover every block in a file (e.g. allowing sequential
>> writeback).
>>
>> With the 5 patches, samsung 940 SSD cache + crummy 5400 RPM USB hard drive:
>>
>> Typical seconds look like:
>>
>> Reading 38232K from cache in 4809 IO.  38232/4809=7.95k per cache device IO.
>>
>> Writing 38112k to cache in 400 I/O = 95.28k -- or we are combining
>> about 11.9 extents to a contiguous writeback.  Tracing, there are
>> still contiguous things that are not getting merged well, but it's OK.
>> (I'm hoping plugging makes this better).
>>
>> sda4809.00 38232.00   446.00  38232446
>> sdb 400.00 0.00 38112.00  0  38112
>>
>> Without the 5 patches, a typical second--
>>
>> sda2509.00 19968.00   316.00  19968316
>> sdb 502.00 0.00 19648.00  0  38112
>>
>> or we are combining about 4.9 extents to a contiguous writeback, and
>> writing back at about half the rate.  All of these numbers are +/- 10%
>> and obtained by eyeballing and grabbing representative seconds.
>>
>> Mike
>>
>> On Sat, Sep 30, 2017 at 2:02 AM, Michael Lyle  wrote:
>>>
>>> Resent because I can't seem to get gmail to not send HTML mail.  And off to 
>>> sleep.
>>>
>>>
>>> On Sat, Sep 30, 2017 at 1:57 AM, Michael Lyle  wrote:
 Two instances of this.

 With the 5 patches, samsung 940 SSD cache + crummy 5400 RPM USB hard drive:

 Typical seconds look like:

 Reading 38232K from cache in 4809 IO.  38232/4809=7.95k per cache device 
 IO.

 Writing 38112k to cache in 400 I/O = 95.28k -- or we are combining about
 11.9 extents to a contiguous writeback.  Tracing, there are still 
 contiguous
 things that are not getting merged well, but it's OK.  (I'm hoping plugging
 makes this better).

 sda4809.00 38232.00   446.00  38232446
 sdb 400.00 0.00 38112.00  0  38112

 Without the 5 patches, a typical second--

 sda2509.00 19968.00   316.00  19968316
 sdb 502.00 0.00 19648.00  0  38112

 or we are combining about 4.9 extents to a contiguous writeback, and 
 writing
 back at about half the rate.
>
> Hi Mike,
>
> Get it. Now I am testing your patches (all 5 patches). It has been 12+
> hours, should be 12+ hours more. The backing device is a raid0 composed
> by 4x1.8T 2.5inch harddisk, cache device is a 3.8T NVMe SSD. Block size
> is 512K.
>
> Hope it works as you expected on my server.
>
> --
> Coly Li


[PATCH v2 1/1] bsg-lib: fix use-after-free under memory-pressure

2017-10-01 Thread Benjamin Block
When under memory-pressure it is possible that the mempool which backs
the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count
emergency buffers - in case it can't get a regular allocation. These
buffers are preallocated and once they are also used, they are
re-supplied with old finished requests from the same request_queue (see
mempool_free()).

The bug is, when re-supplying the emergency pool, the old requests are
not again ran through the callback mempool_t->alloc(), and thus also not
through the callback bsg_init_rq(). Thus we skip initialization, and
while the sense-buffer still should be good, scsi_request->cmd might
have become to be an invalid pointer in the meantime. When the request
is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB,
bsg will replace it with a custom allocated buffer, which is freed when
the user's command is finished, thus it dangles afterwards. When next a
command is sent by the user that has a smaller/similar CDB as
BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by
scsi_request->__cmd, will not make a custom allocation, and write into
undefined memory.

Fix this by splitting bsg_init_rq() into two functions:
 - bsg_init_rq() is changed to only do the allocation of the
   sense-buffer, which is used to back the bsg job's reply buffer. This
   pointer should never change during the lifetime of a scsi_request, so
   it doesn't need re-initialization.
 - bsg_initialize_rq() is a new function that makes use of
   'struct request_queue's initialize_rq_fn callback (which was
   introduced in v4.12). This is always called before the request is
   given out via blk_get_request(). This function does the remaining
   initialization that was previously done in bsg_init_rq(), and will
   also do it when the request is taken from the emergency-pool of the
   backing mempool.

Fixes: 50b4d485528d ("bsg-lib: fix kernel panic resulting from missing 
allocation of reply-buffer")
Cc:  # 4.11+
Reviewed-by: Christoph Hellwig 
Signed-off-by: Benjamin Block 
---

Notes:
I did test this on zFCP with FC CT commands send via the ioctl() and
write() system-call. That did work fine. But I would very much
appreciate if anyone could run this against an other HBA or even an
other implementer of bsg-lib, such as now SAS, because I have no access
to such hardware here.

This should make no difference to the normal cases - where each request
is allocated via slab - with- or without this patch; if I didn't miss
anything. Only the order is a bit mixed up - the memset is done after
the sense-allocation, so I have to buffer the sense-pointer for that.
But otherwise there is no difference I am aware of, so it should behave
the same (does for me).

I could not reproduce the memory-pressure case here in the lab.. I
don't see any reason why it should work now, but I am open to
suggestions :)

Beste Grüße / Best regards,
  - Benjamin Block

Changes in v2:
 - Renamed function names to reflect the function-pointer names in
   request_queue

 block/bsg-lib.c | 29 ++---
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index c82408c7cc3c..ff5a158b22fa 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -208,20 +208,34 @@ static int bsg_init_rq(struct request_queue *q, struct 
request *req, gfp_t gfp)
struct bsg_job *job = blk_mq_rq_to_pdu(req);
struct scsi_request *sreq = >sreq;
 
-   memset(job, 0, sizeof(*job));
+   /* called right after the request is allocated for the request_queue */
 
-   scsi_req_init(sreq);
-   sreq->sense_len = SCSI_SENSE_BUFFERSIZE;
-   sreq->sense = kzalloc(sreq->sense_len, gfp);
+   sreq->sense = kzalloc(SCSI_SENSE_BUFFERSIZE, gfp);
if (!sreq->sense)
return -ENOMEM;
 
+   return 0;
+}
+
+static void bsg_initialize_rq(struct request *req)
+{
+   struct bsg_job *job = blk_mq_rq_to_pdu(req);
+   struct scsi_request *sreq = >sreq;
+   void *sense = sreq->sense;
+
+   /* called right before the request is given to the request_queue user */
+
+   memset(job, 0, sizeof(*job));
+
+   scsi_req_init(sreq);
+
+   sreq->sense = sense;
+   sreq->sense_len = SCSI_SENSE_BUFFERSIZE;
+
job->req = req;
-   job->reply = sreq->sense;
+   job->reply = sense;
job->reply_len = sreq->sense_len;
job->dd_data = job + 1;
-
-   return 0;
 }
 
 static void bsg_exit_rq(struct request_queue *q, struct request *req)
@@ -252,6 +266,7 @@ struct request_queue *bsg_setup_queue(struct device *dev, 
const char *name,
q->cmd_size = sizeof(struct bsg_job) + dd_job_size;
q->init_rq_fn = bsg_init_rq;
q->exit_rq_fn = 

Re: [PATCH 4/6] lightnvm: remove stable extern and unused exported symbols

2017-10-01 Thread Rakesh Pandit
On Sun, Oct 01, 2017 at 04:25:18PM +0300, Rakesh Pandit wrote:
> Not all exported symbols are being used outside core and there were
> some stable entries in lightnvm.h
> 

If you can replace 'stable' with 'stale' both in subject or body while
picking up that would be great.

Regards,


[PATCH 6/6] lightnvm: pblk: fix releases of kmem cache in error path

2017-10-01 Thread Rakesh Pandit
If pblk_core_init fails lets destroy all global caches.

Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/pblk-init.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 519e5cf..9f39800 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -246,7 +246,7 @@ static int pblk_core_init(struct pblk *pblk)
pblk->page_bio_pool = mempool_create_page_pool(nvm_max_phys_sects(dev),
0);
if (!pblk->page_bio_pool)
-   return -ENOMEM;
+   goto free_global_caches;
 
pblk->gen_ws_pool = mempool_create_slab_pool(PBLK_GEN_WS_POOL_SIZE,
pblk_ws_cache);
@@ -314,6 +314,12 @@ static int pblk_core_init(struct pblk *pblk)
mempool_destroy(pblk->gen_ws_pool);
 free_page_bio_pool:
mempool_destroy(pblk->page_bio_pool);
+free_global_caches:
+   kmem_cache_destroy(pblk_ws_cache);
+   kmem_cache_destroy(pblk_rec_cache);
+   kmem_cache_destroy(pblk_g_rq_cache);
+   kmem_cache_destroy(pblk_e_rq_cache);
+   kmem_cache_destroy(pblk_w_rq_cache);
return -ENOMEM;
 }
 
-- 
2.7.4



[PATCH 5/6] lightnvm: pblk: free up mempool allocation for erases correctly

2017-10-01 Thread Rakesh Pandit
While separating read and erase mempools in 22da65a1b pblk_g_rq_cache
was used two times to set aside memory both for erase and read
requests.  Because same kmem cache is used repeatedly a single call to
kmem_cache_destroy wouldn't deallocate everything.  Repeatedly doing
loading and unloading of pblk modules would eventually result in some
leak.

The fix is to really use separate kmem cache and track it
appropriately.

Fixes: 22da65a1b ("lightnvm: pblk: decouple read/erase mempools")
Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/pblk-init.c | 16 ++--
 drivers/lightnvm/pblk.h  |  1 +
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 9d9adcf..519e5cf 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -21,7 +21,7 @@
 #include "pblk.h"
 
 static struct kmem_cache *pblk_ws_cache, *pblk_rec_cache, *pblk_g_rq_cache,
-   *pblk_w_rq_cache;
+   *pblk_w_rq_cache, *pblk_e_rq_cache;
 static DECLARE_RWSEM(pblk_lock);
 struct bio_set *pblk_bio_set;
 
@@ -206,12 +206,23 @@ static int pblk_init_global_caches(struct pblk *pblk)
return -ENOMEM;
}
 
+   pblk_e_rq_cache = kmem_cache_create("pblk_e_rq", pblk_e_rq_size,
+   0, 0, NULL);
+   if (!pblk_e_rq_cache) {
+   kmem_cache_destroy(pblk_ws_cache);
+   kmem_cache_destroy(pblk_rec_cache);
+   kmem_cache_destroy(pblk_g_rq_cache);
+   up_write(_lock);
+   return -ENOMEM;
+   }
+
pblk_w_rq_cache = kmem_cache_create("pblk_w_rq", pblk_w_rq_size,
0, 0, NULL);
if (!pblk_w_rq_cache) {
kmem_cache_destroy(pblk_ws_cache);
kmem_cache_destroy(pblk_rec_cache);
kmem_cache_destroy(pblk_g_rq_cache);
+   kmem_cache_destroy(pblk_e_rq_cache);
up_write(_lock);
return -ENOMEM;
}
@@ -252,7 +263,7 @@ static int pblk_core_init(struct pblk *pblk)
goto free_rec_pool;
 
pblk->e_rq_pool = mempool_create_slab_pool(geo->nr_luns,
-   pblk_g_rq_cache);
+   pblk_e_rq_cache);
if (!pblk->e_rq_pool)
goto free_r_rq_pool;
 
@@ -327,6 +338,7 @@ static void pblk_core_free(struct pblk *pblk)
kmem_cache_destroy(pblk_ws_cache);
kmem_cache_destroy(pblk_rec_cache);
kmem_cache_destroy(pblk_g_rq_cache);
+   kmem_cache_destroy(pblk_e_rq_cache);
kmem_cache_destroy(pblk_w_rq_cache);
 }
 
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index fcac246..03834d1 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -651,6 +651,7 @@ struct pblk_line_ws {
 
 #define pblk_g_rq_size (sizeof(struct nvm_rq) + sizeof(struct pblk_g_ctx))
 #define pblk_w_rq_size (sizeof(struct nvm_rq) + sizeof(struct pblk_c_ctx))
+#define pblk_e_rq_size pblk_g_rq_size
 
 /*
  * pblk ring buffer operations
-- 
2.7.4



[PATCH 4/6] lightnvm: remove stable extern and unused exported symbols

2017-10-01 Thread Rakesh Pandit
Not all exported symbols are being used outside core and there were
some stable entries in lightnvm.h

Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/core.c  | 129 +++
 include/linux/lightnvm.h |   7 ---
 2 files changed, 64 insertions(+), 72 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 9cd1c4b..2817b37 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -226,6 +226,24 @@ static const struct block_device_operations nvm_fops = {
.owner  = THIS_MODULE,
 };
 
+struct nvm_tgt_type *nvm_find_target_type(const char *name, int lock)
+{
+   struct nvm_tgt_type *tmp, *tt = NULL;
+
+   if (lock)
+   down_write(_tgtt_lock);
+
+   list_for_each_entry(tmp, _tgt_types, list)
+   if (!strcmp(name, tmp->name)) {
+   tt = tmp;
+   break;
+   }
+
+   if (lock)
+   up_write(_tgtt_lock);
+   return tt;
+}
+
 static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 {
struct nvm_ioctl_create_simple *s = >conf.s;
@@ -532,25 +550,6 @@ void nvm_part_to_tgt(struct nvm_dev *dev, sector_t 
*entries,
 }
 EXPORT_SYMBOL(nvm_part_to_tgt);
 
-struct nvm_tgt_type *nvm_find_target_type(const char *name, int lock)
-{
-   struct nvm_tgt_type *tmp, *tt = NULL;
-
-   if (lock)
-   down_write(_tgtt_lock);
-
-   list_for_each_entry(tmp, _tgt_types, list)
-   if (!strcmp(name, tmp->name)) {
-   tt = tmp;
-   break;
-   }
-
-   if (lock)
-   up_write(_tgtt_lock);
-   return tt;
-}
-EXPORT_SYMBOL(nvm_find_target_type);
-
 int nvm_register_tgt_type(struct nvm_tgt_type *tt)
 {
int ret = 0;
@@ -602,6 +601,52 @@ static struct nvm_dev *nvm_find_nvm_dev(const char *name)
return NULL;
 }
 
+static int nvm_set_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd,
+   const struct ppa_addr *ppas, int nr_ppas)
+{
+   struct nvm_dev *dev = tgt_dev->parent;
+   struct nvm_geo *geo = _dev->geo;
+   int i, plane_cnt, pl_idx;
+   struct ppa_addr ppa;
+
+   if (geo->plane_mode == NVM_PLANE_SINGLE && nr_ppas == 1) {
+   rqd->nr_ppas = nr_ppas;
+   rqd->ppa_addr = ppas[0];
+
+   return 0;
+   }
+
+   rqd->nr_ppas = nr_ppas;
+   rqd->ppa_list = nvm_dev_dma_alloc(dev, GFP_KERNEL, >dma_ppa_list);
+   if (!rqd->ppa_list) {
+   pr_err("nvm: failed to allocate dma memory\n");
+   return -ENOMEM;
+   }
+
+   plane_cnt = geo->plane_mode;
+   rqd->nr_ppas *= plane_cnt;
+
+   for (i = 0; i < nr_ppas; i++) {
+   for (pl_idx = 0; pl_idx < plane_cnt; pl_idx++) {
+   ppa = ppas[i];
+   ppa.g.pl = pl_idx;
+   rqd->ppa_list[(pl_idx * nr_ppas) + i] = ppa;
+   }
+   }
+
+   return 0;
+}
+
+static void nvm_free_rqd_ppalist(struct nvm_tgt_dev *tgt_dev,
+   struct nvm_rq *rqd)
+{
+   if (!rqd->ppa_list)
+   return;
+
+   nvm_dev_dma_free(tgt_dev->parent, rqd->ppa_list, rqd->dma_ppa_list);
+}
+
+
 int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct ppa_addr *ppas,
   int nr_ppas, int type)
 {
@@ -775,52 +820,6 @@ void nvm_put_area(struct nvm_tgt_dev *tgt_dev, sector_t 
begin)
 }
 EXPORT_SYMBOL(nvm_put_area);
 
-int nvm_set_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd,
-   const struct ppa_addr *ppas, int nr_ppas)
-{
-   struct nvm_dev *dev = tgt_dev->parent;
-   struct nvm_geo *geo = _dev->geo;
-   int i, plane_cnt, pl_idx;
-   struct ppa_addr ppa;
-
-   if (geo->plane_mode == NVM_PLANE_SINGLE && nr_ppas == 1) {
-   rqd->nr_ppas = nr_ppas;
-   rqd->ppa_addr = ppas[0];
-
-   return 0;
-   }
-
-   rqd->nr_ppas = nr_ppas;
-   rqd->ppa_list = nvm_dev_dma_alloc(dev, GFP_KERNEL, >dma_ppa_list);
-   if (!rqd->ppa_list) {
-   pr_err("nvm: failed to allocate dma memory\n");
-   return -ENOMEM;
-   }
-
-   plane_cnt = geo->plane_mode;
-   rqd->nr_ppas *= plane_cnt;
-
-   for (i = 0; i < nr_ppas; i++) {
-   for (pl_idx = 0; pl_idx < plane_cnt; pl_idx++) {
-   ppa = ppas[i];
-   ppa.g.pl = pl_idx;
-   rqd->ppa_list[(pl_idx * nr_ppas) + i] = ppa;
-   }
-   }
-
-   return 0;
-}
-EXPORT_SYMBOL(nvm_set_rqd_ppalist);
-
-void nvm_free_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd)
-{
-   if (!rqd->ppa_list)
-   return;
-
-   nvm_dev_dma_free(tgt_dev->parent, rqd->ppa_list, rqd->dma_ppa_list);
-}
-EXPORT_SYMBOL(nvm_free_rqd_ppalist);
-
 void 

[PATCH 3/6] lightnvm: remove unused argument from nvm_set_tgt_bb_tbl

2017-10-01 Thread Rakesh Pandit
vblk isn't being used anyway and if we ever have a usecase we can
introduce this again.  This makes the logic easier and removes
unnecessary checks.

Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/core.c  | 29 -
 include/linux/lightnvm.h |  2 +-
 2 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index ddae430..9cd1c4b 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -616,7 +616,7 @@ int nvm_set_tgt_bb_tbl(struct nvm_tgt_dev *tgt_dev, struct 
ppa_addr *ppas,
 
memset(, 0, sizeof(struct nvm_rq));
 
-   nvm_set_rqd_ppalist(tgt_dev, , ppas, nr_ppas, 1);
+   nvm_set_rqd_ppalist(tgt_dev, , ppas, nr_ppas);
nvm_rq_tgt_to_dev(tgt_dev, );
 
ret = dev->ops->set_bb_tbl(dev, _addr, rqd.nr_ppas, type);
@@ -680,7 +680,7 @@ int nvm_erase_sync(struct nvm_tgt_dev *tgt_dev, struct 
ppa_addr *ppas,
rqd.private = 
rqd.flags = geo->plane_mode >> 1;
 
-   ret = nvm_set_rqd_ppalist(tgt_dev, , ppas, nr_ppas, 1);
+   ret = nvm_set_rqd_ppalist(tgt_dev, , ppas, nr_ppas);
if (ret)
return ret;
 
@@ -776,14 +776,14 @@ void nvm_put_area(struct nvm_tgt_dev *tgt_dev, sector_t 
begin)
 EXPORT_SYMBOL(nvm_put_area);
 
 int nvm_set_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, struct nvm_rq *rqd,
-   const struct ppa_addr *ppas, int nr_ppas, int vblk)
+   const struct ppa_addr *ppas, int nr_ppas)
 {
struct nvm_dev *dev = tgt_dev->parent;
struct nvm_geo *geo = _dev->geo;
int i, plane_cnt, pl_idx;
struct ppa_addr ppa;
 
-   if ((!vblk || geo->plane_mode == NVM_PLANE_SINGLE) && nr_ppas == 1) {
+   if (geo->plane_mode == NVM_PLANE_SINGLE && nr_ppas == 1) {
rqd->nr_ppas = nr_ppas;
rqd->ppa_addr = ppas[0];
 
@@ -797,19 +797,14 @@ int nvm_set_rqd_ppalist(struct nvm_tgt_dev *tgt_dev, 
struct nvm_rq *rqd,
return -ENOMEM;
}
 
-   if (!vblk) {
-   for (i = 0; i < nr_ppas; i++)
-   rqd->ppa_list[i] = ppas[i];
-   } else {
-   plane_cnt = geo->plane_mode;
-   rqd->nr_ppas *= plane_cnt;
-
-   for (i = 0; i < nr_ppas; i++) {
-   for (pl_idx = 0; pl_idx < plane_cnt; pl_idx++) {
-   ppa = ppas[i];
-   ppa.g.pl = pl_idx;
-   rqd->ppa_list[(pl_idx * nr_ppas) + i] = ppa;
-   }
+   plane_cnt = geo->plane_mode;
+   rqd->nr_ppas *= plane_cnt;
+
+   for (i = 0; i < nr_ppas; i++) {
+   for (pl_idx = 0; pl_idx < plane_cnt; pl_idx++) {
+   ppa = ppas[i];
+   ppa.g.pl = pl_idx;
+   rqd->ppa_list[(pl_idx * nr_ppas) + i] = ppa;
}
}
 
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index 81b71c6d..c8c014b 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -481,7 +481,7 @@ extern int nvm_max_phys_sects(struct nvm_tgt_dev *);
 extern int nvm_submit_io(struct nvm_tgt_dev *, struct nvm_rq *);
 extern int nvm_erase_sync(struct nvm_tgt_dev *, struct ppa_addr *, int);
 extern int nvm_set_rqd_ppalist(struct nvm_tgt_dev *, struct nvm_rq *,
-   const struct ppa_addr *, int, int);
+   const struct ppa_addr *, int);
 extern void nvm_free_rqd_ppalist(struct nvm_tgt_dev *, struct nvm_rq *);
 extern int nvm_get_l2p_tbl(struct nvm_tgt_dev *, u64, u32, nvm_l2p_update_fn *,
   void *);
-- 
2.7.4



[PATCH 2/6] lightnvm: pblk: reduce arguments in __pblk_rb_update_l2p

2017-10-01 Thread Rakesh Pandit
We already pass the structure pointer so no need to pass the member.

Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/pblk-rb.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 05e6b2e..920ffac 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -201,9 +201,9 @@ unsigned int pblk_rb_read_commit(struct pblk_rb *rb, 
unsigned int nr_entries)
return subm;
 }
 
-static int __pblk_rb_update_l2p(struct pblk_rb *rb, unsigned int *l2p_upd,
-   unsigned int to_update)
+static int __pblk_rb_update_l2p(struct pblk_rb *rb, unsigned int to_update)
 {
+   unsigned int l2p_update = rb->l2p_update;
struct pblk *pblk = container_of(rb, struct pblk, rwb);
struct pblk_line *line;
struct pblk_rb_entry *entry;
@@ -213,7 +213,7 @@ static int __pblk_rb_update_l2p(struct pblk_rb *rb, 
unsigned int *l2p_upd,
int flags;
 
for (i = 0; i < to_update; i++) {
-   entry = >entries[*l2p_upd];
+   entry = >entries[l2p_update];
w_ctx = >w_ctx;
 
flags = READ_ONCE(entry->w_ctx.flags);
@@ -230,7 +230,7 @@ static int __pblk_rb_update_l2p(struct pblk_rb *rb, 
unsigned int *l2p_upd,
line = >lines[pblk_tgt_ppa_to_line(w_ctx->ppa)];
kref_put(>ref, pblk_line_put);
clean_wctx(w_ctx);
-   *l2p_upd = (*l2p_upd + 1) & (rb->nr_entries - 1);
+   rb->l2p_update = (l2p_update + 1) & (rb->nr_entries - 1);
}
 
pblk_rl_out(>rl, user_io, gc_io);
@@ -258,7 +258,7 @@ static int pblk_rb_update_l2p(struct pblk_rb *rb, unsigned 
int nr_entries,
 
count = nr_entries - space;
/* l2p_update used exclusively under rb->w_lock */
-   ret = __pblk_rb_update_l2p(rb, >l2p_update, count);
+   ret = __pblk_rb_update_l2p(rb, count);
 
 out:
return ret;
@@ -280,7 +280,7 @@ void pblk_rb_sync_l2p(struct pblk_rb *rb)
sync = smp_load_acquire(>sync);
 
to_update = pblk_rb_ring_count(sync, rb->l2p_update, rb->nr_entries);
-   __pblk_rb_update_l2p(rb, >l2p_update, to_update);
+   __pblk_rb_update_l2p(rb, to_update);
 
spin_unlock(>w_lock);
 }
-- 
2.7.4



[PATCH 1/6] lightnvm: pblk: remove useless line

2017-10-01 Thread Rakesh Pandit
Signed-off-by: Rakesh Pandit 
---
 drivers/lightnvm/pblk-core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 1f8aa94..4ffd1d6 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1053,7 +1053,6 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
/* Mark emeta metadata sectors as bad sectors. We need to consider bad
 * blocks to make sure that there are enough sectors to store emeta
 */
-   bit = lm->sec_per_line;
off = lm->sec_per_line - lm->emeta_sec[0];
bitmap_set(line->invalid_bitmap, off, lm->emeta_sec[0]);
while (nr_bb) {
-- 
2.7.4



[PATCH 0/6] misc cleanups and error path fixes

2017-10-01 Thread Rakesh Pandit
For 4.15.

These are last set of cleanups, error path fixes most likely and also
some memory leak.  Also removes stale symbols from lightnvm.h.  In
addition remove exported symbols which are no where used (except
locally in core).

Everything is based on top of:
https://github.com/OpenChannelSSD/linux branch for-4.15/pblk

So, they can be queued on top of patches already submitted earlier for
4.15.  Most of them are trivial found while reading algorithms used
during ALPSS yesterday or day before.

Rakesh Pandit (6):
  lightnvm: pblk: remove useless line
  lightnvm: pblk: reduce arguments in __pblk_rb_update_l2p
  lightnvm: remove unused argument from nvm_set_tgt_bb_tbl
  lightnvm: remove stable extern and unused exported symbols
  lightnvm: pblk: free up mempool allocation for erases correctly
  lightnvm: pblk: fix releases of kmem cache in error path

 drivers/lightnvm/core.c  | 138 +--
 drivers/lightnvm/pblk-core.c |   1 -
 drivers/lightnvm/pblk-init.c |  24 +++-
 drivers/lightnvm/pblk-rb.c   |  12 ++--
 drivers/lightnvm/pblk.h  |   1 +
 include/linux/lightnvm.h |   7 ---
 6 files changed, 94 insertions(+), 89 deletions(-)

-- 
2.7.4



Re: [PATCH] blk-mq: wire up completion notifier for laptop mode

2017-10-01 Thread Christoph Hellwig
On Sat, Sep 30, 2017 at 10:06:45AM +0200, Jens Axboe wrote:
> For some reason, the laptop mode IO completion notified was never wired
> up for blk-mq. Ensure that we trigger the callback appropriately, to arm
> the laptop mode flush timer.

Looks fine:

Reviewed-by: Christoph Hellwig 



Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems

2017-10-01 Thread Christoph Hellwig
On Fri, Sep 29, 2017 at 10:21:53PM +0800, Tony Yang wrote:
> Hi, All
> 
>  Because my environment requirements, the kernel must use 4.8.17,
> I would like to ask, how to use the kernel 4.8.17 nvme multi-path?
> Because I see support for multi-path versions are above 4.13

In that case we simply can't help you.