Re: [PATCH 1/8] block: add WRITE_BG

2016-10-27 Thread Christoph Hellwig
>   *   non-volatile media on completion.
> + * WRITE_BG  Background write. This is for background activity like
> + *   the periodic flush and background threshold writeback
>   *
>   */
>  #define RW_MASK  REQ_OP_WRITE
> @@ -202,6 +204,7 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
> offset,
>  #define WRITE_FLUSH  (REQ_SYNC | REQ_NOIDLE | REQ_PREFLUSH)
>  #define WRITE_FUA(REQ_SYNC | REQ_NOIDLE | REQ_FUA)
>  #define WRITE_FLUSH_FUA  (REQ_SYNC | REQ_NOIDLE | REQ_PREFLUSH | 
> REQ_FUA)
> +#define WRITE_BG (REQ_NOIDLE | REQ_BG)

I've been trying to kill off these WRITE_ flags as they aren't exactly
helpful, see my branch here that I'm waiting for the previous serious to
go in:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-flags

Which also begs the question why you add the REQ_NOIDLE flag above, as
it's only applied to synchronous queues in cfq as far as I can tell.

And while I'm at nitpicking about the most trivial patch of the
series anyway:  any good reason to not just spell out the "BACKGROUND" ?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12] blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()

2016-10-27 Thread Hannes Reinecke
On 10/27/2016 12:53 AM, Bart Van Assche wrote:
> Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
> are followed by kicking the requeue list. Hence add an argument to
> these two functions that allows to kick the requeue list. This was
> proposed by Christoph Hellwig.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Sagi Grimberg 
> Cc: Johannes Thumshirn 
> ---
>  block/blk-flush.c|  5 +
>  block/blk-mq.c   | 10 +++---
>  drivers/block/xen-blkfront.c |  2 +-
>  drivers/md/dm-rq.c   |  2 +-
>  drivers/nvme/host/core.c |  2 +-
>  drivers/scsi/scsi_lib.c  |  4 +---
>  include/linux/blk-mq.h   |  5 +++--
>  7 files changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 6a14b68..a834aed 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -134,10 +134,7 @@ static void blk_flush_restore_request(struct request *rq)
>  static bool blk_flush_queue_rq(struct request *rq, bool add_front)
>  {
>   if (rq->q->mq_ops) {
> - struct request_queue *q = rq->q;
> -
> - blk_mq_add_to_requeue_list(rq, add_front);
> - blk_mq_kick_requeue_list(q);
> + blk_mq_add_to_requeue_list(rq, add_front, true);
>   return false;
>   } else {
>   if (add_front)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4945437..688bcc3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -492,12 +492,12 @@ static void __blk_mq_requeue_request(struct request *rq)
>   }
>  }
>  
> -void blk_mq_requeue_request(struct request *rq)
> +void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
>  {
>   __blk_mq_requeue_request(rq);
>  
>   BUG_ON(blk_queued_rq(rq));
> - blk_mq_add_to_requeue_list(rq, true);
> + blk_mq_add_to_requeue_list(rq, true, kick_requeue_list);
>  }
>  EXPORT_SYMBOL(blk_mq_requeue_request);
>  
> @@ -535,7 +535,8 @@ static void blk_mq_requeue_work(struct work_struct *work)
>   blk_mq_start_hw_queues(q);
>  }
>  
> -void blk_mq_add_to_requeue_list(struct request *rq, bool at_head)
> +void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
> + bool kick_requeue_list)
>  {
>   struct request_queue *q = rq->q;
>   unsigned long flags;
> @@ -554,6 +555,9 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool 
> at_head)
>   list_add_tail(>queuelist, >requeue_list);
>   }
>   spin_unlock_irqrestore(>requeue_lock, flags);
> +
> + if (kick_requeue_list)
> + blk_mq_kick_requeue_list(q);
>  }
>  EXPORT_SYMBOL(blk_mq_add_to_requeue_list);
>  
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 9908597..1ca702d 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -2043,7 +2043,7 @@ static int blkif_recover(struct blkfront_info *info)
>   /* Requeue pending requests (flush or discard) */
>   list_del_init(>queuelist);
>   BUG_ON(req->nr_phys_segments > segs);
> - blk_mq_requeue_request(req);
> + blk_mq_requeue_request(req, false);
>   }
>   blk_mq_kick_requeue_list(info->rq);
>  
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index dc75bea..fbd37b4 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -354,7 +354,7 @@ EXPORT_SYMBOL(dm_mq_kick_requeue_list);
>  
>  static void dm_mq_delay_requeue_request(struct request *rq, unsigned long 
> msecs)
>  {
> - blk_mq_requeue_request(rq);
> + blk_mq_requeue_request(rq, false);
>   __dm_mq_kick_requeue_list(rq->q, msecs);
>  }
>  
Hmm. __dm_mq_kick_requeue_list() does essentially the same.
Have you checked if you can use 'true' here and drop the call to it?
However, it does take the queue_lock when doing so.
Is that required? None of the other drivers do it ...

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/12] dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code

2016-10-27 Thread Hannes Reinecke
On 10/27/2016 12:54 AM, Bart Van Assche wrote:
> Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED
> in the dm start and stop queue functions, only manipulate the latter
> flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped().
> 
> Signed-off-by: Bart Van Assche 
> Reviewed-by: Christoph Hellwig 
> Cc: Mike Snitzer 
> ---
>  drivers/md/dm-rq.c | 18 ++
>  1 file changed, 2 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index fbd37b4..d47a504 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -75,12 +75,6 @@ static void dm_old_start_queue(struct request_queue *q)
>  
>  static void dm_mq_start_queue(struct request_queue *q)
>  {
> - unsigned long flags;
> -
> - spin_lock_irqsave(q->queue_lock, flags);
> - queue_flag_clear(QUEUE_FLAG_STOPPED, q);
> - spin_unlock_irqrestore(q->queue_lock, flags);
> -
>   blk_mq_start_stopped_hw_queues(q, true);
>   blk_mq_kick_requeue_list(q);
>  }
> @@ -105,16 +99,8 @@ static void dm_old_stop_queue(struct request_queue *q)
>  
>  static void dm_mq_stop_queue(struct request_queue *q)
>  {
> - unsigned long flags;
> -
> - spin_lock_irqsave(q->queue_lock, flags);
> - if (blk_queue_stopped(q)) {
> - spin_unlock_irqrestore(q->queue_lock, flags);
> + if (blk_mq_queue_stopped(q))
>   return;
> - }
> -
> - queue_flag_set(QUEUE_FLAG_STOPPED, q);
> - spin_unlock_irqrestore(q->queue_lock, flags);
>  
>   /* Avoid that requeuing could restart the queue. */
>   blk_mq_cancel_requeue_work(q);
> @@ -341,7 +327,7 @@ static void __dm_mq_kick_requeue_list(struct 
> request_queue *q, unsigned long mse
>   unsigned long flags;
>  
>   spin_lock_irqsave(q->queue_lock, flags);
> - if (!blk_queue_stopped(q))
> + if (!blk_mq_queue_stopped(q))
>   blk_mq_delay_kick_requeue_list(q, msecs);
>   spin_unlock_irqrestore(q->queue_lock, flags);
>  }
> 
Ah. Right. That answers my previous question.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A question regarding "multiple SGL"

2016-10-27 Thread Christoph Hellwig
Hi Robert,

please explain your use cases that isn't handled.  The one and only
reason to set MSDBD to 1 is to make the code a lot simpler given that
there is no real use case for supporting more.

RDMA uses memory registrations to register large and possibly
discontiguous data regions for a single rkey, aka single SGL descriptor
in NVMe terms.  There would be two reasons to support multiple SGL
descriptors:  a) to support a larger I/O size than supported by a single
MR, or b) to support a data region format not mappable by a single
MR.

iSER only supports a single rkey (or stag in IETF terminology) and has
been doing fine on a) and mostly fine on b).   There are a few possible
data layouts not supported by the traditional IB/iWarp FR WRs, but the
limit is in fact exactly the same as imposed by the NVMe PRPs used for
PCIe NVMe devices, so the Linux block layer has support to not generate
them.  Also with modern Mellanox IB/RoCE hardware we can actually
register completely arbitrary SGLs.  iSER supports using this registration
mode already with a trivial code addition, but for NVMe we didn't have a
pressing need yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A question regarding "multiple SGL"

2016-10-27 Thread Christoph Hellwig
Hi Robert,

There is no feature called "Multiple SGL in one NVMe capsule".  The
NVMe over Fabrics specification allows a controller to advertise how
many SGL descriptors it supports using the MSDBD Identify field:

"Maximum SGL Data Block Descriptors (MSDBD): This field indicates the
maximum number of (Keyed) SGL Data Block descriptors that a host is allowed to
place in a capsule. A value of 0h indicates no limit."

Setting this value to 1 is perfectly valid.  Similarly a host is free
to chose any number of SGL descriptors between 0 (only for command that
don't transfer data) to the limit imposed by the controller using the
MSDBD field.

There are no plans to support a MSDBD value larger than 1 in the Linux
NVMe target, and there are no plans to ever submit commands with multiple
SGLs from the host driver either.

Cheers,
Christoph
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A question regarding "multiple SGL"

2016-10-27 Thread Qiuxin (robert)
Hi Christoph,

Thanks , got it.

Could you please do me favor to let me know the background why we ONLY support 
" MSDBD ==1"?   I am NOT trying to resist or oppose anything , I just want to 
know the reason.  You know,  it is a little wired for me, as  "MSDBD ==1" does 
not fulfill all the use cases which is depicted in the spec.

Best,
Robert Qiuxin

Robert Qiuxin
华为技术有限公司 Huawei Technologies Co., Ltd.
Phone: +86-755-28420357
Fax: 
Mobile: +86 15986638429
Email: qiu...@huawei.com
地址:深圳市龙岗区坂田华为基地 邮编:518129
Huawei Technologies Co., Ltd.
Bantian, Longgang District,Shenzhen 518129, P.R.China
http://www.huawei.com 

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI, 
which 
is intended only for the person or entity whose address is listed above. Any 
use of the 
information contained herein in any way (including, but not limited to, total 
or partial 
disclosure, reproduction, or dissemination) by persons other than the intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender by 
phone or email immediately and delete it!
-邮件原件-
发件人: Christoph Hellwig [mailto:h...@lst.de] 
发送时间: 2016年10月27日 14:41
收件人: 鑫愿
抄送: Bart Van Assche; Jens Axboe; linux-block@vger.kernel.org; James Bottomley; 
Martin K. Petersen; Mike Snitzer; linux-r...@vger.kernel.org; Ming Lei; 
linux-n...@lists.infradead.org; Keith Busch; Doug Ledford; 
linux-s...@vger.kernel.org; Laurence Oberman; Christoph Hellwig; Tiger zhao; 
Qiuxin (robert)
主题: Re: A question regarding "multiple SGL"

Hi Robert,

There is no feature called "Multiple SGL in one NVMe capsule".  The NVMe over 
Fabrics specification allows a controller to advertise how many SGL descriptors 
it supports using the MSDBD Identify field:

"Maximum SGL Data Block Descriptors (MSDBD): This field indicates the maximum 
number of (Keyed) SGL Data Block descriptors that a host is allowed to place in 
a capsule. A value of 0h indicates no limit."

Setting this value to 1 is perfectly valid.  Similarly a host is free to chose 
any number of SGL descriptors between 0 (only for command that don't transfer 
data) to the limit imposed by the controller using the MSDBD field.

There are no plans to support a MSDBD value larger than 1 in the Linux NVMe 
target, and there are no plans to ever submit commands with multiple SGLs from 
the host driver either.

Cheers,
Christoph


Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-27 Thread Grozdan
On Thu, Oct 27, 2016 at 11:26 AM, Jan Kara  wrote:
> On Wed 26-10-16 10:12:38, Jens Axboe wrote:
>> On 10/26/2016 10:04 AM, Paolo Valente wrote:
>> >
>> >>Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe  ha 
>> >>scritto:
>> >>
>> >>On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
>> >>>On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
>> The question to ask first is whether to actually have pluggable
>> schedulers on blk-mq at all, or just have one that is meant to
>> do the right thing in every case (and possibly can be bypassed
>> completely).
>> >>>
>> >>>That would be my preference.  Have a BFQ-variant for blk-mq as an
>> >>>option (default to off unless opted in by the driver or user), and
>> >>>not other scheduler for blk-mq.  Don't bother with bfq for non
>> >>>blk-mq.  It's not like there is any advantage in the legacy-request
>> >>>device even for slow devices, except for the option of having I/O
>> >>>scheduling.
>> >>
>> >>It's the only right way forward. blk-mq might not offer any substantial
>> >>advantages to rotating storage, but with scheduling, it won't offer a
>> >>downside either. And it'll take us towards the real goal, which is to
>> >>have just one IO path.
>> >
>> >ok
>> >
>> >>Adding a new scheduler for the legacy IO path
>> >>makes no sense.
>> >
>> >I would fully agree if effective and stable I/O scheduling would be
>> >available in blk-mq in one or two months.  But I guess that it will
>> >take at least one year optimistically, given the current status of the
>> >needed infrastructure, and given the great difficulties of doing
>> >effective scheduling at the high parallelism and extreme target speeds
>> >of blk-mq.  Of course, this holds true unless little clever scheduling
>> >is performed.
>> >
>> >So, what's the point in forcing a lot of users wait another year or
>> >more, for a solution that has yet to be even defined, while they could
>> >enjoy a much better system, and then switch an even better system when
>> >scheduling is ready in blk-mq too?
>>
>> That same argument could have been made 2 years ago. Saying no to a new
>> scheduler for the legacy framework goes back roughly that long. We could
>> have had BFQ for mq NOW, if we didn't keep coming back to this very
>> point.
>>
>> I'm hesistant to add a new scheduler because it's very easy to add, very
>> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
>> it'll take us years and years to get rid of it again. We should be
>> moving towards LESS moving parts in the legacy path, not more.
>>
>> We can keep having this discussion every few years, but I think we'd
>> both prefer to make some actual progress here. It's perfectly fine to
>> add an interface for a single queue interface for an IO scheduler for
>> blk-mq, since we don't care too much about scalability there. And that
>> won't take years, that should be a few weeks. Retrofitting BFQ on top of
>> that should not be hard either. That can co-exist with a real multiqueue
>> scheduler as well, something that's geared towards some fairness for
>> faster devices.
>
> OK, so some solution like having a variant of blk_sq_make_request() that
> will consume requests, do IO scheduling decisions on them, and feed them
> into the HW queue is it sees fit would be acceptable? That will provide the
> IO scheduler a global view that it needs for complex scheduling decisions
> so it should indeed be relatively easy to port BFQ to work like that.
>
> Honza
> --
> Jan Kara 
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hello,

Let me first say that I'm in no way associated with Paolo Valente or
any other BFQ developer. I'm a mere user who has had great experience
using BFQ

My workload is one that takes my disks to their limits. I often use
large files like raw Blu-ray streams which then I remux to mkv's while
at the same time streaming at least 2 movies to various devices in
house and using my system as I do while the remuxing process is going
on. At times, I'm also pushing video files to my NAS at close to Gbps
speed while the stuff I mentioned is in progress

My experience with BFQ is that it has never resulted in the video
streams being interrupted due to disk trashing. I've extensively used
all the other Linux disk schedulers in the past and what I've observed
is that whenever I start the remuxing (and copying) process, the
streams will begin to hiccup, stutter and often multi-seconds long
"waits" will occur. It gets even worse, when I do this kind of
workload, the whole system will come to almost a halt and
interactivity goes out the window. Impossible to start an app in a
reasonable amount of time. Loading a visited website makes Chrome hang

Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-27 Thread Sagi Grimberg



You do realise that local filesystems can silently change the
location of file data at any point in time, so there is no such
thing as a "stable mapping" of file data to block device addresses
in userspace?

If you want remote access to the blocks owned and controlled by a
filesystem, then you need to use a filesystem with a remote locking
mechanism to allow co-ordinated, coherent access to the data in
those blocks. Anything else is just asking for ongoing, unfixable
filesystem corruption or data leakage problems (i.e.  security
issues).


And at least for XFS we have such a mechanism :)  E.g. I have a
prototype of a pNFS layout that uses XFS+DAX to allow clients to do
RDMA directly to XFS files, with the same locking mechanism we use
for the current block and scsi layout in xfs_pnfs.c.


Christoph, did you manage to leap to the future and solve the
RDMA persistency hole? :)

e.g. what happens with O_DSYNC in this model? Or you did
a message exchange for commits?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] lightnvm: add ECC error codes

2016-10-27 Thread Javier González
Add ECC error codes to enable the appropriate handling in the target.

Signed-off-by: Javier González 
---
 include/linux/lightnvm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index e3ccaff..33643ae 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -107,6 +107,8 @@ enum {
NVM_RSP_NOT_CHANGEABLE  = 0x1,
NVM_RSP_ERR_FAILWRITE   = 0x40ff,
NVM_RSP_ERR_EMPTYPAGE   = 0x42ff,
+   NVM_RSP_ERR_FAILECC = 0x4281,
+   NVM_RSP_WARN_HIGHECC= 0x4700,
 
/* Device opcodes */
NVM_OP_HBREAD   = 0x02,
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] lightnvm: do not decide on device blocks

2016-10-27 Thread Javier González
Device blocks should be marked by the device and considered as bad
blocks by the media manager. Thus, do not make assumptions on which
blocks are going to be used by the device. In doing so we might lose
valid blocks from the free list.

Signed-off-by: Javier González 
---
 drivers/lightnvm/gennvm.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c
index 730d736..a7e17fa 100644
--- a/drivers/lightnvm/gennvm.c
+++ b/drivers/lightnvm/gennvm.c
@@ -371,12 +371,6 @@ static int gen_blocks_init(struct nvm_dev *dev, struct 
gen_dev *gn)
block->lun = >vlun;
block->id = cur_block_id++;
 
-   /* First block is reserved for device */
-   if (unlikely(lun_iter == 0 && blk_iter == 0)) {
-   lun->vlun.nr_free_blocks--;
-   continue;
-   }
-
list_add_tail(>list, >free_list);
}
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] lightnvm: export set bad block table

2016-10-27 Thread Javier González
Bad blocks should be managed by block owners. This would be either
targets for data blocks or sysblk for system blocks.

In order to support this, export two functions: One to mark a block as
an specific type (e.g., bad block) and another to update the bad block
table on the device.

Move bad block management to rrpc.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c   | 27 +++
 drivers/lightnvm/gennvm.c | 25 +
 drivers/lightnvm/rrpc.c   | 34 +-
 drivers/lightnvm/sysblk.c | 29 +
 include/linux/lightnvm.h  | 17 -
 5 files changed, 82 insertions(+), 50 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 4be3879..a81ed1c 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -197,6 +197,33 @@ void nvm_mark_blk(struct nvm_dev *dev, struct ppa_addr 
ppa, int type)
 }
 EXPORT_SYMBOL(nvm_mark_blk);
 
+int nvm_set_bb_tbl(struct nvm_dev *dev, struct ppa_addr *ppas, int nr_ppas,
+   int type)
+{
+   struct nvm_rq rqd;
+   int ret;
+
+   if (nr_ppas > dev->ops->max_phys_sect) {
+   pr_err("nvm: unable to update all sysblocks atomically\n");
+   return -EINVAL;
+   }
+
+   memset(, 0, sizeof(struct nvm_rq));
+
+   nvm_set_rqd_ppalist(dev, , ppas, nr_ppas, 1);
+   nvm_generic_to_addr_mode(dev, );
+
+   ret = dev->ops->set_bb_tbl(dev, _addr, rqd.nr_ppas, type);
+   nvm_free_rqd_ppalist(dev, );
+   if (ret) {
+   pr_err("nvm: sysblk failed bb mark\n");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL(nvm_set_bb_tbl);
+
 int nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd)
 {
return dev->mt->submit_io(dev, rqd);
diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c
index 575afc4..ae19a61 100644
--- a/drivers/lightnvm/gennvm.c
+++ b/drivers/lightnvm/gennvm.c
@@ -611,34 +611,11 @@ static void gen_mark_blk(struct nvm_dev *dev, struct 
ppa_addr ppa, int type)
blk->state = type;
 }
 
-/*
- * mark block bad in gen. It is expected that the target recovers separately
- */
-static void gen_mark_blk_bad(struct nvm_dev *dev, struct nvm_rq *rqd)
-{
-   int bit = -1;
-   int max_secs = dev->ops->max_phys_sect;
-   void *comp_bits = >ppa_status;
-
-   nvm_addr_to_generic_mode(dev, rqd);
-
-   /* look up blocks and mark them as bad */
-   if (rqd->nr_ppas == 1) {
-   gen_mark_blk(dev, rqd->ppa_addr, NVM_BLK_ST_BAD);
-   return;
-   }
-
-   while ((bit = find_next_bit(comp_bits, max_secs, bit + 1)) < max_secs)
-   gen_mark_blk(dev, rqd->ppa_list[bit], NVM_BLK_ST_BAD);
-}
-
 static void gen_end_io(struct nvm_rq *rqd)
 {
struct nvm_tgt_instance *ins = rqd->ins;
 
-   if (rqd->error == NVM_RSP_ERR_FAILWRITE)
-   gen_mark_blk_bad(rqd->dev, rqd);
-
+   /* Write failures and bad blocks are managed within the target */
ins->tt->end_io(rqd);
 }
 
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index cb30ccf..8deef2e 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -716,6 +716,34 @@ static void rrpc_run_gc(struct rrpc *rrpc, struct 
rrpc_block *rblk)
queue_work(rrpc->kgc_wq, >ws_gc);
 }
 
+static void __rrpc_mark_bad_block(struct nvm_dev *dev, struct ppa_addr *ppa)
+{
+   nvm_mark_blk(dev, *ppa, NVM_BLK_ST_BAD);
+   nvm_set_bb_tbl(dev, ppa, 1, NVM_BLK_T_GRWN_BAD);
+}
+
+static void rrpc_mark_bad_block(struct rrpc *rrpc, struct nvm_rq *rqd)
+{
+   struct nvm_dev *dev = rrpc->dev;
+   void *comp_bits = >ppa_status;
+   struct ppa_addr ppa, prev_ppa;
+   int nr_ppas = rqd->nr_ppas;
+   int bit;
+
+   if (rqd->nr_ppas == 1)
+   __rrpc_mark_bad_block(dev, >ppa_addr);
+
+   ppa_set_empty(_ppa);
+   bit = -1;
+   while ((bit = find_next_bit(comp_bits, nr_ppas, bit + 1)) < nr_ppas) {
+   ppa = rqd->ppa_list[bit];
+   if (ppa_cmp_blk(ppa, prev_ppa))
+   continue;
+
+   __rrpc_mark_bad_block(dev, );
+   }
+}
+
 static void rrpc_end_io_write(struct rrpc *rrpc, struct rrpc_rq *rrqd,
sector_t laddr, uint8_t npages)
 {
@@ -742,8 +770,12 @@ static void rrpc_end_io(struct nvm_rq *rqd)
uint8_t npages = rqd->nr_ppas;
sector_t laddr = rrpc_get_laddr(rqd->bio) - npages;
 
-   if (bio_data_dir(rqd->bio) == WRITE)
+   if (bio_data_dir(rqd->bio) == WRITE) {
+   if (rqd->error == NVM_RSP_ERR_FAILWRITE)
+   rrpc_mark_bad_block(rrpc, rqd);
+
rrpc_end_io_write(rrpc, rrqd, laddr, npages);
+   }
 
bio_put(rqd->bio);
 
diff --git a/drivers/lightnvm/sysblk.c 

Re: [PATCH 05/12] blk-mq: Introduce blk_mq_quiesce_queue()

2016-10-27 Thread Bart Van Assche

On 10/26/2016 10:52 PM, Hannes Reinecke wrote:

Hmm. Can't say I like having to have two RCU structure for the same
purpose, but I guess that can't be helped.


Hello Hannes,

There are two RCU structures because of BLK_MQ_F_BLOCKING. Today only 
the nbd driver sets that flag. If the nbd driver would be modified such 
that it doesn't set that flag then the BLK_MQ_F_BLOCKING flag and also 
queue_rq_srcu could be eliminated from the blk-mq core.


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-27 Thread Christoph Hellwig
On Thu, Oct 27, 2016 at 01:22:49PM +0300, Sagi Grimberg wrote:
> Christoph, did you manage to leap to the future and solve the
> RDMA persistency hole? :)
> 
> e.g. what happens with O_DSYNC in this model? Or you did
> a message exchange for commits?

Yes, pNFS calls this the layoutcommit.  That being said once we get a RDMA
commit or flush operation we could easily make the layoutcommit optional
for some operations.  There already is a precedence for the in the
flexfiles layout specification.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/12] blk-mq: Introduce blk_mq_quiesce_queue()

2016-10-27 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] lightnvm: drop reserve and release LUN callbacks

2016-10-27 Thread Javier González
On target initialization, targets use callbacks to the media manager to
configure the LUNs they use. In order to simplify the flow, drop this
callbacks and manage everything internally on the media manager.

By making use of the newly introduce LUN management structure, the media
manager knows which target exclusively owns each target and can
therefore allocate and free all the necessary structures before
initializing the target. Not exclusively owned LUNs belong to the media
manager in any case.

Adapt rrpc to not use the reserve_lun/release_lun callback functions.

Signed-off-by: Javier González 
---
 drivers/lightnvm/gennvm.c | 62 +++
 drivers/lightnvm/rrpc.c   | 12 +
 include/linux/lightnvm.h  |  5 +---
 3 files changed, 49 insertions(+), 30 deletions(-)

diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c
index 8bff725..575afc4 100644
--- a/drivers/lightnvm/gennvm.c
+++ b/drivers/lightnvm/gennvm.c
@@ -35,6 +35,30 @@ static const struct block_device_operations gen_fops = {
.owner  = THIS_MODULE,
 };
 
+static int gen_reserve_luns(struct nvm_dev *dev, int lun_begin, int lun_end,
+   struct nvm_target *t)
+{
+   struct gen_dev *gn = dev->mp;
+   struct gen_lun *lun;
+   int i;
+
+   for (i = lun_begin; i <= lun_end; i++) {
+   if (test_and_set_bit(i, dev->lun_map)) {
+   pr_err("gennvm: lun %d is already allocated\n", i);
+   goto fail;
+   }
+
+   lun = >luns[i];
+   }
+
+   return 0;
+fail:
+   while (--i > lun_begin)
+   clear_bit(i, dev->lun_map);
+
+   return 1;
+}
+
 static int gen_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 {
struct gen_dev *gn = dev->mp;
@@ -80,6 +104,9 @@ static int gen_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
tdisk->fops = _fops;
tdisk->queue = tqueue;
 
+   if (tt->exclusive && gen_reserve_luns(dev, s->lun_begin, s->lun_end, t))
+   goto err_init;
+
targetdata = tt->init(dev, tdisk, s->lun_begin, s->lun_end);
if (IS_ERR(targetdata))
goto err_init;
@@ -110,7 +137,23 @@ err_t:
return -ENOMEM;
 }
 
-static void __gen_remove_target(struct nvm_target *t)
+static void gen_release_luns(struct nvm_dev *dev, struct nvm_target *t)
+{
+   struct gen_dev *gn = dev->mp;
+   struct gen_lun *lun;
+   int lunid;
+   int i;
+
+   gen_for_each_lun(gn, lun, i) {
+   if (lun->tgt != t)
+   continue;
+
+   lunid = lun->vlun.id;
+   WARN_ON(!test_and_clear_bit(lunid, dev->lun_map));
+   }
+}
+
+static void __gen_remove_target(struct nvm_dev *dev, struct nvm_target *t)
 {
struct nvm_tgt_type *tt = t->type;
struct gendisk *tdisk = t->disk;
@@ -122,6 +165,7 @@ static void __gen_remove_target(struct nvm_target *t)
if (tt->exit)
tt->exit(tdisk->private_data);
 
+   gen_release_luns(dev, t);
put_disk(tdisk);
 
list_del(>list);
@@ -152,7 +196,7 @@ static int gen_remove_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_remove *remove)
mutex_unlock(>lock);
return 1;
}
-   __gen_remove_target(t);
+   __gen_remove_target(dev, t);
mutex_unlock(>lock);
 
return 0;
@@ -474,7 +518,7 @@ static void gen_unregister(struct nvm_dev *dev)
list_for_each_entry_safe(t, tmp, >targets, list) {
if (t->dev != dev)
continue;
-   __gen_remove_target(t);
+   __gen_remove_target(dev, t);
}
mutex_unlock(>lock);
 
@@ -618,16 +662,6 @@ static int gen_erase_blk(struct nvm_dev *dev, struct 
nvm_block *blk, int flags)
return nvm_erase_ppa(dev, , 1, flags);
 }
 
-static int gen_reserve_lun(struct nvm_dev *dev, int lunid)
-{
-   return test_and_set_bit(lunid, dev->lun_map);
-}
-
-static void gen_release_lun(struct nvm_dev *dev, int lunid)
-{
-   WARN_ON(!test_and_clear_bit(lunid, dev->lun_map));
-}
-
 static struct nvm_lun *gen_get_lun(struct nvm_dev *dev, int lunid)
 {
struct gen_dev *gn = dev->mp;
@@ -674,8 +708,6 @@ static struct nvmm_type gen = {
.mark_blk   = gen_mark_blk,
 
.get_lun= gen_get_lun,
-   .reserve_lun= gen_reserve_lun,
-   .release_lun= gen_release_lun,
.lun_info_print = gen_lun_info_print,
 
.get_area   = gen_get_area,
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index f293d00..cb30ccf 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -1167,8 +1167,6 @@ static void rrpc_core_free(struct rrpc *rrpc)
 
 static void rrpc_luns_free(struct rrpc *rrpc)
 {
-   struct nvm_dev *dev = rrpc->dev;
-   

[PATCH 7/7] lightnvm: rrpc: split bios of size > 256kb

2016-10-27 Thread Javier González
rrpc cannot handle bios of size > 256kb due to NVME's 64 bit completion
bitmap. If a larger bio comes, split it explicitly.

Signed-off-by: Javier González 
---
 drivers/lightnvm/rrpc.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index 8deef2e..0b8251f 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -984,6 +984,12 @@ static blk_qc_t rrpc_make_rq(struct request_queue *q, 
struct bio *bio)
struct nvm_rq *rqd;
int err;
 
+   /*
+* Multipage is supported up until 256kb due to NVME's 64 bit completion
+* bitmap.
+*/
+   blk_queue_split(q, , q->bio_split);
+
if (bio_op(bio) == REQ_OP_DISCARD) {
rrpc_discard(rrpc, bio);
return BLK_QC_T_NONE;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] lightnvm: enable to send hint to erase command

2016-10-27 Thread Javier González
Erases might be subject to host hints. An example is multi-plane
programming to erase blocks in parallel. Enable targets to specify this
hints.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c  | 9 ++---
 drivers/lightnvm/gennvm.c| 5 ++---
 drivers/lightnvm/rrpc.c  | 2 +-
 drivers/lightnvm/sysblk.c| 4 ++--
 drivers/nvme/host/lightnvm.c | 1 +
 include/linux/lightnvm.h | 7 +++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index a2393e1..f752087 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -204,9 +204,9 @@ int nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd)
 }
 EXPORT_SYMBOL(nvm_submit_io);
 
-int nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *blk)
+int nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *blk, int flags)
 {
-   return dev->mt->erase_blk(dev, blk, 0);
+   return dev->mt->erase_blk(dev, blk, flags);
 }
 EXPORT_SYMBOL(nvm_erase_blk);
 
@@ -287,7 +287,8 @@ void nvm_free_rqd_ppalist(struct nvm_dev *dev, struct 
nvm_rq *rqd)
 }
 EXPORT_SYMBOL(nvm_free_rqd_ppalist);
 
-int nvm_erase_ppa(struct nvm_dev *dev, struct ppa_addr *ppas, int nr_ppas)
+int nvm_erase_ppa(struct nvm_dev *dev, struct ppa_addr *ppas, int nr_ppas,
+   int flags)
 {
struct nvm_rq rqd;
int ret;
@@ -303,6 +304,8 @@ int nvm_erase_ppa(struct nvm_dev *dev, struct ppa_addr 
*ppas, int nr_ppas)
 
nvm_generic_to_addr_mode(dev, );
 
+   rqd.flags = flags;
+
ret = dev->ops->erase_block(dev, );
 
nvm_free_rqd_ppalist(dev, );
diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c
index b74174c..730d736 100644
--- a/drivers/lightnvm/gennvm.c
+++ b/drivers/lightnvm/gennvm.c
@@ -593,12 +593,11 @@ static int gen_submit_io(struct nvm_dev *dev, struct 
nvm_rq *rqd)
return dev->ops->submit_io(dev, rqd);
 }
 
-static int gen_erase_blk(struct nvm_dev *dev, struct nvm_block *blk,
-   unsigned long flags)
+static int gen_erase_blk(struct nvm_dev *dev, struct nvm_block *blk, int flags)
 {
struct ppa_addr addr = block_to_ppa(dev, blk);
 
-   return nvm_erase_ppa(dev, , 1);
+   return nvm_erase_ppa(dev, , 1, flags);
 }
 
 static int gen_reserve_lun(struct nvm_dev *dev, int lunid)
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index 37fcaad..067e890 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -404,7 +404,7 @@ static void rrpc_block_gc(struct work_struct *work)
if (rrpc_move_valid_pages(rrpc, rblk))
goto put_back;
 
-   if (nvm_erase_blk(dev, rblk->parent))
+   if (nvm_erase_blk(dev, rblk->parent, 0))
goto put_back;
 
rrpc_put_blk(rrpc, rblk);
diff --git a/drivers/lightnvm/sysblk.c b/drivers/lightnvm/sysblk.c
index a75bd28..d229067 100644
--- a/drivers/lightnvm/sysblk.c
+++ b/drivers/lightnvm/sysblk.c
@@ -379,7 +379,7 @@ static int nvm_prepare_new_sysblks(struct nvm_dev *dev, 
struct sysblk_scan *s)
ppa = >ppas[scan_ppa_idx(i, nxt_blk)];
ppa->g.pg = ppa_to_slc(dev, 0);
 
-   ret = nvm_erase_ppa(dev, ppa, 1);
+   ret = nvm_erase_ppa(dev, ppa, 1, 0);
if (ret)
return ret;
 
@@ -725,7 +725,7 @@ int nvm_dev_factory(struct nvm_dev *dev, int flags)
/* continue to erase until list of blks until empty */
while ((ppa_cnt =
nvm_fact_get_blks(dev, ppas, max_ppas, blk_bitmap)) > 0)
-   nvm_erase_ppa(dev, ppas, ppa_cnt);
+   nvm_erase_ppa(dev, ppas, ppa_cnt, 0);
 
/* mark host reserved blocks free */
if (flags & NVM_FACTORY_RESET_HOST_BLKS) {
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index f5e3011..9470d51 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -543,6 +543,7 @@ static int nvme_nvm_erase_block(struct nvm_dev *dev, struct 
nvm_rq *rqd)
c.erase.nsid = cpu_to_le32(ns->ns_id);
c.erase.spba = cpu_to_le64(rqd->ppa_addr.ppa);
c.erase.length = cpu_to_le16(rqd->nr_ppas - 1);
+   c.erase.control = cpu_to_le16(rqd->flags);
 
return nvme_submit_sync_cmd(q, (struct nvme_command *), NULL, 0);
 }
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index d190786..d7da953 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -472,8 +472,7 @@ typedef int (nvmm_open_blk_fn)(struct nvm_dev *, struct 
nvm_block *);
 typedef int (nvmm_close_blk_fn)(struct nvm_dev *, struct nvm_block *);
 typedef void (nvmm_flush_blk_fn)(struct nvm_dev *, struct nvm_block *);
 typedef int (nvmm_submit_io_fn)(struct nvm_dev *, struct nvm_rq *);
-typedef int (nvmm_erase_blk_fn)(struct nvm_dev *, struct nvm_block *,
-  

RE: A question regarding "multiple SGL"

2016-10-27 Thread Steve Wise
> > Hi Robert,
> 
> Hey Robert, Christoph,
> 
> > please explain your use cases that isn't handled.  The one and only
> > reason to set MSDBD to 1 is to make the code a lot simpler given that
> > there is no real use case for supporting more.
> >
> > RDMA uses memory registrations to register large and possibly
> > discontiguous data regions for a single rkey, aka single SGL descriptor
> > in NVMe terms.  There would be two reasons to support multiple SGL
> > descriptors:  a) to support a larger I/O size than supported by a single
> > MR, or b) to support a data region format not mappable by a single
> > MR.
> >
> > iSER only supports a single rkey (or stag in IETF terminology) and has
> > been doing fine on a) and mostly fine on b).   There are a few possible
> > data layouts not supported by the traditional IB/iWarp FR WRs, but the
> > limit is in fact exactly the same as imposed by the NVMe PRPs used for
> > PCIe NVMe devices, so the Linux block layer has support to not generate
> > them.  Also with modern Mellanox IB/RoCE hardware we can actually
> > register completely arbitrary SGLs.  iSER supports using this registration
> > mode already with a trivial code addition, but for NVMe we didn't have a
> > pressing need yet.
> 
> Good explanation :)
> 
> The IO transfer size is a bit more pressing on some devices (e.g.
> cxgb3/4) where the number of pages per-MR can be indeed lower than
> a reasonable transfer size (Steve can correct me if I'm wrong).
>

Currently, cxgb4 support 128KB REG_MR operations on a host with 4K page size,
via a max mr page list depth of 32.  Soon it will be bumped up from 32 to 128
and life will be better...

 
> However, if there is a real demand for this we'll happily accept
> patches :)
> 
> Just a note, having this feature in-place can bring unexpected behavior
> depending on how we implement it:
> - If we can use multiple MRs per IO (for multiple SGLs) we can either
> prepare for the worst-case and allocate enough MRs to satisfy the
> various IO patterns. This will be much heavier in terms of resource
> allocation and can limit the scalability of the host driver.
> - Or we can implement a shared MR pool with a reasonable number of MRs.
> In this case each IO can consume one or more MRs on the expense of
> other IOs. In this case we may need to requeue the IO later when we
> have enough available MRs to satisfy the IO. This can yield some
> unexpected performance gaps for some workloads.
> 

I would like to see the storage protocols deal with lack of resources for the
worst case.  This allows much smaller resource usage for both MRs, and SQ
resources, at the expense of adding flow control logic to deal with lack of
available MR and/or SQ slots to process the next IO.  I think it can be
implemented efficiently such that when in flow-control mode, the code is driving
new IO submissions off of SQ completions which will free up SQ slots and most
likely MRs from the QP's MR pool.

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] lightnvm: manage block list on LUN owner

2016-10-27 Thread Javier González
LUNs can be exclusively owned by a target through the media manager's
reserve_lun function. In this case, the target should implement its own
provisioning and manage internally the free/used/bad block list.

This patch introduces a LUN management structure that can be passed on
to LUN exclusive owners. The media manager is the default owner. On boot
up, it populates the lists with based on the device's bad block table
and sysblk metadata. Then, if the LUN is owned by a target, the
management structured is passed on to it. From this moment, the target
is responsible for list management. Thus, since LUNs are managed
strictly by the target, there is no need for the media manager to
reserve GC blocks.

As a FTL, rrpc owns exclusively its LUNs. Therefore, adapt rrpc to use
the new management interface.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c   |  5 ++--
 drivers/lightnvm/gennvm.c | 76 +++
 drivers/lightnvm/gennvm.h | 19 ++--
 drivers/lightnvm/rrpc.c   | 74 ++---
 drivers/lightnvm/rrpc.h   |  4 +++
 include/linux/lightnvm.h  | 21 +
 6 files changed, 145 insertions(+), 54 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index f752087..4be3879 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -178,10 +178,9 @@ static struct nvm_dev *nvm_find_nvm_dev(const char *name)
return NULL;
 }
 
-struct nvm_block *nvm_get_blk(struct nvm_dev *dev, struct nvm_lun *lun,
-   unsigned long flags)
+struct nvm_block *nvm_get_blk(struct nvm_dev *dev, struct nvm_lun *lun)
 {
-   return dev->mt->get_blk(dev, lun, flags);
+   return dev->mt->get_blk(dev, lun);
 }
 EXPORT_SYMBOL(nvm_get_blk);
 
diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c
index a7e17fa..8bff725 100644
--- a/drivers/lightnvm/gennvm.c
+++ b/drivers/lightnvm/gennvm.c
@@ -243,6 +243,7 @@ static void gen_luns_free(struct nvm_dev *dev)
 static int gen_luns_init(struct nvm_dev *dev, struct gen_dev *gn)
 {
struct gen_lun *lun;
+   struct nvm_lun_mgmt *mgmt;
int i;
 
gn->luns = kcalloc(dev->nr_luns, sizeof(struct gen_lun), GFP_KERNEL);
@@ -250,18 +251,31 @@ static int gen_luns_init(struct nvm_dev *dev, struct 
gen_dev *gn)
return -ENOMEM;
 
gen_for_each_lun(gn, lun, i) {
+   mgmt = kmalloc(sizeof(struct nvm_lun_mgmt), GFP_KERNEL);
+   if (!mgmt)
+   goto free;
+
+   lun->mgmt = mgmt;
+   lun->tgt = NULL;
+
spin_lock_init(>vlun.lock);
-   INIT_LIST_HEAD(>free_list);
-   INIT_LIST_HEAD(>used_list);
-   INIT_LIST_HEAD(>bb_list);
+   INIT_LIST_HEAD(>mgmt->free_list);
+   INIT_LIST_HEAD(>mgmt->used_list);
+   INIT_LIST_HEAD(>mgmt->bb_list);
+   lun->mgmt->nr_free_blocks = dev->blks_per_lun;
 
-   lun->reserved_blocks = 2; /* for GC only */
lun->vlun.id = i;
lun->vlun.lun_id = i % dev->luns_per_chnl;
lun->vlun.chnl_id = i / dev->luns_per_chnl;
-   lun->vlun.nr_free_blocks = dev->blks_per_lun;
+   lun->vlun.priv = NULL;
}
return 0;
+
+free:
+   gen_for_each_lun(gn, lun, i)
+   kfree(lun->mgmt);
+
+   return -ENOMEM;
 }
 
 static int gen_block_bb(struct gen_dev *gn, struct ppa_addr ppa,
@@ -279,12 +293,13 @@ static int gen_block_bb(struct gen_dev *gn, struct 
ppa_addr ppa,
lun = >luns[(dev->luns_per_chnl * ppa.g.ch) + ppa.g.lun];
 
for (i = 0; i < nr_blks; i++) {
-   if (blks[i] == 0)
+   if (blks[i] == NVM_BLK_T_FREE && i > 0)
continue;
 
blk = >vlun.blocks[i];
-   list_move_tail(>list, >bb_list);
-   lun->vlun.nr_free_blocks--;
+   list_move_tail(>list, >mgmt->bb_list);
+   blk->state = NVM_BLK_ST_BAD;
+   lun->mgmt->nr_free_blocks--;
}
 
return 0;
@@ -333,9 +348,9 @@ static int gen_block_map(u64 slba, u32 nlb, __le64 
*entries, void *private)
 * block. It's up to the FTL on top to re-etablish the
 * block state. The block is assumed to be open.
 */
-   list_move_tail(>list, >used_list);
+   list_move_tail(>list, >mgmt->used_list);
blk->state = NVM_BLK_ST_TGT;
-   lun->vlun.nr_free_blocks--;
+   lun->mgmt->nr_free_blocks--;
}
}
 
@@ -371,7 +386,7 @@ static int gen_blocks_init(struct nvm_dev *dev, struct 
gen_dev *gn)
block->lun = >vlun;
block->id = cur_block_id++;
 
-

[PATCH] null_blk: Add notes to use LightNVM

2016-10-27 Thread Yasuaki Ishimatsu

If CONFIG_NVM is disabled, loading null_block module with use_lightnvm=1
fails. But there are no messages and documents related to the failure.

So the patch adds the notes to use LightNVM.

Signed-off-by: Yasuaki Ishimatsu 
Cc: Jens Axboe 
---
 Documentation/block/null_blk.txt | 2 +-
 drivers/block/null_blk.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt
index d8880ca..0365a26 100644
--- a/Documentation/block/null_blk.txt
+++ b/Documentation/block/null_blk.txt
@@ -72,4 +72,4 @@ use_per_node_hctx=[0/1]: Default: 0
  queue for each CPU node in the system.

 use_lightnvm=[0/1]: Default: 0
-  Register device with LightNVM. Requires blk-mq to be used.
+  Register device with LightNVM. Requires blk-mq and CONFIG_NVM=y to be used.
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index ba6f4a2e..4943ee2 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -577,6 +577,7 @@ static void null_nvm_unregister(struct nullb *nullb)
 #else
 static int null_nvm_register(struct nullb *nullb)
 {
+   pr_err("null_blk: CONFIG_NVM needs to be enabled for LightNVM\n");
return -EINVAL;
 }
 static void null_nvm_unregister(struct nullb *nullb) {}
--
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] lightnvm: do not decide on device blocks

2016-10-27 Thread Javier González
Device blocks should be marked by the device and considered as bad
blocks by the media manager. Thus, do not make assumptions on which
blocks are going to be used by the device. In doing so we might lose
valid blocks from the free list.

Signed-off-by: Javier González 
---
 drivers/lightnvm/gennvm.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c
index 730d736..a7e17fa 100644
--- a/drivers/lightnvm/gennvm.c
+++ b/drivers/lightnvm/gennvm.c
@@ -371,12 +371,6 @@ static int gen_blocks_init(struct nvm_dev *dev, struct 
gen_dev *gn)
block->lun = >vlun;
block->id = cur_block_id++;
 
-   /* First block is reserved for device */
-   if (unlikely(lun_iter == 0 && blk_iter == 0)) {
-   lun->vlun.nr_free_blocks--;
-   continue;
-   }
-
list_add_tail(>list, >free_list);
}
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-27 Thread Christoph Hellwig
On Thu, Oct 27, 2016 at 08:41:27PM +0100, Mark Brown wrote:
> Plus the benchmarking to verify that it works well of course, especially
> initially where it'll also be a new queue infrastructure as well as the
> blk-mq conversion itself.  It does feel like something that's going to
> take at least a couple of kernel releases to get through.

Or to put it the other way around:  it could have been long done
if people had started it the first it was suggestead.  Instead you guys
keep arguing and nothing gets done.  Get started now, waiting won't
make anything go faster.

> I think there's also value in having improvements there for people who
> benefit from them while queue infrastructure for blk-mq is being worked
> on.  

Well, apply it to you vendor tree then and maintain it yourself if you
disagree with our direction.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] LightNVM patchset V2

2016-10-27 Thread Javier González
V2: Rebase patches on Matias' for-next

Javier González (7):
  lightnvm: enable to send hint to erase command
  lightnvm: do not decide on device blocks
  lightnvm: manage block list on LUN owner
  lightnvm: drop reserve and release LUN callbacks
  lightnvm: export set bad block table
  lightnvm: add ECC error codes
  lightnvm: rrpc: split bios of size > 256kb

 drivers/lightnvm/core.c  |  41 --
 drivers/lightnvm/gennvm.c| 180 +--
 drivers/lightnvm/gennvm.h|  19 ++---
 drivers/lightnvm/rrpc.c  | 128 --
 drivers/lightnvm/rrpc.h  |   4 +
 drivers/lightnvm/sysblk.c|  33 ++--
 drivers/nvme/host/lightnvm.c |   1 +
 include/linux/lightnvm.h |  52 +
 8 files changed, 305 insertions(+), 153 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-27 Thread Ulf Hansson
[...]

>> Instead, what I can tell, as we have been looking into converting mmc
>> (which I maintains) and that is indeed a significant amount of work.
>> We will need to rip out all of the mmc request management, and most
>> likely we also need to extend the blkmq interface - as to be able to
>> do re-implement all the current request optimizations. We are looking
>> into this, but it just takes time.
>
>
> It's usually as much work as you make it into, for most cases it's
> pretty straight forward and usually removes more code than it adds.
> Hence the end result is better for it as well - less code in a driver is
> better.

>From a scalability and maintenance point of view, converting to blkmq
makes perfect sense.

Although, me personally don't want to sacrifice on performance (at
least very little), just for the sake of gaining in
scalability/maintainability.

I would rather strive to adopt the blkmq framework to also suit my
needs. Then it simply do takes more time.

For example, in the mmc case we have implemented an asynchronous
request path, which greatly improves performance on some systems.

>
>> I can imagine, that it's not always a straight forward "convert to blk
>> mq" patch for every block device driver.
>
>
> Well, I've actually done a few conversions, and it's not difficult at
> all. The grunt of the work is usually around converting to using some of
> the blk-mq features for parts of the driver that it had implemented
> privately, like timeout handling, etc.
>
> I'm always happy to help people with converting drivers.

Great, we ping you if we need some help! Thanks!

>
 3)
 While we work on scheduling in blkmq (at least for single queue
 devices), it's of course important that we set high goals. Having BFQ
 (and the other schedulers) in the legacy blk, provides a good
 reference for what we could aim for.
>>>
>>>
>>>
>>> Sure, but you don't need BFQ to be included in the kernel for that.
>>
>>
>> Perhaps not.
>>
>> But does that mean, you expect Paolo to maintain an up to date BFQ
>> tree for you?
>
>
> I don't expect anything. If Paolo or others want to compare with BFQ on
> the legacy IO path, then they can do that however way they want. If you
> (and others) want to have that reference point, it's up to you how to
> accomplish that.

Do I get this right? You personally don't care about using BFQ as
reference when evolving blkmq for single queue devices?

Paolo and lots of other Linux users certainly do care about this.

Moreover, I am still trying to understand what's the big deal to why
you say no to BFQ as a legacy scheduler. Ideally it shouldn't cause
you any maintenance burden and it doesn't make the removal of the
legacy blk layer any more difficult, right?

Kind regards
Ulf Hansson
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-27 Thread Mark Brown
On Thu, Oct 27, 2016 at 12:21:06PM -0600, Jens Axboe wrote:
> On 10/27/2016 12:13 PM, Ulf Hansson wrote:

> > I can imagine, that it's not always a straight forward "convert to blk
> > mq" patch for every block device driver.

> Well, I've actually done a few conversions, and it's not difficult at
> all. The grunt of the work is usually around converting to using some of
> the blk-mq features for parts of the driver that it had implemented
> privately, like timeout handling, etc.

Plus the benchmarking to verify that it works well of course, especially
initially where it'll also be a new queue infrastructure as well as the
blk-mq conversion itself.  It does feel like something that's going to
take at least a couple of kernel releases to get through.

> > > > 3)
> > > > While we work on scheduling in blkmq (at least for single queue
> > > > devices), it's of course important that we set high goals. Having BFQ
> > > > (and the other schedulers) in the legacy blk, provides a good
> > > > reference for what we could aim for.

> > > Sure, but you don't need BFQ to be included in the kernel for that.

> > Perhaps not.

> > But does that mean, you expect Paolo to maintain an up to date BFQ
> > tree for you?

> I don't expect anything. If Paolo or others want to compare with BFQ on
> the legacy IO path, then they can do that however way they want. If you
> (and others) want to have that reference point, it's up to you how to
> accomplish that.

I think there's also value in having improvements there for people who
benefit from them while queue infrastructure for blk-mq is being worked
on.  


signature.asc
Description: PGP signature


[PATCH 7/7] lightnvm: rrpc: split bios of size > 256kb

2016-10-27 Thread Javier González
rrpc cannot handle bios of size > 256kb due to NVMe using a 64 bit
bitmap to signal I/O completion. If a larger bio comes, split it
explicitly.

Signed-off-by: Javier González 
---
 drivers/lightnvm/rrpc.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index 8deef2e..0b8251f 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -984,6 +984,12 @@ static blk_qc_t rrpc_make_rq(struct request_queue *q, 
struct bio *bio)
struct nvm_rq *rqd;
int err;
 
+   /*
+* Multipage is supported up until 256kb due to NVME's 64 bit completion
+* bitmap.
+*/
+   blk_queue_split(q, , q->bio_split);
+
if (bio_op(bio) == REQ_OP_DISCARD) {
rrpc_discard(rrpc, bio);
return BLK_QC_T_NONE;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] lightnvm: add ECC error codes

2016-10-27 Thread Javier González
Add ECC error codes to enable the appropriate handling in the target.

Signed-off-by: Javier González 
---
 include/linux/lightnvm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index e3ccaff..33643ae 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -107,6 +107,8 @@ enum {
NVM_RSP_NOT_CHANGEABLE  = 0x1,
NVM_RSP_ERR_FAILWRITE   = 0x40ff,
NVM_RSP_ERR_EMPTYPAGE   = 0x42ff,
+   NVM_RSP_ERR_FAILECC = 0x4281,
+   NVM_RSP_WARN_HIGHECC= 0x4700,
 
/* Device opcodes */
NVM_OP_HBREAD   = 0x02,
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] lightnvm: manage block list on LUN owner

2016-10-27 Thread Javier González
From: Javier González 

LUNs can be exclusively owned by a target through the media manager's
reserve_lun function. In this case, the target should implement its own
provisioning and manage internally the free/used/bad block list.

This patch introduces a LUN management structure that can be passed on
to LUN exclusive owners. The media manager is the default owner. On boot
up, it populates the lists with based on the device's bad block table
and sysblk metadata. Then, if the LUN is owned by a target, the
management structured is passed on to it. From this moment, the target
is responsible for list management. Thus, since LUNs are managed
strictly by the target, there is no need for the media manager to
reserve GC blocks.

As a FTL, rrpc owns exclusively its LUNs. Therefore, adapt rrpc to use
the new management interface.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c   |  5 ++--
 drivers/lightnvm/gennvm.c | 76 +++
 drivers/lightnvm/gennvm.h | 19 ++--
 drivers/lightnvm/rrpc.c   | 74 ++---
 drivers/lightnvm/rrpc.h   |  4 +++
 include/linux/lightnvm.h  | 21 +
 6 files changed, 145 insertions(+), 54 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 1a6a7e6..4c6577e 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -178,10 +178,9 @@ static struct nvm_dev *nvm_find_nvm_dev(const char *name)
return NULL;
 }
 
-struct nvm_block *nvm_get_blk(struct nvm_dev *dev, struct nvm_lun *lun,
-   unsigned long flags)
+struct nvm_block *nvm_get_blk(struct nvm_dev *dev, struct nvm_lun *lun)
 {
-   return dev->mt->get_blk(dev, lun, flags);
+   return dev->mt->get_blk(dev, lun);
 }
 EXPORT_SYMBOL(nvm_get_blk);
 
diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c
index a7e17fa..8bff725 100644
--- a/drivers/lightnvm/gennvm.c
+++ b/drivers/lightnvm/gennvm.c
@@ -243,6 +243,7 @@ static void gen_luns_free(struct nvm_dev *dev)
 static int gen_luns_init(struct nvm_dev *dev, struct gen_dev *gn)
 {
struct gen_lun *lun;
+   struct nvm_lun_mgmt *mgmt;
int i;
 
gn->luns = kcalloc(dev->nr_luns, sizeof(struct gen_lun), GFP_KERNEL);
@@ -250,18 +251,31 @@ static int gen_luns_init(struct nvm_dev *dev, struct 
gen_dev *gn)
return -ENOMEM;
 
gen_for_each_lun(gn, lun, i) {
+   mgmt = kmalloc(sizeof(struct nvm_lun_mgmt), GFP_KERNEL);
+   if (!mgmt)
+   goto free;
+
+   lun->mgmt = mgmt;
+   lun->tgt = NULL;
+
spin_lock_init(>vlun.lock);
-   INIT_LIST_HEAD(>free_list);
-   INIT_LIST_HEAD(>used_list);
-   INIT_LIST_HEAD(>bb_list);
+   INIT_LIST_HEAD(>mgmt->free_list);
+   INIT_LIST_HEAD(>mgmt->used_list);
+   INIT_LIST_HEAD(>mgmt->bb_list);
+   lun->mgmt->nr_free_blocks = dev->blks_per_lun;
 
-   lun->reserved_blocks = 2; /* for GC only */
lun->vlun.id = i;
lun->vlun.lun_id = i % dev->luns_per_chnl;
lun->vlun.chnl_id = i / dev->luns_per_chnl;
-   lun->vlun.nr_free_blocks = dev->blks_per_lun;
+   lun->vlun.priv = NULL;
}
return 0;
+
+free:
+   gen_for_each_lun(gn, lun, i)
+   kfree(lun->mgmt);
+
+   return -ENOMEM;
 }
 
 static int gen_block_bb(struct gen_dev *gn, struct ppa_addr ppa,
@@ -279,12 +293,13 @@ static int gen_block_bb(struct gen_dev *gn, struct 
ppa_addr ppa,
lun = >luns[(dev->luns_per_chnl * ppa.g.ch) + ppa.g.lun];
 
for (i = 0; i < nr_blks; i++) {
-   if (blks[i] == 0)
+   if (blks[i] == NVM_BLK_T_FREE && i > 0)
continue;
 
blk = >vlun.blocks[i];
-   list_move_tail(>list, >bb_list);
-   lun->vlun.nr_free_blocks--;
+   list_move_tail(>list, >mgmt->bb_list);
+   blk->state = NVM_BLK_ST_BAD;
+   lun->mgmt->nr_free_blocks--;
}
 
return 0;
@@ -333,9 +348,9 @@ static int gen_block_map(u64 slba, u32 nlb, __le64 
*entries, void *private)
 * block. It's up to the FTL on top to re-etablish the
 * block state. The block is assumed to be open.
 */
-   list_move_tail(>list, >used_list);
+   list_move_tail(>list, >mgmt->used_list);
blk->state = NVM_BLK_ST_TGT;
-   lun->vlun.nr_free_blocks--;
+   lun->mgmt->nr_free_blocks--;
}
}
 
@@ -371,7 +386,7 @@ static int gen_blocks_init(struct nvm_dev *dev, struct 
gen_dev *gn)
block->lun = >vlun;
   

[PATCH 4/7] lightnvm: drop reserve and release LUN callbacks

2016-10-27 Thread Javier González
From: Javier González 

On target initialization, targets use callbacks to the media manager to
configure the LUNs they use. In order to simplify the flow, drop this
callbacks and manage everything internally on the media manager.

By making use of the newly introduce LUN management structure, the media
manager knows which target exclusively owns each target and can
therefore allocate and free all the necessary structures before
initializing the target. Not exclusively owned LUNs belong to the media
manager in any case.

Adapt rrpc to not use the reserve_lun/release_lun callback functions.

Signed-off-by: Javier González 
---
 drivers/lightnvm/gennvm.c | 68 ---
 drivers/lightnvm/rrpc.c   | 12 +
 include/linux/lightnvm.h  |  5 +---
 3 files changed, 55 insertions(+), 30 deletions(-)

diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c
index 8bff725..a340685 100644
--- a/drivers/lightnvm/gennvm.c
+++ b/drivers/lightnvm/gennvm.c
@@ -35,6 +35,50 @@ static const struct block_device_operations gen_fops = {
.owner  = THIS_MODULE,
 };
 
+static int gen_reserve_luns(struct nvm_dev *dev, int lun_begin, int lun_end,
+   struct nvm_target *t)
+{
+   struct gen_dev *gn = dev->mp;
+   struct gen_lun *lun;
+   int i;
+
+   for (i = lun_begin; i <= lun_end; i++) {
+   if (test_and_set_bit(i, dev->lun_map)) {
+   pr_err("gennvm: lun %d is already allocated\n", i);
+   goto fail;
+   }
+
+   lun = >luns[i];
+   lun->tgt = t;
+   lun->vlun.priv = lun->mgmt;
+   }
+
+   return 0;
+fail:
+   while (--i > lun_begin)
+   clear_bit(i, dev->lun_map);
+
+   return 1;
+}
+
+static void gen_release_luns(struct nvm_dev *dev, struct nvm_target *t)
+{
+   struct gen_dev *gn = dev->mp;
+   struct gen_lun *lun;
+   int lunid;
+   int i;
+
+   gen_for_each_lun(gn, lun, i) {
+   if (lun->tgt != t)
+   continue;
+
+   lunid = lun->vlun.id;
+   WARN_ON(!test_and_clear_bit(lunid, dev->lun_map));
+   lun->vlun.priv = NULL;
+   lun->tgt = NULL;
+   }
+}
+
 static int gen_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 {
struct gen_dev *gn = dev->mp;
@@ -80,6 +124,9 @@ static int gen_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
tdisk->fops = _fops;
tdisk->queue = tqueue;
 
+   if (tt->exclusive && gen_reserve_luns(dev, s->lun_begin, s->lun_end, t))
+   goto err_reserve;
+
targetdata = tt->init(dev, tdisk, s->lun_begin, s->lun_end);
if (IS_ERR(targetdata))
goto err_init;
@@ -102,6 +149,8 @@ static int gen_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
 
return 0;
 err_init:
+   gen_release_luns(dev, t);
+err_reserve:
put_disk(tdisk);
 err_queue:
blk_cleanup_queue(tqueue);
@@ -110,7 +159,7 @@ static int gen_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
return -ENOMEM;
 }
 
-static void __gen_remove_target(struct nvm_target *t)
+static void __gen_remove_target(struct nvm_dev *dev, struct nvm_target *t)
 {
struct nvm_tgt_type *tt = t->type;
struct gendisk *tdisk = t->disk;
@@ -122,6 +171,7 @@ static void __gen_remove_target(struct nvm_target *t)
if (tt->exit)
tt->exit(tdisk->private_data);
 
+   gen_release_luns(dev, t);
put_disk(tdisk);
 
list_del(>list);
@@ -152,7 +202,7 @@ static int gen_remove_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_remove *remove)
mutex_unlock(>lock);
return 1;
}
-   __gen_remove_target(t);
+   __gen_remove_target(dev, t);
mutex_unlock(>lock);
 
return 0;
@@ -474,7 +524,7 @@ static void gen_unregister(struct nvm_dev *dev)
list_for_each_entry_safe(t, tmp, >targets, list) {
if (t->dev != dev)
continue;
-   __gen_remove_target(t);
+   __gen_remove_target(dev, t);
}
mutex_unlock(>lock);
 
@@ -618,16 +668,6 @@ static int gen_erase_blk(struct nvm_dev *dev, struct 
nvm_block *blk, int flags)
return nvm_erase_ppa(dev, , 1, flags);
 }
 
-static int gen_reserve_lun(struct nvm_dev *dev, int lunid)
-{
-   return test_and_set_bit(lunid, dev->lun_map);
-}
-
-static void gen_release_lun(struct nvm_dev *dev, int lunid)
-{
-   WARN_ON(!test_and_clear_bit(lunid, dev->lun_map));
-}
-
 static struct nvm_lun *gen_get_lun(struct nvm_dev *dev, int lunid)
 {
struct gen_dev *gn = dev->mp;
@@ -674,8 +714,6 @@ static struct nvmm_type gen = {
.mark_blk   = gen_mark_blk,
 
.get_lun= gen_get_lun,
-   

[PATCH 1/7] lightnvm: enable to send hint to erase command

2016-10-27 Thread Javier González
Erases might be subject to host hints. An example is multi-plane
programming to erase blocks in parallel. Enable targets to specify this
hints.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c  | 9 ++---
 drivers/lightnvm/gennvm.c| 5 ++---
 drivers/lightnvm/rrpc.c  | 2 +-
 drivers/lightnvm/sysblk.c| 4 ++--
 drivers/nvme/host/lightnvm.c | 1 +
 include/linux/lightnvm.h | 7 +++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 1cac0f8..1a6a7e6 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -204,9 +204,9 @@ int nvm_submit_io(struct nvm_dev *dev, struct nvm_rq *rqd)
 }
 EXPORT_SYMBOL(nvm_submit_io);
 
-int nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *blk)
+int nvm_erase_blk(struct nvm_dev *dev, struct nvm_block *blk, int flags)
 {
-   return dev->mt->erase_blk(dev, blk, 0);
+   return dev->mt->erase_blk(dev, blk, flags);
 }
 EXPORT_SYMBOL(nvm_erase_blk);
 
@@ -287,7 +287,8 @@ void nvm_free_rqd_ppalist(struct nvm_dev *dev, struct 
nvm_rq *rqd)
 }
 EXPORT_SYMBOL(nvm_free_rqd_ppalist);
 
-int nvm_erase_ppa(struct nvm_dev *dev, struct ppa_addr *ppas, int nr_ppas)
+int nvm_erase_ppa(struct nvm_dev *dev, struct ppa_addr *ppas, int nr_ppas,
+   int flags)
 {
struct nvm_rq rqd;
int ret;
@@ -303,6 +304,8 @@ int nvm_erase_ppa(struct nvm_dev *dev, struct ppa_addr 
*ppas, int nr_ppas)
 
nvm_generic_to_addr_mode(dev, );
 
+   rqd.flags = flags;
+
ret = dev->ops->erase_block(dev, );
 
nvm_free_rqd_ppalist(dev, );
diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c
index b74174c..730d736 100644
--- a/drivers/lightnvm/gennvm.c
+++ b/drivers/lightnvm/gennvm.c
@@ -593,12 +593,11 @@ static int gen_submit_io(struct nvm_dev *dev, struct 
nvm_rq *rqd)
return dev->ops->submit_io(dev, rqd);
 }
 
-static int gen_erase_blk(struct nvm_dev *dev, struct nvm_block *blk,
-   unsigned long flags)
+static int gen_erase_blk(struct nvm_dev *dev, struct nvm_block *blk, int flags)
 {
struct ppa_addr addr = block_to_ppa(dev, blk);
 
-   return nvm_erase_ppa(dev, , 1);
+   return nvm_erase_ppa(dev, , 1, flags);
 }
 
 static int gen_reserve_lun(struct nvm_dev *dev, int lunid)
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index 37fcaad..067e890 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -404,7 +404,7 @@ static void rrpc_block_gc(struct work_struct *work)
if (rrpc_move_valid_pages(rrpc, rblk))
goto put_back;
 
-   if (nvm_erase_blk(dev, rblk->parent))
+   if (nvm_erase_blk(dev, rblk->parent, 0))
goto put_back;
 
rrpc_put_blk(rrpc, rblk);
diff --git a/drivers/lightnvm/sysblk.c b/drivers/lightnvm/sysblk.c
index a75bd28..d229067 100644
--- a/drivers/lightnvm/sysblk.c
+++ b/drivers/lightnvm/sysblk.c
@@ -379,7 +379,7 @@ static int nvm_prepare_new_sysblks(struct nvm_dev *dev, 
struct sysblk_scan *s)
ppa = >ppas[scan_ppa_idx(i, nxt_blk)];
ppa->g.pg = ppa_to_slc(dev, 0);
 
-   ret = nvm_erase_ppa(dev, ppa, 1);
+   ret = nvm_erase_ppa(dev, ppa, 1, 0);
if (ret)
return ret;
 
@@ -725,7 +725,7 @@ int nvm_dev_factory(struct nvm_dev *dev, int flags)
/* continue to erase until list of blks until empty */
while ((ppa_cnt =
nvm_fact_get_blks(dev, ppas, max_ppas, blk_bitmap)) > 0)
-   nvm_erase_ppa(dev, ppas, ppa_cnt);
+   nvm_erase_ppa(dev, ppas, ppa_cnt, 0);
 
/* mark host reserved blocks free */
if (flags & NVM_FACTORY_RESET_HOST_BLKS) {
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index f5e3011..9470d51 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -543,6 +543,7 @@ static int nvme_nvm_erase_block(struct nvm_dev *dev, struct 
nvm_rq *rqd)
c.erase.nsid = cpu_to_le32(ns->ns_id);
c.erase.spba = cpu_to_le64(rqd->ppa_addr.ppa);
c.erase.length = cpu_to_le16(rqd->nr_ppas - 1);
+   c.erase.control = cpu_to_le16(rqd->flags);
 
return nvme_submit_sync_cmd(q, (struct nvme_command *), NULL, 0);
 }
diff --git a/include/linux/lightnvm.h b/include/linux/lightnvm.h
index d190786..d7da953 100644
--- a/include/linux/lightnvm.h
+++ b/include/linux/lightnvm.h
@@ -472,8 +472,7 @@ typedef int (nvmm_open_blk_fn)(struct nvm_dev *, struct 
nvm_block *);
 typedef int (nvmm_close_blk_fn)(struct nvm_dev *, struct nvm_block *);
 typedef void (nvmm_flush_blk_fn)(struct nvm_dev *, struct nvm_block *);
 typedef int (nvmm_submit_io_fn)(struct nvm_dev *, struct nvm_rq *);
-typedef int (nvmm_erase_blk_fn)(struct nvm_dev *, struct nvm_block *,
-  

Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-27 Thread Jens Axboe

On 10/27/2016 01:34 PM, Ulf Hansson wrote:

[...]


Instead, what I can tell, as we have been looking into converting mmc
(which I maintains) and that is indeed a significant amount of work.
We will need to rip out all of the mmc request management, and most
likely we also need to extend the blkmq interface - as to be able to
do re-implement all the current request optimizations. We are looking
into this, but it just takes time.



It's usually as much work as you make it into, for most cases it's
pretty straight forward and usually removes more code than it adds.
Hence the end result is better for it as well - less code in a driver is
better.


From a scalability and maintenance point of view, converting to blkmq
makes perfect sense.

Although, me personally don't want to sacrifice on performance (at
least very little), just for the sake of gaining in
scalability/maintainability.


Nobody has said anything about sacrificing performance. And whether you
like it or not, maintainability is always the most important aspect.
Even performance takes a backseat to maintainability.


I would rather strive to adopt the blkmq framework to also suit my
needs. Then it simply do takes more time.

For example, in the mmc case we have implemented an asynchronous
request path, which greatly improves performance on some systems.


blk-mq has evolved to support a variety of devices, there's nothing
special about mmc that can't work well within that framework.


3)
While we work on scheduling in blkmq (at least for single queue
devices), it's of course important that we set high goals. Having BFQ
(and the other schedulers) in the legacy blk, provides a good
reference for what we could aim for.




Sure, but you don't need BFQ to be included in the kernel for that.



Perhaps not.

But does that mean, you expect Paolo to maintain an up to date BFQ
tree for you?



I don't expect anything. If Paolo or others want to compare with BFQ on
the legacy IO path, then they can do that however way they want. If you
(and others) want to have that reference point, it's up to you how to
accomplish that.


Do I get this right? You personally don't care about using BFQ as
reference when evolving blkmq for single queue devices?

Paolo and lots of other Linux users certainly do care about this.


I'm getting a little tired of this putting words in my mouth... That is
not what I'm saying at all. What I'm saying is that the people working
on BFQ can do what they need to do to have a reference implementation to
compare against. You don't need BFQ in the kernel for that. I said it's
up to YOU, with the you here meaning the people that want to work on it,
how that goes down.


Moreover, I am still trying to understand what's the big deal to why
you say no to BFQ as a legacy scheduler. Ideally it shouldn't cause
you any maintenance burden and it doesn't make the removal of the
legacy blk layer any more difficult, right?


Not sure I can state it much clearer. It's a new scheduler, and a
complicated one at that. It WILL carry a maintenance burden. And I'm
really not that interested in adding such a burden for something that
will be defunct as soon as the single queue blk-mq version is done.
Additionally, if we put BFQ in right now, the motivation to do the real
work will be gone.

The path forward is clear. It'd be a lot better to put some work behind
that, rather than continue this email thread.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/12] dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code

2016-10-27 Thread Johannes Thumshirn
On Wed, Oct 26, 2016 at 03:54:07PM -0700, Bart Van Assche wrote:
> Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED
> in the dm start and stop queue functions, only manipulate the latter
> flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped().
> 
> Signed-off-by: Bart Van Assche 
> Reviewed-by: Christoph Hellwig 

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] blk-mq: Move more code into blk_mq_direct_issue_request()

2016-10-27 Thread Christoph Hellwig
Looks good:

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/12] nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code

2016-10-27 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-27 Thread Jens Axboe

On 10/27/2016 08:34 AM, Grozdan wrote:

On Thu, Oct 27, 2016 at 11:26 AM, Jan Kara  wrote:

On Wed 26-10-16 10:12:38, Jens Axboe wrote:

On 10/26/2016 10:04 AM, Paolo Valente wrote:



Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe  ha scritto:

On 10/26/2016 09:29 AM, Christoph Hellwig wrote:

On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:

The question to ask first is whether to actually have pluggable
schedulers on blk-mq at all, or just have one that is meant to
do the right thing in every case (and possibly can be bypassed
completely).


That would be my preference.  Have a BFQ-variant for blk-mq as an
option (default to off unless opted in by the driver or user), and
not other scheduler for blk-mq.  Don't bother with bfq for non
blk-mq.  It's not like there is any advantage in the legacy-request
device even for slow devices, except for the option of having I/O
scheduling.


It's the only right way forward. blk-mq might not offer any substantial
advantages to rotating storage, but with scheduling, it won't offer a
downside either. And it'll take us towards the real goal, which is to
have just one IO path.


ok


Adding a new scheduler for the legacy IO path
makes no sense.


I would fully agree if effective and stable I/O scheduling would be
available in blk-mq in one or two months.  But I guess that it will
take at least one year optimistically, given the current status of the
needed infrastructure, and given the great difficulties of doing
effective scheduling at the high parallelism and extreme target speeds
of blk-mq.  Of course, this holds true unless little clever scheduling
is performed.

So, what's the point in forcing a lot of users wait another year or
more, for a solution that has yet to be even defined, while they could
enjoy a much better system, and then switch an even better system when
scheduling is ready in blk-mq too?


That same argument could have been made 2 years ago. Saying no to a new
scheduler for the legacy framework goes back roughly that long. We could
have had BFQ for mq NOW, if we didn't keep coming back to this very
point.

I'm hesistant to add a new scheduler because it's very easy to add, very
difficult to get rid of. If we do add BFQ as a legacy scheduler now,
it'll take us years and years to get rid of it again. We should be
moving towards LESS moving parts in the legacy path, not more.

We can keep having this discussion every few years, but I think we'd
both prefer to make some actual progress here. It's perfectly fine to
add an interface for a single queue interface for an IO scheduler for
blk-mq, since we don't care too much about scalability there. And that
won't take years, that should be a few weeks. Retrofitting BFQ on top of
that should not be hard either. That can co-exist with a real multiqueue
scheduler as well, something that's geared towards some fairness for
faster devices.


OK, so some solution like having a variant of blk_sq_make_request() that
will consume requests, do IO scheduling decisions on them, and feed them
into the HW queue is it sees fit would be acceptable? That will provide the
IO scheduler a global view that it needs for complex scheduling decisions
so it should indeed be relatively easy to port BFQ to work like that.

Honza
--
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Hello,

Let me first say that I'm in no way associated with Paolo Valente or
any other BFQ developer. I'm a mere user who has had great experience
using BFQ

My workload is one that takes my disks to their limits. I often use
large files like raw Blu-ray streams which then I remux to mkv's while
at the same time streaming at least 2 movies to various devices in
house and using my system as I do while the remuxing process is going
on. At times, I'm also pushing video files to my NAS at close to Gbps
speed while the stuff I mentioned is in progress

My experience with BFQ is that it has never resulted in the video
streams being interrupted due to disk trashing. I've extensively used
all the other Linux disk schedulers in the past and what I've observed
is that whenever I start the remuxing (and copying) process, the
streams will begin to hiccup, stutter and often multi-seconds long
"waits" will occur. It gets even worse, when I do this kind of
workload, the whole system will come to almost a halt and
interactivity goes out the window. Impossible to start an app in a
reasonable amount of time. Loading a visited website makes Chrome hang
while trying to get the contents from its cache, etc

BFQ has greatly helped to have a responsive system during such
operations and as I said, I have never experience any interruption of
the video streams. Do I think BFQ is the best 

Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-27 Thread Jens Axboe

On 10/27/2016 03:26 AM, Jan Kara wrote:

On Wed 26-10-16 10:12:38, Jens Axboe wrote:

On 10/26/2016 10:04 AM, Paolo Valente wrote:



Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe  ha scritto:

On 10/26/2016 09:29 AM, Christoph Hellwig wrote:

On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:

The question to ask first is whether to actually have pluggable
schedulers on blk-mq at all, or just have one that is meant to
do the right thing in every case (and possibly can be bypassed
completely).


That would be my preference.  Have a BFQ-variant for blk-mq as an
option (default to off unless opted in by the driver or user), and
not other scheduler for blk-mq.  Don't bother with bfq for non
blk-mq.  It's not like there is any advantage in the legacy-request
device even for slow devices, except for the option of having I/O
scheduling.


It's the only right way forward. blk-mq might not offer any substantial
advantages to rotating storage, but with scheduling, it won't offer a
downside either. And it'll take us towards the real goal, which is to
have just one IO path.


ok


Adding a new scheduler for the legacy IO path
makes no sense.


I would fully agree if effective and stable I/O scheduling would be
available in blk-mq in one or two months.  But I guess that it will
take at least one year optimistically, given the current status of the
needed infrastructure, and given the great difficulties of doing
effective scheduling at the high parallelism and extreme target speeds
of blk-mq.  Of course, this holds true unless little clever scheduling
is performed.

So, what's the point in forcing a lot of users wait another year or
more, for a solution that has yet to be even defined, while they could
enjoy a much better system, and then switch an even better system when
scheduling is ready in blk-mq too?


That same argument could have been made 2 years ago. Saying no to a new
scheduler for the legacy framework goes back roughly that long. We could
have had BFQ for mq NOW, if we didn't keep coming back to this very
point.

I'm hesistant to add a new scheduler because it's very easy to add, very
difficult to get rid of. If we do add BFQ as a legacy scheduler now,
it'll take us years and years to get rid of it again. We should be
moving towards LESS moving parts in the legacy path, not more.

We can keep having this discussion every few years, but I think we'd
both prefer to make some actual progress here. It's perfectly fine to
add an interface for a single queue interface for an IO scheduler for
blk-mq, since we don't care too much about scalability there. And that
won't take years, that should be a few weeks. Retrofitting BFQ on top of
that should not be hard either. That can co-exist with a real multiqueue
scheduler as well, something that's geared towards some fairness for
faster devices.


OK, so some solution like having a variant of blk_sq_make_request() that
will consume requests, do IO scheduling decisions on them, and feed them
into the HW queue is it sees fit would be acceptable? That will provide the
IO scheduler a global view that it needs for complex scheduling decisions
so it should indeed be relatively easy to port BFQ to work like that.


I'd probably start off Omar's base [1] that switches the software queues
to store bios instead of requests, since that lifts the of the 1:1
mapping between what we can queue up and what we can dispatch. Without
that, the IO scheduler won't have too much to work with. And with that
in place, it'll be a "bio in, request out" type of setup, which is
similar to what we have in the legacy path.

I'd keep the software queues, but as a starting point, mandate 1
hardware queue to keep that as the per-device view of the state. The IO
scheduler would be responsible for moving one or more bios from the
software queues to the hardware queue, when they are ready to dispatch.

[1] 
https://github.com/osandov/linux/commit/8ef3508628b6cf7c4712cd3d8084ee11ef5d2530


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-27 Thread Ulf Hansson
[...]

>
> I'm hesistant to add a new scheduler because it's very easy to add, very
> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
> it'll take us years and years to get rid of it again. We should be
> moving towards LESS moving parts in the legacy path, not more.

Jens, I think you are wrong here and let me try to elaborate on why.

1)
We already have legacy schedulers like CFQ, DEADLINE, etc - and most
block device drivers are still using the legacy blk interface.

To be able to remove the legacy blk layer, all block device drivers
must be converted to blkmq - of course.

So to reach that goal, we will not only need to evolve blkmq to allow
scheduling (at least for single queue devices), but we also need to
convert *all* block device drivers to blkmq. For sure this will take
*years* and not months.

More important, when the transition to blkmq has been completed, then
there is absolutely no difference (from effort point of view) in
removing the legacy blk layer - no matter if we have BFQ in there or
not.

I do understand if you have concern from maintenance point of view, as
I assume you would rather focus on evolving blkmq, than care about
legacy blk code. So, would it help if Paolo volunteers to maintain the
BFQ code in the meantime?

2)
While we work on evolving blkmq and convert block device drivers to
it, BFQ could as a separate legacy scheduler, help *lots* of Linux
users to get a significant improved experience. Should we really
prevent them from that? I think you block maintainer guys, really need
to consider this fact.

3)
While we work on scheduling in blkmq (at least for single queue
devices), it's of course important that we set high goals. Having BFQ
(and the other schedulers) in the legacy blk, provides a good
reference for what we could aim for.

>
> We can keep having this discussion every few years, but I think we'd
> both prefer to make some actual progress here. It's perfectly fine to
> add an interface for a single queue interface for an IO scheduler for
> blk-mq, since we don't care too much about scalability there. And that
> won't take years, that should be a few weeks. Retrofitting BFQ on top of
> that should not be hard either. That can co-exist with a real multiqueue
> scheduler as well, something that's geared towards some fairness for
> faster devices.

That's really great news!

I hope we get a possibility to meet and discuss the plans for this at
Kernel summit/Linux Plumbers the next week!

>
> --
> Jens Axboe

Kind regards
Ulf Hansson
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-27 Thread Jens Axboe

On 10/27/2016 11:32 AM, Ulf Hansson wrote:

[...]



I'm hesistant to add a new scheduler because it's very easy to add, very
difficult to get rid of. If we do add BFQ as a legacy scheduler now,
it'll take us years and years to get rid of it again. We should be
moving towards LESS moving parts in the legacy path, not more.


Jens, I think you are wrong here and let me try to elaborate on why.

1)
We already have legacy schedulers like CFQ, DEADLINE, etc - and most
block device drivers are still using the legacy blk interface.


I don't think that's an accurate statement. In terms of coverage, most
drivers do support blk-mq. Anything SCSI, nvme, virtio-blk, SATA runs on
(or can run on) top of blk-mq.


To be able to remove the legacy blk layer, all block device drivers
must be converted to blkmq - of course.


That's a given.


So to reach that goal, we will not only need to evolve blkmq to allow
scheduling (at least for single queue devices), but we also need to
convert *all* block device drivers to blkmq. For sure this will take
*years* and not months.


Correct.


More important, when the transition to blkmq has been completed, then
there is absolutely no difference (from effort point of view) in
removing the legacy blk layer - no matter if we have BFQ in there or
not.

I do understand if you have concern from maintenance point of view, as
I assume you would rather focus on evolving blkmq, than care about
legacy blk code. So, would it help if Paolo volunteers to maintain the
BFQ code in the meantime?


We're obviously still maintaining the legacy IO path. But we don't want
to actively develop it, and we haven't, for a long time.

And Paolo maintaining it is a strict requirement for inclusion, legacy
or blk-mq aside. That would go for both. I'd never accept a major
feature from an individual or company if they weren't willing and
capable of maintaining it. Throwing submissions over the wall is not
viable.


2)
While we work on evolving blkmq and convert block device drivers to
it, BFQ could as a separate legacy scheduler, help *lots* of Linux
users to get a significant improved experience. Should we really
prevent them from that? I think you block maintainer guys, really need
to consider this fact.


You still seem to be basing that assumption on the notion that we have
to convert tons of drivers for BFQ to make sense under the blk-mq
umbrella. That's not the case.


3)
While we work on scheduling in blkmq (at least for single queue
devices), it's of course important that we set high goals. Having BFQ
(and the other schedulers) in the legacy blk, provides a good
reference for what we could aim for.


Sure, but you don't need BFQ to be included in the kernel for that.


We can keep having this discussion every few years, but I think we'd
both prefer to make some actual progress here. It's perfectly fine to
add an interface for a single queue interface for an IO scheduler for
blk-mq, since we don't care too much about scalability there. And that
won't take years, that should be a few weeks. Retrofitting BFQ on top of
that should not be hard either. That can co-exist with a real multiqueue
scheduler as well, something that's geared towards some fairness for
faster devices.


That's really great news!

I hope we get a possibility to meet and discuss the plans for this at
Kernel summit/Linux Plumbers the next week!


I'll be there.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/12] blk-mq: Introduce blk_mq_hctx_stopped()

2016-10-27 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A question regarding "multiple SGL"

2016-10-27 Thread Sagi Grimberg



Hi Robert,


Hey Robert, Christoph,


please explain your use cases that isn't handled.  The one and only
reason to set MSDBD to 1 is to make the code a lot simpler given that
there is no real use case for supporting more.

RDMA uses memory registrations to register large and possibly
discontiguous data regions for a single rkey, aka single SGL descriptor
in NVMe terms.  There would be two reasons to support multiple SGL
descriptors:  a) to support a larger I/O size than supported by a single
MR, or b) to support a data region format not mappable by a single
MR.

iSER only supports a single rkey (or stag in IETF terminology) and has
been doing fine on a) and mostly fine on b).   There are a few possible
data layouts not supported by the traditional IB/iWarp FR WRs, but the
limit is in fact exactly the same as imposed by the NVMe PRPs used for
PCIe NVMe devices, so the Linux block layer has support to not generate
them.  Also with modern Mellanox IB/RoCE hardware we can actually
register completely arbitrary SGLs.  iSER supports using this registration
mode already with a trivial code addition, but for NVMe we didn't have a
pressing need yet.


Good explanation :)

The IO transfer size is a bit more pressing on some devices (e.g.
cxgb3/4) where the number of pages per-MR can be indeed lower than
a reasonable transfer size (Steve can correct me if I'm wrong).

However, if there is a real demand for this we'll happily accept
patches :)

Just a note, having this feature in-place can bring unexpected behavior
depending on how we implement it:
- If we can use multiple MRs per IO (for multiple SGLs) we can either
prepare for the worst-case and allocate enough MRs to satisfy the
various IO patterns. This will be much heavier in terms of resource
allocation and can limit the scalability of the host driver.
- Or we can implement a shared MR pool with a reasonable number of MRs.
In this case each IO can consume one or more MRs on the expense of
other IOs. In this case we may need to requeue the IO later when we
have enough available MRs to satisfy the IO. This can yield some
unexpected performance gaps for some workloads.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12] blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()

2016-10-27 Thread Christoph Hellwig
On Wed, Oct 26, 2016 at 03:53:39PM -0700, Bart Van Assche wrote:
> Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
> are followed by kicking the requeue list. Hence add an argument to
> these two functions that allows to kick the requeue list. This was
> proposed by Christoph Hellwig.

Looks good,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/12] blk-mq: Introduce blk_mq_hctx_stopped()

2016-10-27 Thread Sagi Grimberg

Looks fine,

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12] SRP transport: Move queuecommand() wait code to SCSI core

2016-10-27 Thread Johannes Thumshirn
On Wed, Oct 26, 2016 at 03:55:00PM -0700, Bart Van Assche wrote:
> Additionally, rename srp_wait_for_queuecommand() into
> scsi_wait_for_queuecommand() and add a comment about the
> queuecommand() call from scsi_send_eh_cmnd().
> 
> Signed-off-by: Bart Van Assche 
> Reviewed-by: Christoph Hellwig 

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/12] dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code

2016-10-27 Thread Mike Snitzer
On Wed, Oct 26 2016 at  6:54pm -0400,
Bart Van Assche  wrote:

> Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED
> in the dm start and stop queue functions, only manipulate the latter
> flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped().
> 
> Signed-off-by: Bart Van Assche 
> Reviewed-by: Christoph Hellwig 
> Cc: Mike Snitzer 

Acked-by: Mike Snitzer 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/12] SRP transport, scsi-mq: Wait for .queue_rq() if necessary

2016-10-27 Thread Johannes Thumshirn
On Wed, Oct 26, 2016 at 03:55:34PM -0700, Bart Van Assche wrote:
> Ensure that if scsi-mq is enabled that scsi_wait_for_queuecommand()
> waits until ongoing shost->hostt->queuecommand() calls have finished.
> 
> Signed-off-by: Bart Van Assche 
> Reviewed-by: Christoph Hellwig 

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12] blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()

2016-10-27 Thread Johannes Thumshirn
On Wed, Oct 26, 2016 at 03:53:39PM -0700, Bart Van Assche wrote:
> Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
> are followed by kicking the requeue list. Hence add an argument to
> these two functions that allows to kick the requeue list. This was
> proposed by Christoph Hellwig.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/12] SRP transport, scsi-mq: Wait for .queue_rq() if necessary

2016-10-27 Thread Sagi Grimberg

Thanks for moving it,

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] blk-mq: Move more code into blk_mq_direct_issue_request()

2016-10-27 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12] SRP transport: Move queuecommand() wait code to SCSI core

2016-10-27 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] blk-mq: Move more code into blk_mq_direct_issue_request()

2016-10-27 Thread Johannes Thumshirn
On Wed, Oct 26, 2016 at 03:52:35PM -0700, Bart Van Assche wrote:
> Move the "hctx stopped" test and the insert request calls into
> blk_mq_direct_issue_request(). Rename that function into
> blk_mq_try_issue_directly() to reflect its new semantics. Pass
> the hctx pointer to that function instead of looking it up a
> second time. These changes avoid that code has to be duplicated
> in the next patch.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/12] blk-mq: Do not invoke .queue_rq() for a stopped queue

2016-10-27 Thread Johannes Thumshirn
On Wed, Oct 26, 2016 at 03:50:44PM -0700, Bart Van Assche wrote:
> The meaning of the BLK_MQ_S_STOPPED flag is "do not call
> .queue_rq()". Hence modify blk_mq_make_request() such that requests
> are queued instead of issued if a queue has been stopped.
> 
> Reported-by: Ming Lei 
> Signed-off-by: Bart Van Assche 
> Reviewed-by: Christoph Hellwig 

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] blk-mq: Move more code into blk_mq_direct_issue_request()

2016-10-27 Thread Johannes Thumshirn
On Wed, Oct 26, 2016 at 03:52:35PM -0700, Bart Van Assche wrote:
> Move the "hctx stopped" test and the insert request calls into
> blk_mq_direct_issue_request(). Rename that function into
> blk_mq_try_issue_directly() to reflect its new semantics. Pass
> the hctx pointer to that function instead of looking it up a
> second time. These changes avoid that code has to be duplicated
> in the next patch.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler

2016-10-27 Thread Jan Kara
On Wed 26-10-16 10:12:38, Jens Axboe wrote:
> On 10/26/2016 10:04 AM, Paolo Valente wrote:
> >
> >>Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe  ha 
> >>scritto:
> >>
> >>On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
> >>>On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
> The question to ask first is whether to actually have pluggable
> schedulers on blk-mq at all, or just have one that is meant to
> do the right thing in every case (and possibly can be bypassed
> completely).
> >>>
> >>>That would be my preference.  Have a BFQ-variant for blk-mq as an
> >>>option (default to off unless opted in by the driver or user), and
> >>>not other scheduler for blk-mq.  Don't bother with bfq for non
> >>>blk-mq.  It's not like there is any advantage in the legacy-request
> >>>device even for slow devices, except for the option of having I/O
> >>>scheduling.
> >>
> >>It's the only right way forward. blk-mq might not offer any substantial
> >>advantages to rotating storage, but with scheduling, it won't offer a
> >>downside either. And it'll take us towards the real goal, which is to
> >>have just one IO path.
> >
> >ok
> >
> >>Adding a new scheduler for the legacy IO path
> >>makes no sense.
> >
> >I would fully agree if effective and stable I/O scheduling would be
> >available in blk-mq in one or two months.  But I guess that it will
> >take at least one year optimistically, given the current status of the
> >needed infrastructure, and given the great difficulties of doing
> >effective scheduling at the high parallelism and extreme target speeds
> >of blk-mq.  Of course, this holds true unless little clever scheduling
> >is performed.
> >
> >So, what's the point in forcing a lot of users wait another year or
> >more, for a solution that has yet to be even defined, while they could
> >enjoy a much better system, and then switch an even better system when
> >scheduling is ready in blk-mq too?
> 
> That same argument could have been made 2 years ago. Saying no to a new
> scheduler for the legacy framework goes back roughly that long. We could
> have had BFQ for mq NOW, if we didn't keep coming back to this very
> point.
> 
> I'm hesistant to add a new scheduler because it's very easy to add, very
> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
> it'll take us years and years to get rid of it again. We should be
> moving towards LESS moving parts in the legacy path, not more.
> 
> We can keep having this discussion every few years, but I think we'd
> both prefer to make some actual progress here. It's perfectly fine to
> add an interface for a single queue interface for an IO scheduler for
> blk-mq, since we don't care too much about scalability there. And that
> won't take years, that should be a few weeks. Retrofitting BFQ on top of
> that should not be hard either. That can co-exist with a real multiqueue
> scheduler as well, something that's geared towards some fairness for
> faster devices.

OK, so some solution like having a variant of blk_sq_make_request() that
will consume requests, do IO scheduling decisions on them, and feed them
into the HW queue is it sees fit would be acceptable? That will provide the
IO scheduler a global view that it needs for complex scheduling decisions
so it should indeed be relatively easy to port BFQ to work like that.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html