date:20170829

Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-29 Thread Byungchul Park

On Wed, Aug 30, 2017 at 02:20:37PM +0900, Sergey Senozhatsky wrote:
> Byungchul, a quick question.

Hello Sergey,

> have you measured the performance impact? somehow my linux-next is

Yeah, it might have performance impact inevitably.

> notably slower than earlier 4.13 linux-next. (e.g. scrolling in vim
> is irritatingly slow)

To Ingo,

I cannot decide if we have to roll back CONFIG_LOCKDEP_CROSSRELEASE
dependency on CONFIG_PROVE_LOCKING in Kconfig. With them enabled,
lockdep detection becomes strong but has performance impact. But,
it's anyway a debug option so IMHO we don't have to take case of the
performance impact. Please let me know your decision.

> `time dmesg' shows some difference, but probably that's not a good
> test.
> 
>   !LOCKDEPLOCKDEP LOCKDEP -CROSSRELEASE -COMPLETIONS
>   real 0m0.661s   0m2.290s0m1.920s
>   user 0m0.010s   0m0.105s0m0.000s
>   sys  0m0.636s   0m2.224s0m1.888s
> 
> anyone else "sees"/"can confirm" the slow down?
> 
> 
> it gets back to "usual normal" when I disable CROSSRELEASE and COMPLETIONS.
> 
> ---
> 
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index b19c491cbc4e..cdc30ef81c5e 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1091,8 +1091,6 @@ config PROVE_LOCKING
> select DEBUG_MUTEXES
> select DEBUG_RT_MUTEXES if RT_MUTEXES
> select DEBUG_LOCK_ALLOC
> -   select LOCKDEP_CROSSRELEASE
> -   select LOCKDEP_COMPLETIONS
> select TRACE_IRQFLAGS
> default n
> help
> 
> ---
> 
>   -ss

Re: possible circular locking dependency detected [was: linux-next: Tree for Aug 22]

2017-08-29 Thread Sergey Senozhatsky

On (08/23/17 09:03), Byungchul Park wrote:
[..]
> > Byungchul, did you add the crosslock checks to lockdep? Can you have a look 
> > at
> > the above report? That report namely doesn't make sense to me.
> 
> The report is talking about the following lockup:
> 
> A work in a worker A task work on exit to user
> -- ---
> mutex_lock(>bd_mutex)
>mutext_lock(>bd_mutex)
> blk_execute_rq()
>wait_for_completion_io_timeout()
>complete()
> 
[..]
> To Peterz,
> 
> Anyway I wanted to avoid lockdep reports in the case using a timeout
> interface. Do you think it's still worth reporting the kind of lockup?
> I'm ok if you do.

Byungchul, a quick question.
have you measured the performance impact? somehow my linux-next is
notably slower than earlier 4.13 linux-next. (e.g. scrolling in vim
is irritatingly slow)


`time dmesg' shows some difference, but probably that's not a good
test.

!LOCKDEPLOCKDEP LOCKDEP -CROSSRELEASE -COMPLETIONS
real 0m0.661s   0m2.290s0m1.920s
user 0m0.010s   0m0.105s0m0.000s
sys  0m0.636s   0m2.224s0m1.888s

anyone else "sees"/"can confirm" the slow down?


it gets back to "usual normal" when I disable CROSSRELEASE and COMPLETIONS.

---

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index b19c491cbc4e..cdc30ef81c5e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1091,8 +1091,6 @@ config PROVE_LOCKING
select DEBUG_MUTEXES
select DEBUG_RT_MUTEXES if RT_MUTEXES
select DEBUG_LOCK_ALLOC
-   select LOCKDEP_CROSSRELEASE
-   select LOCKDEP_COMPLETIONS
select TRACE_IRQFLAGS
default n
help

---

-ss

Re: I/O hangs after resuming from suspend-to-ram

2017-08-29 Thread Ming Lei

On Wed, Aug 30, 2017 at 10:15:37AM +0800, Ming Lei wrote:
> Hi,
> 
> On Tue, Aug 29, 2017 at 05:52:42PM +0200, Oleksandr Natalenko wrote:
> > Hello.
> > 
> > Re-tested with v4.13-rc7 + proposed patch and got the same result.
> 
> Maybe there is another issue, I didn't use dmcrypt on raid10, will
> test in your way to see if I can reproduce it.
> 
> BTW, could you share us which blk-mq scheduler you are using on sata?
> The patch I posted should address one issue on none scheduler.

Can't reproduce even with putting dmcypt on raid10 after applying
my patch.

Could you apply the following debug patch and provide the dmesg log
after running the commands below?

# echo 9 > /proc/sys/kernel/printk
# echo devices > /sys/power/pm_test
# echo mem > /sys/power/state

BTW, it is better to provide the two sata disk(behind raid10) name.

diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
index b44c1bb687a2..75b13248ea1c 100644
--- a/drivers/scsi/scsi_pm.c
+++ b/drivers/scsi/scsi_pm.c
@@ -53,17 +53,22 @@ static int scsi_dev_type_suspend(struct device *dev,
 {
const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
int err;
+   struct scsi_device *sdev = to_scsi_device(dev);
 
/* flush pending in-flight resume operations, suspend is synchronous */
async_synchronize_full_domain(_sd_pm_domain);
 
-   err = scsi_device_quiesce(to_scsi_device(dev));
+   sdev_printk(KERN_WARNING, sdev, "%s: enter\n", __func__);
+   err = scsi_device_quiesce(sdev);
if (err == 0) {
+   sdev_printk(KERN_WARNING, sdev, "%s: before suspend\n", 
__func__);
err = cb(dev, pm);
+   sdev_printk(KERN_WARNING, sdev, "%s: after suspend\n", 
__func__);
if (err)
-   scsi_device_resume(to_scsi_device(dev));
+   scsi_device_resume(sdev);
}
dev_dbg(dev, "scsi suspend: %d\n", err);
+   sdev_printk(KERN_WARNING, sdev, "%s: exit\n", __func__);
return err;
 }
 
@@ -72,9 +77,13 @@ static int scsi_dev_type_resume(struct device *dev,
 {
const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
int err = 0;
+   struct scsi_device *sdev = to_scsi_device(dev);
 
+   sdev_printk(KERN_WARNING, sdev, "%s: enter\n", __func__);
+   sdev_printk(KERN_WARNING, sdev, "%s: before resume\n", __func__);
err = cb(dev, pm);
-   scsi_device_resume(to_scsi_device(dev));
+   sdev_printk(KERN_WARNING, sdev, "%s: after resume\n", __func__);
+   scsi_device_resume(sdev);
dev_dbg(dev, "scsi resume: %d\n", err);
 
if (err == 0) {
@@ -83,6 +92,7 @@ static int scsi_dev_type_resume(struct device *dev,
pm_runtime_enable(dev);
}
 
+   sdev_printk(KERN_WARNING, sdev, "%s: exit\n", __func__);
return err;
 }


-- 
Ming

Re: [RFC] block/loop: make loop cgroup aware

2017-08-29 Thread Shaohua Li

On Tue, Aug 29, 2017 at 08:28:09AM -0700, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Tue, Aug 29, 2017 at 08:22:36AM -0700, Shaohua Li wrote:
> > > Yeah, writeback tracks the most active cgroup and associates writeback
> > > ios with that cgroup.  For buffered loop devices, I think it'd be fine
> > > to make the loop driver forward the cgroup association and let the
> > > writeback layer determine the matching association as usual.
> > 
> > Doing this means we must forward cgroup info to page, not bio. I need to 
> > check
> > if we can make the mechanism work for memcg.
> 
> The association is already coming from the page.  We just need to make
> sure that going through loop driver doesn't get in the way of the
> membership getting propagated to the underlying device.

I think there is confusion. App writes files in upper layer fs on loop. memcg
estabilish membership for the pages of these files. Then writeback does write,
loop then write these pages to under layer fs. The write is done in loop
thread. The write will allocate new page cache for under layer fs files. The
issue is we must forward memcg info from upper layer files page cache to under
layer files page cache.

> > > Hmm... why do we need double forwarding (original issuer -> aio cmd ->
> > > ios) in loop?  If we need this, doesn't this mean that we're missing
> > > ownership forwarding in aio paths and should make the forwarding
> > > happen there?
> > 
> > what do you mean double forwarding?
> 
> So, this looks like the loop driver is explicitly forwarding the
> association from the original issuer to the aio command and then from
> the aio command to the ios to the underlying device.  I'm wondering
> whether the right way to do this is making aio forward the association
> by default, instead of the loop driver doing it explicitly.  That way,
> all aio users can benefit from the forwarding instead of just loop.

That's possible. The downside doing this in aio is we must audit all fs to make
sure all bio have association. I'd like to avoid doing this if there is no
other loop-like cgroup usage.

Thanks,
Shaohua

Re: [PATCH V2 2/2] block/loop: allow request merge for directio mode

2017-08-29 Thread Shaohua Li

On Wed, Aug 30, 2017 at 10:51:21AM +0800, Ming Lei wrote:
> On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote:
> > On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote:
> > > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> > > > From: Shaohua Li 
> > > > 
> > > > Currently loop disables merge. While it makes sense for buffer IO mode,
> > > > directio mode can benefit from request merge. Without merge, loop could
> > > > send small size IO to underlayer disk and harm performance.
> > > 
> > > Hi Shaohua,
> > > 
> > > IMO no matter if merge is used, loop always sends page by page
> > > to VFS in both dio or buffer I/O.
> > 
> > Why do you think so?
> 
> do_blockdev_direct_IO() still handles page by page from iov_iter, and
> with bigger request, I guess it might be the plug merge working.

This is not true. directio sends big size bio directly, not because of plug
merge. Please at least check the code before you complain.

> >  
> > > Also if merge is enabled on loop, that means merge is run
> > > on both loop and low level block driver, and not sure if we
> > > can benefit from that.
> > 
> > why does merge still happen in low level block driver?
> 
> Because scheduler is still working on low level disk. My question
> is that why the scheduler in low level disk doesn't work now
> if scheduler on loop can merge?

The low level disk can still do merge, but since this is directio, the upper
layer already dispatches request as big as possible. There is very little
chance the requests can be merged again.

> > 
> > > 
> > > So Could you provide some performance data about this patch?
> > 
> > In my virtual machine, a workload improves from ~20M/s to ~50M/s. And I 
> > clearly
> > see the request size becomes bigger.
> 
> Could you share us what the low level disk is?

It's a SATA ssd.

Thanks,
Shaohua

Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests

2017-08-29 Thread Ming Lei

On Tue, Aug 29, 2017 at 08:15:23PM +, Bart Van Assche wrote:
> On Tue, 2017-08-29 at 19:16 +0800, Ming Lei wrote:
> > Hi Bart,
> > 
> > Did you see perf regression on SRP with smaller jobs after applying
> > my patchset V3?
> > 
> > I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64,
> > looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) 
> > VS.
> > v4.13-rc6+blk-next+patch V3(3rd column):
> > 
> > 
> > ---
> >  IOPS(K)  |NONE |NONE
> > ---
> > read  |  475.83 |  485.88
> > ---
> > randread  |  142.86 |  141.96
> > ---
> > write |   483.9 |  492.39
> > ---
> > randwrite |  124.83 |  124.53
> > ---
> > [ ... ]
> > ---
> >  LAT(us)  |NONE |NONE
> > ---
> > read  |2.15 |2.11
> > ---
> > randread  |7.17 |7.21
> > ---
> > write |2.11 |2.08
> > ---
> > randwrite | 8.2 |8.22
> > ---
> > [ ... ]
> 
> Hello Ming,
> 
> Although I would prefer to see measurement data against an SRP target system
> that supports a higher workload (>1M IOPS) and

Hi Bart,

For so higher workload, I guess it often requires to increase
.cmd_per_lun.

> also for a high-end NVMe drive,

My patch won't affect NVMe drive since NVMe driver doesn't become busy
usually.


> I think the above data is sufficient to show that the performance impact of
> your patch series is most likely small enough even for high-end SCSI initiator
> drivers.

OK.

-- 
Ming

Re: [PATCH V2 2/2] block/loop: allow request merge for directio mode

2017-08-29 Thread Ming Lei

On Tue, Aug 29, 2017 at 08:13:39AM -0700, Shaohua Li wrote:
> On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote:
> > On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> > > From: Shaohua Li 
> > > 
> > > Currently loop disables merge. While it makes sense for buffer IO mode,
> > > directio mode can benefit from request merge. Without merge, loop could
> > > send small size IO to underlayer disk and harm performance.
> > 
> > Hi Shaohua,
> > 
> > IMO no matter if merge is used, loop always sends page by page
> > to VFS in both dio or buffer I/O.
> 
> Why do you think so?

do_blockdev_direct_IO() still handles page by page from iov_iter, and
with bigger request, I guess it might be the plug merge working.

>  
> > Also if merge is enabled on loop, that means merge is run
> > on both loop and low level block driver, and not sure if we
> > can benefit from that.
> 
> why does merge still happen in low level block driver?

Because scheduler is still working on low level disk. My question
is that why the scheduler in low level disk doesn't work now
if scheduler on loop can merge?

> 
> > 
> > So Could you provide some performance data about this patch?
> 
> In my virtual machine, a workload improves from ~20M/s to ~50M/s. And I 
> clearly
> see the request size becomes bigger.

Could you share us what the low level disk is?

-- 
Ming

Re: I/O hangs after resuming from suspend-to-ram

2017-08-29 Thread Ming Lei

Hi,

On Tue, Aug 29, 2017 at 05:52:42PM +0200, Oleksandr Natalenko wrote:
> Hello.
> 
> Re-tested with v4.13-rc7 + proposed patch and got the same result.

Maybe there is another issue, I didn't use dmcrypt on raid10, will
test in your way to see if I can reproduce it.

BTW, could you share us which blk-mq scheduler you are using on sata?
The patch I posted should address one issue on none scheduler.

-- 
Ming

Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module

2017-08-29 Thread Ben Hutchings

On Tue, 2017-08-29 at 13:57 -0600, Jens Axboe wrote:
> On 08/29/2017 01:49 PM, Ben Hutchings wrote:
> > On Tue, 2017-08-29 at 10:46 -0600, Jens Axboe wrote:
> > > On 08/29/2017 10:34 AM, Ben Hutchings wrote:
> > > > On Tue, 2017-08-29 at 09:53 -0600, Jens Axboe wrote:
> > > > > On 08/29/2017 09:48 AM, Jens Axboe wrote:
> > > > > > On 08/29/2017 09:28 AM, Ben Hutchings wrote:
> > > > > > > On Tue, 2017-08-29 at 08:31 -0600, Jens Axboe wrote:
> > > > > > > > On Sun, Aug 13 2017, Ben Hutchings wrote:
> > > > > > > > > The block core requests modules with the "-iosched" name
> > > > > > > > > suffix, but
> > > > > > > > > bfq no longer has that suffix.  Add an alias.
> > > > > > > > 
> > > > > > > > I'd apply these two, but both patches are mangled when saved.
> > > > > > > > It's
> > > > > > > > turning == into =3D and so forth.
> > > > > > > > 
> > > > > > > > Care to check your settings and resend?
> > > > > > > 
> > > > > > > Just tried saving and applying with 'git am' successfully.  I
> > > > > > > think the
> > > > > > > problem is at your end.
> > > > > > 
> > > > > > Then yours is the only one, I apply patches people send me all day
> > > > > > long.
> > > > > > Was the case both in tbird and mutt, both of them showed the diffs
> > > > > > as mangled, and they showed up mangled when saved.
> > > > > 
> > > > > Here's your email in the archive:
> > > > > 
> > > > > https://marc.info/?l=linux-block=150264374920778=raw
> > > > > 
> > > > > Note this part:
> > > > > 
> > > > > Content-Transfer-Encoding: quoted-printable
> > > > 
> > > > What about it?  This is used for every mail with a non-ASCII name in
> > > > it, for example.  'git am' understands it.
> > > 
> > > What about it? It screws up the patch. Maybe git am understands it, but
> > > it's hard/impossible to read manually.
> > 
> > Where, other than the 'raw' link above, are you seeing the patch with
> > q-p encoding not decoded?
> 
> When I save it and view it.

I don't understand why you wouldn't do that in your mailer.  You're
going to need to switch back to the mailer when you reply, after all.
But if you want to display a quoted-printable file in a terminal, you
can use 'qprint -d' to do that.

> > > I'm not going to apply anything
> > > that I can't personally read/review easily. Fix your setup, if you are
> > > going to be sending patches.
> > 
> > So you won't accept patches from anyone with a non-ASCII name?
> 
> You're being ridiculous.

Am I?  It seems like they would trigger the same problem in your
current workflow.

Ben.

-- 
Ben Hutchings
Teamwork is essential - it allows you to blame someone else.



signature.asc
Description: This is a digitally signed message part

Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests

2017-08-29 Thread Jens Axboe

On 08/29/2017 02:57 PM, Bart Van Assche wrote:
> On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote:
>> All that can be done by clearing or setting a flag on the first call to
>> ->queue_rq or ->queuecommand instead.  In NVMe we use RQF_DONTPREP for
>> that, but for SCSI we probablt can't use that given that it has more
>> meaning for the old request path.  But how about just adding a new
>> RQD_DRV_INITIALIZED or similar flag?
> 
> Hello Christoph,
> 
> More changes would have to be made to implement the above proposal
> than just setting a flag at the start of .queue_rq() / .queuecommand()
> for all filesystem requests. The numerous code paths that lead to a
> request not being executed immediately, e.g. a SCSI host being busy or
> a preparation status != BLKPREP_OK, would have to be inspected and
> code would have to be added to clear the "command initialized" flag to
> ensure that initialization occurs shortly before the first time a
> command is executed.
> 
> The choice we have to make is to add more state information and
> complicated code for keeping that state information up-to-date to the
> SCSI core or making a small and easy to maintain change to the block
> layer core that does not involve any new state information. That's why
> I'm asking you to reconsider the patch at the start of this e-mail
> thread.

I don't like this addition, and honestly I wasn't a huge fan of adding
->init() hooks to the alloc path when we initially added this earlier
this summer. The fact that we now have to make this more invasive
doesn't improve the situation at all.

So fwiw, I too would much rather see an implementation based on an RQF_
flag instead.

-- 
Jens Axboe

Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests

2017-08-29 Thread Bart Van Assche

On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote:
> All that can be done by clearing or setting a flag on the first call to
> ->queue_rq or ->queuecommand instead.  In NVMe we use RQF_DONTPREP for
> that, but for SCSI we probablt can't use that given that it has more
> meaning for the old request path.  But how about just adding a new
> RQD_DRV_INITIALIZED or similar flag?

Hello Christoph,

More changes would have to be made to implement the above proposal than just
setting a flag at the start of .queue_rq() / .queuecommand() for all
filesystem requests. The numerous code paths that lead to a request not being
executed immediately, e.g. a SCSI host being busy or a preparation status !=
BLKPREP_OK, would have to be inspected and code would have to be added to clear
the "command initialized" flag to ensure that initialization occurs shortly
before the first time a command is executed.

The choice we have to make is to add more state information and complicated
code for keeping that state information up-to-date to the SCSI core or making a
small and easy to maintain change to the block layer core that does not involve
any new state information. That's why I'm asking you to reconsider the patch at
the start of this e-mail thread.

Thanks,

Bart.

Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests

2017-08-29 Thread Bart Van Assche

On Tue, 2017-08-29 at 19:16 +0800, Ming Lei wrote:
> Hi Bart,
> 
> Did you see perf regression on SRP with smaller jobs after applying
> my patchset V3?
> 
> I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64,
> looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) VS.
> v4.13-rc6+blk-next+patch V3(3rd column):
> 
> 
> ---
>  IOPS(K)  |NONE |NONE
> ---
> read  |  475.83 |  485.88
> ---
> randread  |  142.86 |  141.96
> ---
> write |   483.9 |  492.39
> ---
> randwrite |  124.83 |  124.53
> ---
> [ ... ]
> ---
>  LAT(us)  |NONE |NONE
> ---
> read  |2.15 |2.11
> ---
> randread  |7.17 |7.21
> ---
> write |2.11 |2.08
> ---
> randwrite | 8.2 |8.22
> ---
> [ ... ]

Hello Ming,

Although I would prefer to see measurement data against an SRP target system
that supports a higher workload (>1M IOPS) and also for a high-end NVMe drive,
I think the above data is sufficient to show that the performance impact of
your patch series is most likely small enough even for high-end SCSI initiator
drivers.

Bart.

Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module

2017-08-29 Thread Ben Hutchings

On Tue, 2017-08-29 at 10:46 -0600, Jens Axboe wrote:
> On 08/29/2017 10:34 AM, Ben Hutchings wrote:
> > On Tue, 2017-08-29 at 09:53 -0600, Jens Axboe wrote:
> > > On 08/29/2017 09:48 AM, Jens Axboe wrote:
> > > > On 08/29/2017 09:28 AM, Ben Hutchings wrote:
> > > > > On Tue, 2017-08-29 at 08:31 -0600, Jens Axboe wrote:
> > > > > > On Sun, Aug 13 2017, Ben Hutchings wrote:
> > > > > > > The block core requests modules with the "-iosched" name
> > > > > > > suffix, but
> > > > > > > bfq no longer has that suffix.  Add an alias.
> > > > > > 
> > > > > > I'd apply these two, but both patches are mangled when saved.
> > > > > > It's
> > > > > > turning == into =3D and so forth.
> > > > > > 
> > > > > > Care to check your settings and resend?
> > > > > 
> > > > > Just tried saving and applying with 'git am' successfully.  I
> > > > > think the
> > > > > problem is at your end.
> > > > 
> > > > Then yours is the only one, I apply patches people send me all day
> > > > long.
> > > > Was the case both in tbird and mutt, both of them showed the diffs
> > > > as mangled, and they showed up mangled when saved.
> > > 
> > > Here's your email in the archive:
> > > 
> > > https://marc.info/?l=linux-block=150264374920778=raw
> > > 
> > > Note this part:
> > > 
> > > Content-Transfer-Encoding: quoted-printable
> > 
> > What about it?  This is used for every mail with a non-ASCII name in
> > it, for example.  'git am' understands it.
> 
> What about it? It screws up the patch. Maybe git am understands it, but
> it's hard/impossible to read manually.

Where, other than the 'raw' link above, are you seeing the patch with
q-p encoding not decoded?

> I'm not going to apply anything
> that I can't personally read/review easily. Fix your setup, if you are
> going to be sending patches.

So you won't accept patches from anyone with a non-ASCII name?

Ben.

> > > Problem is definitely at your end.
> > 
> > Or perhaps in the middle?  Anyway, here are the patches again as an
> > mbox.
> 
> Feel free to browse other patches on the list, I don't see any that
> are quoted-printable. And I'd know, since I generally end up applying
> (or at least reviewing) most of them.
> 
-- 
Ben Hutchings
Teamwork is essential - it allows you to blame someone else.



signature.asc
Description: This is a digitally signed message part

Re: [PATCH] bsg: remove #if 0'ed code

2017-08-29 Thread Jens Axboe

On 08/29/2017 10:48 AM, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

Applied, thanks.

-- 
Jens Axboe

[PATCH] bsg: remove #if 0'ed code

2017-08-29 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 block/bsg.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/block/bsg.c b/block/bsg.c
index 37663b664666..ee1335c68de7 100644
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -932,15 +932,8 @@ static long bsg_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 
return ret;
}
-   /*
-* block device ioctls
-*/
default:
-#if 0
-   return ioctl_by_bdev(bd->bdev, cmd, arg);
-#else
return -ENOTTY;
-#endif
}
 }
 
-- 
2.11.0

Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module

2017-08-29 Thread Jens Axboe

On 08/29/2017 10:34 AM, Ben Hutchings wrote:
> On Tue, 2017-08-29 at 09:53 -0600, Jens Axboe wrote:
>> On 08/29/2017 09:48 AM, Jens Axboe wrote:
>>> On 08/29/2017 09:28 AM, Ben Hutchings wrote:
 On Tue, 2017-08-29 at 08:31 -0600, Jens Axboe wrote:
> On Sun, Aug 13 2017, Ben Hutchings wrote:
>> The block core requests modules with the "-iosched" name
>> suffix, but
>> bfq no longer has that suffix.  Add an alias.
>
> I'd apply these two, but both patches are mangled when saved.
> It's
> turning == into =3D and so forth.
>
> Care to check your settings and resend?

 Just tried saving and applying with 'git am' successfully.  I
 think the
 problem is at your end.
>>>
>>> Then yours is the only one, I apply patches people send me all day
>>> long.
>>> Was the case both in tbird and mutt, both of them showed the diffs
>>> as mangled, and they showed up mangled when saved.
>>
>> Here's your email in the archive:
>>
>> https://marc.info/?l=linux-block=150264374920778=raw
>>
>> Note this part:
>>
>> Content-Transfer-Encoding: quoted-printable
> 
> What about it?  This is used for every mail with a non-ASCII name in
> it, for example.  'git am' understands it.

What about it? It screws up the patch. Maybe git am understands it, but
it's hard/impossible to read manually. I'm not going to apply anything
that I can't personally read/review easily. Fix your setup, if you are
going to be sending patches.

>> Problem is definitely at your end.
> 
> Or perhaps in the middle?  Anyway, here are the patches again as an
> mbox.

Feel free to browse other patches on the list, I don't see any that
are quoted-printable. And I'd know, since I generally end up applying
(or at least reviewing) most of them.

-- 
Jens Axboe

Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module

2017-08-29 Thread Ben Hutchings

On Tue, 2017-08-29 at 09:53 -0600, Jens Axboe wrote:
> On 08/29/2017 09:48 AM, Jens Axboe wrote:
> > On 08/29/2017 09:28 AM, Ben Hutchings wrote:
> > > On Tue, 2017-08-29 at 08:31 -0600, Jens Axboe wrote:
> > > > On Sun, Aug 13 2017, Ben Hutchings wrote:
> > > > > The block core requests modules with the "-iosched" name
> > > > > suffix, but
> > > > > bfq no longer has that suffix.  Add an alias.
> > > > 
> > > > I'd apply these two, but both patches are mangled when saved.
> > > > It's
> > > > turning == into =3D and so forth.
> > > > 
> > > > Care to check your settings and resend?
> > > 
> > > Just tried saving and applying with 'git am' successfully.  I
> > > think the
> > > problem is at your end.
> > 
> > Then yours is the only one, I apply patches people send me all day
> > long.
> > Was the case both in tbird and mutt, both of them showed the diffs
> > as mangled, and they showed up mangled when saved.
> 
> Here's your email in the archive:
> 
> https://marc.info/?l=linux-block=150264374920778=raw
> 
> Note this part:
> 
> Content-Transfer-Encoding: quoted-printable

What about it?  This is used for every mail with a non-ASCII name in
it, for example.  'git am' understands it.

> Problem is definitely at your end.

Or perhaps in the middle?  Anyway, here are the patches again as an
mbox.

Ben.

-- 
Ben Hutchings
Teamwork is essential - it allows you to blame someone else.



block-sched-module-aliases.mbox
Description: application/mbox


signature.asc
Description: This is a digitally signed message part

[PATCH] blkcg: check pol->cpd_free_fn before free cpd

2017-08-29 Thread weiping zhang

check pol->cpd_free_fn() instead of pol->cpd_alloc_fn() when free cpd.

Signed-off-by: weiping zhang 
---
 block/blk-cgroup.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0480892..adcbc3e 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1044,7 +1044,7 @@ static void blkcg_css_free(struct cgroup_subsys_state 
*css)
list_del(>all_blkcgs_node);
 
for (i = 0; i < BLKCG_MAX_POLS; i++)
-   if (blkcg->cpd[i])
+   if (blkcg->cpd[i] && blkcg_policy[i]->cpd_free_fn)
blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
 
mutex_unlock(_pol_mutex);
@@ -1109,7 +1109,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 
 free_pd_blkcg:
for (i--; i >= 0; i--)
-   if (blkcg->cpd[i])
+   if (blkcg->cpd[i] && blkcg_policy[i]->cpd_free_fn)
blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
 free_blkcg:
kfree(blkcg);
@@ -1450,7 +1450,7 @@ int blkcg_policy_register(struct blkcg_policy *pol)
return 0;
 
 err_free_cpds:
-   if (pol->cpd_alloc_fn) {
+   if (pol->cpd_free_fn) {
list_for_each_entry(blkcg, _blkcgs, all_blkcgs_node) {
if (blkcg->cpd[pol->plid]) {
pol->cpd_free_fn(blkcg->cpd[pol->plid]);
@@ -1490,7 +1490,7 @@ void blkcg_policy_unregister(struct blkcg_policy *pol)
/* remove cpds and unregister */
mutex_lock(_pol_mutex);
 
-   if (pol->cpd_alloc_fn) {
+   if (pol->cpd_free_fn) {
list_for_each_entry(blkcg, _blkcgs, all_blkcgs_node) {
if (blkcg->cpd[pol->plid]) {
pol->cpd_free_fn(blkcg->cpd[pol->plid]);
-- 
2.9.4

Re: [PATCH] block: Make blk_dequeue_request() static

2017-08-29 Thread Jens Axboe

On 08/28/2017 08:54 PM, Damien Le Moal wrote:
> The only caller of this function is blk_start_request() in the same
> file. Fix blk_start_request() description accordingly.

Applied, thanks.

-- 
Jens Axboe

Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module

2017-08-29 Thread Jens Axboe

On 08/29/2017 09:28 AM, Ben Hutchings wrote:
> On Tue, 2017-08-29 at 08:31 -0600, Jens Axboe wrote:
>> On Sun, Aug 13 2017, Ben Hutchings wrote:
>>> The block core requests modules with the "-iosched" name suffix, but
>>> bfq no longer has that suffix.  Add an alias.
>>
>> I'd apply these two, but both patches are mangled when saved. It's
>> turning == into =3D and so forth.
>>
>> Care to check your settings and resend?
> 
> Just tried saving and applying with 'git am' successfully.  I think the
> problem is at your end.

Then yours is the only one, I apply patches people send me all day long.
Was the case both in tbird and mutt, both of them showed the diffs
as mangled, and they showed up mangled when saved.

-- 
Jens Axboe

Re: [PATCH 0/2] skd: Two additional patches for kernel v4.14

2017-08-29 Thread Jens Axboe

On 08/29/2017 09:32 AM, Bart Van Assche wrote:
> Hello Jens,
> 
> Further code review of the skd driver resulted in two additional
> patches. Please consider these for kernel v4.14.

Looks good to me, applied, thanks.

-- 
Jens Axboe

Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems

2017-08-29 Thread Keith Busch

On Tue, Aug 29, 2017 at 04:55:59PM +0200, Christoph Hellwig wrote:
> On Tue, Aug 29, 2017 at 10:54:17AM -0400, Keith Busch wrote:
> > It also looks like new submissions will get a new path only from the
> > fact that the original/primary is being reset. The controller reset
> > itself seems a bit heavy-handed. Can we just set head->current_path to
> > the next active controller in the list?
> 
> For ANA we'll have to do that anyway, but if we got a failure
> that clearly indicates a path failure what benefit is there in not
> resetting the controller?  But yeah, maybe we can just switch the
> path for non-ANA controllers and wait for timeouts to do their work.

Okay, sounds reasonable.

Speaking of timeouts, nvme_req_needs_retry will fail the command
immediately rather than try the alternate path if it was cancelled due
to timeout handling. Should we create a new construct for a command's
total time separate from recovery timeout so we may try an alternate
paths?

[PATCH 1/2] skd: Remove blk_queue_bounce_limit() call

2017-08-29 Thread Bart Van Assche

Since sTec s1120 devices support 64-bit DMA it is not necessary
to request data buffer bouncing. Hence remove the
blk_queue_bounce_limit() call.

Suggested-by: Christoph Hellwig 
Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/block/skd_main.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index 00a86252b3c5..f987ff601a4c 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -2844,7 +2844,6 @@ static int skd_cons_disk(struct skd_device *skdev)
rc = PTR_ERR(q);
goto err_out;
}
-   blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
q->queuedata = skdev;
q->nr_requests = skd_max_queue_depth / 2;
 
-- 
2.14.1

[PATCH 2/2] skd: Let the block layer core choose .nr_requests

2017-08-29 Thread Bart Van Assche

Since blk_mq_init_queue() initializes .nr_requests to the tag set
size and since that value is a good default for the skd driver, do
not overwrite the value set by blk_mq_init_queue(). This change
doubles the default value of .nr_requests.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/block/skd_main.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index f987ff601a4c..7cedb4295e9d 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -2845,7 +2845,6 @@ static int skd_cons_disk(struct skd_device *skdev)
goto err_out;
}
q->queuedata = skdev;
-   q->nr_requests = skd_max_queue_depth / 2;

skdev->queue = q;
disk->queue = q;
-- 
2.14.1

[PATCH 0/2] skd: Two additional patches for kernel v4.14

2017-08-29 Thread Bart Van Assche

Hello Jens,

Further code review of the skd driver resulted in two additional
patches. Please consider these for kernel v4.14.

Thanks,

Bart.

Bart Van Assche (2):
  skd: Remove blk_queue_bounce_limit() call
  skd: Let the block layer core choose .nr_requests

 drivers/block/skd_main.c | 2 --
 1 file changed, 2 deletions(-)

-- 
2.14.1

Re: [PATCH] blk-mq: Improvements to the hybrid polling sleep time calculation

2017-08-29 Thread Stephen Bates

>> From: Stephen Bates 
>> 
>> Hybrid polling currently uses half the average completion time as an
>> estimate of how long to poll for. We can improve upon this by noting
>> that polling before the minimum completion time makes no sense. Add a
>> sysfs entry to use this fact to improve CPU utilization in certain
>> cases.
>> 
>> At the same time the minimum is a bit too long to sleep for since we
>> must factor in OS wake time for the thread. For now allow the user to
>> set this via a second sysfs entry (in nanoseconds).
>> 
>> Testing this patch on Intel Optane SSDs showed that using the minimum
>> rather than half reduced CPU utilization from 59% to 38%. Tuning
>> this via the wake time adjustment allowed us to trade CPU load for
>> latency. For example
>> 
>> io_poll   delay  hyb_use_min adjust  latency CPU load
>> 1 -1 N/A N/A 8.4 100%
>> 1 0  0   N/A 8.4 57%
>> 1 0  1   0   10.334%
>> 1 9  1   10009.9 37%
>> 1 0  1   20008.4 47%
>> 1 0  1   1   8.4 100%
>> 
>> Ideally we will extend this to auto-calculate the wake time rather
>> than have it set by the user.
>
> I don't like this, it's another weird knob that will exist but that
> no one will know how to use. For most of the testing I've done
> recently, hybrid is a win over busy polling - hence I think we should
> make that the default. 60% of mean has also, in testing, been shown
> to be a win. So that's an easy fix/change we can consider.

I do agree that the this is a hard knob to tune. I am however not happy that 
the current hybrid default may mean we are polling well before the minimum 
completion time. That just seems like a waste of CPU resources to me. I do 
agree that turning on hybrid as the default and perhaps bumping up the default 
is a good idea.

> To go beyond that, I'd much rather see us tracking the time waste.
> If we consider the total completion time of an IO to be A+B+C, where:
>
> A Time needed to go to sleep
> B Sleep time
> C Time needed to wake up
>
> then we could feasibly track A+C. We already know how long the IO
> will take to complete, as we track that. At that point we'd have
> a full picture of how long we should sleep.

Yes, this is where I was thinking of taking this functionality in the long 
term. It seems like tracking C is something other parts of the kernel might 
need. Does anyone know of any existing code in this space?

> Bonus points for informing the lower level scheduler of this as
> well. If the CPU is going idle, we'll enter some sort of power
> state in the processor. If we were able to pass in how long we
> expect to sleep, we could be making better decisions here.

Yup. Again, this seems like something more general that just the block-layer. I 
will do some digging and see/if anything is available to leverage here.

Cheers
Stephen

Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module

2017-08-29 Thread Ben Hutchings

On Tue, 2017-08-29 at 08:31 -0600, Jens Axboe wrote:
> On Sun, Aug 13 2017, Ben Hutchings wrote:
> > The block core requests modules with the "-iosched" name suffix, but
> > bfq no longer has that suffix.  Add an alias.
> 
> I'd apply these two, but both patches are mangled when saved. It's
> turning == into =3D and so forth.
> 
> Care to check your settings and resend?

Just tried saving and applying with 'git am' successfully.  I think the
problem is at your end.

Ben.

-- 
Ben Hutchings
Teamwork is essential - it allows you to blame someone else.



signature.asc
Description: This is a digitally signed message part

Re: [PATCH] block: Make blk_dequeue_request() static

2017-08-29 Thread Bart Van Assche

On Tue, 2017-08-29 at 11:54 +0900, Damien Le Moal wrote:
> The only caller of this function is blk_start_request() in the same
> file. Fix blk_start_request() description accordingly.

Reviewed-by: Bart Van Assche

Re: [RFC] block/loop: make loop cgroup aware

2017-08-29 Thread Tejun Heo

Hello, Shaohua.

On Tue, Aug 29, 2017 at 08:22:36AM -0700, Shaohua Li wrote:
> > Yeah, writeback tracks the most active cgroup and associates writeback
> > ios with that cgroup.  For buffered loop devices, I think it'd be fine
> > to make the loop driver forward the cgroup association and let the
> > writeback layer determine the matching association as usual.
> 
> Doing this means we must forward cgroup info to page, not bio. I need to check
> if we can make the mechanism work for memcg.

The association is already coming from the page.  We just need to make
sure that going through loop driver doesn't get in the way of the
membership getting propagated to the underlying device.

> > Hmm... why do we need double forwarding (original issuer -> aio cmd ->
> > ios) in loop?  If we need this, doesn't this mean that we're missing
> > ownership forwarding in aio paths and should make the forwarding
> > happen there?
> 
> what do you mean double forwarding?

So, this looks like the loop driver is explicitly forwarding the
association from the original issuer to the aio command and then from
the aio command to the ios to the underlying device.  I'm wondering
whether the right way to do this is making aio forward the association
by default, instead of the loop driver doing it explicitly.  That way,
all aio users can benefit from the forwarding instead of just loop.

Thanks.

-- 
tejun

Re: [RFC] block/loop: make loop cgroup aware

2017-08-29 Thread Shaohua Li

On Mon, Aug 28, 2017 at 03:54:59PM -0700, Tejun Heo wrote:
> Hello, Shaohua.
> 
> On Wed, Aug 23, 2017 at 11:15:15AM -0700, Shaohua Li wrote:
> > loop block device handles IO in a separate thread. The actual IO
> > dispatched isn't cloned from the IO loop device received, so the
> > dispatched IO loses the cgroup context.
> 
> Ah, right.  Thanks for spotting this.
> 
> > I'm ignoring buffer IO case now, which is quite complicated.  Making the
> > loop thread aware of cgroup context doesn't really help. The loop device
> > only writes to a single file. In current writeback cgroup
> > implementation, the file can only belong to one cgroup.
> 
> Yeah, writeback tracks the most active cgroup and associates writeback
> ios with that cgroup.  For buffered loop devices, I think it'd be fine
> to make the loop driver forward the cgroup association and let the
> writeback layer determine the matching association as usual.

Doing this means we must forward cgroup info to page, not bio. I need to check
if we can make the mechanism work for memcg.
 
> > For direct IO case, we could workaround the issue in theory. For
> > example, say we assign cgroup1 5M/s BW for loop device and cgroup2
> > 10M/s. We can create a special cgroup for loop thread and assign at
> > least 15M/s for the underlayer disk. In this way, we correctly throttle
> > the two cgroups. But this is tricky to setup.
> > 
> > This patch tries to address the issue. When loop thread is handling IO,
> > it declares the IO is on behalf of the original task, then in block IO
> > we use the original task to find cgroup. The concept probably works for
> > other scenarios too, but right now I don't make it generic yet.
> 
> The overall approach looks sound to me.  Some comments below.
> 
> > @@ -2058,7 +2058,10 @@ int bio_associate_current(struct bio *bio)
> >  
> > get_io_context_active(ioc);
> > bio->bi_ioc = ioc;
> > -   bio->bi_css = task_get_css(current, io_cgrp_id);
> > +   if (current->cgroup_task)
> > +   bio->bi_css = task_get_css(current->cgroup_task, io_cgrp_id);
> > +   else
> > +   bio->bi_css = task_get_css(current, io_cgrp_id);
> 
> Do we need a full pointer field for this?  I think we should be able
> to mark lo writers with a flag (PF or whatever) and then to
> kthread_data() to get the lo struct which contains the target css.
> Oh, let's do csses instead of tasks for consistency, correctness
> (please see below) and performance (csses are cheaper to pin).

Forwarding css sounds better.

> > @@ -502,11 +504,16 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> > loop_cmd *cmd,
> > cmd->iocb.ki_complete = lo_rw_aio_complete;
> > cmd->iocb.ki_flags = IOCB_DIRECT;
> >  
> > +   if (cmd->cgroup_task)
> > +   current->cgroup_task = cmd->cgroup_task;
> > +
> > if (rw == WRITE)
> > ret = call_write_iter(file, >iocb, );
> > else
> > ret = call_read_iter(file, >iocb, );
> >  
> > +   current->cgroup_task = NULL;
> > +
> > if (ret != -EIOCBQUEUED)
> > cmd->iocb.ki_complete(>iocb, ret, 0);
> > return 0;
> 
> Hmm... why do we need double forwarding (original issuer -> aio cmd ->
> ios) in loop?  If we need this, doesn't this mean that we're missing
> ownership forwarding in aio paths and should make the forwarding
> happen there?

what do you mean double forwarding?
> 
> > @@ -1705,6 +1712,11 @@ static blk_status_t loop_queue_rq(struct 
> > blk_mq_hw_ctx *hctx,
> > break;
> > }
> >  
> > +   if (cmd->use_aio) {
> > +   cmd->cgroup_task = current;
> > +   get_task_struct(current);
> > +   } else
> > +   cmd->cgroup_task = NULL;
> 
> What if %current isn't the original issuer of the io?  It could be a
> writeback worker trying to flush to a loop device and we don't want to
> attribute those ios to the writeback worker.  We should forward the
> bi_css not the current task.

Makes sense.

Thanks,
Shaohua

Re: [PATCH V2 2/2] block/loop: allow request merge for directio mode

2017-08-29 Thread Shaohua Li

On Tue, Aug 29, 2017 at 05:56:05PM +0800, Ming Lei wrote:
> On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> > From: Shaohua Li 
> > 
> > Currently loop disables merge. While it makes sense for buffer IO mode,
> > directio mode can benefit from request merge. Without merge, loop could
> > send small size IO to underlayer disk and harm performance.
> 
> Hi Shaohua,
> 
> IMO no matter if merge is used, loop always sends page by page
> to VFS in both dio or buffer I/O.

Why do you think so?
 
> Also if merge is enabled on loop, that means merge is run
> on both loop and low level block driver, and not sure if we
> can benefit from that.

why does merge still happen in low level block driver?

> 
> So Could you provide some performance data about this patch?

In my virtual machine, a workload improves from ~20M/s to ~50M/s. And I clearly
see the request size becomes bigger.

> > 
> > Reviewed-by: Omar Sandoval 
> > Signed-off-by: Shaohua Li 
> > ---
> >  drivers/block/loop.c | 66 
> > 
> >  drivers/block/loop.h |  1 +
> >  2 files changed, 52 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 428da07..75a8f6e 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -213,10 +213,13 @@ static void __loop_update_dio(struct loop_device *lo, 
> > bool dio)
> >  */
> > blk_mq_freeze_queue(lo->lo_queue);
> > lo->use_dio = use_dio;
> > -   if (use_dio)
> > +   if (use_dio) {
> > +   queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
> > lo->lo_flags |= LO_FLAGS_DIRECT_IO;
> > -   else
> > +   } else {
> > +   queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
> > lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
> > +   }
> > blk_mq_unfreeze_queue(lo->lo_queue);
> >  }
> >  
> > @@ -464,6 +467,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
> > ret, long ret2)
> >  {
> > struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
> >  
> > +   kfree(cmd->bvec);
> > +   cmd->bvec = NULL;
> > cmd->ret = ret;
> > blk_mq_complete_request(cmd->rq);
> >  }
> > @@ -473,22 +478,50 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> > loop_cmd *cmd,
> >  {
> > struct iov_iter iter;
> > struct bio_vec *bvec;
> > -   struct bio *bio = cmd->rq->bio;
> > +   struct request *rq = cmd->rq;
> > +   struct bio *bio = rq->bio;
> > struct file *file = lo->lo_backing_file;
> > +   unsigned int offset;
> > +   int segments = 0;
> > int ret;
> >  
> > -   /* nomerge for loop request queue */
> > -   WARN_ON(cmd->rq->bio != cmd->rq->biotail);
> > +   if (rq->bio != rq->biotail) {
> > +   struct req_iterator iter;
> > +   struct bio_vec tmp;
> > +
> > +   __rq_for_each_bio(bio, rq)
> > +   segments += bio_segments(bio);
> > +   bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_KERNEL);
> 
> The allocation should have been GFP_NOIO.

Sounds good. To make this completely correct isn't easy though, we are calling
into the underlayer fs operations which can do allocation.

Thanks,
Shaohua

Re: [GIT PULL] nvme update for Linux 4.14, take 2

2017-08-29 Thread Jens Axboe

On 08/29/2017 09:05 AM, Christoph Hellwig wrote:
> Hi Jens,
> 
> below is the current set of NVMe updates for Linux 4.14, now against your
> postmerge branch, and with three more patches.
> 
> The biggest bit comes from Sagi and refactors the RDMA driver to prepare
> for more code sharing in the setup and teardown path.  But we have various
> features and bug fixes from a lot of people as well.

Pulled, thanks.

-- 
Jens Axboe

[GIT PULL] nvme update for Linux 4.14, take 2

2017-08-29 Thread Christoph Hellwig

Hi Jens,

below is the current set of NVMe updates for Linux 4.14, now against your
postmerge branch, and with three more patches.

The biggest bit comes from Sagi and refactors the RDMA driver to prepare
for more code sharing in the setup and teardown path.  But we have various
features and bug fixes from a lot of people as well.

The following changes since commit cd996fb47c360320cf25ac9503c16de085ea9cfc:

  Merge tag 'v4.13-rc7' into for-4.14/block-postmerge (2017-08-28 13:00:44 
-0600)

are available in the git repository at:

  git://git.infradead.org/nvme.git nvme-4.14

for you to fetch changes up to 1d5df6af8c7469f9ae3e66e7bed0782cfe4f95db:

  nvme: don't blindly overwrite identifiers on disk revalidate (2017-08-29 
10:23:04 +0200)


Arnav Dawn (2):
  nvme: add support for FW activation without reset
  nvme: define NVME_NSID_ALL

Christoph Hellwig (6):
  nvmet: use NVME_NSID_ALL
  nvme: report more detailed status codes to the block layer
  nvme: allow calling nvme_change_ctrl_state from irq context
  nvme: remove unused struct nvme_ns fields
  nvme: remove nvme_revalidate_ns
  nvme: don't blindly overwrite identifiers on disk revalidate

Guan Junxiong (2):
  nvmet: fix the return error code of target if host is not allowed
  nvme-fabrics: log a warning if hostid is invalid

James Smart (2):
  nvme-fc: Reattach to localports on re-registration
  nvmet-fc: simplify sg list handling

Jan H. Schönherr (1):
  nvme: fix uninitialized prp2 value on small transfers

Johannes Thumshirn (2):
  nvmet-fcloop: remove ALL_OPTS define
  nvme-rdma: remove NVME_RDMA_MAX_SEGMENT_SIZE

Jon Derrick (1):
  nvme: add support for NVMe 1.3 Timestamp Feature

Martin K. Petersen (1):
  nvme: honor RTD3 Entry Latency for shutdowns

Martin Wilck (2):
  string.h: add memcpy_and_pad()
  nvmet: use memcpy_and_pad for identify sn/fr

Max Gurtovoy (3):
  nvme: add symbolic constants for CC identifiers
  nvme: rename AMS symbolic constants to fit specification
  nvme-rdma: Use unlikely macro in the fast path

Sagi Grimberg (13):
  nvme-rdma: move nvme_rdma_configure_admin_queue code location
  nvme: Add admin_tagset pointer to nvme_ctrl
  nvme-rdma: move tagset allocation to a dedicated routine
  nvme-rdma: disable the controller on resets
  nvme-rdma: don't free tagset on resets
  nvme-rdma: reuse configure/destroy_admin_queue
  nvme-rdma: introduce configure/destroy io queues
  nvme-rdma: stop queues instead of simply flipping their state
  nvme-rdma: rename nvme_rdma_init_queue to nvme_rdma_alloc_queue
  nvme-rdma: introduce nvme_rdma_start_queue
  nvme-rdma: cleanup error path in controller reset
  nvme-rdma: call ops->reg_read64 instead of nvmf_reg_read64
  nvme: fix identify namespace logging

 drivers/nvme/host/core.c| 242 +
 drivers/nvme/host/fabrics.c |   1 +
 drivers/nvme/host/fc.c  | 145 ---
 drivers/nvme/host/nvme.h|  10 +-
 drivers/nvme/host/pci.c |   5 +-
 drivers/nvme/host/rdma.c| 564 
 drivers/nvme/target/admin-cmd.c |  16 +-
 drivers/nvme/target/configfs.c  |   2 +-
 drivers/nvme/target/core.c  |  15 +-
 drivers/nvme/target/fc.c|  48 +---
 drivers/nvme/target/fcloop.c|   3 -
 drivers/nvme/target/loop.c  |   1 +
 include/linux/nvme-fc-driver.h  |   2 +-
 include/linux/nvme.h|  37 ++-
 include/linux/string.h  |  30 +++
 15 files changed, 677 insertions(+), 444 deletions(-)

[PATCH]blk-mq-sched: remove the empty entry in token wait list

2017-08-29 Thread 查斌

From:  Bin Zha 


When the kyber adjusts the sync and other write depth to the
minimum(1), there is a case that maybe cause the requests to
be stalled in the kyber_hctx_data list.

The following example I have tested:


CPU7
block_rq_insert
add_wait_queue
__sbitmap_queue_get  CPU23
block_rq_issue   block_rq_insert
block_rq_complete -->  waiting token free
  block_rq_issue
  /|\block_rq_complete
   ||
   ||
   |   \|/
   | CPU29
   |block_rq_insert
   |   waiting token free
   | block_rq_issue
   |-- block_rq_complete

CPU1
   block_rq_insert
  waiting token free


The IO request complete in CPU29 will wake up CPU7,
because it has been added to the waitqueue in
kyber_get_domain_token. But it is empty waiter, and won't
wake up the CPU1.If no other requests issue to push it,
the requests will stall in kyber_hctx_data list.


Signed-off-by: Bin Zha 

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index b9faabc..584bfd5 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -548,6 +548,10 @@ static int kyber_get_domain_token(struct
kyber_queue_data *kqd,
 * queue.
 */
nr = __sbitmap_queue_get(domain_tokens);
+   if (nr >= 0) {
+   remove_wait_queue(>wait, wait);
+   INIT_LIST_HEAD(>task_list);
+   }
}
return nr;
 }

Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems

2017-08-29 Thread Christoph Hellwig

On Tue, Aug 29, 2017 at 06:22:50PM +0800, Guan Junxiong wrote:
> Does this introduce conflicts with current DM-Multipath used for NVMe/NVMeF
> when path IO error occurs?  Such IO will be retried not only on the nvme-mpath
> internal path, but also on the dm-mpath path.

It will not reach back to dm-multipath if we fail over here.

Re: [GIT PULL] nvme update for Linux 4.14

2017-08-29 Thread Christoph Hellwig

On Tue, Aug 29, 2017 at 08:34:29AM -0600, Jens Axboe wrote:
> Let me know when it's ready.

I'll send you another pull request in a bit.

Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems

2017-08-29 Thread Keith Busch

On Wed, Aug 23, 2017 at 07:58:15PM +0200, Christoph Hellwig wrote:
> + /* Anything else could be a path failure, so should be retried */
> + spin_lock_irqsave(>head->requeue_lock, flags);
> + blk_steal_bios(>head->requeue_list, req);
> + spin_unlock_irqrestore(>head->requeue_lock, flags);
> +
> + nvme_reset_ctrl(ns->ctrl);
> + kblockd_schedule_work(>head->requeue_work);
> + return true;
> +}

It appears this isn't going to cause the path selection to failover for
the requeued work. The bio's bi_disk is unchanged from the failed path when the
requeue_work submits the bio again so it will use the same path, right? 

It also looks like new submissions will get a new path only from the
fact that the original/primary is being reset. The controller reset
itself seems a bit heavy-handed. Can we just set head->current_path to
the next active controller in the list?

> +static void nvme_requeue_work(struct work_struct *work)
> +{
> + struct nvme_ns_head *head =
> + container_of(work, struct nvme_ns_head, requeue_work);
> + struct bio *bio, *next;
> +
> + spin_lock_irq(>requeue_lock);
> + next = bio_list_get(>requeue_list);
> + spin_unlock_irq(>requeue_lock);
> +
> + while ((bio = next) != NULL) {
> + next = bio->bi_next;
> + bio->bi_next = NULL;
> + generic_make_request_fast(bio);
> + }
> +}

Here, I think we need to reevaluate the path (nvme_find_path) and set
bio->bi_disk accordingly.

Re: [GIT PULL] nvme update for Linux 4.14

2017-08-29 Thread Jens Axboe

On 08/28/2017 01:56 PM, Sagi Grimberg wrote:
 Meh, I don't think that'll work out - the rdma refactoring from
 Sagi deeply depends on the unіnit_ctrl split..
>>>
>>> I can do a for-4.14/block-postmerge branch, which is for-4.14/block
>>> with -rc7 pulled in. Should apply on top of that.
>>>
>>> Did that branch, doesn't apply. But should be easier for you to
>>> rebase on top of that branch. Pushed it out.
>>
>> I've pushed out a nvme-4.14-rebase branch with the rebase, but I didn't
>> have time for non-trivial testing yet, that'll have to wait for
>> tomorrow.
>>
>> Sagi: can you double check the resolutions around blk_mq_reinit_tagse?
> 
> Thanks for doing it Christoph,
> 
> It looks good to me, there is one glitch in patch:
> "nvme-rdma: reuse configure/destroy_admin_queue"
> 
> The "nvme_rdma_configure_admin_queue" block needs to
> be squashed into "nvme-rdma: don't free tagset on resets> 
> I'll do that now.

Let me know when it's ready.

-- 
Jens Axboe

Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module

2017-08-29 Thread Jens Axboe

On Sun, Aug 13 2017, Ben Hutchings wrote:
> The block core requests modules with the "-iosched" name suffix, but
> bfq no longer has that suffix.  Add an alias.

I'd apply these two, but both patches are mangled when saved. It's
turning == into =3D and so forth.

Care to check your settings and resend?

-- 
Jens Axboe

non-blocking buffered reads V5

2017-08-29 Thread Christoph Hellwig

This series resurrects the old patches from Milosz to implement
non-blocking buffered reads.  Thanks to the non-blocking AIO code from
Goldwyn the implementation becomes pretty much trivial.

I've also forward ported the test Milosz sent for recent xfsprogs to
verify that this series works properly, but I'll still have to address
the review comments for it. I'll also volunteer to work with Goldwyn to
properly document the RWF_NOWAIT flag in the man page including this
change.

Changes from V4:
 - improve conditionals in generic_file_buffered_read

Changes from V3:
 - forward ported to the latest kernel
 - fixed a compiler warning

Changes from V2:
 - keep returning -EOPNOTSUPP for the not supported buffered write case
 - add block device node support
 - rebase against current Linus' tree, which has all the requirements

Changes from V1:
 - fix btrfs to reject nowait buffered writes
 - tested btrfs and ext4 in addition to xfs this time


Here are additional details from the original cover letter from Milosz,
where the flag was still called RWF_NONBLOCK:


Background:

 Using a threadpool to emulate non-blocking operations on regular buffered
 files is a common pattern today (samba, libuv, etc...) Applications split the
 work between network bound threads (epoll) and IO threadpool. Not every
 application can use sendfile syscall (TLS / post-processing).

 This common pattern leads to increased request latency. Latency can be due to
 additional synchronization between the threads or fast (cached data) request
 stuck behind slow request (large / uncached data).

 The preadv2 syscall with RWF_NONBLOCK lets userspace applications bypass
 enqueuing operation in the threadpool if it's already available in the
 pagecache.


Performance numbers (newer Samba):

 https://drive.google.com/file/d/0B3maCn0jCvYncndGbXJKbGlhejQ/view?usp=sharing
 
https://docs.google.com/spreadsheets/d/1GGTivi-MfZU0doMzomG4XUo9ioWtRvOGQ5FId042L6s/edit?usp=sharing


Performance number (older):

 Some perf data generated using fio comparing the posix aio engine to a version
 of the posix AIO engine that attempts to performs "fast" reads before
 submitting the operations to the queue. This workflow is on ext4 partition on
 raid0 (test / build-rig.) Simulating our database access patern workload using
 16kb read accesses. Our database uses a home-spun posix aio like queue (samba
 does the same thing.)

 f1: ~73% rand read over mostly cached data (zipf med-size dataset)
 f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
 f3: ~9% seq-read over large dataset

 before:

 f1:
 bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
 lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
 lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
 f2:
 bw (KB  /s): min=2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
 lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
 lat (msec) : >=2000=4.33%
 f3:
 bw (KB  /s): min=0, max=265568, per=99.95%, avg=174575.10,
  stdev=34526.89
 lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
 lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
 lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
 lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
 total:
READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
  mint=61msec, maxt=600113msec

 after (with fast read using preadv2 before submit):

 f1:
 bw (KB  /s): min=3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
 lat (usec) : 2=70.63%, 4=0.01%
 lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
 f2:
 bw (KB  /s): min=2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
 lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
 lat (msec) : >=2000=9.99%
 f3:
 bw (KB  /s): min=1, max=245448, per=100.00%, avg=177366.50,
  stdev=35995.60
 lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
 lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
 lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
 lat (msec) : 100=0.05%, 250=0.02%
 total:
READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
  mint=600020msec, maxt=600178msec

 Interpreting the results you can see total bandwidth stays the same but overall
 request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
 workloads. There is a slight bump in latency for since it's random data that's
 unlikely to be cached but we're always trying "fast read".

 In our application we have starting keeping track of "fast read" hits/misses
 and for files / requests that have a lot hit ratio we don't do "fast reads"
 mostly getting rid of extra latency in the uncached cases. In our real world
 work load we were able to reduce average response time by 20 to 30%

[PATCH 4/4] block_dev: support RFW_NOWAIT on block device nodes

2017-08-29 Thread Christoph Hellwig

All support is already there in the generic code, we just need to wire
it up.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/block_dev.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9941dc8342df..ea21d18d8e79 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1739,6 +1739,8 @@ static int blkdev_open(struct inode * inode, struct file 
* filp)
 */
filp->f_flags |= O_LARGEFILE;
 
+   filp->f_mode |= FMODE_NOWAIT;
+
if (filp->f_flags & O_NDELAY)
filp->f_mode |= FMODE_NDELAY;
if (filp->f_flags & O_EXCL)
@@ -1891,6 +1893,9 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct 
iov_iter *from)
if (iocb->ki_pos >= size)
return -ENOSPC;
 
+   if ((iocb->ki_flags & (IOCB_NOWAIT | IOCB_DIRECT)) == IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
iov_iter_truncate(from, size - iocb->ki_pos);
 
blk_start_plug();
-- 
2.11.0

[PATCH 3/4] fs: support RWF_NOWAIT for buffered reads

2017-08-29 Thread Christoph Hellwig

This is based on the old idea and code from Milosz Tanski.  With the aio
nowait code it becomes mostly trivial now.  Buffered writes continue to
return -EOPNOTSUPP if RWF_NOWAIT is passed.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 fs/aio.c   |  6 --
 fs/btrfs/file.c|  6 +-
 fs/ext4/file.c |  6 +++---
 fs/xfs/xfs_file.c  | 11 +--
 include/linux/fs.h |  6 +++---
 5 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index dcad3a66748c..d93daa076726 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1593,12 +1593,6 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
goto out_put_req;
}
 
-   if ((req->common.ki_flags & IOCB_NOWAIT) &&
-   !(req->common.ki_flags & IOCB_DIRECT)) {
-   ret = -EOPNOTSUPP;
-   goto out_put_req;
-   }
-
ret = put_user(KIOCB_KEY, _iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9e75d8a39aac..e62dd55b4079 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1886,6 +1886,10 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
loff_t oldsize;
int clean_page = 0;
 
+   if (!(iocb->ki_flags & IOCB_DIRECT) &&
+   (iocb->ki_flags & IOCB_NOWAIT))
+   return -EOPNOTSUPP;
+
if (!inode_trylock(inode)) {
if (iocb->ki_flags & IOCB_NOWAIT)
return -EAGAIN;
@@ -3105,7 +3109,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t 
offset, int whence)
 
 static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
-   filp->f_mode |= FMODE_AIO_NOWAIT;
+   filp->f_mode |= FMODE_NOWAIT;
return generic_file_open(inode, filp);
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0d7cf0cc9b87..f83521337b8f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -223,6 +223,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
if (IS_DAX(inode))
return ext4_dax_write_iter(iocb, from);
 #endif
+   if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
+   return -EOPNOTSUPP;
 
if (!inode_trylock(inode)) {
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -448,9 +450,7 @@ static int ext4_file_open(struct inode * inode, struct file 
* filp)
return ret;
}
 
-   /* Set the flags to support nowait AIO */
-   filp->f_mode |= FMODE_AIO_NOWAIT;
-
+   filp->f_mode |= FMODE_NOWAIT;
return dquot_file_open(inode, filp);
 }
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c4893e226fd8..1a09104b3eb0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -259,7 +259,11 @@ xfs_file_buffered_aio_read(
 
trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos);
 
-   xfs_ilock(ip, XFS_IOLOCK_SHARED);
+   if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EAGAIN;
+   xfs_ilock(ip, XFS_IOLOCK_SHARED);
+   }
ret = generic_file_read_iter(iocb, to);
xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
@@ -636,6 +640,9 @@ xfs_file_buffered_aio_write(
int enospc = 0;
int iolock;
 
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EOPNOTSUPP;
+
 write_retry:
iolock = XFS_IOLOCK_EXCL;
xfs_ilock(ip, iolock);
@@ -912,7 +919,7 @@ xfs_file_open(
return -EFBIG;
if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb)))
return -EIO;
-   file->f_mode |= FMODE_AIO_NOWAIT;
+   file->f_mode |= FMODE_NOWAIT;
return 0;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6e1fd5d21248..ded258edc425 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -146,8 +146,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
offset,
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY ((__force fmode_t)0x400)
 
-/* File is capable of returning -EAGAIN if AIO will block */
-#define FMODE_AIO_NOWAIT   ((__force fmode_t)0x800)
+/* File is capable of returning -EAGAIN if I/O will block */
+#define FMODE_NOWAIT   ((__force fmode_t)0x800)
 
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
@@ -3149,7 +3149,7 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, 
int flags)
return -EOPNOTSUPP;
 
if (flags & RWF_NOWAIT) {
-   if (!(ki->ki_filp->f_mode & FMODE_AIO_NOWAIT))
+   if (!(ki->ki_filp->f_mode & FMODE_NOWAIT))
return -EOPNOTSUPP;
ki->ki_flags |= IOCB_NOWAIT;
}
-- 
2.11.0

[PATCH 1/4] fs: pass iocb to do_generic_file_read

2017-08-29 Thread Christoph Hellwig

And rename it to the more descriptive generic_file_buffered_read while
at it.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
---
 mm/filemap.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index a49702445ce0..4bcfa74ad802 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1886,9 +1886,8 @@ static void shrink_readahead_size_eio(struct file *filp,
 }
 
 /**
- * do_generic_file_read - generic file read routine
- * @filp:  the file to read
- * @ppos:  current file position
+ * generic_file_buffered_read - generic file read routine
+ * @iocb:  the iocb to read
  * @iter:  data destination
  * @written:   already copied
  *
@@ -1898,12 +1897,14 @@ static void shrink_readahead_size_eio(struct file *filp,
  * This is really ugly. But the goto's actually try to clarify some
  * of the logic when it comes to error handling etc.
  */
-static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
+static ssize_t generic_file_buffered_read(struct kiocb *iocb,
struct iov_iter *iter, ssize_t written)
 {
+   struct file *filp = iocb->ki_filp;
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
struct file_ra_state *ra = >f_ra;
+   loff_t *ppos = >ki_pos;
pgoff_t index;
pgoff_t last_index;
pgoff_t prev_index;
@@ -2151,14 +2152,14 @@ static ssize_t do_generic_file_read(struct file *filp, 
loff_t *ppos,
 ssize_t
 generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
-   struct file *file = iocb->ki_filp;
-   ssize_t retval = 0;
size_t count = iov_iter_count(iter);
+   ssize_t retval = 0;
 
if (!count)
goto out; /* skip atime */
 
if (iocb->ki_flags & IOCB_DIRECT) {
+   struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
loff_t size;
@@ -2199,7 +2200,7 @@ generic_file_read_iter(struct kiocb *iocb, struct 
iov_iter *iter)
goto out;
}
 
-   retval = do_generic_file_read(file, >ki_pos, iter, retval);
+   retval = generic_file_buffered_read(iocb, iter, retval);
 out:
return retval;
 }
-- 
2.11.0

[PATCH 2/4] fs: support IOCB_NOWAIT in generic_file_buffered_read

2017-08-29 Thread Christoph Hellwig

From: Milosz Tanski 

Allow generic_file_buffered_read to bail out early instead of waiting for
the page lock or reading a page if IOCB_NOWAIT is specified.

Signed-off-by: Milosz Tanski 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jeff Moyer 
Acked-by: Sage Weil 
---
 mm/filemap.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4bcfa74ad802..eed394fd331c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1937,6 +1937,8 @@ static ssize_t generic_file_buffered_read(struct kiocb 
*iocb,
 
page = find_get_page(mapping, index);
if (!page) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   goto would_block;
page_cache_sync_readahead(mapping,
ra, filp,
index, last_index - index);
@@ -1950,6 +1952,11 @@ static ssize_t generic_file_buffered_read(struct kiocb 
*iocb,
index, last_index - index);
}
if (!PageUptodate(page)) {
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   put_page(page);
+   goto would_block;
+   }
+
/*
 * See comment in do_read_cache_page on why
 * wait_on_page_locked is used to avoid unnecessarily
@@ -2131,6 +2138,8 @@ static ssize_t generic_file_buffered_read(struct kiocb 
*iocb,
goto readpage;
}
 
+would_block:
+   error = -EAGAIN;
 out:
ra->prev_pos = prev_index;
ra->prev_pos <<= PAGE_SHIFT;
-- 
2.11.0

Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests

2017-08-29 Thread h...@lst.de

Hi Bart,

this isn't just about performance - it's about understanability of
the I/O path.

The legacy request path already has the prep_fn, which is intended
for exactly this sort of prep work, but even that led to a lot
of confusion.  For blk-mq we decided to not add it but let the called
driver in control.  I really don't want to move away from that.

The passthrough requests using blk_get_requst are a special case
as the caller allocates the requests, stuffs data into them and only
then hands control to the driver, and thus we need some way to
initialize the request before handing controller to the driver in
this particular special case.  And if needed would could actually
do that with explicit calls instead of the callback, although you
changed it to save a bit of code.

Re: [PATCH] block: Make blk_dequeue_request() static

2017-08-29 Thread Christoph Hellwig

On Tue, Aug 29, 2017 at 11:54:37AM +0900, Damien Le Moal wrote:
> The only caller of this function is blk_start_request() in the same
> file. Fix blk_start_request() description accordingly.
> 
> Signed-off-by: Damien Le Moal 

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH 0/4] loop: support different logical block sizes

2017-08-29 Thread Ming Lei

On Thu, Aug 24, 2017 at 12:03:40AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Hi, everyone,
> 
> Here's another attempt at supporting different logical block sizes for
> loop devices. First, some terminology:
> 
>  * Logical block size: the lowest possible block size that the storage
>device can address.
>  * Physical block size: the lowest possible sector size that the
>hardware can operate on without reverting to read-modify-write
>operations. Always greater than or equal to the logical block size.
> 
> These are characteristics of the device itself tracked in the
> request_queue. For loop devices, both are currently always 512 bytes.
> 
> struct block_device also has another block size, basically a "soft"
> block size which is the preferred I/O size and can be changed from
> userspace. For loop devices, this is PAGE_SIZE if the backing file is a
> regular file or otherwise the soft block size of the backing block
> device.
> 
> Patch 1 simplifies how the soft block size is set. I'm tempted to not
> set it at all and use the default of PAGE_SIZE, but maybe someone is
> depending on inheriting the soft block size from the backing block
> device...
> 
> Patch 2 sets the physical block size of loop devices to a more
> meaningful value for loop, PAGE_SIZE.
> 
> Patch 3 allows us to change the logical block size. It adds a new ioctl
> instead of the previous approach (which extends LOOP_{GET,SET}_STATUS).
> Hannes, I assume you had a specific reason for doing it the way you did,
> care to explain?
> 
> Patch 4 is a cleanup.
> 
> This is based on my patch reverting the original block size support,
> targeting 4.14. Thanks!
> 
> Omar Sandoval (4):
>   loop: get rid of lo_blocksize
>   loop: set physical block size to PAGE_SIZE
>   loop: add ioctl for changing logical block size
>   loop: fold loop_switch() into callers
> 
>  drivers/block/loop.c  | 112 
> --
>  drivers/block/loop.h  |   1 -
>  include/uapi/linux/loop.h |   1 +
>  3 files changed, 40 insertions(+), 74 deletions(-)

Looks fine,

Reviewed-by: Ming Lei 

-- 
Ming

Re: [PATCH] block: Call .initialize_rq_fn() also for filesystem requests

2017-08-29 Thread Ming Lei

On Mon, Aug 28, 2017 at 06:31:33PM +, Bart Van Assche wrote:
> On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote:
> > I still disagree that we should have an indirect function call like this
> > in the fast path.
> > 
> > All that can be done by clearing or setting a flag on the first call to
> > ->queue_rq or ->queuecommand instead.  In NVMe we use RQF_DONTPREP for
> > that, but for SCSI we probably can't use that given that it has more
> > meaning for the old request path.  But how about just adding a new
> > RQD_DRV_INITIALIZED or similar flag?
> 
> Hello Christoph,
> 
> Sorry but I'm not enthusiast about the proposal to introduce a
> RQD_DRV_INITIALIZED or similar flag. That approach involves an annoying
> behavior difference, namely that .initialize_rq_fn() would be called from
> inside blk_get_request() for pass-through requests and from inside the prep
> function for filesystem requests. Another disadvantage of that approach is
> that the block layer core never clears request.atomic_flags completely but
> only sets and clears individual flags. The SCSI core would have to follow
> that model and hence code for clearing RQD_DRV_INITIALIZED would have to be
> added to all request completion paths in the SCSI core.
> 
> Have you noticed that Ming Lei's patch series introduces several new atomic
> operations in the hot path? I'm referring here to the BLK_MQ_S_DISPATCH_BUSY
> manipulations. Have you noticed that for SCSI drivers these patches introduce
> an overhead between 0.1 and 1.0 microseconds per I/O request in the hot path?
> I have derived these numbers from the random write SRP performance numbers
> as follows: 1/142460 - 1/142990 = 2.6 microseconds. That number has to be
> multiplied with the number of I/O requests processed in parallel. The number
> of jobs in Ming Lei's test was 64 but that's probably way higher than the
> actual I/O parallelism.

Hi Bart,

Did you see perf regression on SRP with smaller jobs after applying
my patchset V3?

I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64,
looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) VS.
v4.13-rc6+blk-next+patch V3(3rd column):


---
 IOPS(K)  |NONE |NONE
---
read  |  475.83 |  485.88
---
randread  |  142.86 |  141.96
---
write |   483.9 |  492.39
---
randwrite |  124.83 |  124.53
---



---
 CPU(%)   |NONE |NONE
---
read  |   15.43 |   15.81
---
randread  |9.87 |9.75
---
write |   17.67 |   18.34
---
randwrite |   10.96 |   10.56
---



---
 LAT(us)  |NONE |NONE
---
read  |2.15 |2.11
---
randread  |7.17 |7.21
---
write |2.11 |2.08
---
randwrite | 8.2 |8.22
---



---
MERGE(K)  |NONE |NONE
---
read  | 8798.59 | 9064.09
---
randread  |0.02 |0.19
---
write | 8847.73 | 9102.18
---
randwrite |0.03 |0.13
---



-- 
Ming

Re: [PATCH 2/2] mq-deadline: Enable auto-loading when built as module

2017-08-29 Thread Ming Lei

On Sun, Aug 13, 2017 at 06:03:15PM +0100, Ben Hutchings wrote:
> The block core requests modules with the "-iosched" name suffix, but
> mq-deadline does not have that suffix.  Add an alias.
> 
> Fixes: 945ffb60c11d ("mq-deadline: add blk-mq adaptation of the deadline ...")
> Signed-off-by: Ben Hutchings 
> ---
>  block/mq-deadline.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 1b964a387afe..78e7698fa4ed 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -660,6 +660,7 @@ static struct elevator_type mq_deadline = {
>   .elevator_name = "mq-deadline",
>   .elevator_owner = THIS_MODULE,
>  };
> +MODULE_ALIAS("mq-deadline-iosched");
>  
>  static int __init deadline_init(void)
>  {

Reviewed-by: Ming Lei 

-- 
Ming

Re: [PATCH 1/2] bfq: Re-enable auto-loading when built as a module

2017-08-29 Thread Ming Lei

On Sun, Aug 13, 2017 at 06:02:19PM +0100, Ben Hutchings wrote:
> The block core requests modules with the "-iosched" name suffix, but
> bfq no longer has that suffix.  Add an alias.
> 
> Fixes: ea25da48086d ("block, bfq: split bfq-iosched.c into multiple ...")
> Signed-off-by: Ben Hutchings 
> ---
>  block/bfq-iosched.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 436b6ca6b175..3a01563bd564 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4998,6 +4998,7 @@ static struct elevator_type iosched_bfq_mq = {
>   .elevator_name ="bfq",
>   .elevator_owner =   THIS_MODULE,
>  };
> +MODULE_ALIAS("bfq-iosched");
>  
>  static int __init bfq_init(void)
>  {
> 

Reviewed-by: Ming Lei 


-- 
Ming

Re: [PATCH 10/10] nvme: implement multipath access to nvme subsystems

2017-08-29 Thread Guan Junxiong



On 2017/8/24 1:58, Christoph Hellwig wrote:
> -static inline bool nvme_req_needs_retry(struct request *req)
> +static bool nvme_failover_rq(struct request *req)
>  {
> - if (blk_noretry_request(req))
> + struct nvme_ns *ns = req->q->queuedata;
> + unsigned long flags;
> +
> + /*
> +  * Only fail over commands that came in through the the multipath
> +  * aware submissions path.  Note that ns->head might not be set up
> +  * for commands used during controller initialization, but those
> +  * must never set REQ_FAILFAST_TRANSPORT.
> +  */
> + if (!(req->cmd_flags & REQ_FAILFAST_TRANSPORT))
> + return false;
> +
> + switch (nvme_req(req)->status & 0x7ff) {
> + /*
> +  * Generic command status:
> +  */
> + case NVME_SC_INVALID_OPCODE:
> + case NVME_SC_INVALID_FIELD:
> + case NVME_SC_INVALID_NS:
> + case NVME_SC_LBA_RANGE:
> + case NVME_SC_CAP_EXCEEDED:
> + case NVME_SC_RESERVATION_CONFLICT:
> + return false;
> +
> + /*
> +  * I/O command set specific error.  Unfortunately these values are
> +  * reused for fabrics commands, but those should never get here.
> +  */
> + case NVME_SC_BAD_ATTRIBUTES:
> + case NVME_SC_INVALID_PI:
> + case NVME_SC_READ_ONLY:
> + case NVME_SC_ONCS_NOT_SUPPORTED:
> + WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
> + nvme_fabrics_command);
> + return false;
> +
> + /*
> +  * Media and Data Integrity Errors:
> +  */
> + case NVME_SC_WRITE_FAULT:
> + case NVME_SC_READ_ERROR:
> + case NVME_SC_GUARD_CHECK:
> + case NVME_SC_APPTAG_CHECK:
> + case NVME_SC_REFTAG_CHECK:
> + case NVME_SC_COMPARE_FAILED:
> + case NVME_SC_ACCESS_DENIED:
> + case NVME_SC_UNWRITTEN_BLOCK:
>   return false;
> + }
> +
> + /* Anything else could be a path failure, so should be retried */
> + spin_lock_irqsave(>head->requeue_lock, flags);
> + blk_steal_bios(>head->requeue_list, req);
> + spin_unlock_irqrestore(>head->requeue_lock, flags);
> +
> + nvme_reset_ctrl(ns->ctrl);
> + kblockd_schedule_work(>head->requeue_work);
> + return true;
> +}
> +
> +static inline bool nvme_req_needs_retry(struct request *req)
> +{
>   if (nvme_req(req)->status & NVME_SC_DNR)
>   return false;
>   if (jiffies - req->start_time >= req->timeout)
>   return false;
>   if (nvme_req(req)->retries >= nvme_max_retries)
>   return false;
> + if (nvme_failover_rq(req))
> + return false;
> + if (blk_noretry_request(req))
> + return false;
>   return true;
>  }

Does this introduce conflicts with current DM-Multipath used for NVMe/NVMeF
when path IO error occurs?  Such IO will be retried not only on the nvme-mpath
internal path, but also on the dm-mpath path.

In general, I wonder whether nvme-mpath can co-exist with DM-multipath
in a well-defined fashion.

Re: [PATCH V2 2/2] block/loop: allow request merge for directio mode

2017-08-29 Thread Ming Lei

On Thu, Aug 24, 2017 at 12:24:53PM -0700, Shaohua Li wrote:
> From: Shaohua Li 
> 
> Currently loop disables merge. While it makes sense for buffer IO mode,
> directio mode can benefit from request merge. Without merge, loop could
> send small size IO to underlayer disk and harm performance.

Hi Shaohua,

IMO no matter if merge is used, loop always sends page by page
to VFS in both dio or buffer I/O.

Also if merge is enabled on loop, that means merge is run
on both loop and low level block driver, and not sure if we
can benefit from that.

So Could you provide some performance data about this patch?

> 
> Reviewed-by: Omar Sandoval 
> Signed-off-by: Shaohua Li 
> ---
>  drivers/block/loop.c | 66 
> 
>  drivers/block/loop.h |  1 +
>  2 files changed, 52 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 428da07..75a8f6e 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -213,10 +213,13 @@ static void __loop_update_dio(struct loop_device *lo, 
> bool dio)
>*/
>   blk_mq_freeze_queue(lo->lo_queue);
>   lo->use_dio = use_dio;
> - if (use_dio)
> + if (use_dio) {
> + queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
>   lo->lo_flags |= LO_FLAGS_DIRECT_IO;
> - else
> + } else {
> + queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
>   lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
> + }
>   blk_mq_unfreeze_queue(lo->lo_queue);
>  }
>  
> @@ -464,6 +467,8 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
> ret, long ret2)
>  {
>   struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
>  
> + kfree(cmd->bvec);
> + cmd->bvec = NULL;
>   cmd->ret = ret;
>   blk_mq_complete_request(cmd->rq);
>  }
> @@ -473,22 +478,50 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>  {
>   struct iov_iter iter;
>   struct bio_vec *bvec;
> - struct bio *bio = cmd->rq->bio;
> + struct request *rq = cmd->rq;
> + struct bio *bio = rq->bio;
>   struct file *file = lo->lo_backing_file;
> + unsigned int offset;
> + int segments = 0;
>   int ret;
>  
> - /* nomerge for loop request queue */
> - WARN_ON(cmd->rq->bio != cmd->rq->biotail);
> + if (rq->bio != rq->biotail) {
> + struct req_iterator iter;
> + struct bio_vec tmp;
> +
> + __rq_for_each_bio(bio, rq)
> + segments += bio_segments(bio);
> + bvec = kmalloc(sizeof(struct bio_vec) * segments, GFP_KERNEL);

The allocation should have been GFP_NOIO.

-- 
Ming

Re: [PATCH V2 1/2] block/loop: set hw_sectors

2017-08-29 Thread Ming Lei

On Thu, Aug 24, 2017 at 12:24:52PM -0700, Shaohua Li wrote:
> From: Shaohua Li 
> 
> Loop can handle any size of request. Limiting it to 255 sectors just
> burns the CPU for bio split and request merge for underlayer disk and
> also cause bad fs block allocation in directio mode.
> 
> Reviewed-by: Omar Sandoval 
> Signed-off-by: Shaohua Li 
> ---
>  drivers/block/loop.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index b55a1f8..428da07 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1799,6 +1799,7 @@ static int loop_add(struct loop_device **l, int i)
>   }
>   lo->lo_queue->queuedata = lo;
>  
> + blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
>   /*
>* It doesn't make sense to enable merge because the I/O
>* submitted to backing file is handled page by page.
> -- 
> 2.9.5
> 

Reviewed-by: Ming Lei 

-- 
Ming

Re: [PATCH 6/6] power: supply: make device_attribute const

2017-08-29 Thread Sebastian Reichel

Hi,

On Mon, Aug 21, 2017 at 05:13:12PM +0530, Bhumika Goyal wrote:
> Make these const as they are only passed as an argument to the
> function device_create_file and device_remove_file and the corresponding
> arguments are of type const.
> Done using Coccinelle.
> 
> Signed-off-by: Bhumika Goyal 
> ---

Thanks, queued.

-- Sebastian

>  drivers/power/supply/olpc_battery.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/power/supply/olpc_battery.c 
> b/drivers/power/supply/olpc_battery.c
> index fc20ca3..3bc2eea 100644
> --- a/drivers/power/supply/olpc_battery.c
> +++ b/drivers/power/supply/olpc_battery.c
> @@ -559,7 +559,7 @@ static ssize_t olpc_bat_error_read(struct device *dev,
>   return sprintf(buf, "%d\n", ec_byte);
>  }
>  
> -static struct device_attribute olpc_bat_error = {
> +static const struct device_attribute olpc_bat_error = {
>   .attr = {
>   .name = "error",
>   .mode = S_IRUGO,
> -- 
> 1.9.1
> 


signature.asc
Description: PGP signature

Re: [PATCH 07/10] nvme: track shared namespaces

2017-08-29 Thread Christoph Hellwig

On Tue, Aug 29, 2017 at 10:42:18AM +0800, Guan Junxiong wrote:
> As for the __nvme_find_ns_head function, can it lookup the namespace
> globally, not in the current subsytem.

No.

> Take hypermetro scenario for

Please define "hypermetro"

> example, two namespaces which should be viewed as the same namespaces
> from the database application but exist in two different cities respectively.
> Some vendors maybe specify those two namespaces with the same UUID.

Then these vendors are non-compliant IFF the controllers don't belong
to the same subsystem.

> In addition, could you add a switch to turn on/off finding namespaces in
> a subsystem-wide level or globally?

No.

> Can namespace be shared between two subsystem?

No - if you share namespace access you are in the same subsystem.

Re: [PATCH 07/10] nvme: track shared namespaces

2017-08-29 Thread Christoph Hellwig

On Mon, Aug 28, 2017 at 08:41:23PM +0800, Guan Junxiong wrote:
> If a namespace can be accessed by another subsystem,  the above line
> will ignore such namespace.
>
> Or does the NVMe/NVMf specification constrain that any namespace
> can only be accessed by a subsystem?

A namespace is a part of a NVMe subsystem.  You must not reuse the
uniqueue identifier outside the subsystem scope or your implementation
will be non-compliant.

Re: [PATCH 07/10] nvme: track shared namespaces

2017-08-29 Thread Christoph Hellwig

On Mon, Aug 28, 2017 at 01:21:20PM -0700, J Freyensee wrote:
> > >  horrible.  One idea would be to rename the current struct nvme_ns
> > >  to struct nvme_ns_link or similar and use the nvme_ns name for the
> > >  new structure.  But that would involve a lot of churn.
> > 
> > maybe nvme_ns_primary?
> 
> Since it looks like it holds all unique identifier values and should hold
> other namespace characteristics later, maybe:
> 
> nvme_ns_item?
> Or nvme_ns_entry?
> Or nvme_ns_element?
> Or nvme_ns_unit?
> Or nvme_ns_entity?
> Or nvme_ns_container?

I hate them all (including the current ns_head name :)).

I suspect the only way that would make my taste happy is to call
this new one nvme_ns, but that would lead to a lot of churn.

Re: [PATCH 07/10] nvme: track shared namespaces

2017-08-29 Thread Guan Junxiong



On 2017/8/28 14:51, Sagi Grimberg wrote:
> +static int __nvme_check_ids(struct nvme_subsystem *subsys,
> +struct nvme_ns_head *new)
> +{
> +struct nvme_ns_head *h;
> +
> +lockdep_assert_held(>lock);
> +
> +list_for_each_entry(h, >nsheads, entry) {
> +if ((!uuid_is_null(>uuid) &&
> + uuid_equal(>uuid, >uuid)) ||
> +(memchr_inv(new->nguid, 0, sizeof(new->nguid)) &&
> + memcmp(>nguid, >nguid, sizeof(new->nguid))) ||

memcmp() -> !memcmp

> +(memchr_inv(new->eui64, 0, sizeof(new->eui64)) &&
> + memcmp(>eui64, >eui64, sizeof(new->eui64

memcmp() -> !memcmp

Otherwise in this patch, looks good.
Reviewed-by: Guan Junxiong

58 matches

Mail list logo