date:20170213

Re: [PATCH V5 3/4] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN

2017-02-13 Thread Arnd Bergmann

On Mon, Feb 13, 2017 at 6:07 PM, David Laight  wrote:
> From: Arnd Bergmann
>> Sent: 13 February 2017 17:02
> ...
>> >> > +   ioctl_ptr = memdup_user(arg,  _IOC_SIZE(cmd));
>> >> > +   if (IS_ERR_OR_NULL(ioctl_ptr)) {
>> >> > +   ret = PTR_ERR(ioctl_ptr);
>> >> > +   goto out;
>> >> ...
>> >> > + out:
>> >> > +   kfree(ioctl_ptr);
>> >> > +   return ret;
> ...
>> >> That error path doesn't look quite right to me.
> ...
>> > good god, this is a mess this morning. Thanks for the catch, I'll review 
>> > these
>> > more aggressively before sending out.
>>
>> memdup_user() never returns NULL, and generally speaking any use of
>> IS_ERR_OR_NULL() indicates that there is something wrong with the
>> interface, so aside from passing the right pointer (or NULL) into kfree()
>> I think using IS_ERR() is the correct solution.
>
> You missed the problem entirely.
> Code needs to be:
> if (IS_ERR(ioctl_ptr))
> return PTR_ERR(ioctl_ptr);

Sorry if that wasn't clear, I expected the part about the kfree(errptr) to be
obvious but was trying to avoid having Scott go through another revision
for removing the IS_ERR_OR_NULL() after fixing the first problem.

Arnd

Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice

2017-02-13 Thread Jens Axboe

On 02/13/2017 03:28 PM, Jens Axboe wrote:
> On 02/13/2017 03:09 PM, Omar Sandoval wrote:
>> On Mon, Feb 13, 2017 at 10:01:07PM +0100, Paolo Valente wrote:
>>> If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq,
>>> or, viceversa, a blk-mq scheduler is chosen for a device using blk, then
>>> that scheduler is set and initialized without any check, driving the
>>> system into an inconsistent state. This commit addresses this issue by
>>> letting elevator_get fail for these wrong cross choices.
>>>
>>> Signed-off-by: Paolo Valente 
>>> ---
>>>  block/elevator.c | 26 ++
>>>  1 file changed, 18 insertions(+), 8 deletions(-)
>>
>> Hey, Paolo,
>>
>> How exactly are you triggering this? In __elevator_change(), we do check
>> for mq or not mq:
>>
>>  if (!e->uses_mq && q->mq_ops) {
>>  elevator_put(e);
>>  return -EINVAL;
>>  }
>>  if (e->uses_mq && !q->mq_ops) {
>>  elevator_put(e);
>>  return -EINVAL;
>>  }
>>
>> We don't ever appear to call elevator_init() with a specific scheduler
>> name, and for the default we switch off of q->mq_ops and use the
>> defaults from Kconfig:
>>
>>  if (q->mq_ops && q->nr_hw_queues == 1)
>>  e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false);
>>  else if (q->mq_ops)
>>  e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
>>  else
>>  e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
>>
>>  if (!e) {
>>  printk(KERN_ERR
>>  "Default I/O scheduler not found. " \
>>  "Using noop/none.\n");
>>  e = elevator_get("noop", false);
>>  }
>>
>> So I guess this could happen if someone manually changed those Kconfig
>> options, but I don't see what other case would make this happen, could
>> you please explain?
> 
> Was wondering the same - is it using the 'elevator=' boot parameter?
> Didn't look at that path just now, but that's the only one I could
> think of. If it is, I'd much prefer only using 'chosen_elevator' for
> the non-mq stuff, and the fix should be just that instead.
> 
> So instead of:
> 
>   if (!e && *chosen_elevator) {
> 
> do
> 
>   if (!e && !q->mq_ops && && *chosen_elevator) {

Confirmed, that's what it seems to be, and here's a real diff of the
above example that works for me:

diff --git a/block/elevator.c b/block/elevator.c
index 27ff1ed5a6fa..699d10f71a2c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -207,11 +207,12 @@ int elevator_init(struct request_queue *q, char *name)
}
 
/*
-* Use the default elevator specified by config boot param or
-* config option.  Don't try to load modules as we could be running
-* off async and request_module() isn't allowed from async.
+* Use the default elevator specified by config boot param for
+* non-mq devices, or by config option. Don't try to load modules
+* as we could be running off async and request_module() isn't
+* allowed from async.
 */
-   if (!e && *chosen_elevator) {
+   if (!e && !q->mq_ops && *chosen_elevator) {
e = elevator_get(chosen_elevator, false);
if (!e)
printk(KERN_ERR "I/O scheduler %s not found\n",

-- 
Jens Axboe

Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-13 Thread Paolo Valente


> Il giorno 10 feb 2017, alle ore 20:49, Paolo Valente 
>  ha scritto:
> 
>> 
>> Il giorno 10 feb 2017, alle ore 19:13, Bart Van Assche 
>>  ha scritto:
>> 
>> On 02/10/2017 08:49 AM, Paolo Valente wrote:
 $ grep '^C.*_MQ_' .config
 CONFIG_BLK_MQ_PCI=y
 CONFIG_MQ_IOSCHED_BFQ=y
 CONFIG_MQ_IOSCHED_DEADLINE=y
 CONFIG_MQ_IOSCHED_NONE=y
 CONFIG_DEFAULT_MQ_BFQ_MQ=y
 CONFIG_DEFAULT_MQ_IOSCHED="bfq-mq"
 CONFIG_SCSI_MQ_DEFAULT=y
 CONFIG_DM_MQ_DEFAULT=y
 
>>> 
>>> Could you reconfigure with none or mq-deadline as default, check
>>> whether the system boots, and, it it does, switch manually to bfq-mq,
>>> check what happens, and, in the likely case of a failure, try to get
>>> the oops?
>> 
>> Hello Paolo,
>> 
>> I just finished performing that test with the following kernel config:
>> $ grep '^C.*_MQ_' .config
>> CONFIG_BLK_MQ_PCI=y
>> CONFIG_MQ_IOSCHED_BFQ=y
>> CONFIG_MQ_IOSCHED_DEADLINE=y
>> CONFIG_MQ_IOSCHED_NONE=y
>> CONFIG_DEFAULT_MQ_DEADLINE=y
>> CONFIG_DEFAULT_MQ_IOSCHED="mq-deadline"
>> CONFIG_SCSI_MQ_DEFAULT=y
>> CONFIG_DM_MQ_DEFAULT=y
>> 
>> After the system came up I logged in, switched to the bfq-mq scheduler
>> and ran several I/O tests against the boot disk.
> 
> Without any failure, right?
> 
> Unfortunately, as you can imagine, no boot failure occurred on
> any of my test systems so far :(
> 
> This version of bfq-mq can be configured to print all its activity in
> the kernel log, by just defining a macro.  This will of course slow
> down the system so much to make it probably unusable, if bfq-mq is
> active from boot.  Yet, the failure may still occur so early to make
> this approach useful to discover where bfq-mq gets stuck.  As of now I
> have no better ideas.  Any suggestion is welcome.
> 

Hi Bart,
I have found a machine crashing at boot, yet not only when bfq-mq is
chosen, but also when mq-deadline is chosen as the default
scheduler.  I have found and just reported the cause of the failure,
together with a fix.  Probably this is not the cause of your failure,
but what do you think about trying this fix?  BTW, I have rebased the
branch [1] against the new commits in Jens for-4.11/next.

Otherwise, if you have no news or suggestions, would you be willing to
try my micro-logging proposal?

Thanks,
Paolo

[1] https://github.com/Algodev-github/bfq-mq

> Thanks,
> Paolo
> 
>> Sorry but nothing
>> interesting appeared in the kernel log.
>> 
>> Bart.
>> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
>> Notice & Disclaimer:
>> 
>> This e-mail and any files transmitted with it may contain confidential or 
>> legally privileged information of WDC and/or its affiliates, and are 
>> intended solely for the use of the individual or entity to which they are 
>> addressed. If you are not the intended recipient, any disclosure, copying, 
>> distribution or any action taken or omitted to be taken in reliance on it, 
>> is prohibited. If you have received this e-mail in error, please notify the 
>> sender immediately and delete the e-mail in its entirety from your system.

Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice

2017-02-13 Thread Bart Van Assche

On Mon, 2017-02-13 at 22:01 +0100, Paolo Valente wrote:
> -static struct elevator_type *elevator_get(const char *name, bool try_loading)
> +static struct elevator_type *elevator_get(const char *name, bool try_loading,
> +   bool mq_ops)

Please choose a better name for that argument, e.q. "mq". To me the name 
"mq_ops"
means "a pointer to a data structure with operation function pointers".

> + if (e && (e->uses_mq != mq_ops)) {
> + pr_err("ERROR: attempted to choose %s %s I/O scheduler in 
> blk%s",
> +name, e->uses_mq ? "blk-mq" : "legacy", mq_ops ? "-mq" : 
> "");
> + e = NULL;
> + }

How about changing the above into:

+   if (e && e->uses_mq != mq) {
+   pr_err("ERROR: attempt to configure %s as I/O scheduler for a 
%s queue\n",
+  name, mq ? "blk-mq" : "legacy");
+   e = NULL;
+   }

Thanks,

Bart.

Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-13 Thread Paolo Valente


> Il giorno 07 feb 2017, alle ore 18:24, Paolo Valente 
>  ha scritto:
> 
> Hi,
> 
> I have finally pushed here [1] the current WIP branch of bfq for
> blk-mq, which I have tentatively named bfq-mq.
> 
> This branch *IS NOT* meant for merging into mainline and contain code
> that mau easily violate code style, and not only, in many
> places. Commits implement the following main steps:
> 1) Add the last version of bfq for blk
> 2) Clone bfq source files into identical bfq-mq source files
> 3) Modify bfq-mq files to get a working version of bfq for blk-mq
> (cgroups support not yet functional)
> 
> In my intentions, the main goals of this branch are:
> 
> 1) Show, as soon as I could, the changes I made to let bfq-mq comply
> with blk-mq-sched framework. I though this could be particularly
> useful for Jens, being BFQ identical to CFQ in terms of hook
> interfaces and io-context handling, and almost identical in terms
> request-merging.
> 
> 2) Enable people to test this first version bfq-mq. Code is purposely
> overfull of log messages and invariant checks that halt the system on
> failure (lock assertions, BUG_ONs, ...).
> 
> To make it easier to revise commits, I'm sending the patches that
> transform bfq into bfq-mq (last four patches in the branch [1]). They
> work on two files, bfq-mq-iosched.c and bfq-mq.h, which, at the
> beginning, are just copies of bfq-iosched.c and bfq.h.
> 

Hi,
this is just to inform that, as I just wrote to Bart, I have rebase
the branch [1] against the current content of for-4.11/next.

Jens, Omar, did you find the time to have a look at the main commits
or to run some test?

Thanks,
Paolo

[1] https://github.com/Algodev-github/bfq-mq

> Thanks,
> Paolo
> 
> [1] https://github.com/Algodev-github/bfq-mq
> 
> Paolo Valente (4):
>  blk-mq: pass bio to blk_mq_sched_get_rq_priv
>  Move thinktime from bic to bfqq
>  Embed bfq-ioc.c and add locking on request queue
>  Modify interface and operation to comply with blk-mq-sched
> 
> block/bfq-cgroup.c   |   4 -
> block/bfq-mq-iosched.c   | 852 +--
> block/bfq-mq.h   |  65 ++--
> block/blk-mq-sched.c |   8 +-
> block/blk-mq-sched.h |   5 +-
> include/linux/elevator.h |   2 +-
> 6 files changed, 567 insertions(+), 369 deletions(-)
> 
> --
> 2.10.0

Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-13 Thread Bart Van Assche

On Mon, 2017-02-13 at 22:07 +, Bart Van Assche wrote:
> On Mon, 2017-02-13 at 22:07 +0100, Paolo Valente wrote:
> > but what do you think about trying this fix?
> 
> Sorry but with ... the same server I used for the previous test still
> didn't boot up properly. A screenshot is available at
> https://goo.gl/photos/Za9QVGCNe2BJBwxVA.
> 
> > Otherwise, if you have no news or suggestions, would you be willing to
> > try my micro-logging proposal https://github.com/Algodev-github/bfq-mq?
> 
> Sorry but it's not clear to me what logging mechanism you are referring
> to and how to enable it? Are you perhaps referring to
> CONFIG_BFQ_REDIRECT_TO_CONSOLE?

Anyway, a second screenshot has been added to the same album after I had
applied the following patch:

diff --git a/block/Makefile b/block/Makefile
index 1c04fe19e825..bf472ac82c08 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -2,6 +2,8 @@
 # Makefile for the kernel block layer
 #
  
+KBUILD_CFLAGS += -DCONFIG_BFQ_REDIRECT_TO_CONSOLE
+

 obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
    blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
    blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \

Bart.

[PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice

2017-02-13 Thread Paolo Valente

If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq,
or, viceversa, a blk-mq scheduler is chosen for a device using blk, then
that scheduler is set and initialized without any check, driving the
system into an inconsistent state. This commit addresses this issue by
letting elevator_get fail for these wrong cross choices.

Signed-off-by: Paolo Valente 
---
 block/elevator.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 27ff1ed..a25bdd9 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -99,7 +99,8 @@ static void elevator_put(struct elevator_type *e)
module_put(e->elevator_owner);
 }
 
-static struct elevator_type *elevator_get(const char *name, bool try_loading)
+static struct elevator_type *elevator_get(const char *name, bool try_loading,
+ bool mq_ops)
 {
struct elevator_type *e;
 
@@ -113,6 +114,12 @@ static struct elevator_type *elevator_get(const char 
*name, bool try_loading)
e = elevator_find(name);
}
 
+   if (e && (e->uses_mq != mq_ops)) {
+   pr_err("ERROR: attempted to choose %s %s I/O scheduler in 
blk%s",
+  name, e->uses_mq ? "blk-mq" : "legacy", mq_ops ? "-mq" : 
"");
+   e = NULL;
+   }
+
if (e && !try_module_get(e->elevator_owner))
e = NULL;
 
@@ -201,7 +208,7 @@ int elevator_init(struct request_queue *q, char *name)
q->boundary_rq = NULL;
 
if (name) {
-   e = elevator_get(name, true);
+   e = elevator_get(name, true, q->mq_ops);
if (!e)
return -EINVAL;
}
@@ -212,7 +219,7 @@ int elevator_init(struct request_queue *q, char *name)
 * off async and request_module() isn't allowed from async.
 */
if (!e && *chosen_elevator) {
-   e = elevator_get(chosen_elevator, false);
+   e = elevator_get(chosen_elevator, false, q->mq_ops);
if (!e)
printk(KERN_ERR "I/O scheduler %s not found\n",
chosen_elevator);
@@ -220,17 +227,20 @@ int elevator_init(struct request_queue *q, char *name)
 
if (!e) {
if (q->mq_ops && q->nr_hw_queues == 1)
-   e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false);
+   e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false,
+q->mq_ops);
else if (q->mq_ops)
-   e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
+   e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false,
+q->mq_ops);
else
-   e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
+   e = elevator_get(CONFIG_DEFAULT_IOSCHED, false,
+q->mq_ops);
 
if (!e) {
printk(KERN_ERR
"Default I/O scheduler not found. " \
"Using noop/none.\n");
-   e = elevator_get("noop", false);
+   e = elevator_get("noop", false, q->mq_ops);
}
}
 
@@ -1051,7 +1061,7 @@ static int __elevator_change(struct request_queue *q, 
const char *name)
return elevator_switch(q, NULL);
 
strlcpy(elevator_name, name, sizeof(elevator_name));
-   e = elevator_get(strstrip(elevator_name), true);
+   e = elevator_get(strstrip(elevator_name), true, q->mq_ops);
if (!e) {
printk(KERN_ERR "elevator: type %s not found\n", elevator_name);
return -EINVAL;
-- 
2.10.0

[PATCH BUGFIX] attempt to fix wrong scheduler selection

2017-02-13 Thread Paolo Valente

Hi,
if, at boot, a legacy I/O scheduler is chosen for a device using
blk-mq, or, viceversa, a blk-mq scheduler is chosen for a device using
blk, then that scheduler is set and initialized without any check,
driving the system into an inconsistent state.

The purpose of this message is, first, to report this issue, and,
second, to propose a possible fix in case you do consider this as a
bug.

Thanks,
Paolo

Paolo Valente (1):
  block: make elevator_get robust against cross blk/blk-mq choice

 block/elevator.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

--
2.10.0

Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-13 Thread Bart Van Assche

On Mon, 2017-02-13 at 22:07 +0100, Paolo Valente wrote:
> but what do you think about trying this fix?

Sorry but with ... the same server I used for the previous test still
didn't boot up properly. A screenshot is available at
https://goo.gl/photos/Za9QVGCNe2BJBwxVA.

> Otherwise, if you have no news or suggestions, would you be willing to
> try my micro-logging proposal https://github.com/Algodev-github/bfq-mq?

Sorry but it's not clear to me what logging mechanism you are referring
to and how to enable it? Are you perhaps referring to
CONFIG_BFQ_REDIRECT_TO_CONSOLE?

Thanks,

Bart.

Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice

2017-02-13 Thread Jens Axboe

On 02/13/2017 03:09 PM, Omar Sandoval wrote:
> On Mon, Feb 13, 2017 at 10:01:07PM +0100, Paolo Valente wrote:
>> If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq,
>> or, viceversa, a blk-mq scheduler is chosen for a device using blk, then
>> that scheduler is set and initialized without any check, driving the
>> system into an inconsistent state. This commit addresses this issue by
>> letting elevator_get fail for these wrong cross choices.
>>
>> Signed-off-by: Paolo Valente 
>> ---
>>  block/elevator.c | 26 ++
>>  1 file changed, 18 insertions(+), 8 deletions(-)
> 
> Hey, Paolo,
> 
> How exactly are you triggering this? In __elevator_change(), we do check
> for mq or not mq:
> 
>   if (!e->uses_mq && q->mq_ops) {
>   elevator_put(e);
>   return -EINVAL;
>   }
>   if (e->uses_mq && !q->mq_ops) {
>   elevator_put(e);
>   return -EINVAL;
>   }
> 
> We don't ever appear to call elevator_init() with a specific scheduler
> name, and for the default we switch off of q->mq_ops and use the
> defaults from Kconfig:
> 
>   if (q->mq_ops && q->nr_hw_queues == 1)
>   e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false);
>   else if (q->mq_ops)
>   e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
>   else
>   e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
> 
>   if (!e) {
>   printk(KERN_ERR
>   "Default I/O scheduler not found. " \
>   "Using noop/none.\n");
>   e = elevator_get("noop", false);
>   }
> 
> So I guess this could happen if someone manually changed those Kconfig
> options, but I don't see what other case would make this happen, could
> you please explain?

Was wondering the same - is it using the 'elevator=' boot parameter?
Didn't look at that path just now, but that's the only one I could
think of. If it is, I'd much prefer only using 'chosen_elevator' for
the non-mq stuff, and the fix should be just that instead.

So instead of:

if (!e && *chosen_elevator) {

do

if (!e && !q->mq_ops && && *chosen_elevator) {

-- 
Jens Axboe

Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq

2017-02-13 Thread Jens Axboe

On 02/13/2017 02:09 PM, Paolo Valente wrote:
> 
>> Il giorno 07 feb 2017, alle ore 18:24, Paolo Valente 
>>  ha scritto:
>>
>> Hi,
>>
>> I have finally pushed here [1] the current WIP branch of bfq for
>> blk-mq, which I have tentatively named bfq-mq.
>>
>> This branch *IS NOT* meant for merging into mainline and contain code
>> that mau easily violate code style, and not only, in many
>> places. Commits implement the following main steps:
>> 1) Add the last version of bfq for blk
>> 2) Clone bfq source files into identical bfq-mq source files
>> 3) Modify bfq-mq files to get a working version of bfq for blk-mq
>> (cgroups support not yet functional)
>>
>> In my intentions, the main goals of this branch are:
>>
>> 1) Show, as soon as I could, the changes I made to let bfq-mq comply
>> with blk-mq-sched framework. I though this could be particularly
>> useful for Jens, being BFQ identical to CFQ in terms of hook
>> interfaces and io-context handling, and almost identical in terms
>> request-merging.
>>
>> 2) Enable people to test this first version bfq-mq. Code is purposely
>> overfull of log messages and invariant checks that halt the system on
>> failure (lock assertions, BUG_ONs, ...).
>>
>> To make it easier to revise commits, I'm sending the patches that
>> transform bfq into bfq-mq (last four patches in the branch [1]). They
>> work on two files, bfq-mq-iosched.c and bfq-mq.h, which, at the
>> beginning, are just copies of bfq-iosched.c and bfq.h.
>>
> 
> Hi,
> this is just to inform that, as I just wrote to Bart, I have rebase
> the branch [1] against the current content of for-4.11/next.
> 
> Jens, Omar, did you find the time to have a look at the main commits
> or to run some test?

I only looked at the core change you proposed for passing in the
bio as well, and Omar fixed up the icq exit part and I also applied
that patch. I haven't look at any of the bfq-mq patches at all yet.
Not sure what I can do with those, I don't think those are
particularly useful to anyone but you.

Might make more sense to post the conversion for review as a
whole.

-- 
Jens Axboe

Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice

2017-02-13 Thread Omar Sandoval

On Mon, Feb 13, 2017 at 10:01:07PM +0100, Paolo Valente wrote:
> If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq,
> or, viceversa, a blk-mq scheduler is chosen for a device using blk, then
> that scheduler is set and initialized without any check, driving the
> system into an inconsistent state. This commit addresses this issue by
> letting elevator_get fail for these wrong cross choices.
> 
> Signed-off-by: Paolo Valente 
> ---
>  block/elevator.c | 26 ++
>  1 file changed, 18 insertions(+), 8 deletions(-)

Hey, Paolo,

How exactly are you triggering this? In __elevator_change(), we do check
for mq or not mq:

if (!e->uses_mq && q->mq_ops) {
elevator_put(e);
return -EINVAL;
}
if (e->uses_mq && !q->mq_ops) {
elevator_put(e);
return -EINVAL;
}

We don't ever appear to call elevator_init() with a specific scheduler
name, and for the default we switch off of q->mq_ops and use the
defaults from Kconfig:

if (q->mq_ops && q->nr_hw_queues == 1)
e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false);
else if (q->mq_ops)
e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
else
e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);

if (!e) {
printk(KERN_ERR
"Default I/O scheduler not found. " \
"Using noop/none.\n");
e = elevator_get("noop", false);
}

So I guess this could happen if someone manually changed those Kconfig
options, but I don't see what other case would make this happen, could
you please explain?

[PATCH 4/7] nonblocking aio: return on congested block device

2017-02-13 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

A new flag BIO_NONBLOCKING is introduced to identify bio's
orignating from iocb with IOCB_NONBLOCKING. struct request
are requested using BLK_MQ_REQ_NOWAIT if BIO_NONBLOCKING is set.

Signed-off-by: Goldwyn Rodrigues 
---
 block/blk-core.c  | 13 +++--
 block/blk-mq.c| 18 --
 fs/direct-io.c| 11 +--
 include/linux/blk_types.h |  1 +
 4 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 14d7c07..9767573 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1257,6 +1257,11 @@ static struct request *get_request(struct request_queue 
*q, int op,
if (!IS_ERR(rq))
return rq;
 
+   if (bio_flagged(bio, BIO_NONBLOCKING)) {
+   blk_put_rl(rl);
+   return ERR_PTR(-EAGAIN);
+   }
+
if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) 
{
blk_put_rl(rl);
return rq;
@@ -2035,7 +2040,7 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-   if (likely(blk_queue_enter(q, false) == 0)) {
+   if (likely(blk_queue_enter(q, bio_flagged(bio, 
BIO_NONBLOCKING)) == 0)) {
ret = q->make_request_fn(q, bio);
 
blk_queue_exit(q);
@@ -2044,7 +2049,11 @@ blk_qc_t generic_make_request(struct bio *bio)
} else {
struct bio *bio_next = bio_list_pop(current->bio_list);
 
-   bio_io_error(bio);
+   if (unlikely(bio_flagged(bio, BIO_NONBLOCKING))) {
+   bio->bi_error = -EAGAIN;
+   bio_endio(bio);
+   } else
+   bio_io_error(bio);
bio = bio_next;
}
} while (bio);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 81caceb..7a7c674 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1213,6 +1213,8 @@ static struct request *blk_mq_map_request(struct 
request_queue *q,
 
trace_block_getrq(q, bio, op);
blk_mq_set_alloc_data(_data, q, 0, ctx, hctx);
+   if (bio_flagged(bio, BIO_NONBLOCKING))
+   alloc_data.flags |= BLK_MQ_REQ_NOWAIT;
rq = __blk_mq_alloc_request(_data, op, op_flags);
 
data->hctx = alloc_data.hctx;
@@ -1286,8 +1288,14 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
return BLK_QC_T_NONE;
 
rq = blk_mq_map_request(q, bio, );
-   if (unlikely(!rq))
+   if (unlikely(!rq)) {
+   if (bio_flagged(bio, BIO_NONBLOCKING))
+   bio->bi_error = -EAGAIN;
+   else
+   bio->bi_error = -EIO;
+   bio_endio(bio);
return BLK_QC_T_NONE;
+   }
 
cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
@@ -1381,8 +1389,14 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
request_count = blk_plug_queued_count(q);
 
rq = blk_mq_map_request(q, bio, );
-   if (unlikely(!rq))
+   if (unlikely(!rq)) {
+   if (bio_flagged(bio, BIO_NONBLOCKING))
+   bio->bi_error = -EAGAIN;
+   else
+   bio->bi_error = -EIO;
+   bio_endio(bio);
return BLK_QC_T_NONE;
+   }
 
cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index fb9aa16..9997fed 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -386,6 +386,9 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
else
bio->bi_end_io = dio_bio_end_io;
 
+   if (dio->iocb->ki_flags & IOCB_NONBLOCKING)
+   bio_set_flag(bio, BIO_NONBLOCKING);
+
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
@@ -480,8 +483,12 @@ static int dio_bio_complete(struct dio *dio, struct bio 
*bio)
unsigned i;
int err;
 
-   if (bio->bi_error)
-   dio->io_error = -EIO;
+   if (bio->bi_error) {
+   if (bio_flagged(bio, BIO_NONBLOCKING))
+   dio->io_error = bio->bi_error;
+   else
+   dio->io_error = -EIO;
+   }
 
if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) {
err = bio->bi_error;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index cd395ec..94855cf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -119,6 +119,7 @@ struct bio {
 #define BIO_QUIET  6   /* Make BIO Quiet */
 #define BIO_CHAIN  7   /* chained bio, ->bi_remaining in effect */
 #define BIO_REFFED

Re: [PATCH v1 1/5] block: introduce bio_clone_bioset_partial()

2017-02-13 Thread Ming Lei

On Mon, Feb 13, 2017 at 9:46 PM, Christoph Hellwig  wrote:
> On Fri, Feb 10, 2017 at 06:56:13PM +0800, Ming Lei wrote:
>> md still need bio clone(not the fast version) for behind write,
>> and it is more efficient to use bio_clone_bioset_partial().
>>
>> The idea is simple and just copy the bvecs range specified from
>> parameters.
>
> Given how few users bio_clone_bioset has I wonder if we shouldn't
> simply add the two new arguments to it instead of adding another
> indirection.

For md write-behind, looks we have to provide the two arguments,
could you explain a bit how we can do that by adding another indirection?

>
> Otherwise looks fine:
>
> Reviewed-by: Christoph Hellwig 

Thanks!

-- 
Ming Lei

Re: [PATCH 4/7] nonblocking aio: return on congested block device

2017-02-13 Thread Ming Lei

On Tue, Feb 14, 2017 at 10:46 AM, Goldwyn Rodrigues  wrote:
> From: Goldwyn Rodrigues 
>
> A new flag BIO_NONBLOCKING is introduced to identify bio's
> orignating from iocb with IOCB_NONBLOCKING. struct request
> are requested using BLK_MQ_REQ_NOWAIT if BIO_NONBLOCKING is set.
>
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  block/blk-core.c  | 13 +++--
>  block/blk-mq.c| 18 --
>  fs/direct-io.c| 11 +--
>  include/linux/blk_types.h |  1 +
>  4 files changed, 37 insertions(+), 6 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 14d7c07..9767573 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1257,6 +1257,11 @@ static struct request *get_request(struct 
> request_queue *q, int op,
> if (!IS_ERR(rq))
> return rq;
>
> +   if (bio_flagged(bio, BIO_NONBLOCKING)) {
> +   blk_put_rl(rl);
> +   return ERR_PTR(-EAGAIN);
> +   }
> +
> if (!gfpflags_allow_blocking(gfp_mask) || 
> unlikely(blk_queue_dying(q))) {
> blk_put_rl(rl);
> return rq;
> @@ -2035,7 +2040,7 @@ blk_qc_t generic_make_request(struct bio *bio)
> do {
> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> -   if (likely(blk_queue_enter(q, false) == 0)) {
> +   if (likely(blk_queue_enter(q, bio_flagged(bio, 
> BIO_NONBLOCKING)) == 0)) {
> ret = q->make_request_fn(q, bio);
>
> blk_queue_exit(q);
> @@ -2044,7 +2049,11 @@ blk_qc_t generic_make_request(struct bio *bio)
> } else {
> struct bio *bio_next = 
> bio_list_pop(current->bio_list);
>
> -   bio_io_error(bio);
> +   if (unlikely(bio_flagged(bio, BIO_NONBLOCKING))) {
> +   bio->bi_error = -EAGAIN;
> +   bio_endio(bio);
> +   } else
> +   bio_io_error(bio);
> bio = bio_next;
> }
> } while (bio);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 81caceb..7a7c674 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1213,6 +1213,8 @@ static struct request *blk_mq_map_request(struct 
> request_queue *q,
>
> trace_block_getrq(q, bio, op);
> blk_mq_set_alloc_data(_data, q, 0, ctx, hctx);
> +   if (bio_flagged(bio, BIO_NONBLOCKING))
> +   alloc_data.flags |= BLK_MQ_REQ_NOWAIT;
> rq = __blk_mq_alloc_request(_data, op, op_flags);
>
> data->hctx = alloc_data.hctx;
> @@ -1286,8 +1288,14 @@ static blk_qc_t blk_mq_make_request(struct 
> request_queue *q, struct bio *bio)
> return BLK_QC_T_NONE;
>
> rq = blk_mq_map_request(q, bio, );
> -   if (unlikely(!rq))
> +   if (unlikely(!rq)) {
> +   if (bio_flagged(bio, BIO_NONBLOCKING))
> +   bio->bi_error = -EAGAIN;
> +   else
> +   bio->bi_error = -EIO;
> +   bio_endio(bio);
> return BLK_QC_T_NONE;
> +   }
>
> cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
>
> @@ -1381,8 +1389,14 @@ static blk_qc_t blk_sq_make_request(struct 
> request_queue *q, struct bio *bio)
> request_count = blk_plug_queued_count(q);
>
> rq = blk_mq_map_request(q, bio, );
> -   if (unlikely(!rq))
> +   if (unlikely(!rq)) {
> +   if (bio_flagged(bio, BIO_NONBLOCKING))
> +   bio->bi_error = -EAGAIN;
> +   else
> +   bio->bi_error = -EIO;
> +   bio_endio(bio);
> return BLK_QC_T_NONE;
> +   }

There are other places in which blocking may be triggered, such
as allocating for bio clone, wbt_wait(), and sleep in .make_request(),
like md, dm and bcache's.

IMO it should be hard to deal with all, so what is the expection for
flag of BIO_NONBLOCKING?

>
> cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index fb9aa16..9997fed 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -386,6 +386,9 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
> else
> bio->bi_end_io = dio_bio_end_io;
>
> +   if (dio->iocb->ki_flags & IOCB_NONBLOCKING)
> +   bio_set_flag(bio, BIO_NONBLOCKING);
> +
> sdio->bio = bio;
> sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
>  }
> @@ -480,8 +483,12 @@ static int dio_bio_complete(struct dio *dio, struct bio 
> *bio)
> unsigned i;
> int err;
>
> -   if (bio->bi_error)
> -   dio->io_error = -EIO;
> +   if (bio->bi_error) {
> +   if (bio_flagged(bio, BIO_NONBLOCKING))
> +

Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice

2017-02-13 Thread Omar Sandoval

On Tue, Feb 14, 2017 at 07:58:22AM +0100, Hannes Reinecke wrote:
> While we're at the topic:
> 
> Can't we use the same names for legacy and mq scheduler?
> It's quite an unnecessary complication to have
> 'noop', 'deadline', and 'cfq' for legacy, but 'none' and 'mq-deadline'
> for mq. If we could use 'noop' and 'deadline' for mq, too, the existing
> settings or udev rules will continue to work and we wouldn't get any
> annoying and pointless warnings here...

I mentioned this to Jens a little while ago but I didn't feel strongly
enough to push the issue. I also like this idea -- it makes the
transition to blk-mq a little more transparent.

Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice

2017-02-13 Thread Hannes Reinecke

On 02/13/2017 11:28 PM, Jens Axboe wrote:
> On 02/13/2017 03:09 PM, Omar Sandoval wrote:
>> On Mon, Feb 13, 2017 at 10:01:07PM +0100, Paolo Valente wrote:
>>> If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq,
>>> or, viceversa, a blk-mq scheduler is chosen for a device using blk, then
>>> that scheduler is set and initialized without any check, driving the
>>> system into an inconsistent state. This commit addresses this issue by
>>> letting elevator_get fail for these wrong cross choices.
>>>
>>> Signed-off-by: Paolo Valente 
>>> ---
>>>  block/elevator.c | 26 ++
>>>  1 file changed, 18 insertions(+), 8 deletions(-)
>>
>> Hey, Paolo,
>>
>> How exactly are you triggering this? In __elevator_change(), we do check
>> for mq or not mq:
>>
>>  if (!e->uses_mq && q->mq_ops) {
>>  elevator_put(e);
>>  return -EINVAL;
>>  }
>>  if (e->uses_mq && !q->mq_ops) {
>>  elevator_put(e);
>>  return -EINVAL;
>>  }
>>
>> We don't ever appear to call elevator_init() with a specific scheduler
>> name, and for the default we switch off of q->mq_ops and use the
>> defaults from Kconfig:
>>
>>  if (q->mq_ops && q->nr_hw_queues == 1)
>>  e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false);
>>  else if (q->mq_ops)
>>  e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false);
>>  else
>>  e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
>>
>>  if (!e) {
>>  printk(KERN_ERR
>>  "Default I/O scheduler not found. " \
>>  "Using noop/none.\n");
>>  e = elevator_get("noop", false);
>>  }
>>
>> So I guess this could happen if someone manually changed those Kconfig
>> options, but I don't see what other case would make this happen, could
>> you please explain?
> 
> Was wondering the same - is it using the 'elevator=' boot parameter?
> Didn't look at that path just now, but that's the only one I could
> think of. If it is, I'd much prefer only using 'chosen_elevator' for
> the non-mq stuff, and the fix should be just that instead.
> 
[ .. ]
While we're at the topic:

Can't we use the same names for legacy and mq scheduler?
It's quite an unnecessary complication to have
'noop', 'deadline', and 'cfq' for legacy, but 'none' and 'mq-deadline'
for mq. If we could use 'noop' and 'deadline' for mq, too, the existing
settings or udev rules will continue to work and we wouldn't get any
annoying and pointless warnings here...

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [BUG] Potential deadlock in the block layer

2017-02-13 Thread Jens Axboe

On 02/13/2017 07:14 AM, Thomas Gleixner wrote:
> Gabriel reported the lockdep splat below while investigating something
> different.
> 
> Explanation for the splat is in the function comment above
> del_timer_sync().
> 
> I can reproduce it as well and it's clearly broken.

Agree, the disable logic is broken. The below isn't super pretty, but it
should do the trick for 4.10.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c73a6fcaeb9d..838f07e2b64a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3758,7 +3758,7 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct 
cfq_queue *cfqq,
 }
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
-static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
+static bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 {
struct cfq_data *cfqd = cic_to_cfqd(cic);
struct cfq_queue *cfqq;
@@ -3775,15 +3775,7 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, 
struct bio *bio)
 * spuriously on a newly created cic but there's no harm.
 */
if (unlikely(!cfqd) || likely(cic->blkcg_serial_nr == serial_nr))
-   return;
-
-   /*
-* If we have a non-root cgroup, we can depend on that to
-* do proper throttling of writes. Turn off wbt for that
-* case, if it was enabled by default.
-*/
-   if (nonroot_cg)
-   wbt_disable_default(cfqd->queue);
+   return nonroot_cg;
 
/*
 * Drop reference to queues.  New queues will be assigned in new
@@ -3804,9 +3796,13 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, 
struct bio *bio)
}
 
cic->blkcg_serial_nr = serial_nr;
+   return nonroot_cg;
 }
 #else
-static inline void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) 
{ }
+static inline bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
+{
+   return false;
+}
 #endif  /* CONFIG_CFQ_GROUP_IOSCHED */
 
 static struct cfq_queue **
@@ -4448,11 +,12 @@ cfq_set_request(struct request_queue *q, struct request 
*rq, struct bio *bio,
const int rw = rq_data_dir(rq);
const bool is_sync = rq_is_sync(rq);
struct cfq_queue *cfqq;
+   bool disable_wbt;
 
spin_lock_irq(q->queue_lock);
 
check_ioprio_changed(cic, bio);
-   check_blkcg_changed(cic, bio);
+   disable_wbt = check_blkcg_changed(cic, bio);
 new_queue:
cfqq = cic_to_cfqq(cic, is_sync);
if (!cfqq || cfqq == >oom_cfqq) {
@@ -4488,6 +4485,10 @@ cfq_set_request(struct request_queue *q, struct request 
*rq, struct bio *bio,
rq->elv.priv[0] = cfqq;
rq->elv.priv[1] = cfqq->cfqg;
spin_unlock_irq(q->queue_lock);
+
+   if (disable_wbt)
+   wbt_disable_default(q);
+
return 0;
 }
 

-- 
Jens Axboe

Re: [PATCHv6 03/37] page-flags: relax page flag policy for few flags

2017-02-13 Thread Kirill A. Shutemov

On Wed, Feb 08, 2017 at 08:01:13PM -0800, Matthew Wilcox wrote:
> On Thu, Jan 26, 2017 at 02:57:45PM +0300, Kirill A. Shutemov wrote:
> > These flags are in use for filesystems with backing storage: PG_error,
> > PG_writeback and PG_readahead.
> 
> Oh ;-)  Then I amend my comment on patch 1 to be "patch 3 needs to go
> ahead of patch 1" ;-)

It doesn't really matter as long as both before patch 37 :P

-- 
 Kirill A. Shutemov

[BUG] Potential deadlock in the block layer

2017-02-13 Thread Thomas Gleixner

Gabriel reported the lockdep splat below while investigating something
different.

Explanation for the splat is in the function comment above
del_timer_sync().

I can reproduce it as well and it's clearly broken.

Thanks,

tglx

---

[   81.518032] ==
[   81.520151] [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
[   81.522276] 4.10.0-rc8-debug-dirty #1 Tainted: G  I
[   81.524445] --
[   81.526560] (systemd)/1725 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[   81.528672]  (((>window_timer))){+.-...}, at: [] 
del_timer_sync+0x0/0xd0
[   81.530973]
and this task is already holding:
[   81.535399]  (&(>__queue_lock)->rlock){-.-...}, at: [] 
cfq_set_request+0x80/0x2c0
[   81.537637] which would create a new lock dependency:
[   81.539870]  (&(>__queue_lock)->rlock){-.-...} -> 
(((>window_timer))){+.-...}
[   81.542145]
but this new dependency connects a HARDIRQ-irq-safe lock:
[   81.546717]  (&(>__queue_lock)->rlock){-.-...}
[   81.546720]
... which became HARDIRQ-irq-safe at:
[   81.553643]   __lock_acquire+0x24f/0x19e0
[   81.555970]   lock_acquire+0xa5/0xd0
[   81.558302]   _raw_spin_lock_irqsave+0x54/0x90
[   81.560667]   ata_qc_schedule_eh+0x56/0x80 [libata]
[   81.563035]   ata_qc_complete+0x99/0x1c0 [libata]
[   81.565410]   ata_do_link_abort+0x93/0xd0 [libata]
[   81.567786]   ata_port_abort+0xb/0x10 [libata]
[   81.570150]   ahci_handle_port_interrupt+0x41e/0x570 [libahci]
[   81.572533]   ahci_handle_port_intr+0x74/0xb0 [libahci]
[   81.574918]   ahci_single_level_irq_intr+0x3b/0x60 [libahci]
[   81.577311]   __handle_irq_event_percpu+0x35/0x100
[   81.579696]   handle_irq_event_percpu+0x1e/0x50
[   81.582065]   handle_irq_event+0x34/0x60
[   81.584427]   handle_edge_irq+0x112/0x140
[   81.586776]   handle_irq+0x18/0x20
[   81.589109]   do_IRQ+0x87/0x110
[   81.591403]   ret_from_intr+0x0/0x20
[   81.593666]   cpuidle_enter_state+0x102/0x230
[   81.595939]   cpuidle_enter+0x12/0x20
[   81.598202]   call_cpuidle+0x38/0x40
[   81.600443]   do_idle+0x16e/0x1e0
[   81.602670]   cpu_startup_entry+0x5d/0x60
[   81.604890]   rest_init+0x12c/0x140
[   81.607080]   start_kernel+0x45f/0x46c
[   81.609252]   x86_64_start_reservations+0x2a/0x2c
[   81.611399]   x86_64_start_kernel+0xeb/0xf8
[   81.613505]   verify_cpu+0x0/0xfc
[   81.615574]
to a HARDIRQ-irq-unsafe lock:
[   81.619611]  (((>window_timer))){+.-...}
[   81.619614]
... which became HARDIRQ-irq-unsafe at:
[   81.625562] ...
[   81.625565]   __lock_acquire+0x2ba/0x19e0
[   81.629504]   lock_acquire+0xa5/0xd0
[   81.631467]   call_timer_fn+0x74/0x110
[   81.633397]   expire_timers+0xaa/0xd0
[   81.635287]   run_timer_softirq+0x72/0x140
[   81.637145]   __do_softirq+0x143/0x290
[   81.638974]   irq_exit+0x6a/0xd0
[   81.640818]   smp_trace_apic_timer_interrupt+0x79/0x90
[   81.642698]   smp_apic_timer_interrupt+0x9/0x10
[   81.644572]   apic_timer_interrupt+0x93/0xa0
[   81.646445]   cpuidle_enter_state+0x102/0x230
[   81.648332]   cpuidle_enter+0x12/0x20
[   81.650211]   call_cpuidle+0x38/0x40
[   81.652096]   do_idle+0x16e/0x1e0
[   81.653984]   cpu_startup_entry+0x5d/0x60
[   81.655855]   start_secondary+0x150/0x180
[   81.657696]   verify_cpu+0x0/0xfc
[   81.659497]
other info that might help us debug this:

[   81.664802]  Possible interrupt unsafe locking scenario:

[   81.668296]CPU0CPU1
[   81.670010]
[   81.671685]   lock(((>window_timer)));
[   81.673358]local_irq_disable();
[   81.675018]lock(&(>__queue_lock)->rlock);
[   81.676659]lock(((>window_timer)));
[   81.678275]   
[   81.679845] lock(&(>__queue_lock)->rlock);
[   81.681419]
 *** DEADLOCK ***

[   81.685989] 1 lock held by (systemd)/1725:
[   81.687518]  #0:  (&(>__queue_lock)->rlock){-.-...}, at: 
[] cfq_set_request+0x80/0x2c0
[   81.689125]
the dependencies between HARDIRQ-irq-safe lock and the holding 
lock:
[   81.692327] -> (&(>__queue_lock)->rlock){-.-...} ops: 70140 {
[   81.693971]IN-HARDIRQ-W at:
[   81.695606] __lock_acquire+0x24f/0x19e0
[   81.697262] lock_acquire+0xa5/0xd0
[   81.698906] _raw_spin_lock_irqsave+0x54/0x90
[   81.700568] ata_qc_schedule_eh+0x56/0x80 [libata]
[   81.702229] ata_qc_complete+0x99/0x1c0 [libata]
[   81.703884] ata_do_link_abort+0x93/0xd0 [libata]
[   81.705540] ata_port_abort+0xb/0x10 [libata]
[   81.707199] ahci_handle_port_interrupt+0x41e/0x570 
[libahci]
[   81.708881] ahci_handle_port_intr+0x74/0xb0 [libahci]
[   81.710563]

Re: [PATCHv6 07/37] filemap: allocate huge page in page_cache_read(), if allowed

2017-02-13 Thread Kirill A. Shutemov

On Thu, Feb 09, 2017 at 01:18:35PM -0800, Matthew Wilcox wrote:
> On Thu, Jan 26, 2017 at 02:57:49PM +0300, Kirill A. Shutemov wrote:
> > Later we can add logic to accumulate information from shadow entires to
> > return to caller (average eviction time?).
> 
> I would say minimum rather than average.  That will become the refault
> time of the entire page, so minimum would probably have us making better
> decisions?

Yes, makes sense.

> > +   /* Wipe shadow entires */
> > +   radix_tree_for_each_slot(slot, >page_tree, ,
> > +   page->index) {
> > +   if (iter.index >= page->index + hpage_nr_pages(page))
> > +   break;
> >  
> > p = radix_tree_deref_slot_protected(slot, >tree_lock);
> > -   if (!radix_tree_exceptional_entry(p))
> > +   if (!p)
> > +   continue;
> 
> Just FYI, this can't happen.  You're holding the tree lock so nobody
> else gets to remove things from the tree.  radix_tree_for_each_slot()
> only gives you the full slots; it skips the empty ones for you.  I'm
> OK if you want to leave it in out of an abundance of caution.

I'll drop it.

> > +   __radix_tree_replace(>page_tree, iter.node, slot, NULL,
> > +   workingset_update_node, mapping);
> 
> I may add an update_node argument to radix_tree_join at some point,
> so you can use it here.  Or maybe we don't need to do that, and what
> you have here works just fine.
> 
> > mapping->nrexceptional--;
> 
> ... because adjusting the exceptional count is going to be a pain.

Yeah..

> > +   error = __radix_tree_insert(>page_tree,
> > +   page->index, compound_order(page), page);
> > +   /* This shouldn't happen */
> > +   if (WARN_ON_ONCE(error))
> > +   return error;
> 
> A lesser man would have just ignored the return value from
> __radix_tree_insert.  I salute you.
> 
> > @@ -2078,18 +2155,34 @@ static int page_cache_read(struct file *file, 
> > pgoff_t offset, gfp_t gfp_mask)
> >  {
> > struct address_space *mapping = file->f_mapping;
> > struct page *page;
> > +   pgoff_t hoffset;
> > int ret;
> >  
> > do {
> > -   page = __page_cache_alloc(gfp_mask|__GFP_COLD);
> > +   page = page_cache_alloc_huge(mapping, offset, gfp_mask);
> > +no_huge:
> > +   if (!page)
> > +   page = __page_cache_alloc(gfp_mask|__GFP_COLD);
> > if (!page)
> > return -ENOMEM;
> >  
> > -   ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask & 
> > GFP_KERNEL);
> > -   if (ret == 0)
> > +   if (PageTransHuge(page))
> > +   hoffset = round_down(offset, HPAGE_PMD_NR);
> > +   else
> > +   hoffset = offset;
> > +
> > +   ret = add_to_page_cache_lru(page, mapping, hoffset,
> > +   gfp_mask & GFP_KERNEL);
> > +
> > +   if (ret == -EEXIST && PageTransHuge(page)) {
> > +   put_page(page);
> > +   page = NULL;
> > +   goto no_huge;
> > +   } else if (ret == 0) {
> > ret = mapping->a_ops->readpage(file, page);
> > -   else if (ret == -EEXIST)
> > +   } else if (ret == -EEXIST) {
> > ret = 0; /* losing race to add is OK */
> > +   }
> >  
> > put_page(page);
> 
> If the filesystem returns AOP_TRUNCATED_PAGE, you'll go round this loop
> again trying the huge page again, even if the huge page didn't work
> the first time.  I would tend to think that if the huge page failed the
> first time, we shouldn't try it again, so I propose this:

AOP_TRUNCATED_PAGE is positive, so I don't see how you avoid try_huge on
second iteration. Hm?

> 
> struct address_space *mapping = file->f_mapping;
> struct page *page;
> pgoff_t index;
> int ret;
> bool try_huge = true;
> 
> do {
> if (try_huge) {
> page = page_cache_alloc_huge(gfp_mask|__GFP_COLD);
> if (page)
> index = round_down(offset, HPAGE_PMD_NR);
> else
> try_huge = false;
> }
> 
> if (!try_huge) {
> page = __page_cache_alloc(gfp_mask|__GFP_COLD);
> index = offset;
> }
> 
> if (!page)
> return -ENOMEM;
> 
> ret = add_to_page_cache_lru(page, mapping, index,
> gfp_mask & 
> GFP_KERNEL);
> if (ret < 0) {
> if (try_huge) {
> try_huge = false;
> ret = AOP_TRUNCATED_PAGE;
> } else if (ret == -EEXIST)
> ret = 0;

Re: [PATCHv6 08/37] filemap: handle huge pages in do_generic_file_read()

2017-02-13 Thread Kirill A. Shutemov

On Thu, Feb 09, 2017 at 01:55:05PM -0800, Matthew Wilcox wrote:
> On Thu, Jan 26, 2017 at 02:57:50PM +0300, Kirill A. Shutemov wrote:
> > +++ b/mm/filemap.c
> > @@ -1886,6 +1886,7 @@ static ssize_t do_generic_file_read(struct file 
> > *filp, loff_t *ppos,
> > if (unlikely(page == NULL))
> > goto no_cached_page;
> > }
> > +   page = compound_head(page);
> 
> We got this page from find_get_page(), which gets it from
> pagecache_get_page(), which gets it from find_get_entry() ... which
> (unless I'm lost in your patch series) returns the head page.  So this
> line is redundant, right?

No. pagecache_get_page() returns subpage. See description of the first
patch.

> But then down in filemap_fault, we have:
> 
> VM_BUG_ON_PAGE(page->index != offset, page);
> 
> ... again, maybe I'm lost somewhere in your patch series, but I don't see
> anywhere you remove that line (or modify it).

This should be fine as find_get_page() returns subpage.

> So are you not testing
> with VM debugging enabled, or are you not doing a test which includes
> mapping a file with huge pages, reading from it (to get the page in cache),
> then faulting on an address that is not in the first 4kB of that 2MB?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
 Kirill A. Shutemov

[PATCH] blk-mq: Call bio_io_error if request returned is NULL

2017-02-13 Thread Goldwyn Rodrigues

From: Goldwyn Rodrigues 

If blk_mq_map_request returns NULL, bio_endio() function is not
called. Call bio_io_error() in case request returned is NULL.

Signed-off-by: Goldwyn Rodrigues 
---
 block/blk-mq.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 81caceb..7cd8912 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1286,8 +1286,10 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
return BLK_QC_T_NONE;
 
rq = blk_mq_map_request(q, bio, );
-   if (unlikely(!rq))
+   if (unlikely(!rq)) {
+   bio_io_error(bio);
return BLK_QC_T_NONE;
+   }
 
cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
@@ -1381,8 +1383,10 @@ static blk_qc_t blk_sq_make_request(struct request_queue 
*q, struct bio *bio)
request_count = blk_plug_queued_count(q);
 
rq = blk_mq_map_request(q, bio, );
-   if (unlikely(!rq))
+   if (unlikely(!rq)) {
+   bio_io_error(bio);
return BLK_QC_T_NONE;
+   }
 
cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
-- 
2.10.2

Re: [PATCHv6 08/37] filemap: handle huge pages in do_generic_file_read()

2017-02-13 Thread Matthew Wilcox

On Thu, Jan 26, 2017 at 02:57:50PM +0300, Kirill A. Shutemov wrote:
> Most of work happans on head page. Only when we need to do copy data to
> userspace we find relevant subpage.
> 
> We are still limited by PAGE_SIZE per iteration. Lifting this limitation
> would require some more work.

Now that I debugged that bit of my brain, here's a more sensible suggestion.

> @@ -1886,6 +1886,7 @@ static ssize_t do_generic_file_read(struct file *filp, 
> loff_t *ppos,
>   if (unlikely(page == NULL))
>   goto no_cached_page;
>   }
> + page = compound_head(page);
>   if (PageReadahead(page)) {
>   page_cache_async_readahead(mapping,
>   ra, filp, page,

We're going backwards and forwards a lot between subpages and page heads.
I'd like to see us do this:

static inline struct page *pagecache_get_page(struct address_space *mapping,
pgoff_t offset, int fgp_flags, gfp_t cache_gfp_mask)
{
struct page *page = pagecache_get_head(mapping, offset, fgp_flags,
cache_gfp_mask);
return page ? find_subpage(page, offset) : NULL;
}

static inline struct page *find_get_head(struct address_space *mapping,
pgoff_t offset)
{
return pagecache_get_head(mapping, offset, 0, 0);
}

and then we can switch do_generic_file_read() to call find_get_head(),
eliminating the conversion back and forth between subpages and head pages.

RE: [PATCH V5 3/4] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN

2017-02-13 Thread David Laight

From: Scott Bauer Sent: 13 February 2017 16:11
> When CONFIG_KASAN is enabled, compilation fails:
> 
> block/sed-opal.c: In function 'sed_ioctl':
> block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than 
> 2048 bytes [-Werror=frame-
> larger-than=]
> 
> Moved all the ioctl structures off the stack and dynamically activate
> using _IOC_SIZE()

Think I'd not that this simplifies the code considerably.
AFAICT CONFIG_KASAN is a just brainf*ck.
It at least needs annotation that copy_from_user() has properties
similar to memset().
So if the size matches that of the type then no guard space (etc)
is required.

...
> + ioctl_ptr = memdup_user(arg,  _IOC_SIZE(cmd));
> + if (IS_ERR_OR_NULL(ioctl_ptr)) {
> + ret = PTR_ERR(ioctl_ptr);
> + goto out;
...
> + out:
> + kfree(ioctl_ptr);
> + return ret;
>  }

That error path doesn't look quite right to me.

David

[PATCH V5 3/4] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN

2017-02-13 Thread Scott Bauer

When CONFIG_KASAN is enabled, compilation fails:

block/sed-opal.c: In function 'sed_ioctl':
block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than 
2048 bytes [-Werror=frame-larger-than=]

Moved all the ioctl structures off the stack and dynamically activate
using _IOC_SIZE()

Fixes: 455a7b238cd6 ("block: Add Sed-opal library")

Reported-by: Arnd Bergmann 
Signed-off-by: Scott Bauer 
---
 block/sed-opal.c | 130 +++
 1 file changed, 45 insertions(+), 85 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 2448d4a..5733248 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -2348,7 +2348,6 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, 
void __user *arg)
 {
void *ioctl_ptr;
int ret = -ENOTTY;
-   unsigned int cmd_size = _IOC_SIZE(cmd);
 
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -2357,94 +2356,55 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, 
void __user *arg)
return -ENOTSUPP;
}
 
-   switch (cmd) {
-   case IOC_OPAL_SAVE: {
-   struct opal_lock_unlock lk_unlk;
-
-   if (copy_from_user(_unlk, arg, sizeof(lk_unlk)))
-   return -EFAULT;
-   return opal_save(dev, _unlk);
-   }
-   case IOC_OPAL_LOCK_UNLOCK: {
-   struct opal_lock_unlock lk_unlk;
-
-   if (copy_from_user(_unlk, arg, sizeof(lk_unlk)))
-   return -EFAULT;
-   return opal_lock_unlock(dev, _unlk);
-   }
-   case IOC_OPAL_TAKE_OWNERSHIP: {
-   struct opal_key opal_key;
-
-   if (copy_from_user(_key, arg, sizeof(opal_key)))
-   return -EFAULT;
-   return opal_take_ownership(dev, _key);
-   }
-   case IOC_OPAL_ACTIVATE_LSP: {
-   struct opal_lr_act opal_lr_act;
-
-   if (copy_from_user(_lr_act, arg, sizeof(opal_lr_act)))
-   return -EFAULT;
-   return opal_activate_lsp(dev, _lr_act);
-   }
-   case IOC_OPAL_SET_PW: {
-   struct opal_new_pw opal_pw;
-
-   if (copy_from_user(_pw, arg, sizeof(opal_pw)))
-   return -EFAULT;
-   return opal_set_new_pw(dev, _pw);
-   }
-   case IOC_OPAL_ACTIVATE_USR: {
-   struct opal_session_info session;
-
-   if (copy_from_user(, arg, sizeof(session)))
-   return -EFAULT;
-   return opal_activate_user(dev, );
-   }
-   case IOC_OPAL_REVERT_TPR: {
-   struct opal_key opal_key;
-
-   if (copy_from_user(_key, arg, sizeof(opal_key)))
-   return -EFAULT;
-   return opal_reverttper(dev, _key);
-   }
-   case IOC_OPAL_LR_SETUP: {
-   struct opal_user_lr_setup lrs;
-
-   if (copy_from_user(, arg, sizeof(lrs)))
-   return -EFAULT;
-   return opal_setup_locking_range(dev, );
-   }
-   case IOC_OPAL_ADD_USR_TO_LR: {
-   struct opal_lock_unlock lk_unlk;
-
-   if (copy_from_user(_unlk, arg, sizeof(lk_unlk)))
-   return -EFAULT;
-   return opal_add_user_to_lr(dev, _unlk);
-   }
-   case IOC_OPAL_ENABLE_DISABLE_MBR: {
-   struct opal_mbr_data mbr;
-
-   if (copy_from_user(, arg, sizeof(mbr)))
-   return -EFAULT;
-   return opal_enable_disable_shadow_mbr(dev, );
-   }
-   case IOC_OPAL_ERASE_LR: {
-   struct opal_session_info session;
-
-   if (copy_from_user(, arg, sizeof(session)))
-   return -EFAULT;
-   return opal_erase_locking_range(dev, );
+   ioctl_ptr = memdup_user(arg,  _IOC_SIZE(cmd));
+   if (IS_ERR_OR_NULL(ioctl_ptr)) {
+   ret = PTR_ERR(ioctl_ptr);
+   goto out;
}
-   case IOC_OPAL_SECURE_ERASE_LR: {
-   struct opal_session_info session;
 
-   if (copy_from_user(, arg, sizeof(session)))
-   return -EFAULT;
-   return opal_secure_erase_locking_range(dev, );
-   }
+   switch (cmd) {
+   case IOC_OPAL_SAVE:
+   ret = opal_save(dev, ioctl_ptr);
+   break;
+   case IOC_OPAL_LOCK_UNLOCK:
+   ret = opal_lock_unlock(dev, ioctl_ptr);
+   break;
+   case IOC_OPAL_TAKE_OWNERSHIP:
+   ret = opal_take_ownership(dev, ioctl_ptr);
+   break;
+   case IOC_OPAL_ACTIVATE_LSP:
+   ret = opal_activate_lsp(dev, ioctl_ptr);
+   break;
+   case IOC_OPAL_SET_PW:
+   ret = opal_set_new_pw(dev, ioctl_ptr);
+   break;
+   case IOC_OPAL_ACTIVATE_USR:
+   ret =

[PATCH V5 2/4] uapi: sed-opal fix IOW for activate lsp to use correct struct

2017-02-13 Thread Scott Bauer

the IOW for the IOC_OPAL_ACTIVATE_LSP took the wrong strcure which
would give us the wrong size when using _IOC_SIZE, switch it to the
right structure.

Fixes: 058f8a2 ("Include: Uapi: Add user ABI for Sed/Opal")

Signed-off-by: Scott Bauer 
---
 include/uapi/linux/sed-opal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/sed-opal.h b/include/uapi/linux/sed-opal.h
index fc06e3a..c72e073 100644
--- a/include/uapi/linux/sed-opal.h
+++ b/include/uapi/linux/sed-opal.h
@@ -106,7 +106,7 @@ struct opal_mbr_data {
 #define IOC_OPAL_SAVE  _IOW('p', 220, struct opal_lock_unlock)
 #define IOC_OPAL_LOCK_UNLOCK   _IOW('p', 221, struct opal_lock_unlock)
 #define IOC_OPAL_TAKE_OWNERSHIP_IOW('p', 222, struct opal_key)
-#define IOC_OPAL_ACTIVATE_LSP   _IOW('p', 223, struct opal_key)
+#define IOC_OPAL_ACTIVATE_LSP   _IOW('p', 223, struct opal_lr_act)
 #define IOC_OPAL_SET_PW _IOW('p', 224, struct opal_new_pw)
 #define IOC_OPAL_ACTIVATE_USR   _IOW('p', 225, struct opal_session_info)
 #define IOC_OPAL_REVERT_TPR _IOW('p', 226, struct opal_key)
-- 
2.7.4

[PATCH V5 4/4] Maintainers: Modify SED list from nvme to block

2017-02-13 Thread Scott Bauer

Signed-off-by: Scott Bauer 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index e325373..b983b25 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11094,7 +11094,7 @@ SECURE ENCRYPTING DEVICE (SED) OPAL DRIVER
 M: Scott Bauer 
 M: Jonathan Derrick 
 M: Rafael Antognolli 
-L: linux-n...@lists.infradead.org
+L: linux-block@vger.kernel.org
 S: Supported
 F: block/sed*
 F: block/opal_proto.h
-- 
2.7.4

Re: [PATCH V5 3/4] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN

2017-02-13 Thread Scott Bauer

On Mon, Feb 13, 2017 at 04:30:36PM +, David Laight wrote:
> From: Scott Bauer Sent: 13 February 2017 16:11
> > When CONFIG_KASAN is enabled, compilation fails:
> > 
> > block/sed-opal.c: In function 'sed_ioctl':
> > block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than 
> > 2048 bytes [-Werror=frame-
> > larger-than=]
> > 
> > Moved all the ioctl structures off the stack and dynamically activate
> > using _IOC_SIZE()
> 
> Think I'd not that this simplifies the code considerably.
> AFAICT CONFIG_KASAN is a just brainf*ck.
> It at least needs annotation that copy_from_user() has properties
> similar to memset().
> So if the size matches that of the type then no guard space (etc)
> is required.
>

I don't really follow what you're saying. Do you want me to just include that
the patch cleans up the ioctl path a bit. I need to include the KASAN part since
there was build breakage and the series does fix it even if it simplifies the 
path
as well. As for the memset part, we never copy back to userland so there's no 
chance
of data leakage which is what it seems you're hinting at.

> ...
> > +   ioctl_ptr = memdup_user(arg,  _IOC_SIZE(cmd));
> > +   if (IS_ERR_OR_NULL(ioctl_ptr)) {
> > +   ret = PTR_ERR(ioctl_ptr);
> > +   goto out;
> ...
> > + out:
> > +   kfree(ioctl_ptr);
> > +   return ret;
> >  }

> 
> That error path doesn't look quite right to me.
> 
>   David
> 

good god, this is a mess this morning. Thanks for the catch, I'll review these
more aggressively before sending out.

Re: [PATCH V5 3/4] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN

2017-02-13 Thread Arnd Bergmann

On Mon, Feb 13, 2017 at 5:29 PM, Scott Bauer  wrote:
> On Mon, Feb 13, 2017 at 04:30:36PM +, David Laight wrote:
>> From: Scott Bauer Sent: 13 February 2017 16:11
>> > When CONFIG_KASAN is enabled, compilation fails:
>> >
>> > block/sed-opal.c: In function 'sed_ioctl':
>> > block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger 
>> > than 2048 bytes [-Werror=frame-
>> > larger-than=]
>> >
>> > Moved all the ioctl structures off the stack and dynamically activate
>> > using _IOC_SIZE()
>>
>> Think I'd not that this simplifies the code considerably.
>> AFAICT CONFIG_KASAN is a just brainf*ck.
>> It at least needs annotation that copy_from_user() has properties
>> similar to memset().
>> So if the size matches that of the type then no guard space (etc)
>> is required.

I think it still would, as the pointer to the local variable gets passed through
dev->func_data[].

>> ...
>> > +   ioctl_ptr = memdup_user(arg,  _IOC_SIZE(cmd));
>> > +   if (IS_ERR_OR_NULL(ioctl_ptr)) {
>> > +   ret = PTR_ERR(ioctl_ptr);
>> > +   goto out;
>> ...
>> > + out:
>> > +   kfree(ioctl_ptr);
>> > +   return ret;
>> >  }
>
>
>>
>> That error path doesn't look quite right to me.
>>
>>   David
>>
>
> good god, this is a mess this morning. Thanks for the catch, I'll review these
> more aggressively before sending out.

memdup_user() never returns NULL, and generally speaking any use of
IS_ERR_OR_NULL() indicates that there is something wrong with the
interface, so aside from passing the right pointer (or NULL) into kfree()
I think using IS_ERR() is the correct solution.

 Arnd

Re: [PATCHv6 08/37] filemap: handle huge pages in do_generic_file_read()

2017-02-13 Thread Matthew Wilcox

On Mon, Feb 13, 2017 at 06:33:42PM +0300, Kirill A. Shutemov wrote:
> No. pagecache_get_page() returns subpage. See description of the first
> patch.

Your description says:

> We also change interface for page-cache lookup function:
> 
>   - functions that lookup for pages[1] would return subpages of THP
> relevant for requested indexes;
> 
>   - functions that lookup for entries[2] would return one entry per-THP
> and index will point to index of head page (basically, round down to
> HPAGE_PMD_NR);
> 
> This would provide balanced exposure of multi-order entires to the rest
> of the kernel.
> 
> [1] find_get_pages(), pagecache_get_page(), pagevec_lookup(), etc.
> [2] find_get_entry(), find_get_entries(), pagevec_lookup_entries(), etc.

I'm saying:

> > We got this page from find_get_page(), which gets it from
> > pagecache_get_page(), which gets it from find_get_entry() ... which
> > (unless I'm lost in your patch series) returns the head page.

Am I guilty of debugging documentation rather than code?

Re: [PATCHv6 08/37] filemap: handle huge pages in do_generic_file_read()

2017-02-13 Thread Matthew Wilcox

On Mon, Feb 13, 2017 at 08:01:17AM -0800, Matthew Wilcox wrote:
> On Mon, Feb 13, 2017 at 06:33:42PM +0300, Kirill A. Shutemov wrote:
> > No. pagecache_get_page() returns subpage. See description of the first
> > patch.

Oh, I re-read patch 1 and it made sense now.  I missed the bit where
pagecache_get_page() called page_subpage().

[PATCH V5 1/4] block: sed-opal: change ioctl to take user pointer instead of unsinged long

2017-02-13 Thread Scott Bauer

Signed-off-by: Scott Bauer 
---
 block/sed-opal.c | 6 --
 drivers/nvme/host/core.c | 3 ++-
 include/linux/sed-opal.h | 4 ++--
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index bf1406e..2448d4a 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -2344,9 +2344,11 @@ bool opal_unlock_from_suspend(struct opal_dev *dev)
 }
 EXPORT_SYMBOL(opal_unlock_from_suspend);
 
-int sed_ioctl(struct opal_dev *dev, unsigned int cmd, unsigned long ptr)
+int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *arg)
 {
-   void __user *arg = (void __user *)ptr;
+   void *ioctl_ptr;
+   int ret = -ENOTTY;
+   unsigned int cmd_size = _IOC_SIZE(cmd);
 
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d9f4903..04c48e7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -811,7 +811,8 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t 
mode,
return nvme_nvm_ioctl(ns, cmd, arg);
 #endif
if (is_sed_ioctl(cmd))
-   return sed_ioctl(>ctrl->opal_dev, cmd, arg);
+   return sed_ioctl(>ctrl->opal_dev, cmd,
+(void __user *) arg);
return -ENOTTY;
}
 }
diff --git a/include/linux/sed-opal.h b/include/linux/sed-opal.h
index af1a85e..205d520 100644
--- a/include/linux/sed-opal.h
+++ b/include/linux/sed-opal.h
@@ -132,7 +132,7 @@ struct opal_dev {
 #ifdef CONFIG_BLK_SED_OPAL
 bool opal_unlock_from_suspend(struct opal_dev *dev);
 void init_opal_dev(struct opal_dev *opal_dev, sec_send_recv *send_recv);
-int sed_ioctl(struct opal_dev *dev, unsigned int cmd, unsigned long ptr);
+int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *ioctl_ptr);
 
 static inline bool is_sed_ioctl(unsigned int cmd)
 {
@@ -160,7 +160,7 @@ static inline bool is_sed_ioctl(unsigned int cmd)
 }
 
 static inline int sed_ioctl(struct opal_dev *dev, unsigned int cmd,
-   unsigned long ptr)
+   void __user *ioctl_ptr)
 {
return 0;
 }
-- 
2.7.4

Re: [PATCH] blk-mq: Call bio_io_error if request returned is NULL

2017-02-13 Thread Jens Axboe

On 02/13/2017 09:04 AM, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> If blk_mq_map_request returns NULL, bio_endio() function is not
> called. Call bio_io_error() in case request returned is NULL.

That's currently not a possible condition, since the main
request mapper functions block on rq allocation. So we can
never return NULL there.

-- 
Jens Axboe

SED Opal Fixups

2017-02-13 Thread Scott Bauer

So we have a few patches here, they're pretty small. First patch changes
the sed-opal ioctl function parameters to take a void __user* instead of
an unsigned long, this required a small cast in the nvme driver.
Patch 2 is a UAPI fixup for the IOW to make an ioctl
the right size. Patch 3 fixes a compiliation error when building using
KSAN due to the stack frame being too large. And lastly we move the
Maintainers list from nvme to linux-block.

Re: [PATCH V5 1/4] block: sed-opal: change ioctl to take user pointer instead of unsinged long

2017-02-13 Thread Scott Bauer

esOn Mon, Feb 13, 2017 at 09:11:09AM -0700, Scott Bauer wrote:
> Signed-off-by: Scott Bauer 
> ---
>  block/sed-opal.c | 6 --
>  drivers/nvme/host/core.c | 3 ++-
>  include/linux/sed-opal.h | 4 ++--
>  3 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/block/sed-opal.c b/block/sed-opal.c
> index bf1406e..2448d4a 100644
> --- a/block/sed-opal.c
> +++ b/block/sed-opal.c
> @@ -2344,9 +2344,11 @@ bool opal_unlock_from_suspend(struct opal_dev *dev)
>  }
>  EXPORT_SYMBOL(opal_unlock_from_suspend);
>  
> -int sed_ioctl(struct opal_dev *dev, unsigned int cmd, unsigned long ptr)
> +int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *arg)
>  {
> - void __user *arg = (void __user *)ptr;
> + void *ioctl_ptr;
> + int ret = -ENOTTY;
> + unsigned int cmd_size = _IOC_SIZE(cmd);

ugh, I apparently messed up my rebase these should be in patch 2 or maybe I 
should
sqash p1 and p2  together.

Re: [PATCH v2 4/4] Maintainers: Add Information for SED Opal library

2017-02-13 Thread h...@infradead.org

On Fri, Feb 10, 2017 at 09:44:59AM -0700, Scott Bauer wrote:
> > > +F:include/linux/sed*
> > > +F:include/uapi/linux/sed*
> > 
> > Since this is in the block tree and is not nvme specific, could you use
> > linux-block as the main list?
> 
> Sure, that's fine. Is there a way to denote "main list" in maintainers,
> or just list it first?

I'd just use linux-block - traffic isn't much higher than the nvme list,
and it fits a lot better.  FYI, I hope to have ATA (and as byproduct
SCSI) support ready for the next merge window.

RE: [PATCH V5 3/4] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN

2017-02-13 Thread David Laight

From: Arnd Bergmann
> Sent: 13 February 2017 17:02
...
> >> > +   ioctl_ptr = memdup_user(arg,  _IOC_SIZE(cmd));
> >> > +   if (IS_ERR_OR_NULL(ioctl_ptr)) {
> >> > +   ret = PTR_ERR(ioctl_ptr);
> >> > +   goto out;
> >> ...
> >> > + out:
> >> > +   kfree(ioctl_ptr);
> >> > +   return ret;
...
> >> That error path doesn't look quite right to me.
...
> > good god, this is a mess this morning. Thanks for the catch, I'll review 
> > these
> > more aggressively before sending out.
> 
> memdup_user() never returns NULL, and generally speaking any use of
> IS_ERR_OR_NULL() indicates that there is something wrong with the
> interface, so aside from passing the right pointer (or NULL) into kfree()
> I think using IS_ERR() is the correct solution.

You missed the problem entirely.
Code needs to be:
if (IS_ERR(ioctl_ptr))
return PTR_ERR(ioctl_ptr);

David

[PATCH] nbd: use bdget_disk to get bdev

2017-02-13 Thread Josef Bacik

In preparation for the upcoming netlink interface we need to not rely on
already having the bdev for the NBD device we are doing operations on.
Instead of passing the bdev from the bdev ioctl around, use bdget_disk()
wherever we need the bdev and pass that down to the helpers as
necessary.

Signed-off-by: Josef Bacik 
---
 drivers/block/nbd.c | 88 ++---
 1 file changed, 50 insertions(+), 38 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 25891a1..0623f8f 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -168,13 +168,18 @@ static void nbd_size_update(struct nbd_device *nbd, 
struct block_device *bdev)
kobject_uevent(_to_dev(nbd)->kobj, KOBJ_CHANGE);
 }
 
-static void nbd_size_set(struct nbd_device *nbd, struct block_device *bdev,
-   loff_t blocksize, loff_t nr_blocks)
+static int nbd_size_set(struct nbd_device *nbd, loff_t blocksize,
+   loff_t nr_blocks)
 {
+   struct block_device *bdev = bdget_disk(nbd->disk, 0);
+   if (!bdev)
+   return -EINVAL;
nbd->blksize = blocksize;
nbd->bytesize = blocksize * nr_blocks;
if (nbd_is_connected(nbd))
nbd_size_update(nbd, bdev);
+   bdput(bdev);
+   return 0;
 }
 
 static void nbd_end_request(struct nbd_cmd *cmd)
@@ -704,8 +709,7 @@ static int nbd_wait_for_socks(struct nbd_device *nbd)
return ret;
 }
 
-static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev,
- unsigned long arg)
+static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg)
 {
struct socket *sock;
struct nbd_sock **socks;
@@ -749,8 +753,6 @@ static int nbd_add_socket(struct nbd_device *nbd, struct 
block_device *bdev,
nsock->sock = sock;
socks[nbd->num_connections++] = nsock;
 
-   if (max_part)
-   bdev->bd_invalidated = 1;
err = 0;
 out:
mutex_unlock(>socks_lock);
@@ -771,18 +773,19 @@ static void nbd_reset(struct nbd_device *nbd)
 
 static void nbd_bdev_reset(struct block_device *bdev)
 {
-   set_device_ro(bdev, false);
-   bdev->bd_inode->i_size = 0;
+   bd_set_size(bdev, 0);
if (max_part > 0) {
blkdev_reread_part(bdev);
bdev->bd_invalidated = 1;
}
 }
 
-static void nbd_parse_flags(struct nbd_device *nbd, struct block_device *bdev)
+static void nbd_parse_flags(struct nbd_device *nbd)
 {
if (nbd->flags & NBD_FLAG_READ_ONLY)
-   set_device_ro(bdev, true);
+   set_disk_ro(nbd->disk, true);
+   else
+   set_disk_ro(nbd->disk, false);
if (nbd->flags & NBD_FLAG_SEND_TRIM)
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, nbd->disk->queue);
if (nbd->flags & NBD_FLAG_SEND_FLUSH)
@@ -807,16 +810,11 @@ static void send_disconnects(struct nbd_device *nbd)
}
 }
 
-static int nbd_disconnect(struct nbd_device *nbd, struct block_device *bdev)
+static int nbd_disconnect(struct nbd_device *nbd)
 {
dev_info(disk_to_dev(nbd->disk), "NBD_DISCONNECT\n");
if (!nbd_socks_get_unless_zero(nbd))
return -EINVAL;
-
-   mutex_unlock(>config_lock);
-   fsync_bdev(bdev);
-   mutex_lock(>config_lock);
-
if (!test_and_set_bit(NBD_DISCONNECT_REQUESTED,
  >runtime_flags))
send_disconnects(nbd);
@@ -824,28 +822,40 @@ static int nbd_disconnect(struct nbd_device *nbd, struct 
block_device *bdev)
return 0;
 }
 
-static int nbd_clear_sock(struct nbd_device *nbd, struct block_device *bdev)
+static int nbd_clear_sock(struct nbd_device *nbd)
 {
+   struct block_device *bdev = bdget_disk(nbd->disk, 0);
+
sock_shutdown(nbd);
nbd_clear_que(nbd);
-   kill_bdev(bdev);
-   nbd_bdev_reset(bdev);
+   if (bdev) {
+   kill_bdev(bdev);
+   nbd_bdev_reset(bdev);
+   bdput(bdev);
+   }
nbd->task_setup = NULL;
if (test_and_clear_bit(NBD_HAS_SOCKS_REF, >runtime_flags))
nbd_socks_put(nbd);
return 0;
 }
 
-static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev)
+static int nbd_start_device(struct nbd_device *nbd)
 {
struct recv_thread_args *args;
+   struct block_device *bdev = bdget_disk(nbd->disk, 0);
int num_connections = nbd->num_connections;
int error = 0, i;
 
-   if (nbd->task_recv)
+   if (!bdev)
+   return -EINVAL;
+   if (nbd->task_recv) {
+   bdput(bdev);
return -EBUSY;
-   if (!nbd->socks)
+   }
+   if (!nbd->socks) {
+   bdput(bdev);
return -EINVAL;
+   }
if (num_connections > 1 &&
!(nbd->flags & NBD_FLAG_CAN_MULTI_CONN)) {
dev_err(disk_to_dev(nbd->disk), "server does not

Re: [PATCHv6 01/37] mm, shmem: swich huge tmpfs to multi-order radix-tree entries

2017-02-13 Thread Kirill A. Shutemov

On Thu, Feb 09, 2017 at 07:58:20PM +0300, Kirill A. Shutemov wrote:
> I'll look into it.

I ended up with this (I'll test it more later):

void filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
struct radix_tree_iter iter;
void **slot;
struct file *file = vmf->vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
loff_t size;
struct page *page;
bool mapped;

rcu_read_lock();
radix_tree_for_each_slot(slot, >page_tree, ,
start_pgoff) {
unsigned long index = iter.index;
if (index < start_pgoff)
index = start_pgoff;
if (index > end_pgoff)
break;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
continue;
if (radix_tree_exception(page)) {
if (radix_tree_deref_retry(page))
slot = radix_tree_iter_retry();
continue;
}

if (!page_cache_get_speculative(page))
goto repeat;

/* Has the page moved? */
if (unlikely(page != *slot)) {
put_page(page);
goto repeat;
}

/* For multi-order entries, find relevant subpage */
page = find_subpage(page, index);

if (!PageUptodate(page) || PageReadahead(page))
goto skip;
if (!trylock_page(page))
goto skip;

if (page_mapping(page) != mapping || !PageUptodate(page))
goto skip_unlock;

size = round_up(i_size_read(mapping->host), PAGE_SIZE);
if (compound_head(page)->index >= size >> PAGE_SHIFT)
goto skip_unlock;

if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
map_next_subpage:
if (PageHWPoison(page))
goto next;

vmf->address += (index - last_pgoff) << PAGE_SHIFT;
if (vmf->pte)
vmf->pte += index - last_pgoff;
last_pgoff = index;
mapped = !alloc_set_pte(vmf, NULL, page);

/* Huge page is mapped or last index? No need to proceed. */
if (pmd_trans_huge(*vmf->pmd) ||
index == end_pgoff) {
unlock_page(page);
break;
}
next:
if (page && PageCompound(page)) {
/* Last subpage handled? */
if ((index & (compound_nr_pages(page) - 1)) ==
compound_nr_pages(page) - 1)
goto skip_unlock;
index++;
page++;

/*
 * One page reference goes to page table mapping.
 * Need additional reference, if last alloc_set_pte()
 * succeed.
 */
if (mapped)
get_page(page);
goto map_next_subpage;
}
skip_unlock:
unlock_page(page);
skip:
iter.index = compound_head(page)->index +
compound_nr_pages(page) - 1;
/* Only give up reference if alloc_set_pte() failed. */
if (!mapped)
put_page(page);
}
rcu_read_unlock();
}

-- 
 Kirill A. Shutemov

Re: [PATCH v1 3/5] md: fail if mddev->bio_set can't be created

2017-02-13 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig 

but this really needs to be patch 2 in this series.

Re: [PATCH v1 1/5] block: introduce bio_clone_bioset_partial()

2017-02-13 Thread Christoph Hellwig

On Fri, Feb 10, 2017 at 06:56:13PM +0800, Ming Lei wrote:
> md still need bio clone(not the fast version) for behind write,
> and it is more efficient to use bio_clone_bioset_partial().
> 
> The idea is simple and just copy the bvecs range specified from
> parameters.

Given how few users bio_clone_bioset has I wonder if we shouldn't
simply add the two new arguments to it instead of adding another
indirection.

Otherwise looks fine:

Reviewed-by: Christoph Hellwig

Re: [PATCH v1 2/5] md/raid1: use bio_clone_bioset_partial() in case of write behind

2017-02-13 Thread Christoph Hellwig

> + struct bio *mbio = NULL;
> + int offset;
>   if (!r1_bio->bios[i])
>   continue;
>  
> - mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
> - bio_trim(mbio, r1_bio->sector - bio->bi_iter.bi_sector,
> -  max_sectors);
> + offset = r1_bio->sector - bio->bi_iter.bi_sector;

I think offset should be a sector_t.

Otherwise this looks fine:

Reviewed-by: Christoph Hellwig

Re: [PATCH v1 5/5] md: fast clone bio in bio_clone_mddev()

2017-02-13 Thread Christoph Hellwig

Please use bio_clone_fast directly and kill the wrapper.

Re: [PATCH v1 4/5] md: remove unnecessary check on mddev

2017-02-13 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

45 matches

Mail list logo