Re: [PATCH V5 3/4] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
On Mon, Feb 13, 2017 at 6:07 PM, David Laightwrote: > From: Arnd Bergmann >> Sent: 13 February 2017 17:02 > ... >> >> > + ioctl_ptr = memdup_user(arg, _IOC_SIZE(cmd)); >> >> > + if (IS_ERR_OR_NULL(ioctl_ptr)) { >> >> > + ret = PTR_ERR(ioctl_ptr); >> >> > + goto out; >> >> ... >> >> > + out: >> >> > + kfree(ioctl_ptr); >> >> > + return ret; > ... >> >> That error path doesn't look quite right to me. > ... >> > good god, this is a mess this morning. Thanks for the catch, I'll review >> > these >> > more aggressively before sending out. >> >> memdup_user() never returns NULL, and generally speaking any use of >> IS_ERR_OR_NULL() indicates that there is something wrong with the >> interface, so aside from passing the right pointer (or NULL) into kfree() >> I think using IS_ERR() is the correct solution. > > You missed the problem entirely. > Code needs to be: > if (IS_ERR(ioctl_ptr)) > return PTR_ERR(ioctl_ptr); Sorry if that wasn't clear, I expected the part about the kfree(errptr) to be obvious but was trying to avoid having Scott go through another revision for removing the IS_ERR_OR_NULL() after fixing the first problem. Arnd
Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice
On 02/13/2017 03:28 PM, Jens Axboe wrote: > On 02/13/2017 03:09 PM, Omar Sandoval wrote: >> On Mon, Feb 13, 2017 at 10:01:07PM +0100, Paolo Valente wrote: >>> If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq, >>> or, viceversa, a blk-mq scheduler is chosen for a device using blk, then >>> that scheduler is set and initialized without any check, driving the >>> system into an inconsistent state. This commit addresses this issue by >>> letting elevator_get fail for these wrong cross choices. >>> >>> Signed-off-by: Paolo Valente>>> --- >>> block/elevator.c | 26 ++ >>> 1 file changed, 18 insertions(+), 8 deletions(-) >> >> Hey, Paolo, >> >> How exactly are you triggering this? In __elevator_change(), we do check >> for mq or not mq: >> >> if (!e->uses_mq && q->mq_ops) { >> elevator_put(e); >> return -EINVAL; >> } >> if (e->uses_mq && !q->mq_ops) { >> elevator_put(e); >> return -EINVAL; >> } >> >> We don't ever appear to call elevator_init() with a specific scheduler >> name, and for the default we switch off of q->mq_ops and use the >> defaults from Kconfig: >> >> if (q->mq_ops && q->nr_hw_queues == 1) >> e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false); >> else if (q->mq_ops) >> e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false); >> else >> e = elevator_get(CONFIG_DEFAULT_IOSCHED, false); >> >> if (!e) { >> printk(KERN_ERR >> "Default I/O scheduler not found. " \ >> "Using noop/none.\n"); >> e = elevator_get("noop", false); >> } >> >> So I guess this could happen if someone manually changed those Kconfig >> options, but I don't see what other case would make this happen, could >> you please explain? > > Was wondering the same - is it using the 'elevator=' boot parameter? > Didn't look at that path just now, but that's the only one I could > think of. If it is, I'd much prefer only using 'chosen_elevator' for > the non-mq stuff, and the fix should be just that instead. > > So instead of: > > if (!e && *chosen_elevator) { > > do > > if (!e && !q->mq_ops && && *chosen_elevator) { Confirmed, that's what it seems to be, and here's a real diff of the above example that works for me: diff --git a/block/elevator.c b/block/elevator.c index 27ff1ed5a6fa..699d10f71a2c 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -207,11 +207,12 @@ int elevator_init(struct request_queue *q, char *name) } /* -* Use the default elevator specified by config boot param or -* config option. Don't try to load modules as we could be running -* off async and request_module() isn't allowed from async. +* Use the default elevator specified by config boot param for +* non-mq devices, or by config option. Don't try to load modules +* as we could be running off async and request_module() isn't +* allowed from async. */ - if (!e && *chosen_elevator) { + if (!e && !q->mq_ops && *chosen_elevator) { e = elevator_get(chosen_elevator, false); if (!e) printk(KERN_ERR "I/O scheduler %s not found\n", -- Jens Axboe
Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq
> Il giorno 10 feb 2017, alle ore 20:49, Paolo Valente >ha scritto: > >> >> Il giorno 10 feb 2017, alle ore 19:13, Bart Van Assche >> ha scritto: >> >> On 02/10/2017 08:49 AM, Paolo Valente wrote: $ grep '^C.*_MQ_' .config CONFIG_BLK_MQ_PCI=y CONFIG_MQ_IOSCHED_BFQ=y CONFIG_MQ_IOSCHED_DEADLINE=y CONFIG_MQ_IOSCHED_NONE=y CONFIG_DEFAULT_MQ_BFQ_MQ=y CONFIG_DEFAULT_MQ_IOSCHED="bfq-mq" CONFIG_SCSI_MQ_DEFAULT=y CONFIG_DM_MQ_DEFAULT=y >>> >>> Could you reconfigure with none or mq-deadline as default, check >>> whether the system boots, and, it it does, switch manually to bfq-mq, >>> check what happens, and, in the likely case of a failure, try to get >>> the oops? >> >> Hello Paolo, >> >> I just finished performing that test with the following kernel config: >> $ grep '^C.*_MQ_' .config >> CONFIG_BLK_MQ_PCI=y >> CONFIG_MQ_IOSCHED_BFQ=y >> CONFIG_MQ_IOSCHED_DEADLINE=y >> CONFIG_MQ_IOSCHED_NONE=y >> CONFIG_DEFAULT_MQ_DEADLINE=y >> CONFIG_DEFAULT_MQ_IOSCHED="mq-deadline" >> CONFIG_SCSI_MQ_DEFAULT=y >> CONFIG_DM_MQ_DEFAULT=y >> >> After the system came up I logged in, switched to the bfq-mq scheduler >> and ran several I/O tests against the boot disk. > > Without any failure, right? > > Unfortunately, as you can imagine, no boot failure occurred on > any of my test systems so far :( > > This version of bfq-mq can be configured to print all its activity in > the kernel log, by just defining a macro. This will of course slow > down the system so much to make it probably unusable, if bfq-mq is > active from boot. Yet, the failure may still occur so early to make > this approach useful to discover where bfq-mq gets stuck. As of now I > have no better ideas. Any suggestion is welcome. > Hi Bart, I have found a machine crashing at boot, yet not only when bfq-mq is chosen, but also when mq-deadline is chosen as the default scheduler. I have found and just reported the cause of the failure, together with a fix. Probably this is not the cause of your failure, but what do you think about trying this fix? BTW, I have rebased the branch [1] against the new commits in Jens for-4.11/next. Otherwise, if you have no news or suggestions, would you be willing to try my micro-logging proposal? Thanks, Paolo [1] https://github.com/Algodev-github/bfq-mq > Thanks, > Paolo > >> Sorry but nothing >> interesting appeared in the kernel log. >> >> Bart. >> Western Digital Corporation (and its subsidiaries) E-mail Confidentiality >> Notice & Disclaimer: >> >> This e-mail and any files transmitted with it may contain confidential or >> legally privileged information of WDC and/or its affiliates, and are >> intended solely for the use of the individual or entity to which they are >> addressed. If you are not the intended recipient, any disclosure, copying, >> distribution or any action taken or omitted to be taken in reliance on it, >> is prohibited. If you have received this e-mail in error, please notify the >> sender immediately and delete the e-mail in its entirety from your system.
Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice
On Mon, 2017-02-13 at 22:01 +0100, Paolo Valente wrote: > -static struct elevator_type *elevator_get(const char *name, bool try_loading) > +static struct elevator_type *elevator_get(const char *name, bool try_loading, > + bool mq_ops) Please choose a better name for that argument, e.q. "mq". To me the name "mq_ops" means "a pointer to a data structure with operation function pointers". > + if (e && (e->uses_mq != mq_ops)) { > + pr_err("ERROR: attempted to choose %s %s I/O scheduler in > blk%s", > +name, e->uses_mq ? "blk-mq" : "legacy", mq_ops ? "-mq" : > ""); > + e = NULL; > + } How about changing the above into: + if (e && e->uses_mq != mq) { + pr_err("ERROR: attempt to configure %s as I/O scheduler for a %s queue\n", + name, mq ? "blk-mq" : "legacy"); + e = NULL; + } Thanks, Bart.
Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq
> Il giorno 07 feb 2017, alle ore 18:24, Paolo Valente >ha scritto: > > Hi, > > I have finally pushed here [1] the current WIP branch of bfq for > blk-mq, which I have tentatively named bfq-mq. > > This branch *IS NOT* meant for merging into mainline and contain code > that mau easily violate code style, and not only, in many > places. Commits implement the following main steps: > 1) Add the last version of bfq for blk > 2) Clone bfq source files into identical bfq-mq source files > 3) Modify bfq-mq files to get a working version of bfq for blk-mq > (cgroups support not yet functional) > > In my intentions, the main goals of this branch are: > > 1) Show, as soon as I could, the changes I made to let bfq-mq comply > with blk-mq-sched framework. I though this could be particularly > useful for Jens, being BFQ identical to CFQ in terms of hook > interfaces and io-context handling, and almost identical in terms > request-merging. > > 2) Enable people to test this first version bfq-mq. Code is purposely > overfull of log messages and invariant checks that halt the system on > failure (lock assertions, BUG_ONs, ...). > > To make it easier to revise commits, I'm sending the patches that > transform bfq into bfq-mq (last four patches in the branch [1]). They > work on two files, bfq-mq-iosched.c and bfq-mq.h, which, at the > beginning, are just copies of bfq-iosched.c and bfq.h. > Hi, this is just to inform that, as I just wrote to Bart, I have rebase the branch [1] against the current content of for-4.11/next. Jens, Omar, did you find the time to have a look at the main commits or to run some test? Thanks, Paolo [1] https://github.com/Algodev-github/bfq-mq > Thanks, > Paolo > > [1] https://github.com/Algodev-github/bfq-mq > > Paolo Valente (4): > blk-mq: pass bio to blk_mq_sched_get_rq_priv > Move thinktime from bic to bfqq > Embed bfq-ioc.c and add locking on request queue > Modify interface and operation to comply with blk-mq-sched > > block/bfq-cgroup.c | 4 - > block/bfq-mq-iosched.c | 852 +-- > block/bfq-mq.h | 65 ++-- > block/blk-mq-sched.c | 8 +- > block/blk-mq-sched.h | 5 +- > include/linux/elevator.h | 2 +- > 6 files changed, 567 insertions(+), 369 deletions(-) > > -- > 2.10.0
Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq
On Mon, 2017-02-13 at 22:07 +, Bart Van Assche wrote: > On Mon, 2017-02-13 at 22:07 +0100, Paolo Valente wrote: > > but what do you think about trying this fix? > > Sorry but with ... the same server I used for the previous test still > didn't boot up properly. A screenshot is available at > https://goo.gl/photos/Za9QVGCNe2BJBwxVA. > > > Otherwise, if you have no news or suggestions, would you be willing to > > try my micro-logging proposal https://github.com/Algodev-github/bfq-mq? > > Sorry but it's not clear to me what logging mechanism you are referring > to and how to enable it? Are you perhaps referring to > CONFIG_BFQ_REDIRECT_TO_CONSOLE? Anyway, a second screenshot has been added to the same album after I had applied the following patch: diff --git a/block/Makefile b/block/Makefile index 1c04fe19e825..bf472ac82c08 100644 --- a/block/Makefile +++ b/block/Makefile @@ -2,6 +2,8 @@ # Makefile for the kernel block layer # +KBUILD_CFLAGS += -DCONFIG_BFQ_REDIRECT_TO_CONSOLE + obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \ blk-flush.o blk-settings.o blk-ioc.o blk-map.o \ blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \ Bart.
[PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice
If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq, or, viceversa, a blk-mq scheduler is chosen for a device using blk, then that scheduler is set and initialized without any check, driving the system into an inconsistent state. This commit addresses this issue by letting elevator_get fail for these wrong cross choices. Signed-off-by: Paolo Valente--- block/elevator.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/block/elevator.c b/block/elevator.c index 27ff1ed..a25bdd9 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -99,7 +99,8 @@ static void elevator_put(struct elevator_type *e) module_put(e->elevator_owner); } -static struct elevator_type *elevator_get(const char *name, bool try_loading) +static struct elevator_type *elevator_get(const char *name, bool try_loading, + bool mq_ops) { struct elevator_type *e; @@ -113,6 +114,12 @@ static struct elevator_type *elevator_get(const char *name, bool try_loading) e = elevator_find(name); } + if (e && (e->uses_mq != mq_ops)) { + pr_err("ERROR: attempted to choose %s %s I/O scheduler in blk%s", + name, e->uses_mq ? "blk-mq" : "legacy", mq_ops ? "-mq" : ""); + e = NULL; + } + if (e && !try_module_get(e->elevator_owner)) e = NULL; @@ -201,7 +208,7 @@ int elevator_init(struct request_queue *q, char *name) q->boundary_rq = NULL; if (name) { - e = elevator_get(name, true); + e = elevator_get(name, true, q->mq_ops); if (!e) return -EINVAL; } @@ -212,7 +219,7 @@ int elevator_init(struct request_queue *q, char *name) * off async and request_module() isn't allowed from async. */ if (!e && *chosen_elevator) { - e = elevator_get(chosen_elevator, false); + e = elevator_get(chosen_elevator, false, q->mq_ops); if (!e) printk(KERN_ERR "I/O scheduler %s not found\n", chosen_elevator); @@ -220,17 +227,20 @@ int elevator_init(struct request_queue *q, char *name) if (!e) { if (q->mq_ops && q->nr_hw_queues == 1) - e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false); + e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false, +q->mq_ops); else if (q->mq_ops) - e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false); + e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false, +q->mq_ops); else - e = elevator_get(CONFIG_DEFAULT_IOSCHED, false); + e = elevator_get(CONFIG_DEFAULT_IOSCHED, false, +q->mq_ops); if (!e) { printk(KERN_ERR "Default I/O scheduler not found. " \ "Using noop/none.\n"); - e = elevator_get("noop", false); + e = elevator_get("noop", false, q->mq_ops); } } @@ -1051,7 +1061,7 @@ static int __elevator_change(struct request_queue *q, const char *name) return elevator_switch(q, NULL); strlcpy(elevator_name, name, sizeof(elevator_name)); - e = elevator_get(strstrip(elevator_name), true); + e = elevator_get(strstrip(elevator_name), true, q->mq_ops); if (!e) { printk(KERN_ERR "elevator: type %s not found\n", elevator_name); return -EINVAL; -- 2.10.0
[PATCH BUGFIX] attempt to fix wrong scheduler selection
Hi, if, at boot, a legacy I/O scheduler is chosen for a device using blk-mq, or, viceversa, a blk-mq scheduler is chosen for a device using blk, then that scheduler is set and initialized without any check, driving the system into an inconsistent state. The purpose of this message is, first, to report this issue, and, second, to propose a possible fix in case you do consider this as a bug. Thanks, Paolo Paolo Valente (1): block: make elevator_get robust against cross blk/blk-mq choice block/elevator.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) -- 2.10.0
Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq
On Mon, 2017-02-13 at 22:07 +0100, Paolo Valente wrote: > but what do you think about trying this fix? Sorry but with ... the same server I used for the previous test still didn't boot up properly. A screenshot is available at https://goo.gl/photos/Za9QVGCNe2BJBwxVA. > Otherwise, if you have no news or suggestions, would you be willing to > try my micro-logging proposal https://github.com/Algodev-github/bfq-mq? Sorry but it's not clear to me what logging mechanism you are referring to and how to enable it? Are you perhaps referring to CONFIG_BFQ_REDIRECT_TO_CONSOLE? Thanks, Bart.
Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice
On 02/13/2017 03:09 PM, Omar Sandoval wrote: > On Mon, Feb 13, 2017 at 10:01:07PM +0100, Paolo Valente wrote: >> If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq, >> or, viceversa, a blk-mq scheduler is chosen for a device using blk, then >> that scheduler is set and initialized without any check, driving the >> system into an inconsistent state. This commit addresses this issue by >> letting elevator_get fail for these wrong cross choices. >> >> Signed-off-by: Paolo Valente>> --- >> block/elevator.c | 26 ++ >> 1 file changed, 18 insertions(+), 8 deletions(-) > > Hey, Paolo, > > How exactly are you triggering this? In __elevator_change(), we do check > for mq or not mq: > > if (!e->uses_mq && q->mq_ops) { > elevator_put(e); > return -EINVAL; > } > if (e->uses_mq && !q->mq_ops) { > elevator_put(e); > return -EINVAL; > } > > We don't ever appear to call elevator_init() with a specific scheduler > name, and for the default we switch off of q->mq_ops and use the > defaults from Kconfig: > > if (q->mq_ops && q->nr_hw_queues == 1) > e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false); > else if (q->mq_ops) > e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false); > else > e = elevator_get(CONFIG_DEFAULT_IOSCHED, false); > > if (!e) { > printk(KERN_ERR > "Default I/O scheduler not found. " \ > "Using noop/none.\n"); > e = elevator_get("noop", false); > } > > So I guess this could happen if someone manually changed those Kconfig > options, but I don't see what other case would make this happen, could > you please explain? Was wondering the same - is it using the 'elevator=' boot parameter? Didn't look at that path just now, but that's the only one I could think of. If it is, I'd much prefer only using 'chosen_elevator' for the non-mq stuff, and the fix should be just that instead. So instead of: if (!e && *chosen_elevator) { do if (!e && !q->mq_ops && && *chosen_elevator) { -- Jens Axboe
Re: [WIP PATCHSET 0/4] WIP branch for bfq-mq
On 02/13/2017 02:09 PM, Paolo Valente wrote: > >> Il giorno 07 feb 2017, alle ore 18:24, Paolo Valente >>ha scritto: >> >> Hi, >> >> I have finally pushed here [1] the current WIP branch of bfq for >> blk-mq, which I have tentatively named bfq-mq. >> >> This branch *IS NOT* meant for merging into mainline and contain code >> that mau easily violate code style, and not only, in many >> places. Commits implement the following main steps: >> 1) Add the last version of bfq for blk >> 2) Clone bfq source files into identical bfq-mq source files >> 3) Modify bfq-mq files to get a working version of bfq for blk-mq >> (cgroups support not yet functional) >> >> In my intentions, the main goals of this branch are: >> >> 1) Show, as soon as I could, the changes I made to let bfq-mq comply >> with blk-mq-sched framework. I though this could be particularly >> useful for Jens, being BFQ identical to CFQ in terms of hook >> interfaces and io-context handling, and almost identical in terms >> request-merging. >> >> 2) Enable people to test this first version bfq-mq. Code is purposely >> overfull of log messages and invariant checks that halt the system on >> failure (lock assertions, BUG_ONs, ...). >> >> To make it easier to revise commits, I'm sending the patches that >> transform bfq into bfq-mq (last four patches in the branch [1]). They >> work on two files, bfq-mq-iosched.c and bfq-mq.h, which, at the >> beginning, are just copies of bfq-iosched.c and bfq.h. >> > > Hi, > this is just to inform that, as I just wrote to Bart, I have rebase > the branch [1] against the current content of for-4.11/next. > > Jens, Omar, did you find the time to have a look at the main commits > or to run some test? I only looked at the core change you proposed for passing in the bio as well, and Omar fixed up the icq exit part and I also applied that patch. I haven't look at any of the bfq-mq patches at all yet. Not sure what I can do with those, I don't think those are particularly useful to anyone but you. Might make more sense to post the conversion for review as a whole. -- Jens Axboe
Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice
On Mon, Feb 13, 2017 at 10:01:07PM +0100, Paolo Valente wrote: > If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq, > or, viceversa, a blk-mq scheduler is chosen for a device using blk, then > that scheduler is set and initialized without any check, driving the > system into an inconsistent state. This commit addresses this issue by > letting elevator_get fail for these wrong cross choices. > > Signed-off-by: Paolo Valente> --- > block/elevator.c | 26 ++ > 1 file changed, 18 insertions(+), 8 deletions(-) Hey, Paolo, How exactly are you triggering this? In __elevator_change(), we do check for mq or not mq: if (!e->uses_mq && q->mq_ops) { elevator_put(e); return -EINVAL; } if (e->uses_mq && !q->mq_ops) { elevator_put(e); return -EINVAL; } We don't ever appear to call elevator_init() with a specific scheduler name, and for the default we switch off of q->mq_ops and use the defaults from Kconfig: if (q->mq_ops && q->nr_hw_queues == 1) e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false); else if (q->mq_ops) e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false); else e = elevator_get(CONFIG_DEFAULT_IOSCHED, false); if (!e) { printk(KERN_ERR "Default I/O scheduler not found. " \ "Using noop/none.\n"); e = elevator_get("noop", false); } So I guess this could happen if someone manually changed those Kconfig options, but I don't see what other case would make this happen, could you please explain?
[PATCH 4/7] nonblocking aio: return on congested block device
From: Goldwyn RodriguesA new flag BIO_NONBLOCKING is introduced to identify bio's orignating from iocb with IOCB_NONBLOCKING. struct request are requested using BLK_MQ_REQ_NOWAIT if BIO_NONBLOCKING is set. Signed-off-by: Goldwyn Rodrigues --- block/blk-core.c | 13 +++-- block/blk-mq.c| 18 -- fs/direct-io.c| 11 +-- include/linux/blk_types.h | 1 + 4 files changed, 37 insertions(+), 6 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 14d7c07..9767573 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1257,6 +1257,11 @@ static struct request *get_request(struct request_queue *q, int op, if (!IS_ERR(rq)) return rq; + if (bio_flagged(bio, BIO_NONBLOCKING)) { + blk_put_rl(rl); + return ERR_PTR(-EAGAIN); + } + if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) { blk_put_rl(rl); return rq; @@ -2035,7 +2040,7 @@ blk_qc_t generic_make_request(struct bio *bio) do { struct request_queue *q = bdev_get_queue(bio->bi_bdev); - if (likely(blk_queue_enter(q, false) == 0)) { + if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NONBLOCKING)) == 0)) { ret = q->make_request_fn(q, bio); blk_queue_exit(q); @@ -2044,7 +2049,11 @@ blk_qc_t generic_make_request(struct bio *bio) } else { struct bio *bio_next = bio_list_pop(current->bio_list); - bio_io_error(bio); + if (unlikely(bio_flagged(bio, BIO_NONBLOCKING))) { + bio->bi_error = -EAGAIN; + bio_endio(bio); + } else + bio_io_error(bio); bio = bio_next; } } while (bio); diff --git a/block/blk-mq.c b/block/blk-mq.c index 81caceb..7a7c674 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1213,6 +1213,8 @@ static struct request *blk_mq_map_request(struct request_queue *q, trace_block_getrq(q, bio, op); blk_mq_set_alloc_data(_data, q, 0, ctx, hctx); + if (bio_flagged(bio, BIO_NONBLOCKING)) + alloc_data.flags |= BLK_MQ_REQ_NOWAIT; rq = __blk_mq_alloc_request(_data, op, op_flags); data->hctx = alloc_data.hctx; @@ -1286,8 +1288,14 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) return BLK_QC_T_NONE; rq = blk_mq_map_request(q, bio, ); - if (unlikely(!rq)) + if (unlikely(!rq)) { + if (bio_flagged(bio, BIO_NONBLOCKING)) + bio->bi_error = -EAGAIN; + else + bio->bi_error = -EIO; + bio_endio(bio); return BLK_QC_T_NONE; + } cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); @@ -1381,8 +1389,14 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) request_count = blk_plug_queued_count(q); rq = blk_mq_map_request(q, bio, ); - if (unlikely(!rq)) + if (unlikely(!rq)) { + if (bio_flagged(bio, BIO_NONBLOCKING)) + bio->bi_error = -EAGAIN; + else + bio->bi_error = -EIO; + bio_endio(bio); return BLK_QC_T_NONE; + } cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); diff --git a/fs/direct-io.c b/fs/direct-io.c index fb9aa16..9997fed 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -386,6 +386,9 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio, else bio->bi_end_io = dio_bio_end_io; + if (dio->iocb->ki_flags & IOCB_NONBLOCKING) + bio_set_flag(bio, BIO_NONBLOCKING); + sdio->bio = bio; sdio->logical_offset_in_bio = sdio->cur_page_fs_offset; } @@ -480,8 +483,12 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio) unsigned i; int err; - if (bio->bi_error) - dio->io_error = -EIO; + if (bio->bi_error) { + if (bio_flagged(bio, BIO_NONBLOCKING)) + dio->io_error = bio->bi_error; + else + dio->io_error = -EIO; + } if (dio->is_async && dio->op == REQ_OP_READ && dio->should_dirty) { err = bio->bi_error; diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index cd395ec..94855cf 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -119,6 +119,7 @@ struct bio { #define BIO_QUIET 6 /* Make BIO Quiet */ #define BIO_CHAIN 7 /* chained bio, ->bi_remaining in effect */ #define BIO_REFFED
Re: [PATCH v1 1/5] block: introduce bio_clone_bioset_partial()
On Mon, Feb 13, 2017 at 9:46 PM, Christoph Hellwigwrote: > On Fri, Feb 10, 2017 at 06:56:13PM +0800, Ming Lei wrote: >> md still need bio clone(not the fast version) for behind write, >> and it is more efficient to use bio_clone_bioset_partial(). >> >> The idea is simple and just copy the bvecs range specified from >> parameters. > > Given how few users bio_clone_bioset has I wonder if we shouldn't > simply add the two new arguments to it instead of adding another > indirection. For md write-behind, looks we have to provide the two arguments, could you explain a bit how we can do that by adding another indirection? > > Otherwise looks fine: > > Reviewed-by: Christoph Hellwig Thanks! -- Ming Lei
Re: [PATCH 4/7] nonblocking aio: return on congested block device
On Tue, Feb 14, 2017 at 10:46 AM, Goldwyn Rodrigueswrote: > From: Goldwyn Rodrigues > > A new flag BIO_NONBLOCKING is introduced to identify bio's > orignating from iocb with IOCB_NONBLOCKING. struct request > are requested using BLK_MQ_REQ_NOWAIT if BIO_NONBLOCKING is set. > > Signed-off-by: Goldwyn Rodrigues > --- > block/blk-core.c | 13 +++-- > block/blk-mq.c| 18 -- > fs/direct-io.c| 11 +-- > include/linux/blk_types.h | 1 + > 4 files changed, 37 insertions(+), 6 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 14d7c07..9767573 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -1257,6 +1257,11 @@ static struct request *get_request(struct > request_queue *q, int op, > if (!IS_ERR(rq)) > return rq; > > + if (bio_flagged(bio, BIO_NONBLOCKING)) { > + blk_put_rl(rl); > + return ERR_PTR(-EAGAIN); > + } > + > if (!gfpflags_allow_blocking(gfp_mask) || > unlikely(blk_queue_dying(q))) { > blk_put_rl(rl); > return rq; > @@ -2035,7 +2040,7 @@ blk_qc_t generic_make_request(struct bio *bio) > do { > struct request_queue *q = bdev_get_queue(bio->bi_bdev); > > - if (likely(blk_queue_enter(q, false) == 0)) { > + if (likely(blk_queue_enter(q, bio_flagged(bio, > BIO_NONBLOCKING)) == 0)) { > ret = q->make_request_fn(q, bio); > > blk_queue_exit(q); > @@ -2044,7 +2049,11 @@ blk_qc_t generic_make_request(struct bio *bio) > } else { > struct bio *bio_next = > bio_list_pop(current->bio_list); > > - bio_io_error(bio); > + if (unlikely(bio_flagged(bio, BIO_NONBLOCKING))) { > + bio->bi_error = -EAGAIN; > + bio_endio(bio); > + } else > + bio_io_error(bio); > bio = bio_next; > } > } while (bio); > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 81caceb..7a7c674 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1213,6 +1213,8 @@ static struct request *blk_mq_map_request(struct > request_queue *q, > > trace_block_getrq(q, bio, op); > blk_mq_set_alloc_data(_data, q, 0, ctx, hctx); > + if (bio_flagged(bio, BIO_NONBLOCKING)) > + alloc_data.flags |= BLK_MQ_REQ_NOWAIT; > rq = __blk_mq_alloc_request(_data, op, op_flags); > > data->hctx = alloc_data.hctx; > @@ -1286,8 +1288,14 @@ static blk_qc_t blk_mq_make_request(struct > request_queue *q, struct bio *bio) > return BLK_QC_T_NONE; > > rq = blk_mq_map_request(q, bio, ); > - if (unlikely(!rq)) > + if (unlikely(!rq)) { > + if (bio_flagged(bio, BIO_NONBLOCKING)) > + bio->bi_error = -EAGAIN; > + else > + bio->bi_error = -EIO; > + bio_endio(bio); > return BLK_QC_T_NONE; > + } > > cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); > > @@ -1381,8 +1389,14 @@ static blk_qc_t blk_sq_make_request(struct > request_queue *q, struct bio *bio) > request_count = blk_plug_queued_count(q); > > rq = blk_mq_map_request(q, bio, ); > - if (unlikely(!rq)) > + if (unlikely(!rq)) { > + if (bio_flagged(bio, BIO_NONBLOCKING)) > + bio->bi_error = -EAGAIN; > + else > + bio->bi_error = -EIO; > + bio_endio(bio); > return BLK_QC_T_NONE; > + } There are other places in which blocking may be triggered, such as allocating for bio clone, wbt_wait(), and sleep in .make_request(), like md, dm and bcache's. IMO it should be hard to deal with all, so what is the expection for flag of BIO_NONBLOCKING? > > cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index fb9aa16..9997fed 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -386,6 +386,9 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio, > else > bio->bi_end_io = dio_bio_end_io; > > + if (dio->iocb->ki_flags & IOCB_NONBLOCKING) > + bio_set_flag(bio, BIO_NONBLOCKING); > + > sdio->bio = bio; > sdio->logical_offset_in_bio = sdio->cur_page_fs_offset; > } > @@ -480,8 +483,12 @@ static int dio_bio_complete(struct dio *dio, struct bio > *bio) > unsigned i; > int err; > > - if (bio->bi_error) > - dio->io_error = -EIO; > + if (bio->bi_error) { > + if (bio_flagged(bio, BIO_NONBLOCKING)) > +
Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice
On Tue, Feb 14, 2017 at 07:58:22AM +0100, Hannes Reinecke wrote: > While we're at the topic: > > Can't we use the same names for legacy and mq scheduler? > It's quite an unnecessary complication to have > 'noop', 'deadline', and 'cfq' for legacy, but 'none' and 'mq-deadline' > for mq. If we could use 'noop' and 'deadline' for mq, too, the existing > settings or udev rules will continue to work and we wouldn't get any > annoying and pointless warnings here... I mentioned this to Jens a little while ago but I didn't feel strongly enough to push the issue. I also like this idea -- it makes the transition to blk-mq a little more transparent.
Re: [PATCH BUGFIX] block: make elevator_get robust against cross blk/blk-mq choice
On 02/13/2017 11:28 PM, Jens Axboe wrote: > On 02/13/2017 03:09 PM, Omar Sandoval wrote: >> On Mon, Feb 13, 2017 at 10:01:07PM +0100, Paolo Valente wrote: >>> If, at boot, a legacy I/O scheduler is chosen for a device using blk-mq, >>> or, viceversa, a blk-mq scheduler is chosen for a device using blk, then >>> that scheduler is set and initialized without any check, driving the >>> system into an inconsistent state. This commit addresses this issue by >>> letting elevator_get fail for these wrong cross choices. >>> >>> Signed-off-by: Paolo Valente>>> --- >>> block/elevator.c | 26 ++ >>> 1 file changed, 18 insertions(+), 8 deletions(-) >> >> Hey, Paolo, >> >> How exactly are you triggering this? In __elevator_change(), we do check >> for mq or not mq: >> >> if (!e->uses_mq && q->mq_ops) { >> elevator_put(e); >> return -EINVAL; >> } >> if (e->uses_mq && !q->mq_ops) { >> elevator_put(e); >> return -EINVAL; >> } >> >> We don't ever appear to call elevator_init() with a specific scheduler >> name, and for the default we switch off of q->mq_ops and use the >> defaults from Kconfig: >> >> if (q->mq_ops && q->nr_hw_queues == 1) >> e = elevator_get(CONFIG_DEFAULT_SQ_IOSCHED, false); >> else if (q->mq_ops) >> e = elevator_get(CONFIG_DEFAULT_MQ_IOSCHED, false); >> else >> e = elevator_get(CONFIG_DEFAULT_IOSCHED, false); >> >> if (!e) { >> printk(KERN_ERR >> "Default I/O scheduler not found. " \ >> "Using noop/none.\n"); >> e = elevator_get("noop", false); >> } >> >> So I guess this could happen if someone manually changed those Kconfig >> options, but I don't see what other case would make this happen, could >> you please explain? > > Was wondering the same - is it using the 'elevator=' boot parameter? > Didn't look at that path just now, but that's the only one I could > think of. If it is, I'd much prefer only using 'chosen_elevator' for > the non-mq stuff, and the fix should be just that instead. > [ .. ] While we're at the topic: Can't we use the same names for legacy and mq scheduler? It's quite an unnecessary complication to have 'noop', 'deadline', and 'cfq' for legacy, but 'none' and 'mq-deadline' for mq. If we could use 'noop' and 'deadline' for mq, too, the existing settings or udev rules will continue to work and we wouldn't get any annoying and pointless warnings here... Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [BUG] Potential deadlock in the block layer
On 02/13/2017 07:14 AM, Thomas Gleixner wrote: > Gabriel reported the lockdep splat below while investigating something > different. > > Explanation for the splat is in the function comment above > del_timer_sync(). > > I can reproduce it as well and it's clearly broken. Agree, the disable logic is broken. The below isn't super pretty, but it should do the trick for 4.10. diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index c73a6fcaeb9d..838f07e2b64a 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -3758,7 +3758,7 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq, } #ifdef CONFIG_CFQ_GROUP_IOSCHED -static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) +static bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) { struct cfq_data *cfqd = cic_to_cfqd(cic); struct cfq_queue *cfqq; @@ -3775,15 +3775,7 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) * spuriously on a newly created cic but there's no harm. */ if (unlikely(!cfqd) || likely(cic->blkcg_serial_nr == serial_nr)) - return; - - /* -* If we have a non-root cgroup, we can depend on that to -* do proper throttling of writes. Turn off wbt for that -* case, if it was enabled by default. -*/ - if (nonroot_cg) - wbt_disable_default(cfqd->queue); + return nonroot_cg; /* * Drop reference to queues. New queues will be assigned in new @@ -3804,9 +3796,13 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) } cic->blkcg_serial_nr = serial_nr; + return nonroot_cg; } #else -static inline void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) { } +static inline bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) +{ + return false; +} #endif /* CONFIG_CFQ_GROUP_IOSCHED */ static struct cfq_queue ** @@ -4448,11 +,12 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio, const int rw = rq_data_dir(rq); const bool is_sync = rq_is_sync(rq); struct cfq_queue *cfqq; + bool disable_wbt; spin_lock_irq(q->queue_lock); check_ioprio_changed(cic, bio); - check_blkcg_changed(cic, bio); + disable_wbt = check_blkcg_changed(cic, bio); new_queue: cfqq = cic_to_cfqq(cic, is_sync); if (!cfqq || cfqq == >oom_cfqq) { @@ -4488,6 +4485,10 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio, rq->elv.priv[0] = cfqq; rq->elv.priv[1] = cfqq->cfqg; spin_unlock_irq(q->queue_lock); + + if (disable_wbt) + wbt_disable_default(q); + return 0; } -- Jens Axboe
Re: [PATCHv6 03/37] page-flags: relax page flag policy for few flags
On Wed, Feb 08, 2017 at 08:01:13PM -0800, Matthew Wilcox wrote: > On Thu, Jan 26, 2017 at 02:57:45PM +0300, Kirill A. Shutemov wrote: > > These flags are in use for filesystems with backing storage: PG_error, > > PG_writeback and PG_readahead. > > Oh ;-) Then I amend my comment on patch 1 to be "patch 3 needs to go > ahead of patch 1" ;-) It doesn't really matter as long as both before patch 37 :P -- Kirill A. Shutemov
[BUG] Potential deadlock in the block layer
Gabriel reported the lockdep splat below while investigating something different. Explanation for the splat is in the function comment above del_timer_sync(). I can reproduce it as well and it's clearly broken. Thanks, tglx --- [ 81.518032] == [ 81.520151] [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ] [ 81.522276] 4.10.0-rc8-debug-dirty #1 Tainted: G I [ 81.524445] -- [ 81.526560] (systemd)/1725 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 81.528672] (((>window_timer))){+.-...}, at: [] del_timer_sync+0x0/0xd0 [ 81.530973] and this task is already holding: [ 81.535399] (&(>__queue_lock)->rlock){-.-...}, at: [] cfq_set_request+0x80/0x2c0 [ 81.537637] which would create a new lock dependency: [ 81.539870] (&(>__queue_lock)->rlock){-.-...} -> (((>window_timer))){+.-...} [ 81.542145] but this new dependency connects a HARDIRQ-irq-safe lock: [ 81.546717] (&(>__queue_lock)->rlock){-.-...} [ 81.546720] ... which became HARDIRQ-irq-safe at: [ 81.553643] __lock_acquire+0x24f/0x19e0 [ 81.555970] lock_acquire+0xa5/0xd0 [ 81.558302] _raw_spin_lock_irqsave+0x54/0x90 [ 81.560667] ata_qc_schedule_eh+0x56/0x80 [libata] [ 81.563035] ata_qc_complete+0x99/0x1c0 [libata] [ 81.565410] ata_do_link_abort+0x93/0xd0 [libata] [ 81.567786] ata_port_abort+0xb/0x10 [libata] [ 81.570150] ahci_handle_port_interrupt+0x41e/0x570 [libahci] [ 81.572533] ahci_handle_port_intr+0x74/0xb0 [libahci] [ 81.574918] ahci_single_level_irq_intr+0x3b/0x60 [libahci] [ 81.577311] __handle_irq_event_percpu+0x35/0x100 [ 81.579696] handle_irq_event_percpu+0x1e/0x50 [ 81.582065] handle_irq_event+0x34/0x60 [ 81.584427] handle_edge_irq+0x112/0x140 [ 81.586776] handle_irq+0x18/0x20 [ 81.589109] do_IRQ+0x87/0x110 [ 81.591403] ret_from_intr+0x0/0x20 [ 81.593666] cpuidle_enter_state+0x102/0x230 [ 81.595939] cpuidle_enter+0x12/0x20 [ 81.598202] call_cpuidle+0x38/0x40 [ 81.600443] do_idle+0x16e/0x1e0 [ 81.602670] cpu_startup_entry+0x5d/0x60 [ 81.604890] rest_init+0x12c/0x140 [ 81.607080] start_kernel+0x45f/0x46c [ 81.609252] x86_64_start_reservations+0x2a/0x2c [ 81.611399] x86_64_start_kernel+0xeb/0xf8 [ 81.613505] verify_cpu+0x0/0xfc [ 81.615574] to a HARDIRQ-irq-unsafe lock: [ 81.619611] (((>window_timer))){+.-...} [ 81.619614] ... which became HARDIRQ-irq-unsafe at: [ 81.625562] ... [ 81.625565] __lock_acquire+0x2ba/0x19e0 [ 81.629504] lock_acquire+0xa5/0xd0 [ 81.631467] call_timer_fn+0x74/0x110 [ 81.633397] expire_timers+0xaa/0xd0 [ 81.635287] run_timer_softirq+0x72/0x140 [ 81.637145] __do_softirq+0x143/0x290 [ 81.638974] irq_exit+0x6a/0xd0 [ 81.640818] smp_trace_apic_timer_interrupt+0x79/0x90 [ 81.642698] smp_apic_timer_interrupt+0x9/0x10 [ 81.644572] apic_timer_interrupt+0x93/0xa0 [ 81.646445] cpuidle_enter_state+0x102/0x230 [ 81.648332] cpuidle_enter+0x12/0x20 [ 81.650211] call_cpuidle+0x38/0x40 [ 81.652096] do_idle+0x16e/0x1e0 [ 81.653984] cpu_startup_entry+0x5d/0x60 [ 81.655855] start_secondary+0x150/0x180 [ 81.657696] verify_cpu+0x0/0xfc [ 81.659497] other info that might help us debug this: [ 81.664802] Possible interrupt unsafe locking scenario: [ 81.668296]CPU0CPU1 [ 81.670010] [ 81.671685] lock(((>window_timer))); [ 81.673358]local_irq_disable(); [ 81.675018]lock(&(>__queue_lock)->rlock); [ 81.676659]lock(((>window_timer))); [ 81.678275] [ 81.679845] lock(&(>__queue_lock)->rlock); [ 81.681419] *** DEADLOCK *** [ 81.685989] 1 lock held by (systemd)/1725: [ 81.687518] #0: (&(>__queue_lock)->rlock){-.-...}, at: [] cfq_set_request+0x80/0x2c0 [ 81.689125] the dependencies between HARDIRQ-irq-safe lock and the holding lock: [ 81.692327] -> (&(>__queue_lock)->rlock){-.-...} ops: 70140 { [ 81.693971]IN-HARDIRQ-W at: [ 81.695606] __lock_acquire+0x24f/0x19e0 [ 81.697262] lock_acquire+0xa5/0xd0 [ 81.698906] _raw_spin_lock_irqsave+0x54/0x90 [ 81.700568] ata_qc_schedule_eh+0x56/0x80 [libata] [ 81.702229] ata_qc_complete+0x99/0x1c0 [libata] [ 81.703884] ata_do_link_abort+0x93/0xd0 [libata] [ 81.705540] ata_port_abort+0xb/0x10 [libata] [ 81.707199] ahci_handle_port_interrupt+0x41e/0x570 [libahci] [ 81.708881] ahci_handle_port_intr+0x74/0xb0 [libahci] [ 81.710563]
Re: [PATCHv6 07/37] filemap: allocate huge page in page_cache_read(), if allowed
On Thu, Feb 09, 2017 at 01:18:35PM -0800, Matthew Wilcox wrote: > On Thu, Jan 26, 2017 at 02:57:49PM +0300, Kirill A. Shutemov wrote: > > Later we can add logic to accumulate information from shadow entires to > > return to caller (average eviction time?). > > I would say minimum rather than average. That will become the refault > time of the entire page, so minimum would probably have us making better > decisions? Yes, makes sense. > > + /* Wipe shadow entires */ > > + radix_tree_for_each_slot(slot, >page_tree, , > > + page->index) { > > + if (iter.index >= page->index + hpage_nr_pages(page)) > > + break; > > > > p = radix_tree_deref_slot_protected(slot, >tree_lock); > > - if (!radix_tree_exceptional_entry(p)) > > + if (!p) > > + continue; > > Just FYI, this can't happen. You're holding the tree lock so nobody > else gets to remove things from the tree. radix_tree_for_each_slot() > only gives you the full slots; it skips the empty ones for you. I'm > OK if you want to leave it in out of an abundance of caution. I'll drop it. > > + __radix_tree_replace(>page_tree, iter.node, slot, NULL, > > + workingset_update_node, mapping); > > I may add an update_node argument to radix_tree_join at some point, > so you can use it here. Or maybe we don't need to do that, and what > you have here works just fine. > > > mapping->nrexceptional--; > > ... because adjusting the exceptional count is going to be a pain. Yeah.. > > + error = __radix_tree_insert(>page_tree, > > + page->index, compound_order(page), page); > > + /* This shouldn't happen */ > > + if (WARN_ON_ONCE(error)) > > + return error; > > A lesser man would have just ignored the return value from > __radix_tree_insert. I salute you. > > > @@ -2078,18 +2155,34 @@ static int page_cache_read(struct file *file, > > pgoff_t offset, gfp_t gfp_mask) > > { > > struct address_space *mapping = file->f_mapping; > > struct page *page; > > + pgoff_t hoffset; > > int ret; > > > > do { > > - page = __page_cache_alloc(gfp_mask|__GFP_COLD); > > + page = page_cache_alloc_huge(mapping, offset, gfp_mask); > > +no_huge: > > + if (!page) > > + page = __page_cache_alloc(gfp_mask|__GFP_COLD); > > if (!page) > > return -ENOMEM; > > > > - ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask & > > GFP_KERNEL); > > - if (ret == 0) > > + if (PageTransHuge(page)) > > + hoffset = round_down(offset, HPAGE_PMD_NR); > > + else > > + hoffset = offset; > > + > > + ret = add_to_page_cache_lru(page, mapping, hoffset, > > + gfp_mask & GFP_KERNEL); > > + > > + if (ret == -EEXIST && PageTransHuge(page)) { > > + put_page(page); > > + page = NULL; > > + goto no_huge; > > + } else if (ret == 0) { > > ret = mapping->a_ops->readpage(file, page); > > - else if (ret == -EEXIST) > > + } else if (ret == -EEXIST) { > > ret = 0; /* losing race to add is OK */ > > + } > > > > put_page(page); > > If the filesystem returns AOP_TRUNCATED_PAGE, you'll go round this loop > again trying the huge page again, even if the huge page didn't work > the first time. I would tend to think that if the huge page failed the > first time, we shouldn't try it again, so I propose this: AOP_TRUNCATED_PAGE is positive, so I don't see how you avoid try_huge on second iteration. Hm? > > struct address_space *mapping = file->f_mapping; > struct page *page; > pgoff_t index; > int ret; > bool try_huge = true; > > do { > if (try_huge) { > page = page_cache_alloc_huge(gfp_mask|__GFP_COLD); > if (page) > index = round_down(offset, HPAGE_PMD_NR); > else > try_huge = false; > } > > if (!try_huge) { > page = __page_cache_alloc(gfp_mask|__GFP_COLD); > index = offset; > } > > if (!page) > return -ENOMEM; > > ret = add_to_page_cache_lru(page, mapping, index, > gfp_mask & > GFP_KERNEL); > if (ret < 0) { > if (try_huge) { > try_huge = false; > ret = AOP_TRUNCATED_PAGE; > } else if (ret == -EEXIST) > ret = 0;
Re: [PATCHv6 08/37] filemap: handle huge pages in do_generic_file_read()
On Thu, Feb 09, 2017 at 01:55:05PM -0800, Matthew Wilcox wrote: > On Thu, Jan 26, 2017 at 02:57:50PM +0300, Kirill A. Shutemov wrote: > > +++ b/mm/filemap.c > > @@ -1886,6 +1886,7 @@ static ssize_t do_generic_file_read(struct file > > *filp, loff_t *ppos, > > if (unlikely(page == NULL)) > > goto no_cached_page; > > } > > + page = compound_head(page); > > We got this page from find_get_page(), which gets it from > pagecache_get_page(), which gets it from find_get_entry() ... which > (unless I'm lost in your patch series) returns the head page. So this > line is redundant, right? No. pagecache_get_page() returns subpage. See description of the first patch. > But then down in filemap_fault, we have: > > VM_BUG_ON_PAGE(page->index != offset, page); > > ... again, maybe I'm lost somewhere in your patch series, but I don't see > anywhere you remove that line (or modify it). This should be fine as find_get_page() returns subpage. > So are you not testing > with VM debugging enabled, or are you not doing a test which includes > mapping a file with huge pages, reading from it (to get the page in cache), > then faulting on an address that is not in the first 4kB of that 2MB? > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Kirill A. Shutemov
[PATCH] blk-mq: Call bio_io_error if request returned is NULL
From: Goldwyn RodriguesIf blk_mq_map_request returns NULL, bio_endio() function is not called. Call bio_io_error() in case request returned is NULL. Signed-off-by: Goldwyn Rodrigues --- block/blk-mq.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 81caceb..7cd8912 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1286,8 +1286,10 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) return BLK_QC_T_NONE; rq = blk_mq_map_request(q, bio, ); - if (unlikely(!rq)) + if (unlikely(!rq)) { + bio_io_error(bio); return BLK_QC_T_NONE; + } cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); @@ -1381,8 +1383,10 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) request_count = blk_plug_queued_count(q); rq = blk_mq_map_request(q, bio, ); - if (unlikely(!rq)) + if (unlikely(!rq)) { + bio_io_error(bio); return BLK_QC_T_NONE; + } cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); -- 2.10.2
Re: [PATCHv6 08/37] filemap: handle huge pages in do_generic_file_read()
On Thu, Jan 26, 2017 at 02:57:50PM +0300, Kirill A. Shutemov wrote: > Most of work happans on head page. Only when we need to do copy data to > userspace we find relevant subpage. > > We are still limited by PAGE_SIZE per iteration. Lifting this limitation > would require some more work. Now that I debugged that bit of my brain, here's a more sensible suggestion. > @@ -1886,6 +1886,7 @@ static ssize_t do_generic_file_read(struct file *filp, > loff_t *ppos, > if (unlikely(page == NULL)) > goto no_cached_page; > } > + page = compound_head(page); > if (PageReadahead(page)) { > page_cache_async_readahead(mapping, > ra, filp, page, We're going backwards and forwards a lot between subpages and page heads. I'd like to see us do this: static inline struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, int fgp_flags, gfp_t cache_gfp_mask) { struct page *page = pagecache_get_head(mapping, offset, fgp_flags, cache_gfp_mask); return page ? find_subpage(page, offset) : NULL; } static inline struct page *find_get_head(struct address_space *mapping, pgoff_t offset) { return pagecache_get_head(mapping, offset, 0, 0); } and then we can switch do_generic_file_read() to call find_get_head(), eliminating the conversion back and forth between subpages and head pages.
RE: [PATCH V5 3/4] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
From: Scott Bauer Sent: 13 February 2017 16:11 > When CONFIG_KASAN is enabled, compilation fails: > > block/sed-opal.c: In function 'sed_ioctl': > block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than > 2048 bytes [-Werror=frame- > larger-than=] > > Moved all the ioctl structures off the stack and dynamically activate > using _IOC_SIZE() Think I'd not that this simplifies the code considerably. AFAICT CONFIG_KASAN is a just brainf*ck. It at least needs annotation that copy_from_user() has properties similar to memset(). So if the size matches that of the type then no guard space (etc) is required. ... > + ioctl_ptr = memdup_user(arg, _IOC_SIZE(cmd)); > + if (IS_ERR_OR_NULL(ioctl_ptr)) { > + ret = PTR_ERR(ioctl_ptr); > + goto out; ... > + out: > + kfree(ioctl_ptr); > + return ret; > } That error path doesn't look quite right to me. David
[PATCH V5 3/4] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
When CONFIG_KASAN is enabled, compilation fails: block/sed-opal.c: In function 'sed_ioctl': block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than 2048 bytes [-Werror=frame-larger-than=] Moved all the ioctl structures off the stack and dynamically activate using _IOC_SIZE() Fixes: 455a7b238cd6 ("block: Add Sed-opal library") Reported-by: Arnd BergmannSigned-off-by: Scott Bauer --- block/sed-opal.c | 130 +++ 1 file changed, 45 insertions(+), 85 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index 2448d4a..5733248 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -2348,7 +2348,6 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *arg) { void *ioctl_ptr; int ret = -ENOTTY; - unsigned int cmd_size = _IOC_SIZE(cmd); if (!capable(CAP_SYS_ADMIN)) return -EACCES; @@ -2357,94 +2356,55 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *arg) return -ENOTSUPP; } - switch (cmd) { - case IOC_OPAL_SAVE: { - struct opal_lock_unlock lk_unlk; - - if (copy_from_user(_unlk, arg, sizeof(lk_unlk))) - return -EFAULT; - return opal_save(dev, _unlk); - } - case IOC_OPAL_LOCK_UNLOCK: { - struct opal_lock_unlock lk_unlk; - - if (copy_from_user(_unlk, arg, sizeof(lk_unlk))) - return -EFAULT; - return opal_lock_unlock(dev, _unlk); - } - case IOC_OPAL_TAKE_OWNERSHIP: { - struct opal_key opal_key; - - if (copy_from_user(_key, arg, sizeof(opal_key))) - return -EFAULT; - return opal_take_ownership(dev, _key); - } - case IOC_OPAL_ACTIVATE_LSP: { - struct opal_lr_act opal_lr_act; - - if (copy_from_user(_lr_act, arg, sizeof(opal_lr_act))) - return -EFAULT; - return opal_activate_lsp(dev, _lr_act); - } - case IOC_OPAL_SET_PW: { - struct opal_new_pw opal_pw; - - if (copy_from_user(_pw, arg, sizeof(opal_pw))) - return -EFAULT; - return opal_set_new_pw(dev, _pw); - } - case IOC_OPAL_ACTIVATE_USR: { - struct opal_session_info session; - - if (copy_from_user(, arg, sizeof(session))) - return -EFAULT; - return opal_activate_user(dev, ); - } - case IOC_OPAL_REVERT_TPR: { - struct opal_key opal_key; - - if (copy_from_user(_key, arg, sizeof(opal_key))) - return -EFAULT; - return opal_reverttper(dev, _key); - } - case IOC_OPAL_LR_SETUP: { - struct opal_user_lr_setup lrs; - - if (copy_from_user(, arg, sizeof(lrs))) - return -EFAULT; - return opal_setup_locking_range(dev, ); - } - case IOC_OPAL_ADD_USR_TO_LR: { - struct opal_lock_unlock lk_unlk; - - if (copy_from_user(_unlk, arg, sizeof(lk_unlk))) - return -EFAULT; - return opal_add_user_to_lr(dev, _unlk); - } - case IOC_OPAL_ENABLE_DISABLE_MBR: { - struct opal_mbr_data mbr; - - if (copy_from_user(, arg, sizeof(mbr))) - return -EFAULT; - return opal_enable_disable_shadow_mbr(dev, ); - } - case IOC_OPAL_ERASE_LR: { - struct opal_session_info session; - - if (copy_from_user(, arg, sizeof(session))) - return -EFAULT; - return opal_erase_locking_range(dev, ); + ioctl_ptr = memdup_user(arg, _IOC_SIZE(cmd)); + if (IS_ERR_OR_NULL(ioctl_ptr)) { + ret = PTR_ERR(ioctl_ptr); + goto out; } - case IOC_OPAL_SECURE_ERASE_LR: { - struct opal_session_info session; - if (copy_from_user(, arg, sizeof(session))) - return -EFAULT; - return opal_secure_erase_locking_range(dev, ); - } + switch (cmd) { + case IOC_OPAL_SAVE: + ret = opal_save(dev, ioctl_ptr); + break; + case IOC_OPAL_LOCK_UNLOCK: + ret = opal_lock_unlock(dev, ioctl_ptr); + break; + case IOC_OPAL_TAKE_OWNERSHIP: + ret = opal_take_ownership(dev, ioctl_ptr); + break; + case IOC_OPAL_ACTIVATE_LSP: + ret = opal_activate_lsp(dev, ioctl_ptr); + break; + case IOC_OPAL_SET_PW: + ret = opal_set_new_pw(dev, ioctl_ptr); + break; + case IOC_OPAL_ACTIVATE_USR: + ret =
[PATCH V5 2/4] uapi: sed-opal fix IOW for activate lsp to use correct struct
the IOW for the IOC_OPAL_ACTIVATE_LSP took the wrong strcure which would give us the wrong size when using _IOC_SIZE, switch it to the right structure. Fixes: 058f8a2 ("Include: Uapi: Add user ABI for Sed/Opal") Signed-off-by: Scott Bauer--- include/uapi/linux/sed-opal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/uapi/linux/sed-opal.h b/include/uapi/linux/sed-opal.h index fc06e3a..c72e073 100644 --- a/include/uapi/linux/sed-opal.h +++ b/include/uapi/linux/sed-opal.h @@ -106,7 +106,7 @@ struct opal_mbr_data { #define IOC_OPAL_SAVE _IOW('p', 220, struct opal_lock_unlock) #define IOC_OPAL_LOCK_UNLOCK _IOW('p', 221, struct opal_lock_unlock) #define IOC_OPAL_TAKE_OWNERSHIP_IOW('p', 222, struct opal_key) -#define IOC_OPAL_ACTIVATE_LSP _IOW('p', 223, struct opal_key) +#define IOC_OPAL_ACTIVATE_LSP _IOW('p', 223, struct opal_lr_act) #define IOC_OPAL_SET_PW _IOW('p', 224, struct opal_new_pw) #define IOC_OPAL_ACTIVATE_USR _IOW('p', 225, struct opal_session_info) #define IOC_OPAL_REVERT_TPR _IOW('p', 226, struct opal_key) -- 2.7.4
[PATCH V5 4/4] Maintainers: Modify SED list from nvme to block
Signed-off-by: Scott Bauer--- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index e325373..b983b25 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -11094,7 +11094,7 @@ SECURE ENCRYPTING DEVICE (SED) OPAL DRIVER M: Scott Bauer M: Jonathan Derrick M: Rafael Antognolli -L: linux-n...@lists.infradead.org +L: linux-block@vger.kernel.org S: Supported F: block/sed* F: block/opal_proto.h -- 2.7.4
Re: [PATCH V5 3/4] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
On Mon, Feb 13, 2017 at 04:30:36PM +, David Laight wrote: > From: Scott Bauer Sent: 13 February 2017 16:11 > > When CONFIG_KASAN is enabled, compilation fails: > > > > block/sed-opal.c: In function 'sed_ioctl': > > block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than > > 2048 bytes [-Werror=frame- > > larger-than=] > > > > Moved all the ioctl structures off the stack and dynamically activate > > using _IOC_SIZE() > > Think I'd not that this simplifies the code considerably. > AFAICT CONFIG_KASAN is a just brainf*ck. > It at least needs annotation that copy_from_user() has properties > similar to memset(). > So if the size matches that of the type then no guard space (etc) > is required. > I don't really follow what you're saying. Do you want me to just include that the patch cleans up the ioctl path a bit. I need to include the KASAN part since there was build breakage and the series does fix it even if it simplifies the path as well. As for the memset part, we never copy back to userland so there's no chance of data leakage which is what it seems you're hinting at. > ... > > + ioctl_ptr = memdup_user(arg, _IOC_SIZE(cmd)); > > + if (IS_ERR_OR_NULL(ioctl_ptr)) { > > + ret = PTR_ERR(ioctl_ptr); > > + goto out; > ... > > + out: > > + kfree(ioctl_ptr); > > + return ret; > > } > > That error path doesn't look quite right to me. > > David > good god, this is a mess this morning. Thanks for the catch, I'll review these more aggressively before sending out.
Re: [PATCH V5 3/4] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
On Mon, Feb 13, 2017 at 5:29 PM, Scott Bauerwrote: > On Mon, Feb 13, 2017 at 04:30:36PM +, David Laight wrote: >> From: Scott Bauer Sent: 13 February 2017 16:11 >> > When CONFIG_KASAN is enabled, compilation fails: >> > >> > block/sed-opal.c: In function 'sed_ioctl': >> > block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger >> > than 2048 bytes [-Werror=frame- >> > larger-than=] >> > >> > Moved all the ioctl structures off the stack and dynamically activate >> > using _IOC_SIZE() >> >> Think I'd not that this simplifies the code considerably. >> AFAICT CONFIG_KASAN is a just brainf*ck. >> It at least needs annotation that copy_from_user() has properties >> similar to memset(). >> So if the size matches that of the type then no guard space (etc) >> is required. I think it still would, as the pointer to the local variable gets passed through dev->func_data[]. >> ... >> > + ioctl_ptr = memdup_user(arg, _IOC_SIZE(cmd)); >> > + if (IS_ERR_OR_NULL(ioctl_ptr)) { >> > + ret = PTR_ERR(ioctl_ptr); >> > + goto out; >> ... >> > + out: >> > + kfree(ioctl_ptr); >> > + return ret; >> > } > > >> >> That error path doesn't look quite right to me. >> >> David >> > > good god, this is a mess this morning. Thanks for the catch, I'll review these > more aggressively before sending out. memdup_user() never returns NULL, and generally speaking any use of IS_ERR_OR_NULL() indicates that there is something wrong with the interface, so aside from passing the right pointer (or NULL) into kfree() I think using IS_ERR() is the correct solution. Arnd
Re: [PATCHv6 08/37] filemap: handle huge pages in do_generic_file_read()
On Mon, Feb 13, 2017 at 06:33:42PM +0300, Kirill A. Shutemov wrote: > No. pagecache_get_page() returns subpage. See description of the first > patch. Your description says: > We also change interface for page-cache lookup function: > > - functions that lookup for pages[1] would return subpages of THP > relevant for requested indexes; > > - functions that lookup for entries[2] would return one entry per-THP > and index will point to index of head page (basically, round down to > HPAGE_PMD_NR); > > This would provide balanced exposure of multi-order entires to the rest > of the kernel. > > [1] find_get_pages(), pagecache_get_page(), pagevec_lookup(), etc. > [2] find_get_entry(), find_get_entries(), pagevec_lookup_entries(), etc. I'm saying: > > We got this page from find_get_page(), which gets it from > > pagecache_get_page(), which gets it from find_get_entry() ... which > > (unless I'm lost in your patch series) returns the head page. Am I guilty of debugging documentation rather than code?
Re: [PATCHv6 08/37] filemap: handle huge pages in do_generic_file_read()
On Mon, Feb 13, 2017 at 08:01:17AM -0800, Matthew Wilcox wrote: > On Mon, Feb 13, 2017 at 06:33:42PM +0300, Kirill A. Shutemov wrote: > > No. pagecache_get_page() returns subpage. See description of the first > > patch. Oh, I re-read patch 1 and it made sense now. I missed the bit where pagecache_get_page() called page_subpage().
[PATCH V5 1/4] block: sed-opal: change ioctl to take user pointer instead of unsinged long
Signed-off-by: Scott Bauer--- block/sed-opal.c | 6 -- drivers/nvme/host/core.c | 3 ++- include/linux/sed-opal.h | 4 ++-- 3 files changed, 8 insertions(+), 5 deletions(-) diff --git a/block/sed-opal.c b/block/sed-opal.c index bf1406e..2448d4a 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -2344,9 +2344,11 @@ bool opal_unlock_from_suspend(struct opal_dev *dev) } EXPORT_SYMBOL(opal_unlock_from_suspend); -int sed_ioctl(struct opal_dev *dev, unsigned int cmd, unsigned long ptr) +int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *arg) { - void __user *arg = (void __user *)ptr; + void *ioctl_ptr; + int ret = -ENOTTY; + unsigned int cmd_size = _IOC_SIZE(cmd); if (!capable(CAP_SYS_ADMIN)) return -EACCES; diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index d9f4903..04c48e7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -811,7 +811,8 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t mode, return nvme_nvm_ioctl(ns, cmd, arg); #endif if (is_sed_ioctl(cmd)) - return sed_ioctl(>ctrl->opal_dev, cmd, arg); + return sed_ioctl(>ctrl->opal_dev, cmd, +(void __user *) arg); return -ENOTTY; } } diff --git a/include/linux/sed-opal.h b/include/linux/sed-opal.h index af1a85e..205d520 100644 --- a/include/linux/sed-opal.h +++ b/include/linux/sed-opal.h @@ -132,7 +132,7 @@ struct opal_dev { #ifdef CONFIG_BLK_SED_OPAL bool opal_unlock_from_suspend(struct opal_dev *dev); void init_opal_dev(struct opal_dev *opal_dev, sec_send_recv *send_recv); -int sed_ioctl(struct opal_dev *dev, unsigned int cmd, unsigned long ptr); +int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *ioctl_ptr); static inline bool is_sed_ioctl(unsigned int cmd) { @@ -160,7 +160,7 @@ static inline bool is_sed_ioctl(unsigned int cmd) } static inline int sed_ioctl(struct opal_dev *dev, unsigned int cmd, - unsigned long ptr) + void __user *ioctl_ptr) { return 0; } -- 2.7.4
Re: [PATCH] blk-mq: Call bio_io_error if request returned is NULL
On 02/13/2017 09:04 AM, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues> > If blk_mq_map_request returns NULL, bio_endio() function is not > called. Call bio_io_error() in case request returned is NULL. That's currently not a possible condition, since the main request mapper functions block on rq allocation. So we can never return NULL there. -- Jens Axboe
SED Opal Fixups
So we have a few patches here, they're pretty small. First patch changes the sed-opal ioctl function parameters to take a void __user* instead of an unsigned long, this required a small cast in the nvme driver. Patch 2 is a UAPI fixup for the IOW to make an ioctl the right size. Patch 3 fixes a compiliation error when building using KSAN due to the stack frame being too large. And lastly we move the Maintainers list from nvme to linux-block.
Re: [PATCH V5 1/4] block: sed-opal: change ioctl to take user pointer instead of unsinged long
esOn Mon, Feb 13, 2017 at 09:11:09AM -0700, Scott Bauer wrote: > Signed-off-by: Scott Bauer> --- > block/sed-opal.c | 6 -- > drivers/nvme/host/core.c | 3 ++- > include/linux/sed-opal.h | 4 ++-- > 3 files changed, 8 insertions(+), 5 deletions(-) > > diff --git a/block/sed-opal.c b/block/sed-opal.c > index bf1406e..2448d4a 100644 > --- a/block/sed-opal.c > +++ b/block/sed-opal.c > @@ -2344,9 +2344,11 @@ bool opal_unlock_from_suspend(struct opal_dev *dev) > } > EXPORT_SYMBOL(opal_unlock_from_suspend); > > -int sed_ioctl(struct opal_dev *dev, unsigned int cmd, unsigned long ptr) > +int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *arg) > { > - void __user *arg = (void __user *)ptr; > + void *ioctl_ptr; > + int ret = -ENOTTY; > + unsigned int cmd_size = _IOC_SIZE(cmd); ugh, I apparently messed up my rebase these should be in patch 2 or maybe I should sqash p1 and p2 together.
Re: [PATCH v2 4/4] Maintainers: Add Information for SED Opal library
On Fri, Feb 10, 2017 at 09:44:59AM -0700, Scott Bauer wrote: > > > +F:include/linux/sed* > > > +F:include/uapi/linux/sed* > > > > Since this is in the block tree and is not nvme specific, could you use > > linux-block as the main list? > > Sure, that's fine. Is there a way to denote "main list" in maintainers, > or just list it first? I'd just use linux-block - traffic isn't much higher than the nvme list, and it fits a lot better. FYI, I hope to have ATA (and as byproduct SCSI) support ready for the next merge window.
RE: [PATCH V5 3/4] Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
From: Arnd Bergmann > Sent: 13 February 2017 17:02 ... > >> > + ioctl_ptr = memdup_user(arg, _IOC_SIZE(cmd)); > >> > + if (IS_ERR_OR_NULL(ioctl_ptr)) { > >> > + ret = PTR_ERR(ioctl_ptr); > >> > + goto out; > >> ... > >> > + out: > >> > + kfree(ioctl_ptr); > >> > + return ret; ... > >> That error path doesn't look quite right to me. ... > > good god, this is a mess this morning. Thanks for the catch, I'll review > > these > > more aggressively before sending out. > > memdup_user() never returns NULL, and generally speaking any use of > IS_ERR_OR_NULL() indicates that there is something wrong with the > interface, so aside from passing the right pointer (or NULL) into kfree() > I think using IS_ERR() is the correct solution. You missed the problem entirely. Code needs to be: if (IS_ERR(ioctl_ptr)) return PTR_ERR(ioctl_ptr); David
[PATCH] nbd: use bdget_disk to get bdev
In preparation for the upcoming netlink interface we need to not rely on already having the bdev for the NBD device we are doing operations on. Instead of passing the bdev from the bdev ioctl around, use bdget_disk() wherever we need the bdev and pass that down to the helpers as necessary. Signed-off-by: Josef Bacik--- drivers/block/nbd.c | 88 ++--- 1 file changed, 50 insertions(+), 38 deletions(-) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 25891a1..0623f8f 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -168,13 +168,18 @@ static void nbd_size_update(struct nbd_device *nbd, struct block_device *bdev) kobject_uevent(_to_dev(nbd)->kobj, KOBJ_CHANGE); } -static void nbd_size_set(struct nbd_device *nbd, struct block_device *bdev, - loff_t blocksize, loff_t nr_blocks) +static int nbd_size_set(struct nbd_device *nbd, loff_t blocksize, + loff_t nr_blocks) { + struct block_device *bdev = bdget_disk(nbd->disk, 0); + if (!bdev) + return -EINVAL; nbd->blksize = blocksize; nbd->bytesize = blocksize * nr_blocks; if (nbd_is_connected(nbd)) nbd_size_update(nbd, bdev); + bdput(bdev); + return 0; } static void nbd_end_request(struct nbd_cmd *cmd) @@ -704,8 +709,7 @@ static int nbd_wait_for_socks(struct nbd_device *nbd) return ret; } -static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev, - unsigned long arg) +static int nbd_add_socket(struct nbd_device *nbd, unsigned long arg) { struct socket *sock; struct nbd_sock **socks; @@ -749,8 +753,6 @@ static int nbd_add_socket(struct nbd_device *nbd, struct block_device *bdev, nsock->sock = sock; socks[nbd->num_connections++] = nsock; - if (max_part) - bdev->bd_invalidated = 1; err = 0; out: mutex_unlock(>socks_lock); @@ -771,18 +773,19 @@ static void nbd_reset(struct nbd_device *nbd) static void nbd_bdev_reset(struct block_device *bdev) { - set_device_ro(bdev, false); - bdev->bd_inode->i_size = 0; + bd_set_size(bdev, 0); if (max_part > 0) { blkdev_reread_part(bdev); bdev->bd_invalidated = 1; } } -static void nbd_parse_flags(struct nbd_device *nbd, struct block_device *bdev) +static void nbd_parse_flags(struct nbd_device *nbd) { if (nbd->flags & NBD_FLAG_READ_ONLY) - set_device_ro(bdev, true); + set_disk_ro(nbd->disk, true); + else + set_disk_ro(nbd->disk, false); if (nbd->flags & NBD_FLAG_SEND_TRIM) queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, nbd->disk->queue); if (nbd->flags & NBD_FLAG_SEND_FLUSH) @@ -807,16 +810,11 @@ static void send_disconnects(struct nbd_device *nbd) } } -static int nbd_disconnect(struct nbd_device *nbd, struct block_device *bdev) +static int nbd_disconnect(struct nbd_device *nbd) { dev_info(disk_to_dev(nbd->disk), "NBD_DISCONNECT\n"); if (!nbd_socks_get_unless_zero(nbd)) return -EINVAL; - - mutex_unlock(>config_lock); - fsync_bdev(bdev); - mutex_lock(>config_lock); - if (!test_and_set_bit(NBD_DISCONNECT_REQUESTED, >runtime_flags)) send_disconnects(nbd); @@ -824,28 +822,40 @@ static int nbd_disconnect(struct nbd_device *nbd, struct block_device *bdev) return 0; } -static int nbd_clear_sock(struct nbd_device *nbd, struct block_device *bdev) +static int nbd_clear_sock(struct nbd_device *nbd) { + struct block_device *bdev = bdget_disk(nbd->disk, 0); + sock_shutdown(nbd); nbd_clear_que(nbd); - kill_bdev(bdev); - nbd_bdev_reset(bdev); + if (bdev) { + kill_bdev(bdev); + nbd_bdev_reset(bdev); + bdput(bdev); + } nbd->task_setup = NULL; if (test_and_clear_bit(NBD_HAS_SOCKS_REF, >runtime_flags)) nbd_socks_put(nbd); return 0; } -static int nbd_start_device(struct nbd_device *nbd, struct block_device *bdev) +static int nbd_start_device(struct nbd_device *nbd) { struct recv_thread_args *args; + struct block_device *bdev = bdget_disk(nbd->disk, 0); int num_connections = nbd->num_connections; int error = 0, i; - if (nbd->task_recv) + if (!bdev) + return -EINVAL; + if (nbd->task_recv) { + bdput(bdev); return -EBUSY; - if (!nbd->socks) + } + if (!nbd->socks) { + bdput(bdev); return -EINVAL; + } if (num_connections > 1 && !(nbd->flags & NBD_FLAG_CAN_MULTI_CONN)) { dev_err(disk_to_dev(nbd->disk), "server does not
Re: [PATCHv6 01/37] mm, shmem: swich huge tmpfs to multi-order radix-tree entries
On Thu, Feb 09, 2017 at 07:58:20PM +0300, Kirill A. Shutemov wrote: > I'll look into it. I ended up with this (I'll test it more later): void filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff) { struct radix_tree_iter iter; void **slot; struct file *file = vmf->vma->vm_file; struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; loff_t size; struct page *page; bool mapped; rcu_read_lock(); radix_tree_for_each_slot(slot, >page_tree, , start_pgoff) { unsigned long index = iter.index; if (index < start_pgoff) index = start_pgoff; if (index > end_pgoff) break; repeat: page = radix_tree_deref_slot(slot); if (unlikely(!page)) continue; if (radix_tree_exception(page)) { if (radix_tree_deref_retry(page)) slot = radix_tree_iter_retry(); continue; } if (!page_cache_get_speculative(page)) goto repeat; /* Has the page moved? */ if (unlikely(page != *slot)) { put_page(page); goto repeat; } /* For multi-order entries, find relevant subpage */ page = find_subpage(page, index); if (!PageUptodate(page) || PageReadahead(page)) goto skip; if (!trylock_page(page)) goto skip; if (page_mapping(page) != mapping || !PageUptodate(page)) goto skip_unlock; size = round_up(i_size_read(mapping->host), PAGE_SIZE); if (compound_head(page)->index >= size >> PAGE_SHIFT) goto skip_unlock; if (file->f_ra.mmap_miss > 0) file->f_ra.mmap_miss--; map_next_subpage: if (PageHWPoison(page)) goto next; vmf->address += (index - last_pgoff) << PAGE_SHIFT; if (vmf->pte) vmf->pte += index - last_pgoff; last_pgoff = index; mapped = !alloc_set_pte(vmf, NULL, page); /* Huge page is mapped or last index? No need to proceed. */ if (pmd_trans_huge(*vmf->pmd) || index == end_pgoff) { unlock_page(page); break; } next: if (page && PageCompound(page)) { /* Last subpage handled? */ if ((index & (compound_nr_pages(page) - 1)) == compound_nr_pages(page) - 1) goto skip_unlock; index++; page++; /* * One page reference goes to page table mapping. * Need additional reference, if last alloc_set_pte() * succeed. */ if (mapped) get_page(page); goto map_next_subpage; } skip_unlock: unlock_page(page); skip: iter.index = compound_head(page)->index + compound_nr_pages(page) - 1; /* Only give up reference if alloc_set_pte() failed. */ if (!mapped) put_page(page); } rcu_read_unlock(); } -- Kirill A. Shutemov
Re: [PATCH v1 3/5] md: fail if mddev->bio_set can't be created
Looks fine, Reviewed-by: Christoph Hellwigbut this really needs to be patch 2 in this series.
Re: [PATCH v1 1/5] block: introduce bio_clone_bioset_partial()
On Fri, Feb 10, 2017 at 06:56:13PM +0800, Ming Lei wrote: > md still need bio clone(not the fast version) for behind write, > and it is more efficient to use bio_clone_bioset_partial(). > > The idea is simple and just copy the bvecs range specified from > parameters. Given how few users bio_clone_bioset has I wonder if we shouldn't simply add the two new arguments to it instead of adding another indirection. Otherwise looks fine: Reviewed-by: Christoph Hellwig
Re: [PATCH v1 2/5] md/raid1: use bio_clone_bioset_partial() in case of write behind
> + struct bio *mbio = NULL; > + int offset; > if (!r1_bio->bios[i]) > continue; > > - mbio = bio_clone_mddev(bio, GFP_NOIO, mddev); > - bio_trim(mbio, r1_bio->sector - bio->bi_iter.bi_sector, > - max_sectors); > + offset = r1_bio->sector - bio->bi_iter.bi_sector; I think offset should be a sector_t. Otherwise this looks fine: Reviewed-by: Christoph Hellwig
Re: [PATCH v1 5/5] md: fast clone bio in bio_clone_mddev()
Please use bio_clone_fast directly and kill the wrapper.
Re: [PATCH v1 4/5] md: remove unnecessary check on mddev
Looks fine, Reviewed-by: Christoph Hellwig