Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-06-08 Thread Jens Axboe
On 06/08/2017 09:30 AM, Paolo Valente wrote:
> 
>> Il giorno 05 giu 2017, alle ore 10:11, Paolo Valente 
>>  ha scritto:
>>
>> In blk-cgroup, operations on blkg objects are protected with the
>> request_queue lock. This is no more the lock that protects
>> I/O-scheduler operations in blk-mq. In fact, the latter are now
>> protected with a finer-grained per-scheduler-instance lock. As a
>> consequence, although blkg lookups are also rcu-protected, blk-mq I/O
>> schedulers may see inconsistent data when they access blkg and
>> blkg-related objects. BFQ does access these objects, and does incur
>> this problem, in the following case.
>>
>> The blkg_lookup performed in bfq_get_queue, being protected (only)
>> through rcu, may happen to return the address of a copy of the
>> original blkg. If this is the case, then the blkg_get performed in
>> bfq_get_queue, to pin down the blkg, is useless: it does not prevent
>> blk-cgroup code from destroying both the original blkg and all objects
>> directly or indirectly referred by the copy of the blkg. BFQ accesses
>> these objects, which typically causes a crash for NULL-pointer
>> dereference of memory-protection violation.
>>
>> Some additional protection mechanism should be added to blk-cgroup to
>> address this issue. In the meantime, this commit provides a quick
>> temporary fix for BFQ: cache (when safe) blkg data that might
>> disappear right after a blkg_lookup.
>>
>> In particular, this commit exploits the following facts to achieve its
>> goal without introducing further locks.  Destroy operations on a blkg
>> invoke, as a first step, hooks of the scheduler associated with the
>> blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
>> consequence, for any blkg associated with the request queue an
>> instance of BFQ is attached to, we are guaranteed that such a blkg is
>> not destroyed, and that all the pointers it contains are consistent,
>> while that instance is holding its bfqd->lock. A blkg_lookup performed
>> with bfqd->lock held then returns a fully consistent blkg, which
>> remains consistent until this lock is held. In more detail, this holds
>> even if the returned blkg is a copy of the original one.
>>
>> Finally, also the object describing a group inside BFQ needs to be
>> protected from destruction on the blkg_free of the original blkg
>> (which invokes bfq_pd_free). This commit adds private refcounting for
>> this object, to let it disappear only after no bfq_queue refers to it
>> any longer.
>>
>> This commit also removes or updates some stale comments on locking
>> issues related to blk-cgroup operations.
>>
>> Reported-by: Tomas Konir 
>> Reported-by: Lee Tibbert 
>> Reported-by: Marco Piazza 
>> Signed-off-by: Paolo Valente 
>> Tested-by: Tomas Konir 
>> Tested-by: Lee Tibbert 
>> Tested-by: Marco Piazza 
> 
> Hi Jens,
> are you waiting for some further review/ack on this, or is it just in
> your queue of patches to check?  Sorry for bothering you, but this bug
> is causing problems to users.

I'll pull it in, it'll make the next -rc. I'll often let patches sit
for a few days even if I agree with them, to give others a chance to
either further review, comment, or disagree with them.

-- 
Jens Axboe



Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-06-08 Thread Paolo Valente

> Il giorno 05 giu 2017, alle ore 10:11, Paolo Valente 
>  ha scritto:
> 
> In blk-cgroup, operations on blkg objects are protected with the
> request_queue lock. This is no more the lock that protects
> I/O-scheduler operations in blk-mq. In fact, the latter are now
> protected with a finer-grained per-scheduler-instance lock. As a
> consequence, although blkg lookups are also rcu-protected, blk-mq I/O
> schedulers may see inconsistent data when they access blkg and
> blkg-related objects. BFQ does access these objects, and does incur
> this problem, in the following case.
> 
> The blkg_lookup performed in bfq_get_queue, being protected (only)
> through rcu, may happen to return the address of a copy of the
> original blkg. If this is the case, then the blkg_get performed in
> bfq_get_queue, to pin down the blkg, is useless: it does not prevent
> blk-cgroup code from destroying both the original blkg and all objects
> directly or indirectly referred by the copy of the blkg. BFQ accesses
> these objects, which typically causes a crash for NULL-pointer
> dereference of memory-protection violation.
> 
> Some additional protection mechanism should be added to blk-cgroup to
> address this issue. In the meantime, this commit provides a quick
> temporary fix for BFQ: cache (when safe) blkg data that might
> disappear right after a blkg_lookup.
> 
> In particular, this commit exploits the following facts to achieve its
> goal without introducing further locks.  Destroy operations on a blkg
> invoke, as a first step, hooks of the scheduler associated with the
> blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
> consequence, for any blkg associated with the request queue an
> instance of BFQ is attached to, we are guaranteed that such a blkg is
> not destroyed, and that all the pointers it contains are consistent,
> while that instance is holding its bfqd->lock. A blkg_lookup performed
> with bfqd->lock held then returns a fully consistent blkg, which
> remains consistent until this lock is held. In more detail, this holds
> even if the returned blkg is a copy of the original one.
> 
> Finally, also the object describing a group inside BFQ needs to be
> protected from destruction on the blkg_free of the original blkg
> (which invokes bfq_pd_free). This commit adds private refcounting for
> this object, to let it disappear only after no bfq_queue refers to it
> any longer.
> 
> This commit also removes or updates some stale comments on locking
> issues related to blk-cgroup operations.
> 
> Reported-by: Tomas Konir 
> Reported-by: Lee Tibbert 
> Reported-by: Marco Piazza 
> Signed-off-by: Paolo Valente 
> Tested-by: Tomas Konir 
> Tested-by: Lee Tibbert 
> Tested-by: Marco Piazza 

Hi Jens,
are you waiting for some further review/ack on this, or is it just in
your queue of patches to check?  Sorry for bothering you, but this bug
is causing problems to users.

Thanks,
Paolo

> ---
> block/bfq-cgroup.c  | 116 +---
> block/bfq-iosched.c |   2 +-
> block/bfq-iosched.h |  23 +--
> 3 files changed, 105 insertions(+), 36 deletions(-)
> 
> diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
> index c8a32fb..78b2e0d 100644
> --- a/block/bfq-cgroup.c
> +++ b/block/bfq-cgroup.c
> @@ -52,7 +52,7 @@ BFQG_FLAG_FNS(idling)
> BFQG_FLAG_FNS(empty)
> #undef BFQG_FLAG_FNS
> 
> -/* This should be called with the queue_lock held. */
> +/* This should be called with the scheduler lock held. */
> static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats)
> {
>   unsigned long long now;
> @@ -67,7 +67,7 @@ static void bfqg_stats_update_group_wait_time(struct 
> bfqg_stats *stats)
>   bfqg_stats_clear_waiting(stats);
> }
> 
> -/* This should be called with the queue_lock held. */
> +/* This should be called with the scheduler lock held. */
> static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
>struct bfq_group *curr_bfqg)
> {
> @@ -81,7 +81,7 @@ static void bfqg_stats_set_start_group_wait_time(struct 
> bfq_group *bfqg,
>   bfqg_stats_mark_waiting(stats);
> }
> 
> -/* This should be called with the queue_lock held. */
> +/* This should be called with the scheduler lock held. */
> static void bfqg_stats_end_empty_time(struct bfqg_stats *stats)
> {
>   unsigned long long now;
> @@ -203,12 +203,30 @@ struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
> 
> static void bfqg_get(struct bfq_group *bfqg)
> {
> - return blkg_get(bfqg_to_blkg(bfqg));
> + bfqg->ref++;
> }
> 
> void bfqg_put(struct bfq_group *bfqg)
> {
> - return blkg_put(bfqg_to_blkg(bfqg));
> + bfqg->ref--;
> +
> + if (bfqg->ref == 0)
> + kfree(bfqg);
> +}
> +
> +static void 

[PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-06-05 Thread Paolo Valente
In blk-cgroup, operations on blkg objects are protected with the
request_queue lock. This is no more the lock that protects
I/O-scheduler operations in blk-mq. In fact, the latter are now
protected with a finer-grained per-scheduler-instance lock. As a
consequence, although blkg lookups are also rcu-protected, blk-mq I/O
schedulers may see inconsistent data when they access blkg and
blkg-related objects. BFQ does access these objects, and does incur
this problem, in the following case.

The blkg_lookup performed in bfq_get_queue, being protected (only)
through rcu, may happen to return the address of a copy of the
original blkg. If this is the case, then the blkg_get performed in
bfq_get_queue, to pin down the blkg, is useless: it does not prevent
blk-cgroup code from destroying both the original blkg and all objects
directly or indirectly referred by the copy of the blkg. BFQ accesses
these objects, which typically causes a crash for NULL-pointer
dereference of memory-protection violation.

Some additional protection mechanism should be added to blk-cgroup to
address this issue. In the meantime, this commit provides a quick
temporary fix for BFQ: cache (when safe) blkg data that might
disappear right after a blkg_lookup.

In particular, this commit exploits the following facts to achieve its
goal without introducing further locks.  Destroy operations on a blkg
invoke, as a first step, hooks of the scheduler associated with the
blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
consequence, for any blkg associated with the request queue an
instance of BFQ is attached to, we are guaranteed that such a blkg is
not destroyed, and that all the pointers it contains are consistent,
while that instance is holding its bfqd->lock. A blkg_lookup performed
with bfqd->lock held then returns a fully consistent blkg, which
remains consistent until this lock is held. In more detail, this holds
even if the returned blkg is a copy of the original one.

Finally, also the object describing a group inside BFQ needs to be
protected from destruction on the blkg_free of the original blkg
(which invokes bfq_pd_free). This commit adds private refcounting for
this object, to let it disappear only after no bfq_queue refers to it
any longer.

This commit also removes or updates some stale comments on locking
issues related to blk-cgroup operations.

Reported-by: Tomas Konir 
Reported-by: Lee Tibbert 
Reported-by: Marco Piazza 
Signed-off-by: Paolo Valente 
Tested-by: Tomas Konir 
Tested-by: Lee Tibbert 
Tested-by: Marco Piazza 
---
 block/bfq-cgroup.c  | 116 +---
 block/bfq-iosched.c |   2 +-
 block/bfq-iosched.h |  23 +--
 3 files changed, 105 insertions(+), 36 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index c8a32fb..78b2e0d 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -52,7 +52,7 @@ BFQG_FLAG_FNS(idling)
 BFQG_FLAG_FNS(empty)
 #undef BFQG_FLAG_FNS
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats)
 {
unsigned long long now;
@@ -67,7 +67,7 @@ static void bfqg_stats_update_group_wait_time(struct 
bfqg_stats *stats)
bfqg_stats_clear_waiting(stats);
 }
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
 struct bfq_group *curr_bfqg)
 {
@@ -81,7 +81,7 @@ static void bfqg_stats_set_start_group_wait_time(struct 
bfq_group *bfqg,
bfqg_stats_mark_waiting(stats);
 }
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_end_empty_time(struct bfqg_stats *stats)
 {
unsigned long long now;
@@ -203,12 +203,30 @@ struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
 
 static void bfqg_get(struct bfq_group *bfqg)
 {
-   return blkg_get(bfqg_to_blkg(bfqg));
+   bfqg->ref++;
 }
 
 void bfqg_put(struct bfq_group *bfqg)
 {
-   return blkg_put(bfqg_to_blkg(bfqg));
+   bfqg->ref--;
+
+   if (bfqg->ref == 0)
+   kfree(bfqg);
+}
+
+static void bfqg_and_blkg_get(struct bfq_group *bfqg)
+{
+   /* see comments in bfq_bic_update_cgroup for why refcounting bfqg */
+   bfqg_get(bfqg);
+
+   blkg_get(bfqg_to_blkg(bfqg));
+}
+
+void bfqg_and_blkg_put(struct bfq_group *bfqg)
+{
+   bfqg_put(bfqg);
+
+   blkg_put(bfqg_to_blkg(bfqg));
 }
 
 void bfqg_stats_update_io_add(struct bfq_group *bfqg, struct bfq_queue *bfqq,
@@ -312,7 +330,11 @@ void bfq_init_entity(struct bfq_entity *entity, struct 
bfq_group *bfqg)
   

Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-25 Thread Tejun Heo
Hello, Paolo.

On Thu, May 25, 2017 at 09:10:59AM +0200, Paolo Valente wrote:
> Ok.  So, just to better understand: as of now, i.e., before you make
> the changes you are proposing, the address returned by blkg_lookup can
> be used safely only if one both invokes blkg_lookup and gets a
> reference, while holding the rq lock all the time.  But then, before
> the changes you propose, what is the remaining role of rcu protection
> here?  Are there places where the value returned by blkg_lookup is
> actually used safely without getting a reference the returned blkg?

RCU protects the indexing structure and one doesn't have to hold rq
lock to just look up blkg.  The rule is a bit weird because we can
assume that the blkg's ref can be incremented in all places but that's
only because we don't destroy blkgs unless the associated blkcg or
device is destroyed, so we're cheating a little there.

Look for blkg_for_each_descendant_*() users for lockless lookup
examples.  There may be other uses but I can't remebmer off the top of
my head.

> Anyway, I'm willing to help with your proposal, if you think I can be
> of any help at some point.  In this respect, consider that I'm not an
> expert of percpu-refs either.

percpu-refs are not that different from regular atomic refcnts except
that the normal get / put operations are a lot cheaper (well, at least
scalable to a lot of concurrent operations), so better suited for
blk-mq.

> Finally, I guess that the general fix you have in mind may not be
> ready shortly.  So, I'll proceed with my temporary fix for the moment.
> In particular, I will
> 1) fix the typo reported by Jens;
> 2) add a note stating that this is a temporary fix;
> 3) if needed, modify commit log and comments in the diffs, to better
> describe the general problem, and the actual critical race.

Sure, no objection there.

Thanks.

-- 
tejun


Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-25 Thread Paolo Valente

> Il giorno 24 mag 2017, alle ore 17:47, Tejun Heo  ha scritto:
> 
> Hello,
> 
> On Wed, May 24, 2017 at 05:43:18PM +0100, Paolo Valente wrote:
>>> so none of the above objects can be destroyed before the request is
>>> done.
>> 
>> ... the issue seems just to move to a more subtle position: cfq is ok,
>> because it protects itself with rq lock, but blk-mq schedulers don't.
>> So, the race that leads to the (real) crashes reported by people may
>> actually be:
> 
> Oh, I was just thinking about !mq paths the whole time.
> 
>> 1 blkg_lookup executed on a blkg being destroyed: the scheduler gets a
>> copy of the content of the blkg, but the rcu mechanism doesn't prevent
>> destruction from going on
>> 2 blkg_get gets executed on the copy of the original blkg
> 
> So, we can't do that.  We should look up and bump the ref and use the
> original copy.  We probably should switch blkgs to use percpu-refs.
> 

Ok.  So, just to better understand: as of now, i.e., before you make
the changes you are proposing, the address returned by blkg_lookup can
be used safely only if one both invokes blkg_lookup and gets a
reference, while holding the rq lock all the time.  But then, before
the changes you propose, what is the remaining role of rcu protection
here?  Are there places where the value returned by blkg_lookup is
actually used safely without getting a reference the returned blkg?

Anyway, I'm willing to help with your proposal, if you think I can be
of any help at some point.  In this respect, consider that I'm not an
expert of percpu-refs either.

Finally, I guess that the general fix you have in mind may not be
ready shortly.  So, I'll proceed with my temporary fix for the moment.
In particular, I will
1) fix the typo reported by Jens;
2) add a note stating that this is a temporary fix;
3) if needed, modify commit log and comments in the diffs, to better
describe the general problem, and the actual critical race.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun



Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-24 Thread Tejun Heo
Hello,

On Wed, May 24, 2017 at 05:43:18PM +0100, Paolo Valente wrote:
> > so none of the above objects can be destroyed before the request is
> > done.
> 
> ... the issue seems just to move to a more subtle position: cfq is ok,
> because it protects itself with rq lock, but blk-mq schedulers don't.
> So, the race that leads to the (real) crashes reported by people may
> actually be:

Oh, I was just thinking about !mq paths the whole time.

> 1 blkg_lookup executed on a blkg being destroyed: the scheduler gets a
> copy of the content of the blkg, but the rcu mechanism doesn't prevent
> destruction from going on
> 2 blkg_get gets executed on the copy of the original blkg

So, we can't do that.  We should look up and bump the ref and use the
original copy.  We probably should switch blkgs to use percpu-refs.

Thanks.

-- 
tejun


Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-24 Thread Paolo Valente

> Il giorno 24 mag 2017, alle ore 15:50, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Wed, May 24, 2017 at 12:53:26PM +0100, Paolo Valente wrote:
>> Exact, but even after all blkgs, as well as the cfq_group and pd, are
>> gone, the children cfq_queues of the gone cfq_group continue to point
>> to unexisting objects, until new cfq_set_requests are executed for
>> those cfq_queues.  To try to make this statement clearer, here is the
>> critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,
>> say cfqg:
>> 
>> 1 cfq_set_request for a request rq of cfqq
>> 2 removal of (the process associated with cfqq) from bfqg
>> 3 destruction of the blkg that bfqg is associated with
>> 4 destruction of the blkcg the above blkg belongs to
>> 5 destruction of the pd pointed to by cfqg, and of cfqg itself
>> !!!-> from now on cfqq->cfqg is a dangling reference <-!!!
>> 6 execution of cfq functions, different from cfq_set_request, on cfqq
>>  . cfq_insert, cfq_dispatch, cfq_completed_rq, ...
>> 7 execution of a new cfq_set_request for cfqq
>> -> now cfqq->cfqg is again a sane pointer <-
>> 
>> Every function executed at step 6 sees a dangling reference for
>> cfqq->cfqg.
>> 
>> My fix for caching data doesn't solve this more serious problem.
>> 
>> Where have I been mistaken?
> 
> Hmmm... cfq_set_request() invokes cfqg_get() which increases refcnt on
> the blkg, which should pin everything down till the request is done,

Yes, I missed that step, sorry. Still ...

> so none of the above objects can be destroyed before the request is
> done.
> 

... the issue seems just to move to a more subtle position: cfq is ok,
because it protects itself with rq lock, but blk-mq schedulers don't.
So, the race that leads to the (real) crashes reported by people may
actually be:
1 blkg_lookup executed on a blkg being destroyed: the scheduler gets a
copy of the content of the blkg, but the rcu mechanism doesn't prevent
destruction from going on
2 blkg_get gets executed on the copy of the original blkg
3 subsequent scheduler operations involving that stale blkg lead to
the dangling-pointer accesses we have already discussed

Could you patiently tell me whether I'm still wrong?

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun



Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-24 Thread Tejun Heo
Hello, Paolo.

On Wed, May 24, 2017 at 12:53:26PM +0100, Paolo Valente wrote:
> Exact, but even after all blkgs, as well as the cfq_group and pd, are
> gone, the children cfq_queues of the gone cfq_group continue to point
> to unexisting objects, until new cfq_set_requests are executed for
> those cfq_queues.  To try to make this statement clearer, here is the
> critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,
> say cfqg:
> 
> 1 cfq_set_request for a request rq of cfqq
> 2 removal of (the process associated with cfqq) from bfqg
> 3 destruction of the blkg that bfqg is associated with
> 4 destruction of the blkcg the above blkg belongs to
> 5 destruction of the pd pointed to by cfqg, and of cfqg itself
> !!!-> from now on cfqq->cfqg is a dangling reference <-!!!
> 6 execution of cfq functions, different from cfq_set_request, on cfqq
>   . cfq_insert, cfq_dispatch, cfq_completed_rq, ...
> 7 execution of a new cfq_set_request for cfqq
> -> now cfqq->cfqg is again a sane pointer <-
> 
> Every function executed at step 6 sees a dangling reference for
> cfqq->cfqg.
> 
> My fix for caching data doesn't solve this more serious problem.
> 
> Where have I been mistaken?

Hmmm... cfq_set_request() invokes cfqg_get() which increases refcnt on
the blkg, which should pin everything down till the request is done,
so none of the above objects can be destroyed before the request is
done.

Thanks.

-- 
tejun


Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-24 Thread Paolo Valente

> Il giorno 24 mag 2017, alle ore 12:53, Paolo Valente 
>  ha scritto:
> 
>> 
>> Il giorno 23 mag 2017, alle ore 21:42, Tejun Heo  ha 
>> scritto:
>> 
>> Hello, Paolo.
>> 
>> On Sat, May 20, 2017 at 09:27:33AM +0200, Paolo Valente wrote:
>>> Consider a process or a group that is moved from a given source group
>>> to a different group, or simply removed from a group (although I
>>> didn't yet succeed in just removing a process from a group :) ).  The
>>> pointer to the [b|c]fq_group contained in the schedulable entity
>>> belonging to the source group *is not* updated, in BFQ, if the entity
>>> is idle, and *is not* updated *unconditionally* in CFQ.  The update
>>> will happen in bfq_get_rq_private or cfq_set_request, on the arrival
>>> of a new request.  But, if the move happens right after the arrival of
>>> a request, then all the scheduler functions executed until a new
>>> request arrives for that entity will see a stale [b|c]fq_group.  Much
>> 
>> Limited staleness is fine.  Especially in this case, it isn't too
>> weird to claim that the order between the two operations isn't clearly
>> defined.
>> 
> 
> ok
> 
>>> worse, if also a blkcg_deactivate_policy or a blkg_destroy are
>>> executed right after the move, then both the policy data pointed by
>>> the [b|c]fq_group and the [b|c]fq_group itself may be deallocated.
>>> So, all the functions of the scheduler invoked before next request
>>> arrival may use dangling references!
>> 
>> Hmm... but cfq_group is allocated along with blkcg and blkcg always
>> ensures that there are no blkg left before freeing the pd area in
>> blkcg_css_offline().
>> 
> 
> Exact, but even after all blkgs, as well as the cfq_group and pd, are
> gone, the children cfq_queues of the gone cfq_group continue to point
> to unexisting objects, until new cfq_set_requests are executed for
> those cfq_queues.  To try to make this statement clearer, here is the
> critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,
> say cfqg:
> 
> 1 cfq_set_request for a request rq of cfqq

Sorry, this first event is irrelevant for the problem to occur.  What
matters is just that some scheduler hooks are invoked *after* the
deallocation of a cfq_group, and *before* a new cfq_set_request.

Paolo

> 2 removal of (the process associated with cfqq) from bfqg
> 3 destruction of the blkg that bfqg is associated with
> 4 destruction of the blkcg the above blkg belongs to
> 5 destruction of the pd pointed to by cfqg, and of cfqg itself
> !!!-> from now on cfqq->cfqg is a dangling reference <-!!!
> 6 execution of cfq functions, different from cfq_set_request, on cfqq
>   . cfq_insert, cfq_dispatch, cfq_completed_rq, ...
> 7 execution of a new cfq_set_request for cfqq
> -> now cfqq->cfqg is again a sane pointer <-
> 
> Every function executed at step 6 sees a dangling reference for
> cfqq->cfqg.
> 
> My fix for caching data doesn't solve this more serious problem.
> 
> Where have I been mistaken?
> 
> Thanks,
> Paolo
> 
>> Thanks.
>> 
>> -- 
>> tejun



Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-24 Thread Paolo Valente

> Il giorno 23 mag 2017, alle ore 21:42, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Sat, May 20, 2017 at 09:27:33AM +0200, Paolo Valente wrote:
>> Consider a process or a group that is moved from a given source group
>> to a different group, or simply removed from a group (although I
>> didn't yet succeed in just removing a process from a group :) ).  The
>> pointer to the [b|c]fq_group contained in the schedulable entity
>> belonging to the source group *is not* updated, in BFQ, if the entity
>> is idle, and *is not* updated *unconditionally* in CFQ.  The update
>> will happen in bfq_get_rq_private or cfq_set_request, on the arrival
>> of a new request.  But, if the move happens right after the arrival of
>> a request, then all the scheduler functions executed until a new
>> request arrives for that entity will see a stale [b|c]fq_group.  Much
> 
> Limited staleness is fine.  Especially in this case, it isn't too
> weird to claim that the order between the two operations isn't clearly
> defined.
> 

ok

>> worse, if also a blkcg_deactivate_policy or a blkg_destroy are
>> executed right after the move, then both the policy data pointed by
>> the [b|c]fq_group and the [b|c]fq_group itself may be deallocated.
>> So, all the functions of the scheduler invoked before next request
>> arrival may use dangling references!
> 
> Hmm... but cfq_group is allocated along with blkcg and blkcg always
> ensures that there are no blkg left before freeing the pd area in
> blkcg_css_offline().
> 

Exact, but even after all blkgs, as well as the cfq_group and pd, are
gone, the children cfq_queues of the gone cfq_group continue to point
to unexisting objects, until new cfq_set_requests are executed for
those cfq_queues.  To try to make this statement clearer, here is the
critical sequence for a cfq_queue, say cfqq, belonging to a cfq_group,
say cfqg:

1 cfq_set_request for a request rq of cfqq
2 removal of (the process associated with cfqq) from bfqg
3 destruction of the blkg that bfqg is associated with
4 destruction of the blkcg the above blkg belongs to
5 destruction of the pd pointed to by cfqg, and of cfqg itself
!!!-> from now on cfqq->cfqg is a dangling reference <-!!!
6 execution of cfq functions, different from cfq_set_request, on cfqq
. cfq_insert, cfq_dispatch, cfq_completed_rq, ...
7 execution of a new cfq_set_request for cfqq
-> now cfqq->cfqg is again a sane pointer <-

Every function executed at step 6 sees a dangling reference for
cfqq->cfqg.

My fix for caching data doesn't solve this more serious problem.

Where have I been mistaken?

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun



Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-23 Thread Tejun Heo
Hello, Paolo.

On Sat, May 20, 2017 at 09:27:33AM +0200, Paolo Valente wrote:
> Consider a process or a group that is moved from a given source group
> to a different group, or simply removed from a group (although I
> didn't yet succeed in just removing a process from a group :) ).  The
> pointer to the [b|c]fq_group contained in the schedulable entity
> belonging to the source group *is not* updated, in BFQ, if the entity
> is idle, and *is not* updated *unconditionally* in CFQ.  The update
> will happen in bfq_get_rq_private or cfq_set_request, on the arrival
> of a new request.  But, if the move happens right after the arrival of
> a request, then all the scheduler functions executed until a new
> request arrives for that entity will see a stale [b|c]fq_group.  Much

Limited staleness is fine.  Especially in this case, it isn't too
weird to claim that the order between the two operations isn't clearly
defined.

> worse, if also a blkcg_deactivate_policy or a blkg_destroy are
> executed right after the move, then both the policy data pointed by
> the [b|c]fq_group and the [b|c]fq_group itself may be deallocated.
> So, all the functions of the scheduler invoked before next request
> arrival may use dangling references!

Hmm... but cfq_group is allocated along with blkcg and blkcg always
ensures that there are no blkg left before freeing the pd area in
blkcg_css_offline().

Thanks.

-- 
tejun


Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-22 Thread Paolo Valente

> Il giorno 19 mag 2017, alle ore 16:37, Jens Axboe  ha 
> scritto:
> 
> On 05/19/2017 02:39 AM, Paolo Valente wrote:
>> @@ -692,8 +725,7 @@ void bfq_pd_offline(struct blkg_policy_data *pd)
>>  /*
>>   * The idle tree may still contain bfq_queues belonging
>>   * to exited task because they never migrated to a different
>> - * cgroup from the one being destroyed now.  No one else
>> - * can access them so it's safe to act without any lock.
>> ++* cgroup from the one being destroyed now.
>>   */
>>  bfq_flush_idle_tree(st);
>> 
> 
> Looks like an extra '+' snuck into that hunk.
> 

Yes, sorry.  Before possibly submitting a fixed version, I'll wait for
a reply on my previous email in this thread, as the issue now seems
more serious to me, and affecting CFQ too.

Thanks,
Paolo

> -- 
> Jens Axboe
> 



Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-20 Thread Paolo Valente

> Il giorno 19 mag 2017, alle ore 16:54, Tejun Heo  ha scritto:
> 
> Hello, Paolo.
> 
> On Fri, May 19, 2017 at 10:39:08AM +0200, Paolo Valente wrote:
>> Operations on blkg objects in blk-cgroup are protected with the
>> request_queue lock, which is no more the lock that protects
>> I/O-scheduler operations in blk-mq. The latter are now protected with
>> finer-grained per-scheduler-instance locks. As a consequence, if blkg
>> and blkg-related objects are accessed in a blk-mq I/O scheduler, it is
>> possible to have races, unless proper care is taken for these
>> accesses. BFQ does access these objects, and does incur these races.
>> 
>> This commit addresses this issue without introducing further locks, by
>> exploiting the following facts.  Destroy operations on a blkg invoke,
>> as a first step, hooks of the scheduler associated with the blkg. And
>> these hooks are executed with bfqd->lock held for BFQ. As a
>> consequence, for any blkg associated with the request queue an
>> instance of BFQ is attached to, we are guaranteed that such a blkg is
>> not destroyed and that all the pointers it contains are consistent,
>> (only) while that instance is holding its bfqd->lock. A blkg_lookup
>> performed with bfqd->lock held then returns a fully consistent blkg,
>> which remains consistent until this lock is held.
>> 
>> In view of these facts, this commit caches any needed blkg data (only)
>> when it (safely) detects a parent-blkg change for an internal entity,
>> and, to cache these data safely, it gets the new blkg, through a
>> blkg_lookup, and copies data while keeping the bfqd->lock held. As of
>> now, BFQ needs to cache only the path of the blkg, which is used in
>> the bfq_log_* functions.
>> 
>> This commit also removes or updates some stale comments on locking
>> issues related to blk-cgroup operations.
> 
> For a quick fix, this is fine but I think it'd be much better to
> update blkcg core so that we protect lookups with rcu and refcnt the
> blkg with percpu refs so that we can use blkcg correctly for all
> purposes with blk-mq.  There's no reason to hold up the immediate fix
> for that but it'd be nice to at least note what we should be doing in
> the longer term in a comment.
> 

Ok.

I have started thinking of a blk-cgroup-wide solution, but, Tejun and
Jens, the more I think about it, the more I see a more structural bug
:( The bug seems to affect CFQ too, even if CFQ still uses the
request_queue lock.  Hoping that the bug is only in my mind, here is
first my understanding of how the data structures related to the bug
are performed, and, second, why handling them this way apparently
leads to the bug.

For a given instance of [B|C]FQ (i.e., of BFQ or CFQ), a [b|c]fq_group
(descriptor of a group in [B|C]FQ) is created on the creation of each
blkg associated with the same request queue that instance of [B|C]FQ
is attached to.  The schedulable entities that belong to this blkg
(only queues in CFQ, or both queues and generic entities in BFQ), are
then associated with this [b|c]fq_group on the arrival on new I/O
requests for them: these entities contain a pointer to a
[b|c]fq_group, and the pointer is assigned the address of this new
[b|c]fq_group.  The functions where the association occurs are
bfq_get_rq_private for BFQ and cfq_set_request for CFQ.  Both hooks
are executed before the hook for actually enqueueing the request.  Any
access to group information is performed through this [b|c]fq_group
field.  The associated blkg is accessed through the policy_data
pointer in the bfq_group (the policy data in its turn contains a
pointer to the blkg)

Consider a process or a group that is moved from a given source group
to a different group, or simply removed from a group (although I
didn't yet succeed in just removing a process from a group :) ).  The
pointer to the [b|c]fq_group contained in the schedulable entity
belonging to the source group *is not* updated, in BFQ, if the entity
is idle, and *is not* updated *unconditionally* in CFQ.  The update
will happen in bfq_get_rq_private or cfq_set_request, on the arrival
of a new request.  But, if the move happens right after the arrival of
a request, then all the scheduler functions executed until a new
request arrives for that entity will see a stale [b|c]fq_group.  Much
worse, if also a blkcg_deactivate_policy or a blkg_destroy are
executed right after the move, then both the policy data pointed by
the [b|c]fq_group and the [b|c]fq_group itself may be deallocated.
So, all the functions of the scheduler invoked before next request
arrival may use dangling references!

The symptom reported by BFQ users has been actually the dereference of
dangling bfq_group or policy data pointers in a request_insert

What do you think, have I been mistaken in some step?

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun



Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-19 Thread Tejun Heo
Hello, Paolo.

On Fri, May 19, 2017 at 10:39:08AM +0200, Paolo Valente wrote:
> Operations on blkg objects in blk-cgroup are protected with the
> request_queue lock, which is no more the lock that protects
> I/O-scheduler operations in blk-mq. The latter are now protected with
> finer-grained per-scheduler-instance locks. As a consequence, if blkg
> and blkg-related objects are accessed in a blk-mq I/O scheduler, it is
> possible to have races, unless proper care is taken for these
> accesses. BFQ does access these objects, and does incur these races.
> 
> This commit addresses this issue without introducing further locks, by
> exploiting the following facts.  Destroy operations on a blkg invoke,
> as a first step, hooks of the scheduler associated with the blkg. And
> these hooks are executed with bfqd->lock held for BFQ. As a
> consequence, for any blkg associated with the request queue an
> instance of BFQ is attached to, we are guaranteed that such a blkg is
> not destroyed and that all the pointers it contains are consistent,
> (only) while that instance is holding its bfqd->lock. A blkg_lookup
> performed with bfqd->lock held then returns a fully consistent blkg,
> which remains consistent until this lock is held.
> 
> In view of these facts, this commit caches any needed blkg data (only)
> when it (safely) detects a parent-blkg change for an internal entity,
> and, to cache these data safely, it gets the new blkg, through a
> blkg_lookup, and copies data while keeping the bfqd->lock held. As of
> now, BFQ needs to cache only the path of the blkg, which is used in
> the bfq_log_* functions.
> 
> This commit also removes or updates some stale comments on locking
> issues related to blk-cgroup operations.

For a quick fix, this is fine but I think it'd be much better to
update blkcg core so that we protect lookups with rcu and refcnt the
blkg with percpu refs so that we can use blkcg correctly for all
purposes with blk-mq.  There's no reason to hold up the immediate fix
for that but it'd be nice to at least note what we should be doing in
the longer term in a comment.

Thanks.

-- 
tejun


Re: [PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-19 Thread Jens Axboe
On 05/19/2017 02:39 AM, Paolo Valente wrote:
> @@ -692,8 +725,7 @@ void bfq_pd_offline(struct blkg_policy_data *pd)
>   /*
>* The idle tree may still contain bfq_queues belonging
>* to exited task because they never migrated to a different
> -  * cgroup from the one being destroyed now.  No one else
> -  * can access them so it's safe to act without any lock.
> ++* cgroup from the one being destroyed now.
>*/
>   bfq_flush_idle_tree(st);
>  

Looks like an extra '+' snuck into that hunk.

-- 
Jens Axboe



[PATCH BUGFIX] block, bfq: access and cache blkg data only when safe

2017-05-19 Thread Paolo Valente
Operations on blkg objects in blk-cgroup are protected with the
request_queue lock, which is no more the lock that protects
I/O-scheduler operations in blk-mq. The latter are now protected with
finer-grained per-scheduler-instance locks. As a consequence, if blkg
and blkg-related objects are accessed in a blk-mq I/O scheduler, it is
possible to have races, unless proper care is taken for these
accesses. BFQ does access these objects, and does incur these races.

This commit addresses this issue without introducing further locks, by
exploiting the following facts.  Destroy operations on a blkg invoke,
as a first step, hooks of the scheduler associated with the blkg. And
these hooks are executed with bfqd->lock held for BFQ. As a
consequence, for any blkg associated with the request queue an
instance of BFQ is attached to, we are guaranteed that such a blkg is
not destroyed and that all the pointers it contains are consistent,
(only) while that instance is holding its bfqd->lock. A blkg_lookup
performed with bfqd->lock held then returns a fully consistent blkg,
which remains consistent until this lock is held.

In view of these facts, this commit caches any needed blkg data (only)
when it (safely) detects a parent-blkg change for an internal entity,
and, to cache these data safely, it gets the new blkg, through a
blkg_lookup, and copies data while keeping the bfqd->lock held. As of
now, BFQ needs to cache only the path of the blkg, which is used in
the bfq_log_* functions.

This commit also removes or updates some stale comments on locking
issues related to blk-cgroup operations.

Reported-by: Tomas Konir 
Reported-by: Lee Tibbert 
Reported-by: Marco Piazza 
Signed-off-by: Paolo Valente 
Tested-by: Tomas Konir 
Tested-by: Lee Tibbert 
Tested-by: Marco Piazza 
---
 block/bfq-cgroup.c  | 56 +
 block/bfq-iosched.h | 18 +++--
 2 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index c8a32fb..06195ff 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -52,7 +52,7 @@ BFQG_FLAG_FNS(idling)
 BFQG_FLAG_FNS(empty)
 #undef BFQG_FLAG_FNS
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats)
 {
unsigned long long now;
@@ -67,7 +67,7 @@ static void bfqg_stats_update_group_wait_time(struct 
bfqg_stats *stats)
bfqg_stats_clear_waiting(stats);
 }
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
 struct bfq_group *curr_bfqg)
 {
@@ -81,7 +81,7 @@ static void bfqg_stats_set_start_group_wait_time(struct 
bfq_group *bfqg,
bfqg_stats_mark_waiting(stats);
 }
 
-/* This should be called with the queue_lock held. */
+/* This should be called with the scheduler lock held. */
 static void bfqg_stats_end_empty_time(struct bfqg_stats *stats)
 {
unsigned long long now;
@@ -496,9 +496,10 @@ struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
  * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
  * it on the new one.  Avoid putting the entity on the old group idle tree.
  *
- * Must be called under the queue lock; the cgroup owning @bfqg must
- * not disappear (by now this just means that we are called under
- * rcu_read_lock()).
+ * Must be called under the scheduler lock, to make sure that the blkg
+ * owning @bfqg does not disappear (see comments in
+ * bfq_bic_update_cgroup on guaranteeing the consistency of blkg
+ * objects).
  */
 void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
   struct bfq_group *bfqg)
@@ -545,8 +546,9 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue 
*bfqq,
  * @bic: the bic to move.
  * @blkcg: the blk-cgroup to move to.
  *
- * Move bic to blkcg, assuming that bfqd->queue is locked; the caller
- * has to make sure that the reference to cgroup is valid across the call.
+ * Move bic to blkcg, assuming that bfqd->lock is held; which makes
+ * sure that the reference to cgroup is valid across the call (see
+ * comments in bfq_bic_update_cgroup on this issue)
  *
  * NOTE: an alternative approach might have been to store the current
  * cgroup in bfqq and getting a reference to it, reducing the lookup
@@ -604,6 +606,39 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct 
bio *bio)
goto out;
 
bfqg = __bfq_bic_change_cgroup(bfqd, bic, bio_blkcg(bio));
+   /*
+* Update blkg_path for bfq_log_* functions. We cache this
+* path, and update it here, for the following
+