Re: [PATCH 3/3] sched,time: atomically increment stime & utime

2014-08-16 Thread Oleg Nesterov
Aaah, Rik, I am sorry!

You documented this in 0/3 which I didn't bother to read.

Sorry for noise.

On 08/16, Oleg Nesterov wrote:
>
> On 08/15, r...@redhat.com wrote:
> >
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
> >  * If the tick based count grows faster than the scheduler one,
> >  * the result of the scaling may go backward.
> >  * Let's enforce monotonicity.
> > +* Atomic exchange protects against concurrent cputime_adjust.
> >  */
> > -   prev->stime = max(prev->stime, stime);
> > -   prev->utime = max(prev->utime, utime);
> > +   while (stime > (rtime = ACCESS_ONCE(prev->stime)))
> > +   cmpxchg(>stime, rtime, stime);
> > +   while (utime > (rtime = ACCESS_ONCE(prev->utime)))
> > +   cmpxchg(>utime, rtime, utime);
> >
> >  out:
> > *ut = prev->utime;
> 
> I am still not sure about this change. At least I think it needs some
> discussion.
> 
> Let me repeat, afaics this can lead to inconsistent results. Just
> suppose that the caller of thread_group_cputime_adjusted() gets a long
> preemption between thread_group_cputime() and cputime_adjust(), and
> the numbers in signal->prev_cputime grow significantly when this task
> resumes. If cputime_adjust() sees both prev->stime and prev->utime
> updated everything is fine. But we can race with cputime_adjust() on
> another CPU and miss, say, the change in ->utime.
> 
> IOW. To simplify, suppose that thread_group_cputime(T) fills task_cputime
> with zeros. Then the caller X is preempted.
> 
> Another task does thread_group_cputime(T) and this time task_cputime is
> { .utime = A_LOT_U, .stime = A_LOT_S }. This task calls cputime_adjust()
> and sets prev->stime = A_LOT_S.
> 
> X resumes, calls cputime_adjust(), and returns { 0, A_LOT_S }.
> 
> If you think that we do not care, probably I won't argue. But at least
> this should be documented/discussed. And if we can tolerate this, then we
> can probably simply remove the scale_stime recalculation and change it to
> just do
> 
>   static void cputime_adjust(struct task_cputime *curr,
>  struct cputime *prev,
>  cputime_t *ut, cputime_t *st)
>   {
>   cputime_t rtime, stime, utime;
>   /*
>* Let's enforce monotonicity.
>* Atomic exchange protects against concurrent cputime_adjust.
>*/
>   while (stime > (rtime = ACCESS_ONCE(prev->stime)))
>   cmpxchg(>stime, rtime, stime);
>   while (utime > (rtime = ACCESS_ONCE(prev->utime)))
>   cmpxchg(>utime, rtime, utime);
> 
>   *ut = prev->utime;
>   *st = prev->stime;
>   }
> 
> Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched,time: atomically increment stime & utime

2014-08-16 Thread Oleg Nesterov
On 08/15, r...@redhat.com wrote:
>
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
>* If the tick based count grows faster than the scheduler one,
>* the result of the scaling may go backward.
>* Let's enforce monotonicity.
> +  * Atomic exchange protects against concurrent cputime_adjust.
>*/
> - prev->stime = max(prev->stime, stime);
> - prev->utime = max(prev->utime, utime);
> + while (stime > (rtime = ACCESS_ONCE(prev->stime)))
> + cmpxchg(>stime, rtime, stime);
> + while (utime > (rtime = ACCESS_ONCE(prev->utime)))
> + cmpxchg(>utime, rtime, utime);
>
>  out:
>   *ut = prev->utime;

I am still not sure about this change. At least I think it needs some
discussion.

Let me repeat, afaics this can lead to inconsistent results. Just
suppose that the caller of thread_group_cputime_adjusted() gets a long
preemption between thread_group_cputime() and cputime_adjust(), and
the numbers in signal->prev_cputime grow significantly when this task
resumes. If cputime_adjust() sees both prev->stime and prev->utime
updated everything is fine. But we can race with cputime_adjust() on
another CPU and miss, say, the change in ->utime.

IOW. To simplify, suppose that thread_group_cputime(T) fills task_cputime
with zeros. Then the caller X is preempted.

Another task does thread_group_cputime(T) and this time task_cputime is
{ .utime = A_LOT_U, .stime = A_LOT_S }. This task calls cputime_adjust()
and sets prev->stime = A_LOT_S.

X resumes, calls cputime_adjust(), and returns { 0, A_LOT_S }.

If you think that we do not care, probably I won't argue. But at least
this should be documented/discussed. And if we can tolerate this, then we
can probably simply remove the scale_stime recalculation and change it to
just do

static void cputime_adjust(struct task_cputime *curr,
   struct cputime *prev,
   cputime_t *ut, cputime_t *st)
{
cputime_t rtime, stime, utime;
/*
 * Let's enforce monotonicity.
 * Atomic exchange protects against concurrent cputime_adjust.
 */
while (stime > (rtime = ACCESS_ONCE(prev->stime)))
cmpxchg(>stime, rtime, stime);
while (utime > (rtime = ACCESS_ONCE(prev->utime)))
cmpxchg(>utime, rtime, utime);

*ut = prev->utime;
*st = prev->stime;
}

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched,time: atomically increment stime utime

2014-08-16 Thread Oleg Nesterov
On 08/15, r...@redhat.com wrote:

 --- a/kernel/sched/cputime.c
 +++ b/kernel/sched/cputime.c
 @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
* If the tick based count grows faster than the scheduler one,
* the result of the scaling may go backward.
* Let's enforce monotonicity.
 +  * Atomic exchange protects against concurrent cputime_adjust.
*/
 - prev-stime = max(prev-stime, stime);
 - prev-utime = max(prev-utime, utime);
 + while (stime  (rtime = ACCESS_ONCE(prev-stime)))
 + cmpxchg(prev-stime, rtime, stime);
 + while (utime  (rtime = ACCESS_ONCE(prev-utime)))
 + cmpxchg(prev-utime, rtime, utime);

  out:
   *ut = prev-utime;

I am still not sure about this change. At least I think it needs some
discussion.

Let me repeat, afaics this can lead to inconsistent results. Just
suppose that the caller of thread_group_cputime_adjusted() gets a long
preemption between thread_group_cputime() and cputime_adjust(), and
the numbers in signal-prev_cputime grow significantly when this task
resumes. If cputime_adjust() sees both prev-stime and prev-utime
updated everything is fine. But we can race with cputime_adjust() on
another CPU and miss, say, the change in -utime.

IOW. To simplify, suppose that thread_group_cputime(T) fills task_cputime
with zeros. Then the caller X is preempted.

Another task does thread_group_cputime(T) and this time task_cputime is
{ .utime = A_LOT_U, .stime = A_LOT_S }. This task calls cputime_adjust()
and sets prev-stime = A_LOT_S.

X resumes, calls cputime_adjust(), and returns { 0, A_LOT_S }.

If you think that we do not care, probably I won't argue. But at least
this should be documented/discussed. And if we can tolerate this, then we
can probably simply remove the scale_stime recalculation and change it to
just do

static void cputime_adjust(struct task_cputime *curr,
   struct cputime *prev,
   cputime_t *ut, cputime_t *st)
{
cputime_t rtime, stime, utime;
/*
 * Let's enforce monotonicity.
 * Atomic exchange protects against concurrent cputime_adjust.
 */
while (stime  (rtime = ACCESS_ONCE(prev-stime)))
cmpxchg(prev-stime, rtime, stime);
while (utime  (rtime = ACCESS_ONCE(prev-utime)))
cmpxchg(prev-utime, rtime, utime);

*ut = prev-utime;
*st = prev-stime;
}

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched,time: atomically increment stime utime

2014-08-16 Thread Oleg Nesterov
Aaah, Rik, I am sorry!

You documented this in 0/3 which I didn't bother to read.

Sorry for noise.

On 08/16, Oleg Nesterov wrote:

 On 08/15, r...@redhat.com wrote:
 
  --- a/kernel/sched/cputime.c
  +++ b/kernel/sched/cputime.c
  @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
   * If the tick based count grows faster than the scheduler one,
   * the result of the scaling may go backward.
   * Let's enforce monotonicity.
  +* Atomic exchange protects against concurrent cputime_adjust.
   */
  -   prev-stime = max(prev-stime, stime);
  -   prev-utime = max(prev-utime, utime);
  +   while (stime  (rtime = ACCESS_ONCE(prev-stime)))
  +   cmpxchg(prev-stime, rtime, stime);
  +   while (utime  (rtime = ACCESS_ONCE(prev-utime)))
  +   cmpxchg(prev-utime, rtime, utime);
 
   out:
  *ut = prev-utime;
 
 I am still not sure about this change. At least I think it needs some
 discussion.
 
 Let me repeat, afaics this can lead to inconsistent results. Just
 suppose that the caller of thread_group_cputime_adjusted() gets a long
 preemption between thread_group_cputime() and cputime_adjust(), and
 the numbers in signal-prev_cputime grow significantly when this task
 resumes. If cputime_adjust() sees both prev-stime and prev-utime
 updated everything is fine. But we can race with cputime_adjust() on
 another CPU and miss, say, the change in -utime.
 
 IOW. To simplify, suppose that thread_group_cputime(T) fills task_cputime
 with zeros. Then the caller X is preempted.
 
 Another task does thread_group_cputime(T) and this time task_cputime is
 { .utime = A_LOT_U, .stime = A_LOT_S }. This task calls cputime_adjust()
 and sets prev-stime = A_LOT_S.
 
 X resumes, calls cputime_adjust(), and returns { 0, A_LOT_S }.
 
 If you think that we do not care, probably I won't argue. But at least
 this should be documented/discussed. And if we can tolerate this, then we
 can probably simply remove the scale_stime recalculation and change it to
 just do
 
   static void cputime_adjust(struct task_cputime *curr,
  struct cputime *prev,
  cputime_t *ut, cputime_t *st)
   {
   cputime_t rtime, stime, utime;
   /*
* Let's enforce monotonicity.
* Atomic exchange protects against concurrent cputime_adjust.
*/
   while (stime  (rtime = ACCESS_ONCE(prev-stime)))
   cmpxchg(prev-stime, rtime, stime);
   while (utime  (rtime = ACCESS_ONCE(prev-utime)))
   cmpxchg(prev-utime, rtime, utime);
 
   *ut = prev-utime;
   *st = prev-stime;
   }
 
 Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] sched,time: atomically increment stime & utime

2014-08-15 Thread riel
From: Rik van Riel 

The functions task_cputime_adjusted and thread_group_cputime_adjusted
can be called locklessly, as well as concurrently on many different CPUs.

This can occasionally lead to the utime and stime reported by times(), and
other syscalls like it, going backward. The cause for this appears to be
multiple threads racing in cputime_adjust, both with values for utime or
stime that is larger than the original, but each with a different value.

Sometimes the larger value gets saved first, only to be immediately
overwritten with a smaller value by another thread.

Using atomic exchange prevents that problem, and ensures time
progresses monotonically.

Signed-off-by: Rik van Riel 
---
 kernel/sched/cputime.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b5f1c58..ab84270 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
 * If the tick based count grows faster than the scheduler one,
 * the result of the scaling may go backward.
 * Let's enforce monotonicity.
+* Atomic exchange protects against concurrent cputime_adjust.
 */
-   prev->stime = max(prev->stime, stime);
-   prev->utime = max(prev->utime, utime);
+   while (stime > (rtime = ACCESS_ONCE(prev->stime)))
+   cmpxchg(>stime, rtime, stime);
+   while (utime > (rtime = ACCESS_ONCE(prev->utime)))
+   cmpxchg(>utime, rtime, utime);
 
 out:
*ut = prev->utime;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] sched,time: atomically increment stime utime

2014-08-15 Thread riel
From: Rik van Riel r...@redhat.com

The functions task_cputime_adjusted and thread_group_cputime_adjusted
can be called locklessly, as well as concurrently on many different CPUs.

This can occasionally lead to the utime and stime reported by times(), and
other syscalls like it, going backward. The cause for this appears to be
multiple threads racing in cputime_adjust, both with values for utime or
stime that is larger than the original, but each with a different value.

Sometimes the larger value gets saved first, only to be immediately
overwritten with a smaller value by another thread.

Using atomic exchange prevents that problem, and ensures time
progresses monotonically.

Signed-off-by: Rik van Riel r...@redhat.com
---
 kernel/sched/cputime.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b5f1c58..ab84270 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
 * If the tick based count grows faster than the scheduler one,
 * the result of the scaling may go backward.
 * Let's enforce monotonicity.
+* Atomic exchange protects against concurrent cputime_adjust.
 */
-   prev-stime = max(prev-stime, stime);
-   prev-utime = max(prev-utime, utime);
+   while (stime  (rtime = ACCESS_ONCE(prev-stime)))
+   cmpxchg(prev-stime, rtime, stime);
+   while (utime  (rtime = ACCESS_ONCE(prev-utime)))
+   cmpxchg(prev-utime, rtime, utime);
 
 out:
*ut = prev-utime;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/