Re: [PATCH 3/3] sched,time: atomically increment stime & utime
Aaah, Rik, I am sorry! You documented this in 0/3 which I didn't bother to read. Sorry for noise. On 08/16, Oleg Nesterov wrote: > > On 08/15, r...@redhat.com wrote: > > > > --- a/kernel/sched/cputime.c > > +++ b/kernel/sched/cputime.c > > @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr, > > * If the tick based count grows faster than the scheduler one, > > * the result of the scaling may go backward. > > * Let's enforce monotonicity. > > +* Atomic exchange protects against concurrent cputime_adjust. > > */ > > - prev->stime = max(prev->stime, stime); > > - prev->utime = max(prev->utime, utime); > > + while (stime > (rtime = ACCESS_ONCE(prev->stime))) > > + cmpxchg(>stime, rtime, stime); > > + while (utime > (rtime = ACCESS_ONCE(prev->utime))) > > + cmpxchg(>utime, rtime, utime); > > > > out: > > *ut = prev->utime; > > I am still not sure about this change. At least I think it needs some > discussion. > > Let me repeat, afaics this can lead to inconsistent results. Just > suppose that the caller of thread_group_cputime_adjusted() gets a long > preemption between thread_group_cputime() and cputime_adjust(), and > the numbers in signal->prev_cputime grow significantly when this task > resumes. If cputime_adjust() sees both prev->stime and prev->utime > updated everything is fine. But we can race with cputime_adjust() on > another CPU and miss, say, the change in ->utime. > > IOW. To simplify, suppose that thread_group_cputime(T) fills task_cputime > with zeros. Then the caller X is preempted. > > Another task does thread_group_cputime(T) and this time task_cputime is > { .utime = A_LOT_U, .stime = A_LOT_S }. This task calls cputime_adjust() > and sets prev->stime = A_LOT_S. > > X resumes, calls cputime_adjust(), and returns { 0, A_LOT_S }. > > If you think that we do not care, probably I won't argue. But at least > this should be documented/discussed. And if we can tolerate this, then we > can probably simply remove the scale_stime recalculation and change it to > just do > > static void cputime_adjust(struct task_cputime *curr, > struct cputime *prev, > cputime_t *ut, cputime_t *st) > { > cputime_t rtime, stime, utime; > /* >* Let's enforce monotonicity. >* Atomic exchange protects against concurrent cputime_adjust. >*/ > while (stime > (rtime = ACCESS_ONCE(prev->stime))) > cmpxchg(>stime, rtime, stime); > while (utime > (rtime = ACCESS_ONCE(prev->utime))) > cmpxchg(>utime, rtime, utime); > > *ut = prev->utime; > *st = prev->stime; > } > > Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] sched,time: atomically increment stime & utime
On 08/15, r...@redhat.com wrote: > > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr, >* If the tick based count grows faster than the scheduler one, >* the result of the scaling may go backward. >* Let's enforce monotonicity. > + * Atomic exchange protects against concurrent cputime_adjust. >*/ > - prev->stime = max(prev->stime, stime); > - prev->utime = max(prev->utime, utime); > + while (stime > (rtime = ACCESS_ONCE(prev->stime))) > + cmpxchg(>stime, rtime, stime); > + while (utime > (rtime = ACCESS_ONCE(prev->utime))) > + cmpxchg(>utime, rtime, utime); > > out: > *ut = prev->utime; I am still not sure about this change. At least I think it needs some discussion. Let me repeat, afaics this can lead to inconsistent results. Just suppose that the caller of thread_group_cputime_adjusted() gets a long preemption between thread_group_cputime() and cputime_adjust(), and the numbers in signal->prev_cputime grow significantly when this task resumes. If cputime_adjust() sees both prev->stime and prev->utime updated everything is fine. But we can race with cputime_adjust() on another CPU and miss, say, the change in ->utime. IOW. To simplify, suppose that thread_group_cputime(T) fills task_cputime with zeros. Then the caller X is preempted. Another task does thread_group_cputime(T) and this time task_cputime is { .utime = A_LOT_U, .stime = A_LOT_S }. This task calls cputime_adjust() and sets prev->stime = A_LOT_S. X resumes, calls cputime_adjust(), and returns { 0, A_LOT_S }. If you think that we do not care, probably I won't argue. But at least this should be documented/discussed. And if we can tolerate this, then we can probably simply remove the scale_stime recalculation and change it to just do static void cputime_adjust(struct task_cputime *curr, struct cputime *prev, cputime_t *ut, cputime_t *st) { cputime_t rtime, stime, utime; /* * Let's enforce monotonicity. * Atomic exchange protects against concurrent cputime_adjust. */ while (stime > (rtime = ACCESS_ONCE(prev->stime))) cmpxchg(>stime, rtime, stime); while (utime > (rtime = ACCESS_ONCE(prev->utime))) cmpxchg(>utime, rtime, utime); *ut = prev->utime; *st = prev->stime; } Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] sched,time: atomically increment stime utime
On 08/15, r...@redhat.com wrote: --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr, * If the tick based count grows faster than the scheduler one, * the result of the scaling may go backward. * Let's enforce monotonicity. + * Atomic exchange protects against concurrent cputime_adjust. */ - prev-stime = max(prev-stime, stime); - prev-utime = max(prev-utime, utime); + while (stime (rtime = ACCESS_ONCE(prev-stime))) + cmpxchg(prev-stime, rtime, stime); + while (utime (rtime = ACCESS_ONCE(prev-utime))) + cmpxchg(prev-utime, rtime, utime); out: *ut = prev-utime; I am still not sure about this change. At least I think it needs some discussion. Let me repeat, afaics this can lead to inconsistent results. Just suppose that the caller of thread_group_cputime_adjusted() gets a long preemption between thread_group_cputime() and cputime_adjust(), and the numbers in signal-prev_cputime grow significantly when this task resumes. If cputime_adjust() sees both prev-stime and prev-utime updated everything is fine. But we can race with cputime_adjust() on another CPU and miss, say, the change in -utime. IOW. To simplify, suppose that thread_group_cputime(T) fills task_cputime with zeros. Then the caller X is preempted. Another task does thread_group_cputime(T) and this time task_cputime is { .utime = A_LOT_U, .stime = A_LOT_S }. This task calls cputime_adjust() and sets prev-stime = A_LOT_S. X resumes, calls cputime_adjust(), and returns { 0, A_LOT_S }. If you think that we do not care, probably I won't argue. But at least this should be documented/discussed. And if we can tolerate this, then we can probably simply remove the scale_stime recalculation and change it to just do static void cputime_adjust(struct task_cputime *curr, struct cputime *prev, cputime_t *ut, cputime_t *st) { cputime_t rtime, stime, utime; /* * Let's enforce monotonicity. * Atomic exchange protects against concurrent cputime_adjust. */ while (stime (rtime = ACCESS_ONCE(prev-stime))) cmpxchg(prev-stime, rtime, stime); while (utime (rtime = ACCESS_ONCE(prev-utime))) cmpxchg(prev-utime, rtime, utime); *ut = prev-utime; *st = prev-stime; } Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] sched,time: atomically increment stime utime
Aaah, Rik, I am sorry! You documented this in 0/3 which I didn't bother to read. Sorry for noise. On 08/16, Oleg Nesterov wrote: On 08/15, r...@redhat.com wrote: --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr, * If the tick based count grows faster than the scheduler one, * the result of the scaling may go backward. * Let's enforce monotonicity. +* Atomic exchange protects against concurrent cputime_adjust. */ - prev-stime = max(prev-stime, stime); - prev-utime = max(prev-utime, utime); + while (stime (rtime = ACCESS_ONCE(prev-stime))) + cmpxchg(prev-stime, rtime, stime); + while (utime (rtime = ACCESS_ONCE(prev-utime))) + cmpxchg(prev-utime, rtime, utime); out: *ut = prev-utime; I am still not sure about this change. At least I think it needs some discussion. Let me repeat, afaics this can lead to inconsistent results. Just suppose that the caller of thread_group_cputime_adjusted() gets a long preemption between thread_group_cputime() and cputime_adjust(), and the numbers in signal-prev_cputime grow significantly when this task resumes. If cputime_adjust() sees both prev-stime and prev-utime updated everything is fine. But we can race with cputime_adjust() on another CPU and miss, say, the change in -utime. IOW. To simplify, suppose that thread_group_cputime(T) fills task_cputime with zeros. Then the caller X is preempted. Another task does thread_group_cputime(T) and this time task_cputime is { .utime = A_LOT_U, .stime = A_LOT_S }. This task calls cputime_adjust() and sets prev-stime = A_LOT_S. X resumes, calls cputime_adjust(), and returns { 0, A_LOT_S }. If you think that we do not care, probably I won't argue. But at least this should be documented/discussed. And if we can tolerate this, then we can probably simply remove the scale_stime recalculation and change it to just do static void cputime_adjust(struct task_cputime *curr, struct cputime *prev, cputime_t *ut, cputime_t *st) { cputime_t rtime, stime, utime; /* * Let's enforce monotonicity. * Atomic exchange protects against concurrent cputime_adjust. */ while (stime (rtime = ACCESS_ONCE(prev-stime))) cmpxchg(prev-stime, rtime, stime); while (utime (rtime = ACCESS_ONCE(prev-utime))) cmpxchg(prev-utime, rtime, utime); *ut = prev-utime; *st = prev-stime; } Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] sched,time: atomically increment stime & utime
From: Rik van Riel The functions task_cputime_adjusted and thread_group_cputime_adjusted can be called locklessly, as well as concurrently on many different CPUs. This can occasionally lead to the utime and stime reported by times(), and other syscalls like it, going backward. The cause for this appears to be multiple threads racing in cputime_adjust, both with values for utime or stime that is larger than the original, but each with a different value. Sometimes the larger value gets saved first, only to be immediately overwritten with a smaller value by another thread. Using atomic exchange prevents that problem, and ensures time progresses monotonically. Signed-off-by: Rik van Riel --- kernel/sched/cputime.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index b5f1c58..ab84270 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr, * If the tick based count grows faster than the scheduler one, * the result of the scaling may go backward. * Let's enforce monotonicity. +* Atomic exchange protects against concurrent cputime_adjust. */ - prev->stime = max(prev->stime, stime); - prev->utime = max(prev->utime, utime); + while (stime > (rtime = ACCESS_ONCE(prev->stime))) + cmpxchg(>stime, rtime, stime); + while (utime > (rtime = ACCESS_ONCE(prev->utime))) + cmpxchg(>utime, rtime, utime); out: *ut = prev->utime; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] sched,time: atomically increment stime utime
From: Rik van Riel r...@redhat.com The functions task_cputime_adjusted and thread_group_cputime_adjusted can be called locklessly, as well as concurrently on many different CPUs. This can occasionally lead to the utime and stime reported by times(), and other syscalls like it, going backward. The cause for this appears to be multiple threads racing in cputime_adjust, both with values for utime or stime that is larger than the original, but each with a different value. Sometimes the larger value gets saved first, only to be immediately overwritten with a smaller value by another thread. Using atomic exchange prevents that problem, and ensures time progresses monotonically. Signed-off-by: Rik van Riel r...@redhat.com --- kernel/sched/cputime.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index b5f1c58..ab84270 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr, * If the tick based count grows faster than the scheduler one, * the result of the scaling may go backward. * Let's enforce monotonicity. +* Atomic exchange protects against concurrent cputime_adjust. */ - prev-stime = max(prev-stime, stime); - prev-utime = max(prev-utime, utime); + while (stime (rtime = ACCESS_ONCE(prev-stime))) + cmpxchg(prev-stime, rtime, stime); + while (utime (rtime = ACCESS_ONCE(prev-utime))) + cmpxchg(prev-utime, rtime, utime); out: *ut = prev-utime; -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/