Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-25 Thread Tim Chen
On 08/24/2017 01:44 PM, Mel Gorman wrote: > On Thu, Aug 24, 2017 at 11:16:15AM -0700, Linus Torvalds wrote: >> On Thu, Aug 24, 2017 at 10:49 AM, Tim Chen >> wrote: >>> >>> These changes look fine. We are testing them now. >>> Does the second patch in the series look

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-25 Thread Tim Chen
On 08/24/2017 01:44 PM, Mel Gorman wrote: > On Thu, Aug 24, 2017 at 11:16:15AM -0700, Linus Torvalds wrote: >> On Thu, Aug 24, 2017 at 10:49 AM, Tim Chen >> wrote: >>> >>> These changes look fine. We are testing them now. >>> Does the second patch in the series look okay to you? >> >> I didn't

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-24 Thread Mel Gorman
On Thu, Aug 24, 2017 at 11:16:15AM -0700, Linus Torvalds wrote: > On Thu, Aug 24, 2017 at 10:49 AM, Tim Chen wrote: > > > > These changes look fine. We are testing them now. > > Does the second patch in the series look okay to you? > > I didn't really have any

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-24 Thread Mel Gorman
On Thu, Aug 24, 2017 at 11:16:15AM -0700, Linus Torvalds wrote: > On Thu, Aug 24, 2017 at 10:49 AM, Tim Chen wrote: > > > > These changes look fine. We are testing them now. > > Does the second patch in the series look okay to you? > > I didn't really have any reaction to that one, as long as

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-24 Thread Linus Torvalds
On Thu, Aug 24, 2017 at 10:49 AM, Tim Chen wrote: > > These changes look fine. We are testing them now. > Does the second patch in the series look okay to you? I didn't really have any reaction to that one, as long as Mel are ok with it, I'm fine with it.

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-24 Thread Linus Torvalds
On Thu, Aug 24, 2017 at 10:49 AM, Tim Chen wrote: > > These changes look fine. We are testing them now. > Does the second patch in the series look okay to you? I didn't really have any reaction to that one, as long as Mel are ok with it, I'm fine with it. Linus

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-24 Thread Tim Chen
On 08/23/2017 04:30 PM, Linus Torvalds wrote: > On Wed, Aug 23, 2017 at 11:17 AM, Linus Torvalds > wrote: >> On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen wrote: >>> >>> Will you still consider the original patch as a fail safe mechanism? >>

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-24 Thread Tim Chen
On 08/23/2017 04:30 PM, Linus Torvalds wrote: > On Wed, Aug 23, 2017 at 11:17 AM, Linus Torvalds > wrote: >> On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen wrote: >>> >>> Will you still consider the original patch as a fail safe mechanism? >> >> I don't think we have much choice, although I would

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Linus Torvalds
On Wed, Aug 23, 2017 at 11:17 AM, Linus Torvalds wrote: > On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen wrote: >> >> Will you still consider the original patch as a fail safe mechanism? > > I don't think we have much choice, although I would

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Linus Torvalds
On Wed, Aug 23, 2017 at 11:17 AM, Linus Torvalds wrote: > On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen wrote: >> >> Will you still consider the original patch as a fail safe mechanism? > > I don't think we have much choice, although I would *really* want to > get this root-caused rather than just

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Liang, Kan
> > On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen > wrote: > > > > Will you still consider the original patch as a fail safe mechanism? > > I don't think we have much choice, although I would *really* want to get this > root-caused rather than just papering over the

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Liang, Kan
> > On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen > wrote: > > > > Will you still consider the original patch as a fail safe mechanism? > > I don't think we have much choice, although I would *really* want to get this > root-caused rather than just papering over the symptoms. > > Maybe still worth

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Linus Torvalds
On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen wrote: > > Will you still consider the original patch as a fail safe mechanism? I don't think we have much choice, although I would *really* want to get this root-caused rather than just papering over the symptoms. Maybe

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Linus Torvalds
On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen wrote: > > Will you still consider the original patch as a fail safe mechanism? I don't think we have much choice, although I would *really* want to get this root-caused rather than just papering over the symptoms. Maybe still worth testing that

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Mel Gorman
On Tue, Aug 22, 2017 at 11:19:12AM -0700, Linus Torvalds wrote: > On Tue, Aug 22, 2017 at 10:23 AM, Liang, Kan wrote: > > > > Although the patch doesn't trigger watchdog, the spin lock wait time > > is not small (0.45s). > > It may get worse again on larger systems. > >

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Mel Gorman
On Tue, Aug 22, 2017 at 11:19:12AM -0700, Linus Torvalds wrote: > On Tue, Aug 22, 2017 at 10:23 AM, Liang, Kan wrote: > > > > Although the patch doesn't trigger watchdog, the spin lock wait time > > is not small (0.45s). > > It may get worse again on larger systems. > > Yeah, I don't think Mel's

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Tim Chen
On 08/23/2017 07:49 AM, Liang, Kan wrote: >> On Tue, Aug 22, 2017 at 12:55 PM, Liang, Kan wrote: >>> So I propose testing the attached trivial patch. >>> >>> It doesn’t work. >>> The call stack is the same. >> >> So I would have expected the stack trace to be the same,

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Tim Chen
On 08/23/2017 07:49 AM, Liang, Kan wrote: >> On Tue, Aug 22, 2017 at 12:55 PM, Liang, Kan wrote: >>> So I propose testing the attached trivial patch. >>> >>> It doesn’t work. >>> The call stack is the same. >> >> So I would have expected the stack trace to be the same, and I would even >>

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Liang, Kan
> Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk > > On Tue, Aug 22, 2017 at 2:24 PM, Andi Kleen <a...@linux.intel.com> wrote: > > > > I believe in this case it's used by threads, so a reference count > > limit wouldn't help. > > For th

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Liang, Kan
> Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk > > On Tue, Aug 22, 2017 at 2:24 PM, Andi Kleen wrote: > > > > I believe in this case it's used by threads, so a reference count > > limit wouldn't help. > > For the first migration try, y

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Liang, Kan
> On Tue, Aug 22, 2017 at 12:55 PM, Liang, Kan wrote: > > > >> So I propose testing the attached trivial patch. > > > > It doesn’t work. > > The call stack is the same. > > So I would have expected the stack trace to be the same, and I would even > expect the CPU usage to be

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-23 Thread Liang, Kan
> On Tue, Aug 22, 2017 at 12:55 PM, Liang, Kan wrote: > > > >> So I propose testing the attached trivial patch. > > > > It doesn’t work. > > The call stack is the same. > > So I would have expected the stack trace to be the same, and I would even > expect the CPU usage to be fairly similar,

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 3:52 PM, Linus Torvalds wrote: > > The *other* memory policies look fairly sane. They basically have a > fairly well-defined preferred node for the policy (although the > "MPOL_INTERLEAVE" looks wrong for a hugepage). But >

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 3:52 PM, Linus Torvalds wrote: > > The *other* memory policies look fairly sane. They basically have a > fairly well-defined preferred node for the policy (although the > "MPOL_INTERLEAVE" looks wrong for a hugepage). But > MPOL_PREFERRED/MPOL_F_LOCAL really looks

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 2:24 PM, Andi Kleen wrote: > > I believe in this case it's used by threads, so a reference count limit > wouldn't help. For the first migration try, yes. But if it's some kind of "try and try again" pattern, the second time you try and there are

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 2:24 PM, Andi Kleen wrote: > > I believe in this case it's used by threads, so a reference count limit > wouldn't help. For the first migration try, yes. But if it's some kind of "try and try again" pattern, the second time you try and there are people waiting for the

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Andi Kleen
On Tue, Aug 22, 2017 at 04:08:52PM -0500, Christopher Lameter wrote: > On Tue, 22 Aug 2017, Andi Kleen wrote: > > > We only see it on 4S+ today. But systems are always getting larger, > > so what's a large system today, will be a normal medium scale system > > tomorrow. > > > > BTW we also

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Andi Kleen
On Tue, Aug 22, 2017 at 04:08:52PM -0500, Christopher Lameter wrote: > On Tue, 22 Aug 2017, Andi Kleen wrote: > > > We only see it on 4S+ today. But systems are always getting larger, > > so what's a large system today, will be a normal medium scale system > > tomorrow. > > > > BTW we also

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Christopher Lameter
On Tue, 22 Aug 2017, Andi Kleen wrote: > We only see it on 4S+ today. But systems are always getting larger, > so what's a large system today, will be a normal medium scale system > tomorrow. > > BTW we also collected PT traces for the long hang cases, but it was > hard to find a consistent

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Christopher Lameter
On Tue, 22 Aug 2017, Andi Kleen wrote: > We only see it on 4S+ today. But systems are always getting larger, > so what's a large system today, will be a normal medium scale system > tomorrow. > > BTW we also collected PT traces for the long hang cases, but it was > hard to find a consistent

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 1:53 PM, Peter Zijlstra wrote: > > So _the_ problem with yield() is when you hit this with a RT task it > will busy spin and possibly not allow the task that actually has the > lock to make progress at all. I thought we had explicitly defined yield()

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 1:53 PM, Peter Zijlstra wrote: > > So _the_ problem with yield() is when you hit this with a RT task it > will busy spin and possibly not allow the task that actually has the > lock to make progress at all. I thought we had explicitly defined yield() to not do that. But

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Peter Zijlstra
On Tue, Aug 22, 2017 at 01:42:13PM -0700, Linus Torvalds wrote: > +void wait_on_page_bit_or_yield(struct page *page, int bit_nr) > +{ > + if (PageWaiters(page)) { > + yield(); > + return; > + } > + wait_on_page_bit(page, bit_nr); > +} So _the_ problem with

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Peter Zijlstra
On Tue, Aug 22, 2017 at 01:42:13PM -0700, Linus Torvalds wrote: > +void wait_on_page_bit_or_yield(struct page *page, int bit_nr) > +{ > + if (PageWaiters(page)) { > + yield(); > + return; > + } > + wait_on_page_bit(page, bit_nr); > +} So _the_ problem with

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 12:55 PM, Liang, Kan wrote: > >> So I propose testing the attached trivial patch. > > It doesn’t work. > The call stack is the same. So I would have expected the stack trace to be the same, and I would even expect the CPU usage to be fairly similar,

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 12:55 PM, Liang, Kan wrote: > >> So I propose testing the attached trivial patch. > > It doesn’t work. > The call stack is the same. So I would have expected the stack trace to be the same, and I would even expect the CPU usage to be fairly similar, because you'd see

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Liang, Kan
> So I propose testing the attached trivial patch. It doesn’t work. The call stack is the same. 100.00% (821af140) | ---wait_on_page_bit __migration_entry_wait migration_entry_wait do_swap_page

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Liang, Kan
> So I propose testing the attached trivial patch. It doesn’t work. The call stack is the same. 100.00% (821af140) | ---wait_on_page_bit __migration_entry_wait migration_entry_wait do_swap_page

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Andi Kleen
> > Still, generating such a migration storm would be fairly tricky I think. > > Well, Mel seems to have been unable to generate a load that reproduces > the long page waitqueues. And I don't think we've had any other > reports of this either. It could be that it requires a fairly large system.

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Andi Kleen
> > Still, generating such a migration storm would be fairly tricky I think. > > Well, Mel seems to have been unable to generate a load that reproduces > the long page waitqueues. And I don't think we've had any other > reports of this either. It could be that it requires a fairly large system.

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 12:08 PM, Peter Zijlstra wrote: > > So that migration stuff has a filter on, we need two consecutive numa > faults from the same page_cpupid 'hash', see > should_numa_migrate_memory(). Hmm. That is only called for MPOL_F_MORON. We don't actually

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 12:08 PM, Peter Zijlstra wrote: > > So that migration stuff has a filter on, we need two consecutive numa > faults from the same page_cpupid 'hash', see > should_numa_migrate_memory(). Hmm. That is only called for MPOL_F_MORON. We don't actually know what policy the

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 11:56 AM, Peter Zijlstra wrote: > > Won't we now prematurely terminate the wait when we get a spurious > wakeup? I think there's two answers to that: (a) do we even care? (b) what spurious wakeup? The "do we even care" quesiton is because

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 11:56 AM, Peter Zijlstra wrote: > > Won't we now prematurely terminate the wait when we get a spurious > wakeup? I think there's two answers to that: (a) do we even care? (b) what spurious wakeup? The "do we even care" quesiton is because wait_on_page_bit by

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Peter Zijlstra
On Tue, Aug 22, 2017 at 11:19:12AM -0700, Linus Torvalds wrote: > And since everybody touches it, as a result everybody eventually > thinks that page should be migrated to their NUMA node. So that migration stuff has a filter on, we need two consecutive numa faults from the same page_cpupid

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Peter Zijlstra
On Tue, Aug 22, 2017 at 11:19:12AM -0700, Linus Torvalds wrote: > And since everybody touches it, as a result everybody eventually > thinks that page should be migrated to their NUMA node. So that migration stuff has a filter on, we need two consecutive numa faults from the same page_cpupid

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Peter Zijlstra
On Tue, Aug 22, 2017 at 11:19:12AM -0700, Linus Torvalds wrote: > diff --git a/mm/filemap.c b/mm/filemap.c > index a49702445ce0..75c29a1f90fb 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -991,13 +991,11 @@ static inline int > wait_on_page_bit_common(wait_queue_head_t *q, >

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Peter Zijlstra
On Tue, Aug 22, 2017 at 11:19:12AM -0700, Linus Torvalds wrote: > diff --git a/mm/filemap.c b/mm/filemap.c > index a49702445ce0..75c29a1f90fb 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -991,13 +991,11 @@ static inline int > wait_on_page_bit_common(wait_queue_head_t *q, >

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 11:19 AM, Linus Torvalds wrote: > > So I propose testing the attached trivial patch. It may not do > anything at all. But the existing code is actually doing extra work > just to be fragile, in case the scenario above can happen. Side note:

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 11:19 AM, Linus Torvalds wrote: > > So I propose testing the attached trivial patch. It may not do > anything at all. But the existing code is actually doing extra work > just to be fragile, in case the scenario above can happen. Side note: the patch compiles for me. But

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 10:23 AM, Liang, Kan wrote: > > Although the patch doesn't trigger watchdog, the spin lock wait time > is not small (0.45s). > It may get worse again on larger systems. Yeah, I don't think Mel's patch is great - because I think we could do so much

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Linus Torvalds
On Tue, Aug 22, 2017 at 10:23 AM, Liang, Kan wrote: > > Although the patch doesn't trigger watchdog, the spin lock wait time > is not small (0.45s). > It may get worse again on larger systems. Yeah, I don't think Mel's patch is great - because I think we could do so much better. What I like

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Liang, Kan
> > Covering both paths would be something like the patch below which > > spins until the page is unlocked or it should reschedule. It's not > > even boot tested as I spent what time I had on the test case that I > > hoped would be able to prove it really works. > > I will give it a try.

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-22 Thread Liang, Kan
> > Covering both paths would be something like the patch below which > > spins until the page is unlocked or it should reschedule. It's not > > even boot tested as I spent what time I had on the test case that I > > hoped would be able to prove it really works. > > I will give it a try.

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-21 Thread Liang, Kan
> > Because that code sequence doesn't actually depend on > > "wait_on_page_lock()" for _correctness_ anyway, afaik. Anybody who > > does "migration_entry_wait()" _has_ to retry anyway, since the page > > table contents may have changed by waiting. > > > > So I'm not proud of the attached patch,

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-21 Thread Liang, Kan
> > Because that code sequence doesn't actually depend on > > "wait_on_page_lock()" for _correctness_ anyway, afaik. Anybody who > > does "migration_entry_wait()" _has_ to retry anyway, since the page > > table contents may have changed by waiting. > > > > So I'm not proud of the attached patch,

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-21 Thread Mel Gorman
On Fri, Aug 18, 2017 at 12:14:12PM -0700, Linus Torvalds wrote: > On Fri, Aug 18, 2017 at 11:54 AM, Mel Gorman > wrote: > > > > One option to mitigate (but not eliminate) the problem is to record when > > the page lock is contended and pass in TNF_PAGE_CONTENDED (new

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-21 Thread Mel Gorman
On Fri, Aug 18, 2017 at 12:14:12PM -0700, Linus Torvalds wrote: > On Fri, Aug 18, 2017 at 11:54 AM, Mel Gorman > wrote: > > > > One option to mitigate (but not eliminate) the problem is to record when > > the page lock is contended and pass in TNF_PAGE_CONTENDED (new flag) to > >

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Linus Torvalds
On Fri, Aug 18, 2017 at 1:29 PM, Liang, Kan wrote: > Here is the profiling with THP disabled for wait_on_page_bit_common and > wake_up_page_bit. > > > The call stack of wait_on_page_bit_common > # Overhead Trace output > # .. > # >100.00%

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Linus Torvalds
On Fri, Aug 18, 2017 at 1:29 PM, Liang, Kan wrote: > Here is the profiling with THP disabled for wait_on_page_bit_common and > wake_up_page_bit. > > > The call stack of wait_on_page_bit_common > # Overhead Trace output > # .. > # >100.00% (821aefca) >

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Linus Torvalds
On Fri, Aug 18, 2017 at 1:05 PM, Andi Kleen wrote: > > I think what's happening is that it allows more parallelism during wakeup: > > Normally it's like > > CPU 1 CPU 2 CPU 3 > . > > LOCK > wake up tasks on

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Linus Torvalds
On Fri, Aug 18, 2017 at 1:05 PM, Andi Kleen wrote: > > I think what's happening is that it allows more parallelism during wakeup: > > Normally it's like > > CPU 1 CPU 2 CPU 3 > . > > LOCK > wake up tasks on other CPUswoken up

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Liang, Kan
> >> > >> That indicates that it may be a hot page and it's possible that the > >> page is locked for a short time but waiters accumulate. What happens > >> if you leave NUMA balancing enabled but disable THP? > > > > No, disabling THP doesn't help the case. > > Interesting. That particular

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Liang, Kan
> >> > >> That indicates that it may be a hot page and it's possible that the > >> page is locked for a short time but waiters accumulate. What happens > >> if you leave NUMA balancing enabled but disable THP? > > > > No, disabling THP doesn't help the case. > > Interesting. That particular

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Linus Torvalds
On Fri, Aug 18, 2017 at 12:58 PM, Andi Kleen wrote: >> which is hacky, but there's a rationale for it: >> >> (a) avoid the crazy long wait queues ;) >> >> (b) we know that migration is *supposed* to be CPU-bound (not IO >> bound), so yielding the CPU and retrying may just

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Linus Torvalds
On Fri, Aug 18, 2017 at 12:58 PM, Andi Kleen wrote: >> which is hacky, but there's a rationale for it: >> >> (a) avoid the crazy long wait queues ;) >> >> (b) we know that migration is *supposed* to be CPU-bound (not IO >> bound), so yielding the CPU and retrying may just be the right thing >>

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Andi Kleen
> I was really hoping that we'd root-cause this and have a solution (and > then apply Tim's patch as a "belt and suspenders" kind of thing), but One thing I wanted to point out is that Tim's patch seems to make several schedule intensive micro benchmarks faster. I think what's happening is that

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Andi Kleen
> I was really hoping that we'd root-cause this and have a solution (and > then apply Tim's patch as a "belt and suspenders" kind of thing), but One thing I wanted to point out is that Tim's patch seems to make several schedule intensive micro benchmarks faster. I think what's happening is that

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Andi Kleen
> which is hacky, but there's a rationale for it: > > (a) avoid the crazy long wait queues ;) > > (b) we know that migration is *supposed* to be CPU-bound (not IO > bound), so yielding the CPU and retrying may just be the right thing > to do. So this would degenerate into a spin when the

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Andi Kleen
> which is hacky, but there's a rationale for it: > > (a) avoid the crazy long wait queues ;) > > (b) we know that migration is *supposed* to be CPU-bound (not IO > bound), so yielding the CPU and retrying may just be the right thing > to do. So this would degenerate into a spin when the

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Linus Torvalds
On Fri, Aug 18, 2017 at 11:54 AM, Mel Gorman wrote: > > One option to mitigate (but not eliminate) the problem is to record when > the page lock is contended and pass in TNF_PAGE_CONTENDED (new flag) to > task_numa_fault(). Well, finding it contended is fairly easy -

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Linus Torvalds
On Fri, Aug 18, 2017 at 11:54 AM, Mel Gorman wrote: > > One option to mitigate (but not eliminate) the problem is to record when > the page lock is contended and pass in TNF_PAGE_CONTENDED (new flag) to > task_numa_fault(). Well, finding it contended is fairly easy - just look at the page wait

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Mel Gorman
On Fri, Aug 18, 2017 at 10:48:23AM -0700, Linus Torvalds wrote: > On Fri, Aug 18, 2017 at 9:53 AM, Liang, Kan wrote: > > > >> On Fri, Aug 18, 2017 Mel Gorman wrote: > >> > >> That indicates that it may be a hot page and it's possible that the page is > >> locked for a short

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Mel Gorman
On Fri, Aug 18, 2017 at 10:48:23AM -0700, Linus Torvalds wrote: > On Fri, Aug 18, 2017 at 9:53 AM, Liang, Kan wrote: > > > >> On Fri, Aug 18, 2017 Mel Gorman wrote: > >> > >> That indicates that it may be a hot page and it's possible that the page is > >> locked for a short time but waiters

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Linus Torvalds
On Fri, Aug 18, 2017 at 9:53 AM, Liang, Kan wrote: > >> On Fri, Aug 18, 2017 Mel Gorman wrote: >> >> That indicates that it may be a hot page and it's possible that the page is >> locked for a short time but waiters accumulate. What happens if you leave >> NUMA balancing

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Linus Torvalds
On Fri, Aug 18, 2017 at 9:53 AM, Liang, Kan wrote: > >> On Fri, Aug 18, 2017 Mel Gorman wrote: >> >> That indicates that it may be a hot page and it's possible that the page is >> locked for a short time but waiters accumulate. What happens if you leave >> NUMA balancing enabled but disable THP?

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Linus Torvalds
On Fri, Aug 18, 2017 at 5:23 AM, Mel Gorman wrote: > > new_page = alloc_pages_node(node, > - (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), > + (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE) & > ~__GFP_DIRECT_RECLAIM, >

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Linus Torvalds
On Fri, Aug 18, 2017 at 5:23 AM, Mel Gorman wrote: > > new_page = alloc_pages_node(node, > - (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), > + (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE) & > ~__GFP_DIRECT_RECLAIM, > HPAGE_PMD_ORDER); That can't make any

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Liang, Kan
> On Fri, Aug 18, 2017 at 02:20:38PM +, Liang, Kan wrote: > > > Nothing fancy other than needing a comment if it works. > > > > > > > No, the patch doesn't work. > > > > That indicates that it may be a hot page and it's possible that the page is > locked for a short time but waiters

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Liang, Kan
> On Fri, Aug 18, 2017 at 02:20:38PM +, Liang, Kan wrote: > > > Nothing fancy other than needing a comment if it works. > > > > > > > No, the patch doesn't work. > > > > That indicates that it may be a hot page and it's possible that the page is > locked for a short time but waiters

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Andi Kleen
> Still, I don't think this problem is THP specific. If there is a hot > regular page getting migrated, we'll also see many threads get > queued up quickly. THP may have made the problem worse as migrating > it takes a longer time, meaning more threads could get queued up. Also THP probably

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Andi Kleen
> Still, I don't think this problem is THP specific. If there is a hot > regular page getting migrated, we'll also see many threads get > queued up quickly. THP may have made the problem worse as migrating > it takes a longer time, meaning more threads could get queued up. Also THP probably

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Tim Chen
On 08/18/2017 07:46 AM, Mel Gorman wrote: > On Fri, Aug 18, 2017 at 02:20:38PM +, Liang, Kan wrote: >>> Nothing fancy other than needing a comment if it works. >>> >> >> No, the patch doesn't work. >> > > That indicates that it may be a hot page and it's possible that the page is > locked for

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Tim Chen
On 08/18/2017 07:46 AM, Mel Gorman wrote: > On Fri, Aug 18, 2017 at 02:20:38PM +, Liang, Kan wrote: >>> Nothing fancy other than needing a comment if it works. >>> >> >> No, the patch doesn't work. >> > > That indicates that it may be a hot page and it's possible that the page is > locked for

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Mel Gorman
On Fri, Aug 18, 2017 at 02:20:38PM +, Liang, Kan wrote: > > Nothing fancy other than needing a comment if it works. > > > > No, the patch doesn't work. > That indicates that it may be a hot page and it's possible that the page is locked for a short time but waiters accumulate. What

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Mel Gorman
On Fri, Aug 18, 2017 at 02:20:38PM +, Liang, Kan wrote: > > Nothing fancy other than needing a comment if it works. > > > > No, the patch doesn't work. > That indicates that it may be a hot page and it's possible that the page is locked for a short time but waiters accumulate. What

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Liang, Kan
> On Thu, Aug 17, 2017 at 01:44:40PM -0700, Linus Torvalds wrote: > > On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote: > > > > > > Here is the call stack of wait_on_page_bit_common when the queue is > > > long (entries >1000). > > > > > > # Overhead Trace output > > > #

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Liang, Kan
> On Thu, Aug 17, 2017 at 01:44:40PM -0700, Linus Torvalds wrote: > > On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote: > > > > > > Here is the call stack of wait_on_page_bit_common when the queue is > > > long (entries >1000). > > > > > > # Overhead Trace output > > > #

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Liang, Kan
> On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote: > > > > Here is the call stack of wait_on_page_bit_common when the queue is > > long (entries >1000). > > > > # Overhead Trace output > > # .. > > # > >100.00% (931aefca) > >

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Liang, Kan
> On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote: > > > > Here is the call stack of wait_on_page_bit_common when the queue is > > long (entries >1000). > > > > # Overhead Trace output > > # .. > > # > >100.00% (931aefca) > > | > >

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Mel Gorman
On Thu, Aug 17, 2017 at 01:44:40PM -0700, Linus Torvalds wrote: > On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote: > > > > Here is the call stack of wait_on_page_bit_common > > when the queue is long (entries >1000). > > > > # Overhead Trace output > > #

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-18 Thread Mel Gorman
On Thu, Aug 17, 2017 at 01:44:40PM -0700, Linus Torvalds wrote: > On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote: > > > > Here is the call stack of wait_on_page_bit_common > > when the queue is long (entries >1000). > > > > # Overhead Trace output > > # .. > > # > >

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-17 Thread Linus Torvalds
On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote: > > Here is the call stack of wait_on_page_bit_common > when the queue is long (entries >1000). > > # Overhead Trace output > # .. > # >100.00% (931aefca) > | >

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-17 Thread Linus Torvalds
On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote: > > Here is the call stack of wait_on_page_bit_common > when the queue is long (entries >1000). > > # Overhead Trace output > # .. > # >100.00% (931aefca) > | > ---wait_on_page_bit >

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-17 Thread Liang, Kan
> > Here is the wake_up_page_bit call stack when the workaround is running, > which > > is collected by perf record -g -a -e probe:wake_up_page_bit -- sleep 10 > > It's actually not really wake_up_page_bit() that is all that > interesting, it would be more interesting to see which path it is

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-17 Thread Liang, Kan
> > Here is the wake_up_page_bit call stack when the workaround is running, > which > > is collected by perf record -g -a -e probe:wake_up_page_bit -- sleep 10 > > It's actually not really wake_up_page_bit() that is all that > interesting, it would be more interesting to see which path it is

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-17 Thread Linus Torvalds
On Thu, Aug 17, 2017 at 9:17 AM, Liang, Kan wrote: > > We tried both 12 and 16 bit table and that didn't make a difference. > The long wake ups are mostly on the same page when we do instrumentation Ok. > Here is the wake_up_page_bit call stack when the workaround is

Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-17 Thread Linus Torvalds
On Thu, Aug 17, 2017 at 9:17 AM, Liang, Kan wrote: > > We tried both 12 and 16 bit table and that didn't make a difference. > The long wake ups are mostly on the same page when we do instrumentation Ok. > Here is the wake_up_page_bit call stack when the workaround is running, which > is

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-17 Thread Liang, Kan
> On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen > wrote: > > We encountered workloads that have very long wake up list on large > > systems. A waker takes a long time to traverse the entire wake list > > and execute all the wake functions. > > > > We saw page wait list

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-17 Thread Liang, Kan
> On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen > wrote: > > We encountered workloads that have very long wake up list on large > > systems. A waker takes a long time to traverse the entire wake list > > and execute all the wake functions. > > > > We saw page wait list that are up to 3700+ entries

  1   2   >