Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-08-13 Thread Nick Piggin
On Mon, Aug 13, 2007 at 10:05:01AM -0400, Lee Schermerhorn wrote:
> On Mon, 2007-08-13 at 09:43 +0200, Nick Piggin wrote:
> > 
> > Replication may be putting more stress on some locks. It will cause more
> > tlb flushing that can not be batched well, which could cause the call_lock
> > to get hotter. Then i_mmap_lock is held over tlb flushing, so it will
> > inherit the latency from call_lock. (If this is the case, we could
> > potentially extend the tlb flushing API slightly to cope better with
> > unmapping of pages from multiple mm's, but that comes way down the track
> > when/if replication proves itself!).
> > 
> > tlb flushing AFAIKS should not do the IPI unless it is deadling with a
> > multithreaded mm... does usex use threads?
> 
> Yes.  Apparently, there are some tests, perhaps some of the /usr/bin
> apps that get run repeatedly, that are multi-threaded.  This job mix
> caught a number of races in my auto-migration patches when
> multi-threaded tasks race in the page fault paths.
> 
> More below...

Hmm, come to think of it: I'm a bit mistaken. The replica zaps will often
to be coming from _other_ CPUs, so they will require an IPI regardless of
whether they are threaded or not.

The generic ia64 tlb flushing code also does a really bad job at flushing one
'mm' from another: it uses the single-threaded smp_call_function and broadcasts
IPIs (and TLB invalidates) to ALL CPUs, regardless of the cpu_vm_mask of the
target process. So you have a multiplicative problem with call_lock.

I think this path could be significantly optimised... but it's a bit nasty
to be playing around with the TLB flushing code while trying to test
something else :P

Can we make a simple change to smp_flush_tlb_all to do
smp_flush_tlb_cpumask(cpu_online_map), rather than on_each_cpu()? At least
then it will use the direct IPI vector and avoid call_lock.


> > Ah, so it does eventually die? Any hints of why?
> 
> No, doesn't die--as in panic.  I was just commenting that I'd leave it
> running ...  However [:-(], it DID hang again.  The test window said
> that the tests ran for 62h:28m before the screen stopped updating.  In
> another window, I was running a script to snap the replication and #
> file pages vmstats, along with a timestamp, every 10 minutes.  That
> stopped reporting stats at about 7:30am on Saturday--about 14h:30m into
> the test.  It still wrote the timestamps [date command] until around 7am
> this morning [Monday]--or ~62 hours into test.
> 
> So, I do have ~14 hours of replication stats that I can send you or plot
> up...

If you think it could be useful, sure.

 
> Re: the hang:  again, console was scrolling soft lockups continuously.
> Checking the messages file, I see hangs in copy_process(),
> smp_call_function [as in prev test], vma_link [from mmap], ...

I don't suppose it should hang even if it is encountering 10s delays on
call_lock but I wonder how it would go with the tlb flush change.
With luck, it would add more concurrency and make it hang _faster_ ;)


> > Yeah I guess it can be a little misleading: as time approaches infinity,
> > zaps will probably approach replications. But that doesn't tell you how
> > long a replica stayed around and usefully fed CPUs with local memory...
> 
> May be able to capture that info with a more invasive patch -- e.g., add
> a timestamp to the page struct.  I'll think about it.

Yeah that actually could be a good approach. You could make a histogram
of lifetimes which would be a decent metric to start tuning with. Ideally
you'd also want to record some context of what caused the zap and the status
of the file, but it may be difficult to get a good S/N on those metrics.

 
> And, I'll keep you posted.  Not sure how much time I'll be able to
> dedicate to this patch stream.  Got a few others I need to get back
> to...

Thanks, I appreciate it. I'm pretty much in the same boat, just spending a
bit of time on it here and there.

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-08-13 Thread Lee Schermerhorn
On Mon, 2007-08-13 at 09:43 +0200, Nick Piggin wrote:
> On Fri, Aug 10, 2007 at 05:08:18PM -0400, Lee Schermerhorn wrote:
> > On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote:
> > > On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> > > > Hi,
> > > > 
> > > > Just got a bit of time to take another look at the replicated pagecache
> > > > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> > > > gives me more confidence in the locking now; the new ->fault API makes
> > > > MAP_SHARED write faults much more efficient; and a few bugs were found
> > > > and fixed.
> > > > 
> > > > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> > > > tests...
> > > > 


> 
> Hi Lee,
> 
> Am sick with the flu for the past few days, so I haven't done much more
> work here, but I'll just add some (not very useful) comments
> 
> The get_page_from_freelist hang is quite strange. It would be zone->lock,
> which shouldn't have too much contention...
> 
> Replication may be putting more stress on some locks. It will cause more
> tlb flushing that can not be batched well, which could cause the call_lock
> to get hotter. Then i_mmap_lock is held over tlb flushing, so it will
> inherit the latency from call_lock. (If this is the case, we could
> potentially extend the tlb flushing API slightly to cope better with
> unmapping of pages from multiple mm's, but that comes way down the track
> when/if replication proves itself!).
> 
> tlb flushing AFAIKS should not do the IPI unless it is deadling with a
> multithreaded mm... does usex use threads?

Yes.  Apparently, there are some tests, perhaps some of the /usr/bin
apps that get run repeatedly, that are multi-threaded.  This job mix
caught a number of races in my auto-migration patches when
multi-threaded tasks race in the page fault paths.

More below...

> 
> 
> > I should note that I was trying to unmap all mappings to the file backed 
> > pages
> > on internode task migration, instead of just the current task's pte's.  
> > However,
> > I was only attempting this on pages with  mapcount <= 4.  So, I don't think 
> > I 
> > was looping trying to unmap pages with mapcounts of several 10s--such as I 
> > see
> > on some page cache pages in my traces.
> 
> Replication teardown would still have to unmap all... but that shouldn't
> particularly be any worse than, say, page reclaim (except I guess that it
> could occur more often).
> 
>  
> > Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the 
> > current
> > task's ptes for ALL !anon pages, regardless of mapcount.  I've started the 
> > test
> > again and will let it run over the weekend--or as long as it stays up, 
> > which 
> > ever is shorter :-).
> 
> Ah, so it does eventually die? Any hints of why?

No, doesn't die--as in panic.  I was just commenting that I'd leave it
running ...  However [:-(], it DID hang again.  The test window said
that the tests ran for 62h:28m before the screen stopped updating.  In
another window, I was running a script to snap the replication and #
file pages vmstats, along with a timestamp, every 10 minutes.  That
stopped reporting stats at about 7:30am on Saturday--about 14h:30m into
the test.  It still wrote the timestamps [date command] until around 7am
this morning [Monday]--or ~62 hours into test.

So, I do have ~14 hours of replication stats that I can send you or plot
up...

Re: the hang:  again, console was scrolling soft lockups continuously.
Checking the messages file, I see hangs in copy_process(),
smp_call_function [as in prev test], vma_link [from mmap], ...

I also see a number of NaT ["not a thing"] consumptions--ia64 specific
error, probably invalid pointer deref--in swapin_readahead, which my
patches hack.  These might be the cause of the fork/mmap hangs.

Didn't see that in the 8-9Aug runs, so it might be a result of continued
operation after other hangs/problems; or a botch in the rebase to
rc2-mm2.  In any case, I have some work to do there...

> 
> > 
> > I put a tarball with the rebased series in the Replication directory linked
> > above, in case you're interested.  I haven't added the patch description for
> > the new patch yet, but it's pretty simple.  Maybe even correct.
> > 
> > 
> > 
> > Unrelated to the lockups  [I think]:
> > 
> > I forgot to look before I rebooted, but earlier the previous evening, I 
> > checked
> > the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 
> > million
> > replications and ~4.8 million "zaps" [collapse of replicated page].  That's 
> > around
> > 98% zaps.  Do we need some filter in the fault path to reduce the 
> > "thrashing"--if
> > that's what I'm seeing.  
> 
> Yep. The current replication patch is very much only infrastructure at
> this stage (and is good for stress testing). I feel sure that heuristics
> and perhaps tunables would be needed to make the most of it.

Yeah.  I have some ideas to try...

At the end of the 14.5 hours when 

Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-08-13 Thread Nick Piggin
On Fri, Aug 10, 2007 at 05:08:18PM -0400, Lee Schermerhorn wrote:
> On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote:
> > On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> > > Hi,
> > > 
> > > Just got a bit of time to take another look at the replicated pagecache
> > > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> > > gives me more confidence in the locking now; the new ->fault API makes
> > > MAP_SHARED write faults much more efficient; and a few bugs were found
> > > and fixed.
> > > 
> > > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> > > tests...
> > > 
> > 
> > Sending this out to give Nick an update and to give the list a
> > heads up on what I've found so far with the replication patch.
> > 
> > I have rebased Nick's recent pagecache replication patch against
> > 2.6.23-rc1-mm2, atop my memory policy and auto/lazy migration
> > patch sets.  These include:
> > 
> > + shared policy
> > + migrate-on-fault a.k.a. lazy migration
> > + auto-migration - trigger lazy migration on inter-node task
> >task migration
> > + migration cache - pseudo-swap cache for parking unmapped
> > anon pages awaiting migrate-on-fault
> > 
> > I added a couple of patches to fix up the interaction of replication
> > with migration [discussed more below] and a per cpuset control to
> > enable/disable replication.  The latter allowed me to boot successfully
> > and to survive any bugs encountered by restricting the effects to 
> > tasks in the test cpuset with replication enabled.  That was the
> > theory, anyway :-).  Mostly worked...
> 
> After I sent out the last update, I ran a usex job mix overnight ~19.5 hours.
> When I came in the next morning, the console window was full of soft lockups
> on various cpus with varions stack traces.  /var/log/messages showed 142, in
> all.
> 
> I've placed the soft lockup reports from /var/log/messages in the Replication
> directory on free.linux:
> 
>   http://free.linux.hp.com/~lts/Patches/Replication.
> 
> The lockups appeared in several places in the traces I looked at.  Here's a
> couple of examples:
> 
> + unlink_file_vma() from free_pgtables() during task exit:
>   mapping->i_mmap_lock ???
> 
> + smp_call_function() from ia64_global_tlb_purge().
> Maybe the 'call_lock' in arch/ia64/kernel/smp.c ?
>   Traces show us getting to here in one of 2 ways:
> 
>   1) try_to_unmap* during auto task migration [migrate_pages_unmap_only()...]
> 
>   2) from zap_page_range() when __unreplicate_pcache() calls 
> unmap_mapping_range.
> 
> + get_page_from_freelist -> zone_lru_lock?
> 
> An interesting point:  all of the soft lockup messages said that the cpu was
> locked for 11s.  Ring any bells?

Hi Lee,

Am sick with the flu for the past few days, so I haven't done much more
work here, but I'll just add some (not very useful) comments

The get_page_from_freelist hang is quite strange. It would be zone->lock,
which shouldn't have too much contention...

Replication may be putting more stress on some locks. It will cause more
tlb flushing that can not be batched well, which could cause the call_lock
to get hotter. Then i_mmap_lock is held over tlb flushing, so it will
inherit the latency from call_lock. (If this is the case, we could
potentially extend the tlb flushing API slightly to cope better with
unmapping of pages from multiple mm's, but that comes way down the track
when/if replication proves itself!).

tlb flushing AFAIKS should not do the IPI unless it is deadling with a
multithreaded mm... does usex use threads?


> I should note that I was trying to unmap all mappings to the file backed pages
> on internode task migration, instead of just the current task's pte's.  
> However,
> I was only attempting this on pages with  mapcount <= 4.  So, I don't think I 
> was looping trying to unmap pages with mapcounts of several 10s--such as I see
> on some page cache pages in my traces.

Replication teardown would still have to unmap all... but that shouldn't
particularly be any worse than, say, page reclaim (except I guess that it
could occur more often).

 
> Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the current
> task's ptes for ALL !anon pages, regardless of mapcount.  I've started the 
> test
> again and will let it run over the weekend--or as long as it stays up, which 
> ever is shorter :-).

Ah, so it does eventually die? Any hints of why?

> 
> I put a tarball with the rebased series in the Replication directory linked
> above, in case you're interested.  I haven't added the patch description for
> the new patch yet, but it's pretty simple.  Maybe even correct.
> 
> 
> 
> Unrelated to the lockups  [I think]:
> 
> I forgot to look before I rebooted, but earlier the previous evening, I 
> checked
> the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 
> million
> replications and ~4.8 million "zaps" [collapse of 

Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-08-13 Thread Nick Piggin
On Fri, Aug 10, 2007 at 05:08:18PM -0400, Lee Schermerhorn wrote:
 On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote:
  On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
   Hi,
   
   Just got a bit of time to take another look at the replicated pagecache
   patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
   gives me more confidence in the locking now; the new -fault API makes
   MAP_SHARED write faults much more efficient; and a few bugs were found
   and fixed.
   
   More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
   tests...
   
  
  Sending this out to give Nick an update and to give the list a
  heads up on what I've found so far with the replication patch.
  
  I have rebased Nick's recent pagecache replication patch against
  2.6.23-rc1-mm2, atop my memory policy and auto/lazy migration
  patch sets.  These include:
  
  + shared policy
  + migrate-on-fault a.k.a. lazy migration
  + auto-migration - trigger lazy migration on inter-node task
 task migration
  + migration cache - pseudo-swap cache for parking unmapped
  anon pages awaiting migrate-on-fault
  
  I added a couple of patches to fix up the interaction of replication
  with migration [discussed more below] and a per cpuset control to
  enable/disable replication.  The latter allowed me to boot successfully
  and to survive any bugs encountered by restricting the effects to 
  tasks in the test cpuset with replication enabled.  That was the
  theory, anyway :-).  Mostly worked...
 
 After I sent out the last update, I ran a usex job mix overnight ~19.5 hours.
 When I came in the next morning, the console window was full of soft lockups
 on various cpus with varions stack traces.  /var/log/messages showed 142, in
 all.
 
 I've placed the soft lockup reports from /var/log/messages in the Replication
 directory on free.linux:
 
   http://free.linux.hp.com/~lts/Patches/Replication.
 
 The lockups appeared in several places in the traces I looked at.  Here's a
 couple of examples:
 
 + unlink_file_vma() from free_pgtables() during task exit:
   mapping-i_mmap_lock ???
 
 + smp_call_function() from ia64_global_tlb_purge().
 Maybe the 'call_lock' in arch/ia64/kernel/smp.c ?
   Traces show us getting to here in one of 2 ways:
 
   1) try_to_unmap* during auto task migration [migrate_pages_unmap_only()...]
 
   2) from zap_page_range() when __unreplicate_pcache() calls 
 unmap_mapping_range.
 
 + get_page_from_freelist - zone_lru_lock?
 
 An interesting point:  all of the soft lockup messages said that the cpu was
 locked for 11s.  Ring any bells?

Hi Lee,

Am sick with the flu for the past few days, so I haven't done much more
work here, but I'll just add some (not very useful) comments

The get_page_from_freelist hang is quite strange. It would be zone-lock,
which shouldn't have too much contention...

Replication may be putting more stress on some locks. It will cause more
tlb flushing that can not be batched well, which could cause the call_lock
to get hotter. Then i_mmap_lock is held over tlb flushing, so it will
inherit the latency from call_lock. (If this is the case, we could
potentially extend the tlb flushing API slightly to cope better with
unmapping of pages from multiple mm's, but that comes way down the track
when/if replication proves itself!).

tlb flushing AFAIKS should not do the IPI unless it is deadling with a
multithreaded mm... does usex use threads?


 I should note that I was trying to unmap all mappings to the file backed pages
 on internode task migration, instead of just the current task's pte's.  
 However,
 I was only attempting this on pages with  mapcount = 4.  So, I don't think I 
 was looping trying to unmap pages with mapcounts of several 10s--such as I see
 on some page cache pages in my traces.

Replication teardown would still have to unmap all... but that shouldn't
particularly be any worse than, say, page reclaim (except I guess that it
could occur more often).

 
 Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the current
 task's ptes for ALL !anon pages, regardless of mapcount.  I've started the 
 test
 again and will let it run over the weekend--or as long as it stays up, which 
 ever is shorter :-).

Ah, so it does eventually die? Any hints of why?

 
 I put a tarball with the rebased series in the Replication directory linked
 above, in case you're interested.  I haven't added the patch description for
 the new patch yet, but it's pretty simple.  Maybe even correct.
 
 
 
 Unrelated to the lockups  [I think]:
 
 I forgot to look before I rebooted, but earlier the previous evening, I 
 checked
 the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 
 million
 replications and ~4.8 million zaps [collapse of replicated page].  That's 
 around
 98% zaps.  Do we need some filter in the fault path to reduce the 
 thrashing--if
 that's what I'm seeing. 

Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-08-13 Thread Lee Schermerhorn
On Mon, 2007-08-13 at 09:43 +0200, Nick Piggin wrote:
 On Fri, Aug 10, 2007 at 05:08:18PM -0400, Lee Schermerhorn wrote:
  On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote:
   On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
Hi,

Just got a bit of time to take another look at the replicated pagecache
patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
gives me more confidence in the locking now; the new -fault API makes
MAP_SHARED write faults much more efficient; and a few bugs were found
and fixed.

More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
tests...


snip
 
 Hi Lee,
 
 Am sick with the flu for the past few days, so I haven't done much more
 work here, but I'll just add some (not very useful) comments
 
 The get_page_from_freelist hang is quite strange. It would be zone-lock,
 which shouldn't have too much contention...
 
 Replication may be putting more stress on some locks. It will cause more
 tlb flushing that can not be batched well, which could cause the call_lock
 to get hotter. Then i_mmap_lock is held over tlb flushing, so it will
 inherit the latency from call_lock. (If this is the case, we could
 potentially extend the tlb flushing API slightly to cope better with
 unmapping of pages from multiple mm's, but that comes way down the track
 when/if replication proves itself!).
 
 tlb flushing AFAIKS should not do the IPI unless it is deadling with a
 multithreaded mm... does usex use threads?

Yes.  Apparently, there are some tests, perhaps some of the /usr/bin
apps that get run repeatedly, that are multi-threaded.  This job mix
caught a number of races in my auto-migration patches when
multi-threaded tasks race in the page fault paths.

More below...

 
 
  I should note that I was trying to unmap all mappings to the file backed 
  pages
  on internode task migration, instead of just the current task's pte's.  
  However,
  I was only attempting this on pages with  mapcount = 4.  So, I don't think 
  I 
  was looping trying to unmap pages with mapcounts of several 10s--such as I 
  see
  on some page cache pages in my traces.
 
 Replication teardown would still have to unmap all... but that shouldn't
 particularly be any worse than, say, page reclaim (except I guess that it
 could occur more often).
 
  
  Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the 
  current
  task's ptes for ALL !anon pages, regardless of mapcount.  I've started the 
  test
  again and will let it run over the weekend--or as long as it stays up, 
  which 
  ever is shorter :-).
 
 Ah, so it does eventually die? Any hints of why?

No, doesn't die--as in panic.  I was just commenting that I'd leave it
running ...  However [:-(], it DID hang again.  The test window said
that the tests ran for 62h:28m before the screen stopped updating.  In
another window, I was running a script to snap the replication and #
file pages vmstats, along with a timestamp, every 10 minutes.  That
stopped reporting stats at about 7:30am on Saturday--about 14h:30m into
the test.  It still wrote the timestamps [date command] until around 7am
this morning [Monday]--or ~62 hours into test.

So, I do have ~14 hours of replication stats that I can send you or plot
up...

Re: the hang:  again, console was scrolling soft lockups continuously.
Checking the messages file, I see hangs in copy_process(),
smp_call_function [as in prev test], vma_link [from mmap], ...

I also see a number of NaT [not a thing] consumptions--ia64 specific
error, probably invalid pointer deref--in swapin_readahead, which my
patches hack.  These might be the cause of the fork/mmap hangs.

Didn't see that in the 8-9Aug runs, so it might be a result of continued
operation after other hangs/problems; or a botch in the rebase to
rc2-mm2.  In any case, I have some work to do there...

 
  
  I put a tarball with the rebased series in the Replication directory linked
  above, in case you're interested.  I haven't added the patch description for
  the new patch yet, but it's pretty simple.  Maybe even correct.
  
  
  
  Unrelated to the lockups  [I think]:
  
  I forgot to look before I rebooted, but earlier the previous evening, I 
  checked
  the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 
  million
  replications and ~4.8 million zaps [collapse of replicated page].  That's 
  around
  98% zaps.  Do we need some filter in the fault path to reduce the 
  thrashing--if
  that's what I'm seeing.  
 
 Yep. The current replication patch is very much only infrastructure at
 this stage (and is good for stress testing). I feel sure that heuristics
 and perhaps tunables would be needed to make the most of it.

Yeah.  I have some ideas to try...

At the end of the 14.5 hours when it stopped dumping vmstats, we were at
~95% zaps.

 
 
  A while back I took a look at the Virtual Iron page replication patch.  
  They had
  set 

Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-08-13 Thread Nick Piggin
On Mon, Aug 13, 2007 at 10:05:01AM -0400, Lee Schermerhorn wrote:
 On Mon, 2007-08-13 at 09:43 +0200, Nick Piggin wrote:
  
  Replication may be putting more stress on some locks. It will cause more
  tlb flushing that can not be batched well, which could cause the call_lock
  to get hotter. Then i_mmap_lock is held over tlb flushing, so it will
  inherit the latency from call_lock. (If this is the case, we could
  potentially extend the tlb flushing API slightly to cope better with
  unmapping of pages from multiple mm's, but that comes way down the track
  when/if replication proves itself!).
  
  tlb flushing AFAIKS should not do the IPI unless it is deadling with a
  multithreaded mm... does usex use threads?
 
 Yes.  Apparently, there are some tests, perhaps some of the /usr/bin
 apps that get run repeatedly, that are multi-threaded.  This job mix
 caught a number of races in my auto-migration patches when
 multi-threaded tasks race in the page fault paths.
 
 More below...

Hmm, come to think of it: I'm a bit mistaken. The replica zaps will often
to be coming from _other_ CPUs, so they will require an IPI regardless of
whether they are threaded or not.

The generic ia64 tlb flushing code also does a really bad job at flushing one
'mm' from another: it uses the single-threaded smp_call_function and broadcasts
IPIs (and TLB invalidates) to ALL CPUs, regardless of the cpu_vm_mask of the
target process. So you have a multiplicative problem with call_lock.

I think this path could be significantly optimised... but it's a bit nasty
to be playing around with the TLB flushing code while trying to test
something else :P

Can we make a simple change to smp_flush_tlb_all to do
smp_flush_tlb_cpumask(cpu_online_map), rather than on_each_cpu()? At least
then it will use the direct IPI vector and avoid call_lock.


  Ah, so it does eventually die? Any hints of why?
 
 No, doesn't die--as in panic.  I was just commenting that I'd leave it
 running ...  However [:-(], it DID hang again.  The test window said
 that the tests ran for 62h:28m before the screen stopped updating.  In
 another window, I was running a script to snap the replication and #
 file pages vmstats, along with a timestamp, every 10 minutes.  That
 stopped reporting stats at about 7:30am on Saturday--about 14h:30m into
 the test.  It still wrote the timestamps [date command] until around 7am
 this morning [Monday]--or ~62 hours into test.
 
 So, I do have ~14 hours of replication stats that I can send you or plot
 up...

If you think it could be useful, sure.

 
 Re: the hang:  again, console was scrolling soft lockups continuously.
 Checking the messages file, I see hangs in copy_process(),
 smp_call_function [as in prev test], vma_link [from mmap], ...

I don't suppose it should hang even if it is encountering 10s delays on
call_lock but I wonder how it would go with the tlb flush change.
With luck, it would add more concurrency and make it hang _faster_ ;)


  Yeah I guess it can be a little misleading: as time approaches infinity,
  zaps will probably approach replications. But that doesn't tell you how
  long a replica stayed around and usefully fed CPUs with local memory...
 
 May be able to capture that info with a more invasive patch -- e.g., add
 a timestamp to the page struct.  I'll think about it.

Yeah that actually could be a good approach. You could make a histogram
of lifetimes which would be a decent metric to start tuning with. Ideally
you'd also want to record some context of what caused the zap and the status
of the file, but it may be difficult to get a good S/N on those metrics.

 
 And, I'll keep you posted.  Not sure how much time I'll be able to
 dedicate to this patch stream.  Got a few others I need to get back
 to...

Thanks, I appreciate it. I'm pretty much in the same boat, just spending a
bit of time on it here and there.

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-08-10 Thread Lee Schermerhorn
On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote:
> On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> > Hi,
> > 
> > Just got a bit of time to take another look at the replicated pagecache
> > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> > gives me more confidence in the locking now; the new ->fault API makes
> > MAP_SHARED write faults much more efficient; and a few bugs were found
> > and fixed.
> > 
> > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> > tests...
> > 
> 
> Sending this out to give Nick an update and to give the list a
> heads up on what I've found so far with the replication patch.
> 
> I have rebased Nick's recent pagecache replication patch against
> 2.6.23-rc1-mm2, atop my memory policy and auto/lazy migration
> patch sets.  These include:
> 
> + shared policy
> + migrate-on-fault a.k.a. lazy migration
> + auto-migration - trigger lazy migration on inter-node task
>task migration
> + migration cache - pseudo-swap cache for parking unmapped
> anon pages awaiting migrate-on-fault
> 
> I added a couple of patches to fix up the interaction of replication
> with migration [discussed more below] and a per cpuset control to
> enable/disable replication.  The latter allowed me to boot successfully
> and to survive any bugs encountered by restricting the effects to 
> tasks in the test cpuset with replication enabled.  That was the
> theory, anyway :-).  Mostly worked...

After I sent out the last update, I ran a usex job mix overnight ~19.5 hours.
When I came in the next morning, the console window was full of soft lockups
on various cpus with varions stack traces.  /var/log/messages showed 142, in
all.

I've placed the soft lockup reports from /var/log/messages in the Replication
directory on free.linux:

http://free.linux.hp.com/~lts/Patches/Replication.

The lockups appeared in several places in the traces I looked at.  Here's a
couple of examples:

+ unlink_file_vma() from free_pgtables() during task exit:
mapping->i_mmap_lock ???

+ smp_call_function() from ia64_global_tlb_purge().
  Maybe the 'call_lock' in arch/ia64/kernel/smp.c ?
  Traces show us getting to here in one of 2 ways:

  1) try_to_unmap* during auto task migration [migrate_pages_unmap_only()...]

  2) from zap_page_range() when __unreplicate_pcache() calls 
unmap_mapping_range.

+ get_page_from_freelist -> zone_lru_lock?

An interesting point:  all of the soft lockup messages said that the cpu was
locked for 11s.  Ring any bells?


I should note that I was trying to unmap all mappings to the file backed pages
on internode task migration, instead of just the current task's pte's.  However,
I was only attempting this on pages with  mapcount <= 4.  So, I don't think I 
was looping trying to unmap pages with mapcounts of several 10s--such as I see
on some page cache pages in my traces.

Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the current
task's ptes for ALL !anon pages, regardless of mapcount.  I've started the test
again and will let it run over the weekend--or as long as it stays up, which 
ever is shorter :-).

I put a tarball with the rebased series in the Replication directory linked
above, in case you're interested.  I haven't added the patch description for
the new patch yet, but it's pretty simple.  Maybe even correct.



Unrelated to the lockups  [I think]:

I forgot to look before I rebooted, but earlier the previous evening, I checked
the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 
million
replications and ~4.8 million "zaps" [collapse of replicated page].  That's 
around
98% zaps.  Do we need some filter in the fault path to reduce the 
"thrashing"--if
that's what I'm seeing.  

A while back I took a look at the Virtual Iron page replication patch.  They had
set VM_DENY_WRITE when mapping shared executable segments, and only replicated 
pages
in those VMAs.  Maybe 'DENY_WRITE isn't exactly what we want.  Possibly set 
another
flag for shared executables, if we can detect them, and any shared mapping that 
has
no writable mappings ?

I'll try to remember to check the replication statistics after the currently
running test.  If the system stays up, that is.  A quick look < 10 minutes into
the test shows that zaps are now ~84% of replications.  Also, ~47k replicated 
pages
out of ~287K file pages.

Lee




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-08-10 Thread Lee Schermerhorn
On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote:
 On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
  Hi,
  
  Just got a bit of time to take another look at the replicated pagecache
  patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
  gives me more confidence in the locking now; the new -fault API makes
  MAP_SHARED write faults much more efficient; and a few bugs were found
  and fixed.
  
  More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
  tests...
  
 
 Sending this out to give Nick an update and to give the list a
 heads up on what I've found so far with the replication patch.
 
 I have rebased Nick's recent pagecache replication patch against
 2.6.23-rc1-mm2, atop my memory policy and auto/lazy migration
 patch sets.  These include:
 
 + shared policy
 + migrate-on-fault a.k.a. lazy migration
 + auto-migration - trigger lazy migration on inter-node task
task migration
 + migration cache - pseudo-swap cache for parking unmapped
 anon pages awaiting migrate-on-fault
 
 I added a couple of patches to fix up the interaction of replication
 with migration [discussed more below] and a per cpuset control to
 enable/disable replication.  The latter allowed me to boot successfully
 and to survive any bugs encountered by restricting the effects to 
 tasks in the test cpuset with replication enabled.  That was the
 theory, anyway :-).  Mostly worked...

After I sent out the last update, I ran a usex job mix overnight ~19.5 hours.
When I came in the next morning, the console window was full of soft lockups
on various cpus with varions stack traces.  /var/log/messages showed 142, in
all.

I've placed the soft lockup reports from /var/log/messages in the Replication
directory on free.linux:

http://free.linux.hp.com/~lts/Patches/Replication.

The lockups appeared in several places in the traces I looked at.  Here's a
couple of examples:

+ unlink_file_vma() from free_pgtables() during task exit:
mapping-i_mmap_lock ???

+ smp_call_function() from ia64_global_tlb_purge().
  Maybe the 'call_lock' in arch/ia64/kernel/smp.c ?
  Traces show us getting to here in one of 2 ways:

  1) try_to_unmap* during auto task migration [migrate_pages_unmap_only()...]

  2) from zap_page_range() when __unreplicate_pcache() calls 
unmap_mapping_range.

+ get_page_from_freelist - zone_lru_lock?

An interesting point:  all of the soft lockup messages said that the cpu was
locked for 11s.  Ring any bells?


I should note that I was trying to unmap all mappings to the file backed pages
on internode task migration, instead of just the current task's pte's.  However,
I was only attempting this on pages with  mapcount = 4.  So, I don't think I 
was looping trying to unmap pages with mapcounts of several 10s--such as I see
on some page cache pages in my traces.

Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the current
task's ptes for ALL !anon pages, regardless of mapcount.  I've started the test
again and will let it run over the weekend--or as long as it stays up, which 
ever is shorter :-).

I put a tarball with the rebased series in the Replication directory linked
above, in case you're interested.  I haven't added the patch description for
the new patch yet, but it's pretty simple.  Maybe even correct.



Unrelated to the lockups  [I think]:

I forgot to look before I rebooted, but earlier the previous evening, I checked
the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 
million
replications and ~4.8 million zaps [collapse of replicated page].  That's 
around
98% zaps.  Do we need some filter in the fault path to reduce the 
thrashing--if
that's what I'm seeing.  

A while back I took a look at the Virtual Iron page replication patch.  They had
set VM_DENY_WRITE when mapping shared executable segments, and only replicated 
pages
in those VMAs.  Maybe 'DENY_WRITE isn't exactly what we want.  Possibly set 
another
flag for shared executables, if we can detect them, and any shared mapping that 
has
no writable mappings ?

I'll try to remember to check the replication statistics after the currently
running test.  If the system stays up, that is.  A quick look  10 minutes into
the test shows that zaps are now ~84% of replications.  Also, ~47k replicated 
pages
out of ~287K file pages.

Lee




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-08-08 Thread Lee Schermerhorn
On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> Hi,
> 
> Just got a bit of time to take another look at the replicated pagecache
> patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> gives me more confidence in the locking now; the new ->fault API makes
> MAP_SHARED write faults much more efficient; and a few bugs were found
> and fixed.
> 
> More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> tests...
> 

Sending this out to give Nick an update and to give the list a
heads up on what I've found so far with the replication patch.

I have rebased Nick's recent pagecache replication patch against
2.6.23-rc1-mm2, atop my memory policy and auto/lazy migration
patch sets.  These include:

+ shared policy
+ migrate-on-fault a.k.a. lazy migration
+ auto-migration - trigger lazy migration on inter-node task
   task migration
+ migration cache - pseudo-swap cache for parking unmapped
anon pages awaiting migrate-on-fault

I added a couple of patches to fix up the interaction of replication
with migration [discussed more below] and a per cpuset control to
enable/disable replication.  The latter allowed me to boot successfully
and to survive any bugs encountered by restricting the effects to 
tasks in the test cpuset with replication enabled.  That was the
theory, anyway :-).  Mostly worked...

Rather than spam the list, I've placed the entire quilt series that
I'm testing, less the 23-rc1 and 23-rc1-mm2 patches, at:

http://free.linux.hp.com/~lts/Patches/Replication/

It's the 070808 tarball.

I plan to measure the effects on performance with various combinations
of these features enabled.  First, however, I ran into one problem that
required me to investigate further.  In the migrate-on-fault set, I've
introduced a function named "migrate_page_unmap_only()".  It parallels
Christoph's "migrate_pages()" but for lazy migration, it just removes
the pte mappings from the selected pages so that they will incur a fault
on next touch and be migrated to the node specified by policy, if
necessary and "easy" to do.  [don't want to try too hard, as this is 
just a performance optimization.  supposed to be, anyway.]

In migrate_page_unmap_only(), I had a BUG_ON to catch [non-anon] pages
with a NULL page_mapping().  I never hit this in my testing until I
added in the page replication.  To investigate, I took the opportunity
to update my mmtrace instrumentation.   I added a few trace points for
Nick's replication functions and replaced the BUG_ON with a trace
point and skipped pages w/ a NULL mapping.  The kernel patches are in
the patch tarball at the link above.  The user space tools are available
at:

http://free.linux.hp.com/~lts/Tools/mmtrace-latest.tar.gz

A rather large tarball containing formatted traces from a usex run
that hit the NULL mapping trace point is also available from the
replication patches directory linked above.  I've extracted traces
related to the "bug check" and annotated them--also in the tarball.
See the README.

So what's happening?

I think I'm hitting a race between the page replication code when it
"unreplicates" a page and a task that references one of the replicas
attempting to unmap that replica for lazy migration.  When "unreplicating"
a page, the replication patch nulls out all of the mappings for the 
"slave pages", without locking the pages or otherwise coordinating with
other possible accesses to the page, and then calls unmap_mapping_range()
to unmap them.  Meanwhile, these pages are still referenced by various tasks'
page tables.  

One interesting thing I see in the traces is that, in the couple of
instances I looked at, the attempt to unmap [migrate_pages_unmap_only()]
came approximately a second after the __unreplicate_pcache() call that
apparently nulled out the mapping.  I.e., the slave page remained
referenced by the task's page table for almost a second after unreplication.
Nick does have a comment about unmap_mapping_range() sleeping, but a
second seems like a long time.

I don't know whether this is a real problem or not.  I removed the 
BUG_ON and now just skip pages with NULL mapping.  They're being removed
anyway.  I'm running a stress test now, and haven't seen any obvious
problems yet.  I do have concerns, tho'.  Page migration assumes that
if it can successfully isolate a page from the LRU and lock it, that it
has pretty much exclusive access.

Direct migration [Christoph's implementation] is a bit stricter regarding
reference and map counts, unless "MOVE_ALL" is specified.  In my lazy
migration patches, I want to be able to unmap pages with multiple pte
references [currently a per cpuset tunable threshold] to test the
performance impact of trying harder to unmap vs being able to migrate
fewer pages.  

I'm also seeing a lot of "thrashing"--pages being repeatedly replicated
and unreplicated on every other fault to the page.  I haven't investigated
how long the intervals are 

Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-08-08 Thread Lee Schermerhorn
On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
 Hi,
 
 Just got a bit of time to take another look at the replicated pagecache
 patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
 gives me more confidence in the locking now; the new -fault API makes
 MAP_SHARED write faults much more efficient; and a few bugs were found
 and fixed.
 
 More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
 tests...
 

Sending this out to give Nick an update and to give the list a
heads up on what I've found so far with the replication patch.

I have rebased Nick's recent pagecache replication patch against
2.6.23-rc1-mm2, atop my memory policy and auto/lazy migration
patch sets.  These include:

+ shared policy
+ migrate-on-fault a.k.a. lazy migration
+ auto-migration - trigger lazy migration on inter-node task
   task migration
+ migration cache - pseudo-swap cache for parking unmapped
anon pages awaiting migrate-on-fault

I added a couple of patches to fix up the interaction of replication
with migration [discussed more below] and a per cpuset control to
enable/disable replication.  The latter allowed me to boot successfully
and to survive any bugs encountered by restricting the effects to 
tasks in the test cpuset with replication enabled.  That was the
theory, anyway :-).  Mostly worked...

Rather than spam the list, I've placed the entire quilt series that
I'm testing, less the 23-rc1 and 23-rc1-mm2 patches, at:

http://free.linux.hp.com/~lts/Patches/Replication/

It's the 070808 tarball.

I plan to measure the effects on performance with various combinations
of these features enabled.  First, however, I ran into one problem that
required me to investigate further.  In the migrate-on-fault set, I've
introduced a function named migrate_page_unmap_only().  It parallels
Christoph's migrate_pages() but for lazy migration, it just removes
the pte mappings from the selected pages so that they will incur a fault
on next touch and be migrated to the node specified by policy, if
necessary and easy to do.  [don't want to try too hard, as this is 
just a performance optimization.  supposed to be, anyway.]

In migrate_page_unmap_only(), I had a BUG_ON to catch [non-anon] pages
with a NULL page_mapping().  I never hit this in my testing until I
added in the page replication.  To investigate, I took the opportunity
to update my mmtrace instrumentation.   I added a few trace points for
Nick's replication functions and replaced the BUG_ON with a trace
point and skipped pages w/ a NULL mapping.  The kernel patches are in
the patch tarball at the link above.  The user space tools are available
at:

http://free.linux.hp.com/~lts/Tools/mmtrace-latest.tar.gz

A rather large tarball containing formatted traces from a usex run
that hit the NULL mapping trace point is also available from the
replication patches directory linked above.  I've extracted traces
related to the bug check and annotated them--also in the tarball.
See the README.

So what's happening?

I think I'm hitting a race between the page replication code when it
unreplicates a page and a task that references one of the replicas
attempting to unmap that replica for lazy migration.  When unreplicating
a page, the replication patch nulls out all of the mappings for the 
slave pages, without locking the pages or otherwise coordinating with
other possible accesses to the page, and then calls unmap_mapping_range()
to unmap them.  Meanwhile, these pages are still referenced by various tasks'
page tables.  

One interesting thing I see in the traces is that, in the couple of
instances I looked at, the attempt to unmap [migrate_pages_unmap_only()]
came approximately a second after the __unreplicate_pcache() call that
apparently nulled out the mapping.  I.e., the slave page remained
referenced by the task's page table for almost a second after unreplication.
Nick does have a comment about unmap_mapping_range() sleeping, but a
second seems like a long time.

I don't know whether this is a real problem or not.  I removed the 
BUG_ON and now just skip pages with NULL mapping.  They're being removed
anyway.  I'm running a stress test now, and haven't seen any obvious
problems yet.  I do have concerns, tho'.  Page migration assumes that
if it can successfully isolate a page from the LRU and lock it, that it
has pretty much exclusive access.

Direct migration [Christoph's implementation] is a bit stricter regarding
reference and map counts, unless MOVE_ALL is specified.  In my lazy
migration patches, I want to be able to unmap pages with multiple pte
references [currently a per cpuset tunable threshold] to test the
performance impact of trying harder to unmap vs being able to migrate
fewer pages.  

I'm also seeing a lot of thrashing--pages being repeatedly replicated
and unreplicated on every other fault to the page.  I haven't investigated
how long the intervals are between the faults, so maybe 

Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-07-30 Thread Lee Schermerhorn
On Mon, 2007-07-30 at 05:16 +0200, Nick Piggin wrote:
> On Fri, Jul 27, 2007 at 10:30:47AM -0400, Lee Schermerhorn wrote:
> > On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> > > Hi,
> > > 
> > > Just got a bit of time to take another look at the replicated pagecache
> > > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> > > gives me more confidence in the locking now; the new ->fault API makes
> > > MAP_SHARED write faults much more efficient; and a few bugs were found
> > > and fixed.
> > > 
> > > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> > > tests...
> > > 
> > > --
> > > 
> > > Page-based NUMA pagecache replication.
> > 
> > 
> > Hi, Nick.
> > 
> > Glad to see you're back on this.  It's been on my list, but delayed by
> > other patch streams...
> 
> Yeah, thought I should keep it alive :) Patch is against 2.6.23-rc1.

D'Oh!  :-(  You could have just said "Read the subject line, Lee!"
> 
>  
> > As I mentioned to you in prior mail, I want to try to integrate this
> > atop my "auto/lazy migration" patches, such that when a task moves to a
> > new node, we remove just that task's pte ref's to page cache pages
> > [along with all refs to anon pages, as I do now] so that the task will
> > take a fault on next touch and either use an existing local copy or
> > replicate the page at that time.  Unfortunately, that's in the queue
> > behind the memoryless node patches and my stalled shared policy patches,
> > among other things :-(.
> 
> That's OK. It will likely be a long process to get any of this in...
> As you know, replicated currently needs some of your automigration
> infrastructure in order to get ptes pointing to the right places
> after a task migration. I'd like to try some experiments with them on
> a larger system, once you get time to update your patchset...

I'll try to make a pass this week, maybe next...

Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-07-30 Thread Lee Schermerhorn
On Mon, 2007-07-30 at 05:16 +0200, Nick Piggin wrote:
 On Fri, Jul 27, 2007 at 10:30:47AM -0400, Lee Schermerhorn wrote:
  On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
   Hi,
   
   Just got a bit of time to take another look at the replicated pagecache
   patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
   gives me more confidence in the locking now; the new -fault API makes
   MAP_SHARED write faults much more efficient; and a few bugs were found
   and fixed.
   
   More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
   tests...
   
   --
   
   Page-based NUMA pagecache replication.
  snip really big patch!
  
  Hi, Nick.
  
  Glad to see you're back on this.  It's been on my list, but delayed by
  other patch streams...
 
 Yeah, thought I should keep it alive :) Patch is against 2.6.23-rc1.

D'Oh!  :-(  You could have just said Read the subject line, Lee!
 
  
  As I mentioned to you in prior mail, I want to try to integrate this
  atop my auto/lazy migration patches, such that when a task moves to a
  new node, we remove just that task's pte ref's to page cache pages
  [along with all refs to anon pages, as I do now] so that the task will
  take a fault on next touch and either use an existing local copy or
  replicate the page at that time.  Unfortunately, that's in the queue
  behind the memoryless node patches and my stalled shared policy patches,
  among other things :-(.
 
 That's OK. It will likely be a long process to get any of this in...
 As you know, replicated currently needs some of your automigration
 infrastructure in order to get ptes pointing to the right places
 after a task migration. I'd like to try some experiments with them on
 a larger system, once you get time to update your patchset...

I'll try to make a pass this week, maybe next...

Lee

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-07-29 Thread Nick Piggin
On Fri, Jul 27, 2007 at 10:30:47AM -0400, Lee Schermerhorn wrote:
> On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> > Hi,
> > 
> > Just got a bit of time to take another look at the replicated pagecache
> > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> > gives me more confidence in the locking now; the new ->fault API makes
> > MAP_SHARED write faults much more efficient; and a few bugs were found
> > and fixed.
> > 
> > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> > tests...
> > 
> > --
> > 
> > Page-based NUMA pagecache replication.
> 
> 
> Hi, Nick.
> 
> Glad to see you're back on this.  It's been on my list, but delayed by
> other patch streams...

Yeah, thought I should keep it alive :) Patch is against 2.6.23-rc1.

 
> As I mentioned to you in prior mail, I want to try to integrate this
> atop my "auto/lazy migration" patches, such that when a task moves to a
> new node, we remove just that task's pte ref's to page cache pages
> [along with all refs to anon pages, as I do now] so that the task will
> take a fault on next touch and either use an existing local copy or
> replicate the page at that time.  Unfortunately, that's in the queue
> behind the memoryless node patches and my stalled shared policy patches,
> among other things :-(.

That's OK. It will likely be a long process to get any of this in...
As you know, replicated currently needs some of your automigration
infrastructure in order to get ptes pointing to the right places
after a task migration. I'd like to try some experiments with them on
a larger system, once you get time to update your patchset...

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-07-29 Thread Nick Piggin
On Fri, Jul 27, 2007 at 10:30:47AM -0400, Lee Schermerhorn wrote:
 On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
  Hi,
  
  Just got a bit of time to take another look at the replicated pagecache
  patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
  gives me more confidence in the locking now; the new -fault API makes
  MAP_SHARED write faults much more efficient; and a few bugs were found
  and fixed.
  
  More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
  tests...
  
  --
  
  Page-based NUMA pagecache replication.
 snip really big patch!
 
 Hi, Nick.
 
 Glad to see you're back on this.  It's been on my list, but delayed by
 other patch streams...

Yeah, thought I should keep it alive :) Patch is against 2.6.23-rc1.

 
 As I mentioned to you in prior mail, I want to try to integrate this
 atop my auto/lazy migration patches, such that when a task moves to a
 new node, we remove just that task's pte ref's to page cache pages
 [along with all refs to anon pages, as I do now] so that the task will
 take a fault on next touch and either use an existing local copy or
 replicate the page at that time.  Unfortunately, that's in the queue
 behind the memoryless node patches and my stalled shared policy patches,
 among other things :-(.

That's OK. It will likely be a long process to get any of this in...
As you know, replicated currently needs some of your automigration
infrastructure in order to get ptes pointing to the right places
after a task migration. I'd like to try some experiments with them on
a larger system, once you get time to update your patchset...

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-07-27 Thread Lee Schermerhorn
On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> Hi,
> 
> Just got a bit of time to take another look at the replicated pagecache
> patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> gives me more confidence in the locking now; the new ->fault API makes
> MAP_SHARED write faults much more efficient; and a few bugs were found
> and fixed.
> 
> More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> tests...
> 
> --
> 
> Page-based NUMA pagecache replication.


Hi, Nick.

Glad to see you're back on this.  It's been on my list, but delayed by
other patch streams...

As I mentioned to you in prior mail, I want to try to integrate this
atop my "auto/lazy migration" patches, such that when a task moves to a
new node, we remove just that task's pte ref's to page cache pages
[along with all refs to anon pages, as I do now] so that the task will
take a fault on next touch and either use an existing local copy or
replicate the page at that time.  Unfortunately, that's in the queue
behind the memoryless node patches and my stalled shared policy patches,
among other things :-(.

Also, what kernel is this patch against?  Diffs just say "linux-2.6".
Somewhat ambiguous...

Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-07-27 Thread Nick Piggin
Hi,

Just got a bit of time to take another look at the replicated pagecache
patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
gives me more confidence in the locking now; the new ->fault API makes
MAP_SHARED write faults much more efficient; and a few bugs were found
and fixed.

More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
tests...

--

Page-based NUMA pagecache replication.

This is a scheme for page replication replicates read-only pagecache pages
opportunistically, at pagecache lookup time (at points where we know the
page is being looked up for read only).

The page will be replicated if it resides on a different node to what the
requesting CPU is on. Also, the original page must meet some conditions:
it must be clean, uptodate, not under writeback, and not have an elevated
refcount or filesystem private data. However it is allowed to be mapped
into pagetables.

Replication is done at the pagecache level, where a replicated pagecache
(inode,offset) key will have a special bit set in its radix-tree entry,
which tells us the entry points to a descriptor rather than a page.

This descriptor (struct pcache_desc) has another radix-tree which is keyed by
node. The pagecache gains an (optional) 3rd dimension!

Pagecache lookups which are not explicitly denoted as being read-only are
treaded as writes, and they collapse the replication before proceeding.
Writes into pagetables are caught by using the same mechanism as dirty page
throttling uses, and also collapse the replication.

After collapsing a replication, all process page tables are unmapped, so that
any processes mapping discarded pages will refault in the correct one.

/proc/vmstat has nr_repl_pages, which is the number of _additional_ pages
replicated, beyond the first.

Status:
- Lee showed that ~10s (1%) user time was cut off a kernel compile benchmark
  on his 4 node 16-way box.

Todo:
- find_get_page locking semantics are slightly changed. This doesn't appear
  to be a problem but I need to have a more thorough look.
- Would like to be able to control replication via userspace, and maybe
  even internally to the kernel.
- Ideally, reclaim might reclaim replicated pages preferentially, however
  I aim to be _minimally_ intrusive, and this conflicts with that.
- More correctness testing.
- Eventually, have to look at playing nicely with migration.
- radix-tree nodes start using up a large amount of memory. Try to improve.
  (eg. different data structure, smaller tree, or don't load master 
immediately).

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -5,6 +5,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 struct address_space;
 
@@ -80,4 +82,10 @@ struct page {
 #endif /* WANT_PAGE_VIRTUAL */
 };
 
+struct pcache_desc {
+   struct page *master;
+   nodemask_t nodes_present;
+   struct radix_tree_root page_tree;
+};
+
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -593,16 +593,13 @@ void fastcall __lock_page_nosync(struct 
  * Is there a pagecache struct page at the given (mapping, offset) tuple?
  * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page * find_get_page(struct address_space *mapping, unsigned long 
offset)
+struct page *find_get_page(struct address_space *mapping, unsigned long offset)
 {
struct page *page;
 
read_lock_irq(>tree_lock);
page = radix_tree_lookup(>page_tree, offset);
-   if (page)
-   page_cache_get(page);
-   read_unlock_irq(>tree_lock);
-   return page;
+   return get_unreplicated_page(mapping, offset, page);
 }
 EXPORT_SYMBOL(find_get_page);
 
@@ -621,26 +618,16 @@ struct page *find_lock_page(struct addre
 {
struct page *page;
 
-   read_lock_irq(>tree_lock);
 repeat:
-   page = radix_tree_lookup(>page_tree, offset);
+   page = find_get_page(mapping, offset);
if (page) {
-   page_cache_get(page);
-   if (TestSetPageLocked(page)) {
-   read_unlock_irq(>tree_lock);
-   __lock_page(page);
-   read_lock_irq(>tree_lock);
-
-   /* Has the page been truncated while we slept? */
-   if (unlikely(page->mapping != mapping ||
-page->index != offset)) {
-   unlock_page(page);
-   page_cache_release(page);
-   goto repeat;
-   }
+   lock_page(page);
+   if (unlikely(page->mapping != mapping)) {
+   unlock_page(page);
+   

[patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-07-27 Thread Nick Piggin
Hi,

Just got a bit of time to take another look at the replicated pagecache
patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
gives me more confidence in the locking now; the new -fault API makes
MAP_SHARED write faults much more efficient; and a few bugs were found
and fixed.

More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
tests...

--

Page-based NUMA pagecache replication.

This is a scheme for page replication replicates read-only pagecache pages
opportunistically, at pagecache lookup time (at points where we know the
page is being looked up for read only).

The page will be replicated if it resides on a different node to what the
requesting CPU is on. Also, the original page must meet some conditions:
it must be clean, uptodate, not under writeback, and not have an elevated
refcount or filesystem private data. However it is allowed to be mapped
into pagetables.

Replication is done at the pagecache level, where a replicated pagecache
(inode,offset) key will have a special bit set in its radix-tree entry,
which tells us the entry points to a descriptor rather than a page.

This descriptor (struct pcache_desc) has another radix-tree which is keyed by
node. The pagecache gains an (optional) 3rd dimension!

Pagecache lookups which are not explicitly denoted as being read-only are
treaded as writes, and they collapse the replication before proceeding.
Writes into pagetables are caught by using the same mechanism as dirty page
throttling uses, and also collapse the replication.

After collapsing a replication, all process page tables are unmapped, so that
any processes mapping discarded pages will refault in the correct one.

/proc/vmstat has nr_repl_pages, which is the number of _additional_ pages
replicated, beyond the first.

Status:
- Lee showed that ~10s (1%) user time was cut off a kernel compile benchmark
  on his 4 node 16-way box.

Todo:
- find_get_page locking semantics are slightly changed. This doesn't appear
  to be a problem but I need to have a more thorough look.
- Would like to be able to control replication via userspace, and maybe
  even internally to the kernel.
- Ideally, reclaim might reclaim replicated pages preferentially, however
  I aim to be _minimally_ intrusive, and this conflicts with that.
- More correctness testing.
- Eventually, have to look at playing nicely with migration.
- radix-tree nodes start using up a large amount of memory. Try to improve.
  (eg. different data structure, smaller tree, or don't load master 
immediately).

Index: linux-2.6/include/linux/mm_types.h
===
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -5,6 +5,8 @@
 #include linux/threads.h
 #include linux/list.h
 #include linux/spinlock.h
+#include linux/radix-tree.h
+#include linux/nodemask.h
 
 struct address_space;
 
@@ -80,4 +82,10 @@ struct page {
 #endif /* WANT_PAGE_VIRTUAL */
 };
 
+struct pcache_desc {
+   struct page *master;
+   nodemask_t nodes_present;
+   struct radix_tree_root page_tree;
+};
+
 #endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -593,16 +593,13 @@ void fastcall __lock_page_nosync(struct 
  * Is there a pagecache struct page at the given (mapping, offset) tuple?
  * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page * find_get_page(struct address_space *mapping, unsigned long 
offset)
+struct page *find_get_page(struct address_space *mapping, unsigned long offset)
 {
struct page *page;
 
read_lock_irq(mapping-tree_lock);
page = radix_tree_lookup(mapping-page_tree, offset);
-   if (page)
-   page_cache_get(page);
-   read_unlock_irq(mapping-tree_lock);
-   return page;
+   return get_unreplicated_page(mapping, offset, page);
 }
 EXPORT_SYMBOL(find_get_page);
 
@@ -621,26 +618,16 @@ struct page *find_lock_page(struct addre
 {
struct page *page;
 
-   read_lock_irq(mapping-tree_lock);
 repeat:
-   page = radix_tree_lookup(mapping-page_tree, offset);
+   page = find_get_page(mapping, offset);
if (page) {
-   page_cache_get(page);
-   if (TestSetPageLocked(page)) {
-   read_unlock_irq(mapping-tree_lock);
-   __lock_page(page);
-   read_lock_irq(mapping-tree_lock);
-
-   /* Has the page been truncated while we slept? */
-   if (unlikely(page-mapping != mapping ||
-page-index != offset)) {
-   unlock_page(page);
-   page_cache_release(page);
-   goto repeat;
-   }
+   lock_page(page);
+   if 

Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

2007-07-27 Thread Lee Schermerhorn
On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
 Hi,
 
 Just got a bit of time to take another look at the replicated pagecache
 patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
 gives me more confidence in the locking now; the new -fault API makes
 MAP_SHARED write faults much more efficient; and a few bugs were found
 and fixed.
 
 More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
 tests...
 
 --
 
 Page-based NUMA pagecache replication.
snip really big patch!

Hi, Nick.

Glad to see you're back on this.  It's been on my list, but delayed by
other patch streams...

As I mentioned to you in prior mail, I want to try to integrate this
atop my auto/lazy migration patches, such that when a task moves to a
new node, we remove just that task's pte ref's to page cache pages
[along with all refs to anon pages, as I do now] so that the task will
take a fault on next touch and either use an existing local copy or
replicate the page at that time.  Unfortunately, that's in the queue
behind the memoryless node patches and my stalled shared policy patches,
among other things :-(.

Also, what kernel is this patch against?  Diffs just say linux-2.6.
Somewhat ambiguous...

Lee

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/