Re: Mersenne: mprime, linux and 2 MB pages

2002-03-19 Thread Nick Craig-Wood

On Mon, Mar 18, 2002 at 02:12:48PM +, Brian J. Beesley wrote:
 On Monday 18 March 2002 10:21, Nick Craig-Wood wrote:
  There has been some discussion on the linux kernel mailing list about
  providing 2 MB pages (instead of 4kB ones) to user space for the use
  of database or scientific calculations.
 
  It seems to me that prime95/mprime would benefit from this enormously
  - it should reduce the TLB thrashing to practically zero and hence
  speed up mprime by some unknown amount.
 
  Is this true?  Should we put in our plea to the developers?
 
 Other people may and probably will disagree, but I think it will make very 
 little if any difference for most applications.

For most applications yes - however it would be configurable in a
per-process manner.  Possibly each process might have to take special
action to get 2 MB pages - say a new flag to memmap().

 The point is that mprime should normally be running on a system in a way 
 which means that all its active data pages are in memory. Having active data 
 paged out will cause a hideous performance hit.
 
 If the active data is already memory resident, TLB thrashing is not going to 
 be an issue.

The TLB (translation lookaside buffer) has very little to do with the
Virtual Memory system.  The TLB is used by the processor to cache the
address translations from logical memory to physical memory.  These
have to be read from the page table RAM which is expensive - hence the
cache.

When I was working on a DWT implementation for StrongARM I found that
thrashing the TLB caused a factor of two slow down.  The StrongARM
system I was using had no virtual memory.

If mprime is using 10 MB of memory say, then each page needs 1 TLB
entry to be used at all by the processor - ie 2560 TLB entries which
is way bigger than the size of the TLB in the processor (I don't
remember what it is in x86 but on StrongARM it has 32 entries).  To
access each physical page the TLB has to be reloaded from the page
tables which is an extra memory access or two.  If you use 2 MB pages
then there are only 5 pages needed and hence the TLB will never need
to be refilled and hence some speed gain.

 If the page size is going to be changed at all, there is a lot to be
 said for using the same size pages as AGP hardware - 4MB I think -
 there have already been some issues on some Athlon (K7) architecture
 linux systems caused by incorrect mapping between linux virtual
 pages and AGP address space; obviously using the same page size
 removes this source of confusion.

The choice of page size is constrained by the memory management
hardware.  I think 4k and 2 MB are the only sensible choices but I may
be wrong.

 One factor with shifting to a much larger page size is a
 corresponding decrease in the number of pages available to the
 system - a 32 MByte system will have only 8 4MB pages resident in
 real memory at any one time. Since page access rules are often used
 to protect data from accidental modification by rogue pointers etc.,
 a big reduction in system physical page count is a distinctly mixed
 blessing.

The proposal was that you would be able to turn on 2 MB pages for a
given process - obviously you wouldn't want 2 MB pages for every
process unless you had  100 GB of RAM.

I think this would make a real difference to mprime - what percentage
I don't know - at the cost of on average 1 MB of RAM extra.

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
_
Unsubscribe  list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: mprime, linux and 2 MB pages

2002-03-19 Thread Brian J. Beesley

On Tuesday 19 March 2002 10:09, Nick Craig-Wood wrote:
 On Mon, Mar 18, 2002 at 02:12:48PM +, Brian J. Beesley wrote:
 
  If the active data is already memory resident, TLB thrashing is not going
  to be an issue.

 The TLB (translation lookaside buffer) has very little to do with the
 Virtual Memory system.  The TLB is used by the processor to cache the
 address translations from logical memory to physical memory.  These
 have to be read from the page table RAM which is expensive - hence the
 cache.

Ah, but ... frequently accessing pages (virtual _or_ physical) will keep the 
TLB pages from getting too far away from the processor; probably at worst 
they will stay in the L1 cache.

The overhead of accessing from L1 cache is small compared with the overhead 
of accessing data from main memory, and _tiny_ compared with the overhead of 
accessing data from the page/swap file.

 When I was working on a DWT implementation for StrongARM I found that
 thrashing the TLB caused a factor of two slow down.  The StrongARM
 system I was using had no virtual memory.

 If mprime is using 10 MB of memory say, then each page needs 1 TLB
 entry to be used at all by the processor - ie 2560 TLB entries which
 is way bigger than the size of the TLB in the processor (I don't
 remember what it is in x86 but on StrongARM it has 32 entries).  To
 access each physical page the TLB has to be reloaded from the page
 tables which is an extra memory access or two.  If you use 2 MB pages
 then there are only 5 pages needed and hence the TLB will never need
 to be refilled and hence some speed gain.

Don't you _need_ to have at least enough TLB entries to map the whole of the 
processor cache? (Since without it you can't map the cache table entries...) 
The K7 (Athlon) architecture is designed to support at least 8MBytes cache, 
even though AFAIK no Athlons with more than 512KB cache have been supplied. 
Intel have supplied Xeons with 2MBytes cache; I can't remember offhand what 
the design limit is...

Anyway, here's the point. I'm running mprime on an Athlon (XP1700) with a 
very large exponent (~67 million); the virtual memory used by the mprime 
process is 42912 Kbytes = 10,000+ pages. The speed it's running at suggests 
that any performance loss due to TLB thrashing is small, since the extra drop 
beyond linearity is only about what one would expect from the LL test 
algorithm being O(n log n).

Whatever effect TLB thrashing may or may not be having, it doesn't look as 
though it's having a dominant effect on mprime.

 I think this would make a real difference to mprime - what percentage
 I don't know - at the cost of on average 1 MB of RAM extra.

I wouldn't mind _doubling_ the memory footprint, if we got a _significant _
performance boost as a consequence.

BTW why does this argument apply only to mprime? Surely Windows has the same 
underlying architecture - though obviously it's harder to get the Windows 
kernel changed than linux. 

Regards
Brian Beesley
_
Unsubscribe  list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: mprime, linux and 2 MB pages

2002-03-19 Thread Nick Craig-Wood

On Tue, Mar 19, 2002 at 08:11:54PM +, Brian J. Beesley wrote:
 Ah, but ... frequently accessing pages (virtual _or_ physical) will
 keep the TLB pages from getting too far away from the processor;
 probably at worst they will stay in the L1 cache.

Yes.

 The overhead of accessing from L1 cache is small compared with the
 overhead of accessing data from main memory, and _tiny_ compared
 with the overhead of accessing data from the page/swap file.

Yes... but a TLB miss costs one or maybe two extra memory cycles.  Ie
it halves the performance if you are missing every fetch.

 Don't you _need_ to have at least enough TLB entries to map the
 whole of the processor cache? (Since without it you can't map the
 cache table entries...)

Hmm, not sure.  The processor cache isn't directly mapped into memory
so it doesn't need TLB entries.  Depending on the architecture the
cache will either cache physical addresses or logical addresses.

[snip]
 Whatever effect TLB thrashing may or may not be having, it doesn't look as 
 though it's having a dominant effect on mprime.

Very true and that is a testament to George's programming skills.  A
naively programmed FFT will demonstrate TLB thrashing admirably!

  I think this would make a real difference to mprime - what percentage
  I don't know - at the cost of on average 1 MB of RAM extra.
 
 I wouldn't mind _doubling_ the memory footprint, if we got a _significant _
 performance boost as a consequence.

Me neither.  I'd like to try it but I need to persuade some friendly
kernel hacker to implement it for me ;-) I would guess it might make
at most 10% difference to the run time.

 BTW why does this argument apply only to mprime? Surely Windows has the same 
 underlying architecture - though obviously it's harder to get the Windows 
 kernel changed than linux. 

It applies exactly the same to Windows of course.  Its just that you
can chat with the Linux kernel developers on the mailing list ;-)

-- 
Nick Craig-Wood
[EMAIL PROTECTED]
_
Unsubscribe  list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: mprime, linux and 2 MB pages

2002-03-18 Thread Brian J. Beesley

On Monday 18 March 2002 10:21, Nick Craig-Wood wrote:
 There has been some discussion on the linux kernel mailing list about
 providing 2 MB pages (instead of 4kB ones) to user space for the use
 of database or scientific calculations.

 It seems to me that prime95/mprime would benefit from this enormously
 - it should reduce the TLB thrashing to practically zero and hence
 speed up mprime by some unknown amount.

 Is this true?  Should we put in our plea to the developers?

Other people may and probably will disagree, but I think it will make very 
little if any difference for most applications.

The point is that mprime should normally be running on a system in a way 
which means that all its active data pages are in memory. Having active data 
paged out will cause a hideous performance hit.

If the active data is already memory resident, TLB thrashing is not going to 
be an issue.

Applications written in such a way that rarely-accessed data is stored in 
virtual memory with the intention that the OS allows it to be paged out are a 
different matter - larger page sizes would undoubtedly help those, at least 
to some extent.

If the page size is going to be changed at all, there is a lot to be said for 
using the same size pages as AGP hardware - 4MB I think - there have already 
been some issues on some Athlon (K7) architecture linux systems caused by 
incorrect mapping between linux virtual pages and AGP address space; 
obviously using the same page size removes this source of confusion.

One factor with shifting to a much larger page size is a corresponding 
decrease in the number of pages available to the system - a 32 MByte system 
will have only 8 4MB pages resident in real memory at any one time. Since 
page access rules are often used to protect data from accidental modification 
by rogue pointers etc., a big reduction in system physical page count is a 
distinctly mixed blessing.

As a project I don't think we need to make reccomendations one way or the 
other. As an individual I would say either go with AGP or stick with the 
status quo; and I think the status quo is better suited to systems with small 
to moderate amounts of physical memory (certainly those with less than 256 
MBytes).

Regards
Brian Beesley

_
Unsubscribe  list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers