Re: Mersenne: mprime, linux and 2 MB pages
On Mon, Mar 18, 2002 at 02:12:48PM +, Brian J. Beesley wrote: On Monday 18 March 2002 10:21, Nick Craig-Wood wrote: There has been some discussion on the linux kernel mailing list about providing 2 MB pages (instead of 4kB ones) to user space for the use of database or scientific calculations. It seems to me that prime95/mprime would benefit from this enormously - it should reduce the TLB thrashing to practically zero and hence speed up mprime by some unknown amount. Is this true? Should we put in our plea to the developers? Other people may and probably will disagree, but I think it will make very little if any difference for most applications. For most applications yes - however it would be configurable in a per-process manner. Possibly each process might have to take special action to get 2 MB pages - say a new flag to memmap(). The point is that mprime should normally be running on a system in a way which means that all its active data pages are in memory. Having active data paged out will cause a hideous performance hit. If the active data is already memory resident, TLB thrashing is not going to be an issue. The TLB (translation lookaside buffer) has very little to do with the Virtual Memory system. The TLB is used by the processor to cache the address translations from logical memory to physical memory. These have to be read from the page table RAM which is expensive - hence the cache. When I was working on a DWT implementation for StrongARM I found that thrashing the TLB caused a factor of two slow down. The StrongARM system I was using had no virtual memory. If mprime is using 10 MB of memory say, then each page needs 1 TLB entry to be used at all by the processor - ie 2560 TLB entries which is way bigger than the size of the TLB in the processor (I don't remember what it is in x86 but on StrongARM it has 32 entries). To access each physical page the TLB has to be reloaded from the page tables which is an extra memory access or two. If you use 2 MB pages then there are only 5 pages needed and hence the TLB will never need to be refilled and hence some speed gain. If the page size is going to be changed at all, there is a lot to be said for using the same size pages as AGP hardware - 4MB I think - there have already been some issues on some Athlon (K7) architecture linux systems caused by incorrect mapping between linux virtual pages and AGP address space; obviously using the same page size removes this source of confusion. The choice of page size is constrained by the memory management hardware. I think 4k and 2 MB are the only sensible choices but I may be wrong. One factor with shifting to a much larger page size is a corresponding decrease in the number of pages available to the system - a 32 MByte system will have only 8 4MB pages resident in real memory at any one time. Since page access rules are often used to protect data from accidental modification by rogue pointers etc., a big reduction in system physical page count is a distinctly mixed blessing. The proposal was that you would be able to turn on 2 MB pages for a given process - obviously you wouldn't want 2 MB pages for every process unless you had 100 GB of RAM. I think this would make a real difference to mprime - what percentage I don't know - at the cost of on average 1 MB of RAM extra. -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: mprime, linux and 2 MB pages
On Tuesday 19 March 2002 10:09, Nick Craig-Wood wrote: On Mon, Mar 18, 2002 at 02:12:48PM +, Brian J. Beesley wrote: If the active data is already memory resident, TLB thrashing is not going to be an issue. The TLB (translation lookaside buffer) has very little to do with the Virtual Memory system. The TLB is used by the processor to cache the address translations from logical memory to physical memory. These have to be read from the page table RAM which is expensive - hence the cache. Ah, but ... frequently accessing pages (virtual _or_ physical) will keep the TLB pages from getting too far away from the processor; probably at worst they will stay in the L1 cache. The overhead of accessing from L1 cache is small compared with the overhead of accessing data from main memory, and _tiny_ compared with the overhead of accessing data from the page/swap file. When I was working on a DWT implementation for StrongARM I found that thrashing the TLB caused a factor of two slow down. The StrongARM system I was using had no virtual memory. If mprime is using 10 MB of memory say, then each page needs 1 TLB entry to be used at all by the processor - ie 2560 TLB entries which is way bigger than the size of the TLB in the processor (I don't remember what it is in x86 but on StrongARM it has 32 entries). To access each physical page the TLB has to be reloaded from the page tables which is an extra memory access or two. If you use 2 MB pages then there are only 5 pages needed and hence the TLB will never need to be refilled and hence some speed gain. Don't you _need_ to have at least enough TLB entries to map the whole of the processor cache? (Since without it you can't map the cache table entries...) The K7 (Athlon) architecture is designed to support at least 8MBytes cache, even though AFAIK no Athlons with more than 512KB cache have been supplied. Intel have supplied Xeons with 2MBytes cache; I can't remember offhand what the design limit is... Anyway, here's the point. I'm running mprime on an Athlon (XP1700) with a very large exponent (~67 million); the virtual memory used by the mprime process is 42912 Kbytes = 10,000+ pages. The speed it's running at suggests that any performance loss due to TLB thrashing is small, since the extra drop beyond linearity is only about what one would expect from the LL test algorithm being O(n log n). Whatever effect TLB thrashing may or may not be having, it doesn't look as though it's having a dominant effect on mprime. I think this would make a real difference to mprime - what percentage I don't know - at the cost of on average 1 MB of RAM extra. I wouldn't mind _doubling_ the memory footprint, if we got a _significant _ performance boost as a consequence. BTW why does this argument apply only to mprime? Surely Windows has the same underlying architecture - though obviously it's harder to get the Windows kernel changed than linux. Regards Brian Beesley _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: mprime, linux and 2 MB pages
On Tue, Mar 19, 2002 at 08:11:54PM +, Brian J. Beesley wrote: Ah, but ... frequently accessing pages (virtual _or_ physical) will keep the TLB pages from getting too far away from the processor; probably at worst they will stay in the L1 cache. Yes. The overhead of accessing from L1 cache is small compared with the overhead of accessing data from main memory, and _tiny_ compared with the overhead of accessing data from the page/swap file. Yes... but a TLB miss costs one or maybe two extra memory cycles. Ie it halves the performance if you are missing every fetch. Don't you _need_ to have at least enough TLB entries to map the whole of the processor cache? (Since without it you can't map the cache table entries...) Hmm, not sure. The processor cache isn't directly mapped into memory so it doesn't need TLB entries. Depending on the architecture the cache will either cache physical addresses or logical addresses. [snip] Whatever effect TLB thrashing may or may not be having, it doesn't look as though it's having a dominant effect on mprime. Very true and that is a testament to George's programming skills. A naively programmed FFT will demonstrate TLB thrashing admirably! I think this would make a real difference to mprime - what percentage I don't know - at the cost of on average 1 MB of RAM extra. I wouldn't mind _doubling_ the memory footprint, if we got a _significant _ performance boost as a consequence. Me neither. I'd like to try it but I need to persuade some friendly kernel hacker to implement it for me ;-) I would guess it might make at most 10% difference to the run time. BTW why does this argument apply only to mprime? Surely Windows has the same underlying architecture - though obviously it's harder to get the Windows kernel changed than linux. It applies exactly the same to Windows of course. Its just that you can chat with the Linux kernel developers on the mailing list ;-) -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: mprime, linux and 2 MB pages
There has been some discussion on the linux kernel mailing list about providing 2 MB pages (instead of 4kB ones) to user space for the use of database or scientific calculations. It seems to me that prime95/mprime would benefit from this enormously - it should reduce the TLB thrashing to practically zero and hence speed up mprime by some unknown amount. Is this true? Should we put in our plea to the developers? -- Nick Craig-Wood [EMAIL PROTECTED] _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: mprime, linux and 2 MB pages
On Monday 18 March 2002 10:21, Nick Craig-Wood wrote: There has been some discussion on the linux kernel mailing list about providing 2 MB pages (instead of 4kB ones) to user space for the use of database or scientific calculations. It seems to me that prime95/mprime would benefit from this enormously - it should reduce the TLB thrashing to practically zero and hence speed up mprime by some unknown amount. Is this true? Should we put in our plea to the developers? Other people may and probably will disagree, but I think it will make very little if any difference for most applications. The point is that mprime should normally be running on a system in a way which means that all its active data pages are in memory. Having active data paged out will cause a hideous performance hit. If the active data is already memory resident, TLB thrashing is not going to be an issue. Applications written in such a way that rarely-accessed data is stored in virtual memory with the intention that the OS allows it to be paged out are a different matter - larger page sizes would undoubtedly help those, at least to some extent. If the page size is going to be changed at all, there is a lot to be said for using the same size pages as AGP hardware - 4MB I think - there have already been some issues on some Athlon (K7) architecture linux systems caused by incorrect mapping between linux virtual pages and AGP address space; obviously using the same page size removes this source of confusion. One factor with shifting to a much larger page size is a corresponding decrease in the number of pages available to the system - a 32 MByte system will have only 8 4MB pages resident in real memory at any one time. Since page access rules are often used to protect data from accidental modification by rogue pointers etc., a big reduction in system physical page count is a distinctly mixed blessing. As a project I don't think we need to make reccomendations one way or the other. As an individual I would say either go with AGP or stick with the status quo; and I think the status quo is better suited to systems with small to moderate amounts of physical memory (certainly those with less than 256 MBytes). Regards Brian Beesley _ Unsubscribe list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers