Re: Mersenne: More P4 timings
On 26 Apr 2001, at 21:23, George Woltman wrote: I seriously doubt GIMPS benchmarks will have an impact on P4 sales! I don't know ... as I understand the situation, Intel have been forced to drop their P4 prices way down because the units just aren't shifting ... magazine keep printing reviews where 1.4 GHz P4 systems have similar performance benchmark figures to 1 GHz Athlon systems, which are about half the price, and even the average PC buyer is starting to realize that raw clock speed is not neccessarily a good guide to system performance. The other factor is that most users simply do not (yet) need the raw power available in GHz+ PCs. Those that do are often at least aware of projects like GIMPS which can make good use of as much CPU power as they can get. You have increased the P4 performance of Prime95 by a factor of almost 3; at this performance level, a P4 system makes economic sense even before the latest round of price cuts, whereas without that performance improvement, quite frankly a power user would have been crazy to buy a P4 system even at the less than the price of an alternative system with a slower Athlon processor. It certainly makes a difference as to what I will consider next time I build myself a new system, and how I will evaluate competitive systems next time my employer is purchasing - and, being a largish university, that will be by the truckload. How many data passes per iteration? I think you may be getting very close to saturating PC600 memory throughput! Not even close. I use two memory passes. A 512K FFT is 4MB. Two reads, two writes, plus say 4MB of sin/cos data is 20MB. PC600 memory can deliver 2.0GB/sec. Thus, 20MB / (2.0GB/sec) is 0.01 seconds. Ah, I thought 512K FFT might take 5 or 6 passes. Some have asked about Athlon optimizations. I'm not an expert on the Athlon CPU. The only change I see to make is a different memory layout to take advantage of its different cache layout. I suspect a best case improvement of 10%. That's a lot of work (for me) for a modest gain. True enough. But what about implementing prefetch? The model here would be similar to implementing prefetch for PIII, though the details would differ. So there _might_ be a significant benefit for a large group of non-Athlon users. We're certainly not looking at performance times 3 here, but 10 or 20 percent is significant and valuable - _provided_ the development cost is not too great. I certainly accept that George has the ultimate right to target his development effort into whatever _he_ finds fun. There is another downside to this - while implementing prefetch on Athlon and PIII may be similar from the coding point of view, the opcodes are different - so we would need yet more specific versions of the code, which could cause a problem with distribution, and would also need _very_ careful detection of processor type to avoid unwanted execution of illegal opcodes, with the potential to hang or crash systems. And processsor detection is hard to implement properly because of the continual release of new types. AMD has committed to implementing SSE2 in a future chip. Then AMD users will also benefit from this new code. Assuming the details of the implementation are reasonably compatible. Intel might have something to say about this (like extortionate fees for licensing the idea, even if AMD home-grow their own silicon implementation of the SSE2 instruction set). Remember, MMX and 3DNow! were mutually incompatible implementations of very similar extensions to the basic Pentium instruction set. Regards Brian Beesley 1775*2^332181+1 is prime! (10 digits) Discovered 22-Apr-2001 _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: More P4 timings
Hi: George Woltman wrote: I just completed my first 512K FFT using the new SSE2 instructions! The 512K FFT handles exponents up to 10.3 million. Timings are as follows: 1.4GHz P4, old code:0.126 sec. 1.4GHz P4, new code:0.048 sec. 1.2GHz Athlon, 133MHz DDR: 0.084 sec. I have a few more optimizations up my sleeve. I think my goal of 0.040 seconds is achievable. I really think Intel should give you a good prize!. I can't imagine better publicity for P4. The cut of prices announced recently will help too. Good job!. Regards. Guillermo _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: More P4 timings
At 08:07 PM 4/26/01 +0200, you wrote: [sniped] 1.4GHz P4, old code:0.126 sec. 1.4GHz P4, new code:0.048 sec. 1.2GHz Athlon, 133MHz DDR: 0.084 sec. I have a few more optimizations up my sleeve. I think my goal of 0.040 seconds is achievable. Hey George: When will we begin to see this improved code in action? This would seem to cut a 10 million exponent LL test time almost 280 hours, which is quite a savings, and an increase to total throughput for the project. In the time I can clear one exponent, it could now clear about 4.5 exponents. - This message was sent using GSWeb Mail Services. http://www.gasou.edu/gsumail _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Re: Mersenne: More P4 timings
Hi again, At 07:52 PM 4/26/2001 +, Brian J. Beesley wrote: As opposed to 10.32 million with the 8087 code? If so, that's pretty good - I thought you might have lost more precision than that. I haven't figured out the upper limits yet. I expect the limits to be roughly what Ernst Mayer's program uses. As I recall, his limits are about 2% less than prime95's. 1.4GHz P4, new code: 0.048 sec. Not bad at all, especially if you still have PC600 memory! Intel could use some good publicity; I hope they reward you handsomely for this work, which surely must have some impact on sales. I seriously doubt GIMPS benchmarks will have an impact on P4 sales! How many data passes per iteration? I think you may be getting very close to saturating PC600 memory throughput! Not even close. I use two memory passes. A 512K FFT is 4MB. Two reads, two writes, plus say 4MB of sin/cos data is 20MB. PC600 memory can deliver 2.0GB/sec. Thus, 20MB / (2.0GB/sec) is 0.01 seconds. When will this new code be ready? Patience please :) Above 8K, only the 64K and 512K FFT have been coded. To do: more optimizing, better prefetching, the rest of the FFT sizes, commonizing code to reduce executable size, auxiliary code for ECM and P-1, reduce allocated memory, find the FFT crossover points, and QA. Then I can take a little while to add new features. In light of the clear benefits for P4 users, I'll make a beta available as soon as possible. Now the bad news. I implemented my first optimization today. The one I thought would get me close to 0.40 seconds. Alas, I grossly overestimated the benefits. I'm now at 0.46 and worried about getting to 0.40. Some have asked about Athlon optimizations. I'm not an expert on the Athlon CPU. The only change I see to make is a different memory layout to take advantage of its different cache layout. I suspect a best case improvement of 10%. That's a lot of work (for me) for a modest gain. AMD has committed to implementing SSE2 in a future chip. Then AMD users will also benefit from this new code. Regards, George _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: More P4 timings
Hi all, I just completed my first 512K FFT using the new SSE2 instructions! The 512K FFT handles exponents up to 10.3 million. Timings are as follows: 1.4GHz P4, old code:0.126 sec. 1.4GHz P4, new code:0.048 sec. 1.2GHz Athlon, 133MHz DDR: 0.084 sec. I have a few more optimizations up my sleeve. I think my goal of 0.040 seconds is achievable. Having fun, George _ Unsubscribe list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers