Re: Mersenne: P4

2000-11-30 Thread Brian J. Beesley

On 29 Nov 00, at 20:31, Jason Stratos Papadopoulos wrote:

  Humm, too much slowdown for single Ia32 instructions, Intel engineers
  will know the reasons.

Presumably something to do with not being parallel architecture; 
there's only one set of flags, so you can't start an ADC until the 
previous instruction has completed and the flags register has been 
retired. In IA32, there's nothing to tell you _which_ carry flag to 
use in an instruction which uses carry as an input, or output.
 
 This and the 14-18 clock multiply are extremely depressing. For all the
 marketing hype about Intel catering to the internet and the next
 generation of "active media", they're making it awfully difficult to
 implement the cryptography that the internet needs.

Well, you're not _forced_ to buy Intel - even if you stick to 
Windoze! I get the impression that, with the Athlon architecture, AMD 
are making serious inroads into the market - especially at the lower 
end. Now it just so happens that the Athlon architecture runs Prime95
rather well.
 
 The alpha was already at least 5x faster than a PIII for multiprecision
 arithmetic at the same clock speed; with the P4 it will only get worse.

Are you sure about this? I think, with Alpha, you have to execute the 
instruction twice to get a double-precision multiplication - you can 
store either the low half or the high half of a product in one 
instruction, but not both.

Of course, the Alpha's single precision integer arithmetic is 64 
bits, not 32. This does help somewhat :)

It's also easy to get confused between the latency (time between 
loading the opcode and finishing execution) and the throughput (the 
inverse of the longest time an instruction spends in any particular 
unit, times the number of the critical unit involved in the design).

The fact remains that the P4, like all other commercial processors, 
is a compromise. This is what makes it so darned complicated, and 
also indicates why the designers have to make performance tradeoffs, 
some of which don't suit our particular application. It would be 
possible to optimize the design so that it executed the instructions 
_we_ find useful much more quickly, but whether such a processor 
would be capable of running standard commercial benchmarking programs 
at a reasonable speed - or indeed at all - is open to question ... 
we'd probably want to reduce the inherent complexity by junking the 
16-bit x86 legacy, for a start...


Regards
Brian Beesley
_
Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: P4

2000-11-30 Thread EWMAYER

Jason Papadopoulos wrote:

 The alpha was already at least 5x faster than a PIII for multiprecision
 arithmetic at the same clock speed; with the P4 it will only get worse.

Brian Beesley replied:

 Are you sure about this? I think, with Alpha, you have to execute the 
 instruction twice to get a double-precision multiplication - you can 
 store either the low half or the high half of a product in one 
 instruction, but not both.

True, you need two instructions to get both halves of a 64x64 == 128-bit
product on the Alpha. It always seemed wasteful to me to have an instruction
(e.g. Alpha umulh) to get the upper half a product, which simply discards
the lower half - by way of contrast, MIPS dmult calculates the full 128-bit
product in one shot, returning the result in *two* 64-bit integer registers -
but apparently the Alpha designers wanted all instructions to be in 2-input,
one-output-register format, no exceptions. You might say they felt it would
be less than RISCy to do otherwise.

Now, on the Alpha 21064, integer muls are unpipelined and slow, so getting
a 128-bit product takes around 30 cycles. On the 21164 imuls are partially
pipelined - one can start at most every 8 cycles - so a double-width product
needs around 16 cycles, very close to what the MIPS needs using its unpipelined
dmultu instruction. Now on the Alpha 21264, imuls still have fairly long
latency (I believe around 6-8 cycles) but are *fully* pipelined, so even
with the need to use 2 instructions to do it, one can effectively get a
128-bit product every 2 cycles, which really screams.

The IA-64 will (from what I've read) actually be very much like the 21264
in this respect: 2 instructions for the separate halves of a double-wide
integer product (the IA-64 analog of Alpha umulh is called xma.hu), but
here the muls will use the x86-style floating multiplier, whose 64-bit
mantissa is of convenient length for 64-bit integer multiply. Thus, imuls
will benefit from the fact that the fmul unit is already designed to have
fairly low latency and, more importantly, is highly pipelined.

How is any of this relevant to Mersenne testing? Well, fast 64-bit integer
ops should make a well-designed all-integer convolution algorithm competitive
with a floating-point-FFT-based one. However, neither of these two is truly
satisfying since each basically uses just half the functionality (integer
or floating-point) of the processor. We really need an algorithm that does
floating and integer convolutions in parallel and uses some method (e.g.
Chinese remainder theorem or some other form of error correction) to
recorrelate the two result vectors at the end of each convolution step.
Such methods are feasible, but highly nontrivial to implement, and even
more challenging to implement efficiently. Stay tuned...

-ernst

_
Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Factoring

2000-11-30 Thread Chris Nash

Hi folks,

Off-topic I know, but...

We could let prime95 decide the next election grin.
Give everybody a different prime number. Multiply the primes for 
candidate A together, likewise for B.

And like in this election, unless we can completely factor the results, 
we wouldn't know who won, either.

:)
_
Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Re: Mersenne: Factoring

2000-11-30 Thread Peter-Lawrence . Montgomery


Chris Nash proposed:

 We could let prime95 decide the next election grin.
 Give everybody a different prime number. Multiply the primes for 
 candidate A together, likewise for B.
 
 And like in this election, unless we can completely factor the results, 
 we wouldn't know who won, either.
 
No need to factor during primery elections.




_
Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Mersenne Digest V1 #796

2000-11-30 Thread Mersenne Digest


Mersenne Digest  Thursday, November 30 2000  Volume 01 : Number 796




--

Date: Tue, 28 Nov 2000 15:48:11 -0500
From: "Willmore, David (VS Central)" [EMAIL PROTECTED]
Subject: RE: Mersenne: P4 - a correction

George wrote:
  One correction to my previous post.  I said that the latency to
 access the L1 data cache was 2 clocks.  This is correct for integer
 instructions only.  For floating point and SSE2 instructions the latency
 is 6 clocks!  Interestingly, the L2 cache latency is 7 clocks for both
 integer and floating point instructions.
 
Look at the coupling that the FPU has to the cache for one reason.  I would
expect
that the FPU(s) have more ports on the L1 than that integer units do.  Also,
if you look
at the sensitivity of different types of code to load latency, integer code,
by far, is
more sensitive than floating point.  Think about the length of the floating
point
pipeline, it's pretty long to start with, so you're gonig to *have* to
unroll your code
to take advantage of the pipeline, so you might as well cover the additional
load to
use latency the same way.  With enough rename registers, it's all good. :)

Cheers,
David
_
Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers

--

Date: Tue, 28 Nov 2000 16:48:13 -0500
From: George Woltman [EMAIL PROTECTED]
Subject: Mersenne: Re: P4 - a correction

Hi,

At 07:45 AM 11/28/00 +, Brian J. Beesley wrote:
Surely the latency is much less relevant than the throughput,
provided that there are sufficient registers and pipeline entries.

True.  The manuals are unclear as to how many XMM registers are
available.  There are 8 user visible ones and an unknown number
of internal ones that you use via the P4's register renaming feature.

The idea is to keep the execution units busy ...

I just finshed designing the new code for one of the easier problems -
the rounding and carry propagation step after the inverse FFT.
It looks like the P4 can do this in 6.5 clocks per double vs. about
19 clocks per double on the P3.  This doesn't count any benefits
from prefetching data.

It also seems that the multiply units are idle enough in the new
code so that I can continue to use the compact two-to-fractional-power
arrays, rather than two full 5MB arrays.  Thus, yesterday's question of
33MB max is probably 23MB max for a 640K FFT.  That's good, because
I'd feel guilty using up 33MB.  Heck, I feel guilty using 23MB.

I don't expect the above savings (3 times fewer clocks) to hold for
the FFT code.  The old FFT code was much better pipelined than the old
carry propagation code.

Bear in mind that there may be a severe penalty for switching between
SSE2 and x86 FPU formats.

There is no penalty on the P4.  The XMM registers do not overlap the
FPU registers in the same way that MMX registers do.

Having fun,
George

_
Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers

--

Date: Tue, 28 Nov 2000 14:06:41 -0800
From: "Stephan T. Lavavej" [EMAIL PROTECTED]
Subject: Mersenne: More on Compression

 This 'Wonderful' compression technology maybe "Awesome"; however, MY main
 objection or perhaps philosophy towards all of this is that Prime95 is
 not a large
 piece of code.  It takes a relatively small amount of time to download
over
 a modem
 compared to other software items that we modem users may download in a
weeks
 period.
 Maybe if Prime95 was ..say .. a 20MB download.  Then perhaps shaking
things
 up to save
 some download time would be a good idea, but as things stand now we are
only
 talking about saving 20 or so seconds.  Perhaps that is the reason people
 may find it 'objectionable'..
 Maybe its just not worth the hassle at the moment for this particular
 application

Actually, the download time wouldn't change all that much.  (The package
would go from 400K to 300K.)  But the executable itself, once unzipped,
would only be some 200K, as compared to over 1MB before.  It's just more
efficient.  The question is, if compression involves a one-time, five-minute
cost on the part of the developer and saves everyone a few seconds of
download time and a few K of HD space, then why not?  Why have bloated code?
I sure like looking at 200K executables instead of megabyte and larger
things.  Makes me think of my old 486 where everything fit in 120MB.

 P.S. The website for the compression program never resolved for me; I'd
 like to take a look at it, if someone would send me the IP.

http://upx.tsx.org is a redirection URL for
http://wildsau.idv.uni-linz.ac.at/mfx/upx.html.

 The P4 at 1.5 GHz is just the beginning of a sequence of