Mersenne: RE: any large groups that run GIMPS software on corporate computers?

1998-11-24 Thread Lars S\oz\uer

This is probably not the input you're looking for:

I joined GIMPS in the beginning of 1997
and installed mprime on most (i.e. 10) of the
P133-200 Linux boxes in our research group.

After some months, our group boss told me to remove it
from all except the one I myself am working on.
I first tried to ignore his command,
as the other users didn't see any problems.

But sometimes it happened that mprime went mad
due to the fact that it had to write intermediate files
onto an nfs mounted remote disk, which sometimes
failed because the network wasn't properly set up.
It then tended to steal more cycles than would be expected
from a 'nice' process (and didn't produce anything sensible anymore).

I don't know if he was aware of this problem, but
finally I had to obey because he insisted that
his machine runs slower if mprime runs on it,
and that it doesn't matter if other users agree to let
me run mprime on their machines, 
while he as the administrator forbids it.
-- 
Lars S"oz"uer|Tel.: +49-9131-33351  private
Schornbaumstr. 4 |  NEW:  -85-27066 office 
D-91052 Erlangen |FAX:  -15249
Germany  |e-mail: [EMAIL PROTECTED]
Web Site: http://try.physik.uni-erlangen.de/~soezueer



Mersenne: Status Reports

1998-11-24 Thread Steve Gardner

The status reports are a great way of keeping track of each machine's work.
Currently the reports are arranged in exponent order.  I think it would be
preferable,  at least by the people with more than 10 PCs to arrange the
list by machine 'name'.

My second choice would be 'days to go' with the shortest time period at the
top.

Can the cgi for this be done easily?  Otherwise I guess I found another use
for Excel.

Thanks

Steve Gardner
 Test Point Inc
   [EMAIL PROTECTED]
  Browse 83000 Computer Products At
www.pcavenue.com




Mersenne: Catching up...

1998-11-24 Thread Ernst W. Mayer


Dear All: nigel was down for several days at the beginning of
November for a major OS upgrade, so I'm just catching up with
the last dowzen-or-so Mersenne digest. The mailing list server
also bounced me after trying and failing to deliver several
M-digest, so several postings I sent out in the last few weeks
also got bounced, without me receiving a notice to that effect.

So I'll repost and address some things I came across in the
last three weeks' digests, all in one smorgasboard message;
please pardon me if it seems a tad schizophrenic:

On 11/2, Peter Montgomery wrote:

Let's assume that the FFT computation loses 12 bits
of significance for FFT lengths in the range of interest.
Now consider an LL test on Mp where p ~= 5 million:

   Mantissa Output   Radix FFT
 bitsbits length

  53  41  2^11 524288 (Alpha)
  64  52  2^16 327680 (Pentium)
 110  98  2^40 131072 (Proposed)

While Peter's numbers are strictly speaking correct, the conclusions
are misleading. First off, he allows the Pentium to use a non-power-of-2
FFT length, but not the Alpha. In fact, with 53 mantissa bits, even using
a power-of-two length of 2^18 = 262144 (more familiarly referred to as
"256K" by GIMPSers), one can test exponents to around 5.2 million.

Second, Prime95 gains the benefit of x86 80-bit floating operands only
while data are in registers; all floating loads and stores from/to
memory are on 64-bit IEEE floating data. As George explained to me
a long time ago, this is because 80-bit loads and stores, while available
on the x86, take twice as long as 64-bit loads and stores, and since
using 8-bit operands throughout will cut the FFT length by less than
a factor of 2, it's not worth doing this way.

Compared to strict 64-bit reals, the extra accuracy gotten by 80-bits-
only-while-in-registers arithmetic is marginal: for N=256K, Prime95 can
test exponents up to 5.26 million, only around 1% larger than a true
64-bit program can do.

*   *   *

{originally attempted to send on 11/17:}

Subject: Re: the Halloween memoranda

A tad off-topic, but after all, GIMPS is also in many ways
an open-source software (OSS) project:

http://www.opensource.org/halloween.html

And, of course, author Eric Raymond is a GIMPSer.

Keep up the good work, Eric!

*   *   *

On 11/2, George Woltman wrote:

I'd like to challenge the readers to produce the best algorithm
and/or code for finding factors of Mersenne numbers greater than 64-bits.

On reading this, I apparently had the same thought as Conrad Curry:
"Such an algorithm already exists - it's called the Pollard p-1 method."

In reply, I propose a

COUNTERCHALLENGE: using reasonable computational cost estimates and known
statistics regarding the the size distribution of factors, convince us that
sieving is a more efficient way of finding factors  64 bits than p-1,
for exponents of current interest (say,  1 million.)

With regard to the p-1 cost estimate: assume that the stage 1 uses a
precomputed product of small prime powers (i.e. a left-to-right binary
exponentiation, costing the same as one LL test iteration per product
bit) and that stage 2 is based on a fast polynomial evaluation, with
a circular convolution length yielding a reasonable memory requirement,
say no more than 256MB of RAM. Assume the cost of doing the GCDs is
trivial.

I think you'll see that the only advantage of sieving is certainty:
if we sieve to B bits without finding a factor we can confidently say
that there are no factors  2^B. What we can say with p-1 (or ECM, for
that matter) is probabilistic. But if I can find 90% of factors  2^B
using p-1 and doing half the work of sieving to do so, that's a trade
I'll make any day.

Cheers,
Ernst



Re: Mersenne: RE: any large groups that run GIMPS software on corporate computers?

1998-11-24 Thread George Woltman

Hi Lars and list members,

At 06:13 PM 11/24/98 +0100, Lars wrote:
But sometimes it happened that mprime went mad
due to the fact that it had to write intermediate files
onto an nfs mounted remote disk, which sometimes
failed because the network wasn't properly set up.
It then tended to steal more cycles than would be expected
from a 'nice' process (and didn't produce anything sensible anymore).

Are there any Linux experts that can clue me in on what
might have gone wrong and what code I would need to prevent
this in the future?

I don't know if he was aware of this problem, but
finally I had to obey because he insisted that
his machine runs slower if mprime runs on it,

I'm sure Lars, a long-time GIMPS participant knows this, but others 
might benefit:

The time between disk writes is adjustable, the default is
30 minutes.  Lars was running mprime long before this was adjustable.

Version 17 now supports a Time= option in the prime.ini file.
This can be used to make mprime sleep during the work day.  This might
be enough to convince an Administrator that mprime is safe.

If there are other ideas on how to make mprime or Prime95 less
intrusive please forward them to me by private email.

Regards,
George



RE: Mersenne: A challenge

1998-11-24 Thread BJ . Beesley

I have a reasonable algorithm for computing p^2 mod f. I have a sketch for the 
code sufficient to estimate timings but it hasn't yet been fully coded or tested. 
It's *much* more complex than the simple code I posted yesterday.

The approximate timing is 3,250 clocks. It will do 2 * p^2 mod f for only about 15 
clocks extra. It will go to 96 bits for an extra 600 clocks. There is scope for a
version limited to 72 bits which would be about 250 clocks faster than the 80 bit 
version.

The algorithm consists of two parts: effectively, a long multiplication in
base 2^32 and a long division in base 2. The long multiplication gains very little 
or nothing from any restriction in the value of f, therefore a full 96 bit long 
multiplication is done.

Let p0 be the low 32 bits of p, p1 the middle 32 bits and p2 the high 32 bits.
When I say "word" I mean 32 bits.

1. Using the FPU, calculate p0*p0, p0*p1, p0*p2, p1*p1, p1*p2 and p2*p2, storing 
each result in a seperate 64-bit work area. This appears to be best done using the 
floating point unit on the processor. Everything else uses only the ALU.

2. Form the result in a 192-bit "accumulator" as follows.
1st word = low word of p0*p0
2nd word = high word of p0*p0 + 2*low word of p0*p1
3rd word = 2*high word of p0*p1 + low word of p1*p1 + 2*low word of p0*p2
4th word = high word of p1*p1 + 2*high word of p0*p2 + 2*low word of p1*p2
5th word = 2*high word of p1*p2 + low word of p2*p2
6th word = high word of p2*p2
.. propogating all carries from one word to the next.
This accumulator has to be stored in memory, there aren't enough registers :-(
From now on, when I say "accumulator", I mean this memory buffer, not a CPU 
register.

(There's scope for optimization here. Can start forming the result while the FPU 
is still working on the later partials; this should allow some of the operations
to be "free" since independent ALU and FPU operations are done in parallel.)

3. If f has 80 or less significant bits (high 16 bits of f2 = 0), left shift the
accumulator by 16 bits  set the division loop count to 80, else set the division 
loop count to 96.

(Scope for further optimization. If f2 has 72 or less significant bits, or more 
than 80 but less than or equal to 88 significant bits, left shift the accumulator 
by 8 bits  reduce the division loop count by 8)

Note, f is frequently referenced in the division loop; store it in registers. The 
loop count is infrequently referenced  can be stored in memory.

4. Repeat until division loop count = 0:
4a. Subtract f from the high 96 bits of the accumulator, if this is
possible without generating a borrow from the MSB.
4b. Left shift the whole accumulator by 1 bit.
4c. Decrement the division loop count.

5. If, but only if, the operation required is 2*p^2 mod f, subtract f from the 
high 96 bits of the accumulator, if this is possible without generating a borrow 
from the MSB.

The required result is now in the high 96 bits of the accumulator.

The bulk of the time is spent in the loop in step 4. Although the exact timing for 
each iteration depends on whether or not the subtraction "goes", the effect is 
small. There are two problems - speculative precalculation is not helpful, since 
there is heavy dependence of each instruction on the previous one, and rotating 
192 bits through memory requires 6 loads, 6 rotates and 6 stores. I estimate the 
loop timing as 1 clock per instruction, not including the rotates, plus 18 clocks 
for the rotates. It comes out to approx. 39 clocks per iteration.

Keeping f in memory and using CPU registers for the 3 most significant words of 
the accumulator may be better.