Mersenne: RE: any large groups that run GIMPS software on corporate computers?
This is probably not the input you're looking for: I joined GIMPS in the beginning of 1997 and installed mprime on most (i.e. 10) of the P133-200 Linux boxes in our research group. After some months, our group boss told me to remove it from all except the one I myself am working on. I first tried to ignore his command, as the other users didn't see any problems. But sometimes it happened that mprime went mad due to the fact that it had to write intermediate files onto an nfs mounted remote disk, which sometimes failed because the network wasn't properly set up. It then tended to steal more cycles than would be expected from a 'nice' process (and didn't produce anything sensible anymore). I don't know if he was aware of this problem, but finally I had to obey because he insisted that his machine runs slower if mprime runs on it, and that it doesn't matter if other users agree to let me run mprime on their machines, while he as the administrator forbids it. -- Lars S"oz"uer|Tel.: +49-9131-33351 private Schornbaumstr. 4 | NEW: -85-27066 office D-91052 Erlangen |FAX: -15249 Germany |e-mail: [EMAIL PROTECTED] Web Site: http://try.physik.uni-erlangen.de/~soezueer
Mersenne: Status Reports
The status reports are a great way of keeping track of each machine's work. Currently the reports are arranged in exponent order. I think it would be preferable, at least by the people with more than 10 PCs to arrange the list by machine 'name'. My second choice would be 'days to go' with the shortest time period at the top. Can the cgi for this be done easily? Otherwise I guess I found another use for Excel. Thanks Steve Gardner Test Point Inc [EMAIL PROTECTED] Browse 83000 Computer Products At www.pcavenue.com
Mersenne: Catching up...
Dear All: nigel was down for several days at the beginning of November for a major OS upgrade, so I'm just catching up with the last dowzen-or-so Mersenne digest. The mailing list server also bounced me after trying and failing to deliver several M-digest, so several postings I sent out in the last few weeks also got bounced, without me receiving a notice to that effect. So I'll repost and address some things I came across in the last three weeks' digests, all in one smorgasboard message; please pardon me if it seems a tad schizophrenic: On 11/2, Peter Montgomery wrote: Let's assume that the FFT computation loses 12 bits of significance for FFT lengths in the range of interest. Now consider an LL test on Mp where p ~= 5 million: Mantissa Output Radix FFT bitsbits length 53 41 2^11 524288 (Alpha) 64 52 2^16 327680 (Pentium) 110 98 2^40 131072 (Proposed) While Peter's numbers are strictly speaking correct, the conclusions are misleading. First off, he allows the Pentium to use a non-power-of-2 FFT length, but not the Alpha. In fact, with 53 mantissa bits, even using a power-of-two length of 2^18 = 262144 (more familiarly referred to as "256K" by GIMPSers), one can test exponents to around 5.2 million. Second, Prime95 gains the benefit of x86 80-bit floating operands only while data are in registers; all floating loads and stores from/to memory are on 64-bit IEEE floating data. As George explained to me a long time ago, this is because 80-bit loads and stores, while available on the x86, take twice as long as 64-bit loads and stores, and since using 8-bit operands throughout will cut the FFT length by less than a factor of 2, it's not worth doing this way. Compared to strict 64-bit reals, the extra accuracy gotten by 80-bits- only-while-in-registers arithmetic is marginal: for N=256K, Prime95 can test exponents up to 5.26 million, only around 1% larger than a true 64-bit program can do. * * * {originally attempted to send on 11/17:} Subject: Re: the Halloween memoranda A tad off-topic, but after all, GIMPS is also in many ways an open-source software (OSS) project: http://www.opensource.org/halloween.html And, of course, author Eric Raymond is a GIMPSer. Keep up the good work, Eric! * * * On 11/2, George Woltman wrote: I'd like to challenge the readers to produce the best algorithm and/or code for finding factors of Mersenne numbers greater than 64-bits. On reading this, I apparently had the same thought as Conrad Curry: "Such an algorithm already exists - it's called the Pollard p-1 method." In reply, I propose a COUNTERCHALLENGE: using reasonable computational cost estimates and known statistics regarding the the size distribution of factors, convince us that sieving is a more efficient way of finding factors 64 bits than p-1, for exponents of current interest (say, 1 million.) With regard to the p-1 cost estimate: assume that the stage 1 uses a precomputed product of small prime powers (i.e. a left-to-right binary exponentiation, costing the same as one LL test iteration per product bit) and that stage 2 is based on a fast polynomial evaluation, with a circular convolution length yielding a reasonable memory requirement, say no more than 256MB of RAM. Assume the cost of doing the GCDs is trivial. I think you'll see that the only advantage of sieving is certainty: if we sieve to B bits without finding a factor we can confidently say that there are no factors 2^B. What we can say with p-1 (or ECM, for that matter) is probabilistic. But if I can find 90% of factors 2^B using p-1 and doing half the work of sieving to do so, that's a trade I'll make any day. Cheers, Ernst
Re: Mersenne: RE: any large groups that run GIMPS software on corporate computers?
Hi Lars and list members, At 06:13 PM 11/24/98 +0100, Lars wrote: But sometimes it happened that mprime went mad due to the fact that it had to write intermediate files onto an nfs mounted remote disk, which sometimes failed because the network wasn't properly set up. It then tended to steal more cycles than would be expected from a 'nice' process (and didn't produce anything sensible anymore). Are there any Linux experts that can clue me in on what might have gone wrong and what code I would need to prevent this in the future? I don't know if he was aware of this problem, but finally I had to obey because he insisted that his machine runs slower if mprime runs on it, I'm sure Lars, a long-time GIMPS participant knows this, but others might benefit: The time between disk writes is adjustable, the default is 30 minutes. Lars was running mprime long before this was adjustable. Version 17 now supports a Time= option in the prime.ini file. This can be used to make mprime sleep during the work day. This might be enough to convince an Administrator that mprime is safe. If there are other ideas on how to make mprime or Prime95 less intrusive please forward them to me by private email. Regards, George
RE: Mersenne: A challenge
I have a reasonable algorithm for computing p^2 mod f. I have a sketch for the code sufficient to estimate timings but it hasn't yet been fully coded or tested. It's *much* more complex than the simple code I posted yesterday. The approximate timing is 3,250 clocks. It will do 2 * p^2 mod f for only about 15 clocks extra. It will go to 96 bits for an extra 600 clocks. There is scope for a version limited to 72 bits which would be about 250 clocks faster than the 80 bit version. The algorithm consists of two parts: effectively, a long multiplication in base 2^32 and a long division in base 2. The long multiplication gains very little or nothing from any restriction in the value of f, therefore a full 96 bit long multiplication is done. Let p0 be the low 32 bits of p, p1 the middle 32 bits and p2 the high 32 bits. When I say "word" I mean 32 bits. 1. Using the FPU, calculate p0*p0, p0*p1, p0*p2, p1*p1, p1*p2 and p2*p2, storing each result in a seperate 64-bit work area. This appears to be best done using the floating point unit on the processor. Everything else uses only the ALU. 2. Form the result in a 192-bit "accumulator" as follows. 1st word = low word of p0*p0 2nd word = high word of p0*p0 + 2*low word of p0*p1 3rd word = 2*high word of p0*p1 + low word of p1*p1 + 2*low word of p0*p2 4th word = high word of p1*p1 + 2*high word of p0*p2 + 2*low word of p1*p2 5th word = 2*high word of p1*p2 + low word of p2*p2 6th word = high word of p2*p2 .. propogating all carries from one word to the next. This accumulator has to be stored in memory, there aren't enough registers :-( From now on, when I say "accumulator", I mean this memory buffer, not a CPU register. (There's scope for optimization here. Can start forming the result while the FPU is still working on the later partials; this should allow some of the operations to be "free" since independent ALU and FPU operations are done in parallel.) 3. If f has 80 or less significant bits (high 16 bits of f2 = 0), left shift the accumulator by 16 bits set the division loop count to 80, else set the division loop count to 96. (Scope for further optimization. If f2 has 72 or less significant bits, or more than 80 but less than or equal to 88 significant bits, left shift the accumulator by 8 bits reduce the division loop count by 8) Note, f is frequently referenced in the division loop; store it in registers. The loop count is infrequently referenced can be stored in memory. 4. Repeat until division loop count = 0: 4a. Subtract f from the high 96 bits of the accumulator, if this is possible without generating a borrow from the MSB. 4b. Left shift the whole accumulator by 1 bit. 4c. Decrement the division loop count. 5. If, but only if, the operation required is 2*p^2 mod f, subtract f from the high 96 bits of the accumulator, if this is possible without generating a borrow from the MSB. The required result is now in the high 96 bits of the accumulator. The bulk of the time is spent in the loop in step 4. Although the exact timing for each iteration depends on whether or not the subtraction "goes", the effect is small. There are two problems - speculative precalculation is not helpful, since there is heavy dependence of each instruction on the previous one, and rotating 192 bits through memory requires 6 loads, 6 rotates and 6 stores. I estimate the loop timing as 1 clock per instruction, not including the rotates, plus 18 clocks for the rotates. It comes out to approx. 39 clocks per iteration. Keeping f in memory and using CPU registers for the 3 most significant words of the accumulator may be better.