Re: Best way to carry on 2-input architecture?

2014-08-22 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: > For concreteness, say we attempt to use a size 16 FFT, over the > cofficient ring 2^{12}+1 (it's some years since I last looked carefully > at FFT-based multiply, but I hope those parameters make sense. The FFT > inputs are 4 bits, but grow to 12 in th

Re: Best way to carry on 2-input architecture?

2014-08-22 Thread Niels Möller
"Wesley W. Terpstra" writes: > I was not proposing it for 64x64!!! > I have no idea when NTT would be more efficient. > Maybe like 2048x2048 or something. At least in software, the threshold where FFT based multiply becomes useful is a lot lower for wraparound mul than for regular mul. > The ga

Re: Best way to carry on 2-input architecture?

2014-08-22 Thread Wesley W. Terpstra
On Fri, Aug 22, 2014 at 2:06 PM, Niels Möller wrote: > But a 64-bit mul is usually not single cycle? Of course not. I'm not sure what you're driving at here. If mullo and mulhi both have the same clock and are both 3 cycles deep, it does not matter if one has slightly lower latency between two r

Re: Best way to carry on 2-input architecture?

2014-08-22 Thread Niels Möller
"Wesley W. Terpstra" writes: > Nah. You use the same clock for the whole ALU, so your clock rate is > limited by the same critical path for both circuits. Thus, mullo is > actually no faster, despite being potentially two gates shorter. But a 64-bit mul is usually not single cycle? On typical x8

Re: Best way to carry on 2-input architecture?

2014-08-22 Thread Wesley W. Terpstra
On Fri, Aug 22, 2014 at 11:46 AM, Wesley W. Terpstra wrote: >> H = (x - L) mod (B-1) > > So let me get this straight. Your plan is to compute x=a*b mod (B-1). > In other words, you sum 64 columns of 64-rows in a square shape > instead of the usual parallelogram. Then you throw the twos complemen

Re: Best way to carry on 2-input architecture?

2014-08-22 Thread Wesley W. Terpstra
On Thu, Aug 21, 2014 at 11:12 PM, Niels Möller wrote: > One *can* reduce a piece at the right end to a single row ... yeah, but Wallace trees are essentially ripple adders at the right-end = slow. So you don't use them for that. Maybe after you get it down to 3 rows for all columns (ie: the middl

Re: Best way to carry on 2-input architecture?

2014-08-21 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: > "Wesley W. Terpstra" writes: > >>> But maybe you can do a circuit for mulhi which is significantly simpler >>> than for a widening multiplication, because you only need to collect the >>> carry out from the low partial products? Something like a Walla

Re: Best way to carry on 2-input architecture?

2014-08-21 Thread Wesley W. Terpstra
Back to the original question: best way to implement carry. Torbjorn has a 3-operand adder: d = a + b + (c&1) I agree that works to implement carry, but I can't help but notice that it costs me virtually nothing to implement that as a straight-up ternary add (if I have 3-operand support already):

Re: Best way to carry on 2-input architecture?

2014-08-20 Thread Niels Möller
"Wesley W. Terpstra" writes: >> But maybe you can do a circuit for mulhi which is significantly simpler >> than for a widening multiplication, because you only need to collect the >> carry out from the low partial products? Something like a Wallace tree >> where you drop the low output bit from a

Re: Best way to carry on 2-input architecture?

2014-08-20 Thread Niels Möller
"Wesley W. Terpstra" writes: > I'm going to retract my out-of-hand rejection of this idea. Depends on the size, I guess. For 64x64, my example scheme would reduce the number of and gates from 4096 to about half. Maybe that's not worth the effort, since there's also going to be quite a lot of add

Re: Best way to carry on 2-input architecture?

2014-08-20 Thread Wesley W. Terpstra
On Wed, Aug 20, 2014 at 10:59 AM, Niels Möller wrote: > One could build it out of smaller blocks, say we have combinatorial 8x8 > multipliers. We then need 64 of those, in sequence or parallel, and a > bunch of adders. Or we could use Karatsuba, to do a 64x64 using 27 8x8 > multiplies and a bunch

Re: Best way to carry on 2-input architecture?

2014-08-20 Thread Wesley W. Terpstra
On Wed, Aug 20, 2014 at 10:59 AM, Niels Möller wrote: > I've been thinking that if you want the full two-word result of a > multiply, and compute each part separately using mulhi and mullo, the > mulhi operation will have to compute the full product anyway, discarding > the low half, and then the

Re: Best way to carry on 2-input architecture?

2014-08-20 Thread Niels Möller
"Wesley W. Terpstra" writes: > On Sun, Aug 17, 2014 at 5:05 PM, Torbjörn Granlund wrote: >> For multiply, d = (a * b + c) mod B (B being the word base) and d = [(a >> * b + c) / B] are very useful. > > I see the benefit to the mulhadd variant (=> no hardware division). > I'm not convinced by the

Re: Best way to carry on 2-input architecture?

2014-08-18 Thread Wesley W. Terpstra
So, yeah, we're drifting pretty far off-topic now. If you have follow-up questions that you think aren't interesting for gmp-devel, just send them to me privately. On Mon, Aug 18, 2014 at 2:35 PM, Niels Möller wrote: > I think the decoder could implement the prefix instruction, as I've > defined

Re: Best way to carry on 2-input architecture?

2014-08-18 Thread Wesley W. Terpstra
On Mon, Aug 18, 2014 at 9:01 AM, Niels Möller wrote: > I don't think it has to be that bad. First, the prefix flag and register > should be saved and restored on irq, so there should be no problem with > irq:s or page faults and the like in the "middle" of an instruction. Yes, that's the advantag

Re: Best way to carry on 2-input architecture?

2014-08-18 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: I also don't know the ARM internals. But short latency between carry in and carry out is important to make the umaal and umlal instructions useful for bignum multiplication. One usually cannot have a critical-path addend for a*b+c operations. But

Re: Best way to carry on 2-input architecture?

2014-08-18 Thread Niels Möller
"Wesley W. Terpstra" writes: > On Mon, Aug 18, 2014 at 9:01 AM, Niels Möller wrote: >> I don't think it has to be that bad. First, the prefix flag and register >> should be saved and restored on irq, so there should be no problem with >> irq:s or page faults and the like in the "middle" of an in

Re: Best way to carry on 2-input architecture?

2014-08-18 Thread Wesley W. Terpstra
On Sun, Aug 17, 2014 at 10:00 PM, Niels Möller wrote: > t...@gmplib.org (Torbjörn Granlund) writes: >> My work is available here: https://gmplib.org/~tege/fisa.pdf Thank you for the link. > And mine at http://www.lysator.liu.se/~nisse/misc/instr16.pdf (a toy > project, trying to squeeze a decent

Re: Best way to carry on 2-input architecture?

2014-08-18 Thread Niels Möller
"Wesley W. Terpstra" writes: > The restriction of having B as both input and output is > not particularly onerous, IMO. I agree. If it allows instructions to be squeezed into 16-bit opcodes rather than 32-bit, then I'd expact that to more than make up for the occasional extra mov instructions ne

Re: Best way to carry on 2-input architecture?

2014-08-17 Thread Niels Möller
"Wesley W. Terpstra" writes: >> For multiply, d = (a * b + c) mod B (B being the word base) and d = [(a >> * b + c) / B] are very useful. > > I see the benefit to the mulhadd variant (=> no hardware division). > I'm not convinced by the mulladd variant. Interesting, I hadn't thought about this l

Re: Best way to carry on 2-input architecture?

2014-08-17 Thread Niels Möller
t...@gmplib.org (Torbjörn Granlund) writes: > My work is available here: https://gmplib.org/~tege/fisa.pdf And mine at http://www.lysator.liu.se/~nisse/misc/instr16.pdf (a toy project, trying to squeeze a decent instruction set with 16 general purpose registers into 16-bit opcodes). > I think a

Re: Best way to carry on 2-input architecture?

2014-08-17 Thread Wesley W. Terpstra
On Sun, Aug 17, 2014 at 5:05 PM, Torbjörn Granlund wrote: > My work is available here: https://gmplib.org/~tege/fisa.pdf Thanks for the link. Currently, I am working on my processor's architecture and its performance. Thus, I have not yet committed to a specific ISA. In fact, I have been building

Re: Best way to carry on 2-input architecture?

2014-08-17 Thread Torbjörn Granlund
Both I and Niels have looked into ISAs which support GMP operations well. My work is available here: https://gmplib.org/~tege/fisa.pdf You're right that "umulhi" and addition as well as subtraction with carry/borrow are critical operations. And for multiply throughput is more important than low

Best way to carry on 2-input architecture?

2014-08-17 Thread Wesley W. Terpstra
Good evening, As a hobby project I've been building an out-of-order superscalar CPU for use in FPGAs. Everything has been going pretty smoothly, but I'd like to be able to support gmplib efficiently on it. To keep the issue stage small+fast, instructions on this machine can only accept two registe