ni...@lysator.liu.se (Niels Möller) writes:
> For concreteness, say we attempt to use a size 16 FFT, over the
> cofficient ring 2^{12}+1 (it's some years since I last looked carefully
> at FFT-based multiply, but I hope those parameters make sense. The FFT
> inputs are 4 bits, but grow to 12 in th
"Wesley W. Terpstra" writes:
> I was not proposing it for 64x64!!!
> I have no idea when NTT would be more efficient.
> Maybe like 2048x2048 or something.
At least in software, the threshold where FFT based multiply becomes
useful is a lot lower for wraparound mul than for regular mul.
> The ga
On Fri, Aug 22, 2014 at 2:06 PM, Niels Möller wrote:
> But a 64-bit mul is usually not single cycle?
Of course not. I'm not sure what you're driving at here.
If mullo and mulhi both have the same clock and are both 3 cycles
deep, it does not matter if one has slightly lower latency between two
r
"Wesley W. Terpstra" writes:
> Nah. You use the same clock for the whole ALU, so your clock rate is
> limited by the same critical path for both circuits. Thus, mullo is
> actually no faster, despite being potentially two gates shorter.
But a 64-bit mul is usually not single cycle? On typical x8
On Fri, Aug 22, 2014 at 11:46 AM, Wesley W. Terpstra wrote:
>> H = (x - L) mod (B-1)
>
> So let me get this straight. Your plan is to compute x=a*b mod (B-1).
> In other words, you sum 64 columns of 64-rows in a square shape
> instead of the usual parallelogram. Then you throw the twos complemen
On Thu, Aug 21, 2014 at 11:12 PM, Niels Möller wrote:
> One *can* reduce a piece at the right end to a single row
... yeah, but Wallace trees are essentially ripple adders at the
right-end = slow. So you don't use them for that. Maybe after you get
it down to 3 rows for all columns (ie: the middl
ni...@lysator.liu.se (Niels Möller) writes:
> "Wesley W. Terpstra" writes:
>
>>> But maybe you can do a circuit for mulhi which is significantly simpler
>>> than for a widening multiplication, because you only need to collect the
>>> carry out from the low partial products? Something like a Walla
Back to the original question: best way to implement carry.
Torbjorn has a 3-operand adder:
d = a + b + (c&1)
I agree that works to implement carry, but I can't help but notice
that it costs me virtually nothing to implement that as a straight-up
ternary add (if I have 3-operand support already):
"Wesley W. Terpstra" writes:
>> But maybe you can do a circuit for mulhi which is significantly simpler
>> than for a widening multiplication, because you only need to collect the
>> carry out from the low partial products? Something like a Wallace tree
>> where you drop the low output bit from a
"Wesley W. Terpstra" writes:
> I'm going to retract my out-of-hand rejection of this idea.
Depends on the size, I guess. For 64x64, my example scheme would reduce
the number of and gates from 4096 to about half. Maybe that's not worth
the effort, since there's also going to be quite a lot of add
On Wed, Aug 20, 2014 at 10:59 AM, Niels Möller wrote:
> One could build it out of smaller blocks, say we have combinatorial 8x8
> multipliers. We then need 64 of those, in sequence or parallel, and a
> bunch of adders. Or we could use Karatsuba, to do a 64x64 using 27 8x8
> multiplies and a bunch
On Wed, Aug 20, 2014 at 10:59 AM, Niels Möller wrote:
> I've been thinking that if you want the full two-word result of a
> multiply, and compute each part separately using mulhi and mullo, the
> mulhi operation will have to compute the full product anyway, discarding
> the low half, and then the
"Wesley W. Terpstra" writes:
> On Sun, Aug 17, 2014 at 5:05 PM, Torbjörn Granlund wrote:
>> For multiply, d = (a * b + c) mod B (B being the word base) and d = [(a
>> * b + c) / B] are very useful.
>
> I see the benefit to the mulhadd variant (=> no hardware division).
> I'm not convinced by the
So, yeah, we're drifting pretty far off-topic now. If you have
follow-up questions that you think aren't interesting for gmp-devel,
just send them to me privately.
On Mon, Aug 18, 2014 at 2:35 PM, Niels Möller wrote:
> I think the decoder could implement the prefix instruction, as I've
> defined
On Mon, Aug 18, 2014 at 9:01 AM, Niels Möller wrote:
> I don't think it has to be that bad. First, the prefix flag and register
> should be saved and restored on irq, so there should be no problem with
> irq:s or page faults and the like in the "middle" of an instruction.
Yes, that's the advantag
ni...@lysator.liu.se (Niels Möller) writes:
I also don't know the ARM internals. But short latency between carry in
and carry out is important to make the umaal and umlal instructions
useful for bignum multiplication.
One usually cannot have a critical-path addend for a*b+c operations.
But
"Wesley W. Terpstra" writes:
> On Mon, Aug 18, 2014 at 9:01 AM, Niels Möller wrote:
>> I don't think it has to be that bad. First, the prefix flag and register
>> should be saved and restored on irq, so there should be no problem with
>> irq:s or page faults and the like in the "middle" of an in
On Sun, Aug 17, 2014 at 10:00 PM, Niels Möller wrote:
> t...@gmplib.org (Torbjörn Granlund) writes:
>> My work is available here: https://gmplib.org/~tege/fisa.pdf
Thank you for the link.
> And mine at http://www.lysator.liu.se/~nisse/misc/instr16.pdf (a toy
> project, trying to squeeze a decent
"Wesley W. Terpstra" writes:
> The restriction of having B as both input and output is
> not particularly onerous, IMO.
I agree. If it allows instructions to be squeezed into 16-bit opcodes
rather than 32-bit, then I'd expact that to more than make up for the
occasional extra mov instructions ne
"Wesley W. Terpstra" writes:
>> For multiply, d = (a * b + c) mod B (B being the word base) and d = [(a
>> * b + c) / B] are very useful.
>
> I see the benefit to the mulhadd variant (=> no hardware division).
> I'm not convinced by the mulladd variant.
Interesting, I hadn't thought about this l
t...@gmplib.org (Torbjörn Granlund) writes:
> My work is available here: https://gmplib.org/~tege/fisa.pdf
And mine at http://www.lysator.liu.se/~nisse/misc/instr16.pdf (a toy
project, trying to squeeze a decent instruction set with 16 general
purpose registers into 16-bit opcodes).
> I think a
On Sun, Aug 17, 2014 at 5:05 PM, Torbjörn Granlund wrote:
> My work is available here: https://gmplib.org/~tege/fisa.pdf
Thanks for the link. Currently, I am working on my processor's
architecture and its performance. Thus, I have not yet committed to a
specific ISA. In fact, I have been building
Both I and Niels have looked into ISAs which support GMP operations
well.
My work is available here: https://gmplib.org/~tege/fisa.pdf
You're right that "umulhi" and addition as well as subtraction with
carry/borrow are critical operations. And for multiply throughput is
more important than low
Good evening,
As a hobby project I've been building an out-of-order superscalar CPU
for use in FPGAs. Everything has been going pretty smoothly, but I'd
like to be able to support gmplib efficiently on it. To keep the issue
stage small+fast, instructions on this machine can only accept two
registe
24 matches
Mail list logo