ni...@lysator.liu.se (Niels Möller) writes:
> For concreteness, say we attempt to use a size 16 FFT, over the
> cofficient ring 2^{12}+1 (it's some years since I last looked carefully
> at FFT-based multiply, but I hope those parameters make sense. The FFT
> inputs are 4 bits, but grow to 12 in th
"Wesley W. Terpstra" writes:
> I was not proposing it for 64x64!!!
> I have no idea when NTT would be more efficient.
> Maybe like 2048x2048 or something.
At least in software, the threshold where FFT based multiply becomes
useful is a lot lower for wraparound mul than for regular mul.
> The ga
On Fri, Aug 22, 2014 at 2:06 PM, Niels Möller wrote:
> But a 64-bit mul is usually not single cycle?
Of course not. I'm not sure what you're driving at here.
If mullo and mulhi both have the same clock and are both 3 cycles
deep, it does not matter if one has slightly lower latency between two
r
"Wesley W. Terpstra" writes:
> Nah. You use the same clock for the whole ALU, so your clock rate is
> limited by the same critical path for both circuits. Thus, mullo is
> actually no faster, despite being potentially two gates shorter.
But a 64-bit mul is usually not single cycle? On typical x8
On Fri, Aug 22, 2014 at 11:46 AM, Wesley W. Terpstra wrote:
>> H = (x - L) mod (B-1)
>
> So let me get this straight. Your plan is to compute x=a*b mod (B-1).
> In other words, you sum 64 columns of 64-rows in a square shape
> instead of the usual parallelogram. Then you throw the twos complemen
On Thu, Aug 21, 2014 at 11:12 PM, Niels Möller wrote:
> One *can* reduce a piece at the right end to a single row
... yeah, but Wallace trees are essentially ripple adders at the
right-end = slow. So you don't use them for that. Maybe after you get
it down to 3 rows for all columns (ie: the middl