Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread Niels Möller
David Miller writes: >> So compared to add_n, you just get an additional xor with -1 in the loop >> (and not on the loop's critical path). I can't guess whether or not that >> will be visible in the execution time. > > Thanks I'll give this a try! And on second thougt, there's no need to handle

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread David Miller
From: ni...@lysator.liu.se (Niels Möller) Date: Fri, 04 Jan 2013 22:29:58 +0100 > David Miller writes: > >> If it's needed for sub_n, then yes that's a bit difficult. I was >> trying to figure out ways to fabricate the needed calculations >> using just subcc and addxc/addxcc but haven't come up

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread Niels Möller
David Miller writes: > If it's needed for sub_n, then yes that's a bit difficult. I was > trying to figure out ways to fabricate the needed calculations > using just subcc and addxc/addxcc but haven't come up with anything > just yet. You could always do the two's complement of one of the opera

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread David Miller
From: Torbjorn Granlund Date: Fri, 04 Jan 2013 15:17:11 +0100 > I expect them to add 3n/2 to 3n cycles, depending on the pipeline > characteristics. Each load can issue in 1 cycle, there is a 4 cycle latency, the loads will fully pipeline. Therefore the overhead is around 3n. > The Oracle manu

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread David Miller
From: Torbjorn Granlund Date: Fri, 04 Jan 2013 14:54:15 +0100 > (For modexp, I assume one can stay in registers, making this > overhead small when using a large exponent, such as RSA > signing/decryption.) The montmul and montsqr instructions are meant to be used in a sort of byte-code'ish way.

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread David Miller
From: Torbjorn Granlund Date: Fri, 04 Jan 2013 14:54:15 +0100 > Did you add umulxhi use in your patch from a few days ago? Yes I did use mulx/umulxhi (both T3 and T4 have umulxhi) and yes the multiplies do pipeline on T4 (it doesn't on T3), and it gets about 4 cycles per limb in a two-way unroll

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread David Miller
From: bodr...@mail.dm.unipi.it Date: Fri, 4 Jan 2013 14:12:23 +0100 (CET) > Il Ven, 4 Gennaio 2013 10:07 am, David Miller ha scritto: >> mpmul 3 ! The immediate field is "N - 1" > > Does the immediate means that, to write e.g. sqr_basecase (it should be > far simpler

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread Torbjorn Granlund
bodr...@mail.dm.unipi.it writes: Does the immediate means that, to write e.g. sqr_basecase (it should be far simpler than writing mul_basecase), you need a branch for each different N? Since you have to preload a (weird) set of hardwired registers, one will really need special code for ev

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread Torbjorn Granlund
I took a brief look at the definition of these instructions. It is clear that they did not consult an expert in the area. They also added DES instructions now (in 2012). They added a few useful instructions, addxc/addxcc and umulxhi. The former is a 64-bit addition with useful carry in and out

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread bodrato
Ciao, Il Ven, 4 Gennaio 2013 10:07 am, David Miller ha scritto: > mpmul 3 ! The immediate field is "N - 1" Does the immediate means that, to write e.g. sqr_basecase (it should be far simpler than writing mul_basecase), you need a branch for each different N? > The c

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread David Miller
From: ni...@lysator.liu.se (Niels Möller) Date: Fri, 04 Jan 2013 09:10:30 +0100 > David Miller writes: > >> That's why realistically I'll probably only use mpmul for 3x3 and >> larger. > > So, e.g., an mpn_addmul_4 would make sense (and up to mpn_addmul_32, if > you want to make maximal use of

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread Niels Möller
David Miller writes: > That's why realistically I'll probably only use mpmul for 3x3 and > larger. So, e.g., an mpn_addmul_4 would make sense (and up to mpn_addmul_32, if you want to make maximal use of mpmul...)? I don't know anything about these sparc instructions beyond what you're explaining

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

2013-01-04 Thread David Miller
From: ni...@lysator.liu.se (Niels Möller) Date: Fri, 04 Jan 2013 08:48:21 +0100 > David Miller writes: > >> Just FYI, I'm also working on an mpn_mul_basecase that makes use of >> the T4 'mpmul' instruction which can do NxN 64-bit limb multiplies >> for values of N from 1 to 32. > > It might mak