Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-25 Thread kabutz
On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes  wrote:

>> kabutz has updated the pull request incrementally with one additional commit 
>> since the last revision:
>> 
>>   Removed JVM flags from benchmark
>
> To add my 2c IMO a parallel version of this type absolutely **must** be 
> opt-in. There are simply far too many side-effects of using the FJP and 
> multiple threads to perform the calculation in parallel as if it is just a 
> minor implementation detail. A clear API is 1000x better than a "kill switch".
> 
> And yes you may still need to expose some kind of tuning knob.
> 
> David

> > Hi @dholmes-ora thanks - is it possible for me to create a CSR without 
> > having access to JIRA? Or do I need to find a sponsor who can help do this 
> > on my behalf?
> 
> You will need a sponsor to do this for you. I'm a little surprised after all 
> these years that you've never even gotten Author role. :)

Most of my contributions have been finding bugs, rather than fixing them. It's 
only recently that I've started adding code to the OpenJDK. So, I'm not 
surprised at all.

-

PR: https://git.openjdk.java.net/jdk/pull/6409


Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-25 Thread David Holmes

On 26/11/2021 4:15 am, kabutz wrote:

On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes  wrote:


kabutz has updated the pull request incrementally with one additional commit 
since the last revision:

   Removed JVM flags from benchmark


To add my 2c IMO a parallel version of this type absolutely **must** be opt-in. There are 
simply far too many side-effects of using the FJP and multiple threads to perform the 
calculation in parallel as if it is just a minor implementation detail. A clear API is 
1000x better than a "kill switch".

And yes you may still need to expose some kind of tuning knob.

David



@dholmes-ora has indicated that a [compatibility and 
specification](https://wiki.openjdk.java.net/display/csr/Main) (CSR) request is 
needed for this pull request. @kabutz please create a 
[CSR](https://wiki.openjdk.java.net/display/csr/Main) request for issue 
[JDK-8277175](https://bugs.openjdk.java.net/browse/JDK-8277175). This pull 
request cannot be integrated until the CSR request is approved.


Hi @dholmes-ora thanks - is it possible for me to create a CSR without having 
access to JIRA? Or do I need to find a sponsor who can help do this on my 
behalf?


You will need a sponsor to do this for you. I'm a little surprised after 
all these years that you've never even gotten Author role. :)


Cheers,
David


(Sorry for this newbie question)

-

PR: https://git.openjdk.java.net/jdk/pull/6409



Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-25 Thread kabutz
On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes  wrote:

>> kabutz has updated the pull request incrementally with one additional commit 
>> since the last revision:
>> 
>>   Removed JVM flags from benchmark
>
> To add my 2c IMO a parallel version of this type absolutely **must** be 
> opt-in. There are simply far too many side-effects of using the FJP and 
> multiple threads to perform the calculation in parallel as if it is just a 
> minor implementation detail. A clear API is 1000x better than a "kill switch".
> 
> And yes you may still need to expose some kind of tuning knob.
> 
> David

> @dholmes-ora has indicated that a [compatibility and 
> specification](https://wiki.openjdk.java.net/display/csr/Main) (CSR) request 
> is needed for this pull request. @kabutz please create a 
> [CSR](https://wiki.openjdk.java.net/display/csr/Main) request for issue 
> [JDK-8277175](https://bugs.openjdk.java.net/browse/JDK-8277175). This pull 
> request cannot be integrated until the CSR request is approved.

Hi @dholmes-ora thanks - is it possible for me to create a CSR without having 
access to JIRA? Or do I need to find a sponsor who can help do this on my 
behalf?

(Sorry for this newbie question)

-

PR: https://git.openjdk.java.net/jdk/pull/6409


Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-22 Thread Paul Sandoz
On Fri, 19 Nov 2021 10:42:23 GMT, kabutz  wrote:

> > I think the functionality in this PR is worth pursuing, but with the JDK 18 
> > rampdown 1 date fast approaching, as a non-urgent issue, I think we 
> > shouldn't try to rush it into JDK 18.
> 
> > I looked more closely and now understand a bit more about the threshold. It 
> > would be useful to have some implementation notes detailing approximately 
> > when the parallel execution kicks in. For `a*a` it's `>= 
> > TOOM_COOK_SQUARE_THRESHOLD(216)` and for `a*b` it's `>= 
> > TOOM_COOK_THRESHOLD(240)`, so we could refer to certain bit lengths.
> > The branching factor is 4 (well 3 i guess, but its easier to think in 
> > powers of 2). It might be reasonable to assume the problem gets split 
> > equally in 4 parts. I don't know in practice what the depth of recursion 
> > might, its hard to see this getting completely out of control, but we could 
> > always keep track of the depth and cut off in proportion to the # runtime 
> > processors if need be.
> > Given the existing degree of specialization, the minimal changes to code, 
> > and the fact that the creation of recursive tasks is in the noise this PR 
> > looks quite reasonable.
> 
> I have run some more tests. For my fibonacci algorithm, here are the worst 
> cases for the various calculations.
> 
> ```
> n   most_bits tasks  time_ms
>  1000 694 0  1
>10_000   6_942 0  1
>   100_000  69_42418 13
> 1_000_000 694_241   468143
>10_000_000   6_942_41811_718   1049
>   100_000_000  69_424_191   292_968  13695
> 1_000_000_000 694_241_913 7_324_218 237282
> ```
> 
> Each data point has 10x the number of bits in the final result and the number 
> of tasks in the final calculation is 25x more. Perhaps I can make the 
> threshold the recursive depth up to which we would run in parallel. And that 
> recursive depth could the availableProcessors() or some multiple thereof.
> 

Defense in depth :-) Perhaps a depth that is between some multiple of log2 and 
log4 of `availableProcessors()`, given the branching factor.

-

PR: https://git.openjdk.java.net/jdk/pull/6409


Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-19 Thread kabutz
On Wed, 17 Nov 2021 19:48:25 GMT, kabutz  wrote:

>> BigInteger currently uses three different algorithms for multiply. The 
>> simple quadratic algorithm, then the slightly better Karatsuba if we exceed 
>> a bit count and then Toom Cook 3 once we go into the several thousands of 
>> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to 
>> parallelize it. I have demonstrated this several times in conference talks. 
>> In order to be consistent with other classes such as Arrays and Collection, 
>> I have added a parallelMultiply() method. Internally we have added a 
>> parameter to the private multiply method to indicate whether the calculation 
>> should be done in parallel.
>> 
>> The performance improvements are as should be expected. Fibonacci of 100 
>> million (using a single-threaded Dijkstra's sum of squares version) 
>> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with 
>> the sequential multiply() method. This is on my 1-8-2 laptop. The final 
>> multiplications are with very large numbers, which then benefit from the 
>> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>> 
>> We have also parallelized the private square() method. Internally, the 
>> square() method defaults to be sequential.
>> 
>> Some benchmark results, run on my 1-6-2 server:
>> 
>> 
>> Benchmark  (n)  Mode  Cnt  Score 
>>  Error  Units
>> BigIntegerParallelMultiply.multiply100ss4 51.707 
>> ±   11.194  ms/op
>> BigIntegerParallelMultiply.multiply   1000ss4988.302 
>> ±  235.977  ms/op
>> BigIntegerParallelMultiply.multiply  1ss4  24662.063 
>> ± 1123.329  ms/op
>> BigIntegerParallelMultiply.parallelMultiply100ss4 49.337 
>> ±   26.611  ms/op
>> BigIntegerParallelMultiply.parallelMultiply   1000ss4527.560 
>> ±  268.903  ms/op
>> BigIntegerParallelMultiply.parallelMultiply  1ss4   9076.551 
>> ± 1899.444  ms/op
>> 
>> 
>> We can see that for larger calculations (fib 100m), the execution is 2.7x 
>> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for 
>> small (fib 1m) it is roughly the same. Considering that the fibonacci 
>> algorithm that we used was in itself sequential, and that the last 3 
>> calculations would dominate, 2.7x faster should probably be considered quite 
>> good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit 
> since the last revision:
> 
>   Removed JVM flags from benchmark

> I think the functionality in this PR is worth pursuing, but with the JDK 18 
> rampdown 1 date fast approaching, as a non-urgent issue, I think we shouldn't 
> try to rush it into JDK 18.





> I looked more closely and now understand a bit more about the threshold. It 
> would be useful to have some implementation notes detailing approximately 
> when the parallel execution kicks in. For `a*a` it's `>= 
> TOOM_COOK_SQUARE_THRESHOLD(216)` and for `a*b` it's `>= 
> TOOM_COOK_THRESHOLD(240)`, so we could refer to certain bit lengths.
> 
> The branching factor is 4 (well 3 i guess, but its easier to think in powers 
> of 2). It might be reasonable to assume the problem gets split equally in 4 
> parts. I don't know in practice what the depth of recursion might, its hard 
> to see this getting completely out of control, but we could always keep track 
> of the depth and cut off in proportion to the # runtime processors if need be.
> 
> Given the existing degree of specialization, the minimal changes to code, and 
> the fact that the creation of recursive tasks is in the noise this PR looks 
> quite reasonable.

I have run some more tests. For my fibonacci algorithm, here are the worst 
cases for the various calculations.


n   most_bits tasks  time_ms
 1000 694 0  1
   10_000   6_942 0  1
  100_000  69_42418 13
1_000_000 694_241   468143
   10_000_000   6_942_41811_718   1049
  100_000_000  69_424_191   292_968  13695
1_000_000_000 694_241_913 7_324_218 237282


Each data point has 10x the number of bits in the final result and the number 
of tasks in the final calculation is 25x more. Perhaps I can make the threshold 
the recursive depth up to which we would run in parallel. And that recursive 
depth could the availableProcessors() or some multiple thereof.


> 
> I think we can simplify the creation of the recursive tasks (assuming we 
> don't track the depth):
> 
> ```java
> private static final class RecursiveOp {
> private static  RecursiveTask exec(RecursiveTask op, boolean 
> parallel) {
> if (parallel) {
> op.fork();
> } else {
> op.invoke();
> }
> return op;
> }
> 
> private static RecursiveTask 

Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-18 Thread Paul Sandoz
On Wed, 17 Nov 2021 19:48:25 GMT, kabutz  wrote:

>> BigInteger currently uses three different algorithms for multiply. The 
>> simple quadratic algorithm, then the slightly better Karatsuba if we exceed 
>> a bit count and then Toom Cook 3 once we go into the several thousands of 
>> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to 
>> parallelize it. I have demonstrated this several times in conference talks. 
>> In order to be consistent with other classes such as Arrays and Collection, 
>> I have added a parallelMultiply() method. Internally we have added a 
>> parameter to the private multiply method to indicate whether the calculation 
>> should be done in parallel.
>> 
>> The performance improvements are as should be expected. Fibonacci of 100 
>> million (using a single-threaded Dijkstra's sum of squares version) 
>> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with 
>> the sequential multiply() method. This is on my 1-8-2 laptop. The final 
>> multiplications are with very large numbers, which then benefit from the 
>> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>> 
>> We have also parallelized the private square() method. Internally, the 
>> square() method defaults to be sequential.
>> 
>> Some benchmark results, run on my 1-6-2 server:
>> 
>> 
>> Benchmark  (n)  Mode  Cnt  Score 
>>  Error  Units
>> BigIntegerParallelMultiply.multiply100ss4 51.707 
>> ±   11.194  ms/op
>> BigIntegerParallelMultiply.multiply   1000ss4988.302 
>> ±  235.977  ms/op
>> BigIntegerParallelMultiply.multiply  1ss4  24662.063 
>> ± 1123.329  ms/op
>> BigIntegerParallelMultiply.parallelMultiply100ss4 49.337 
>> ±   26.611  ms/op
>> BigIntegerParallelMultiply.parallelMultiply   1000ss4527.560 
>> ±  268.903  ms/op
>> BigIntegerParallelMultiply.parallelMultiply  1ss4   9076.551 
>> ± 1899.444  ms/op
>> 
>> 
>> We can see that for larger calculations (fib 100m), the execution is 2.7x 
>> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for 
>> small (fib 1m) it is roughly the same. Considering that the fibonacci 
>> algorithm that we used was in itself sequential, and that the last 3 
>> calculations would dominate, 2.7x faster should probably be considered quite 
>> good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit 
> since the last revision:
> 
>   Removed JVM flags from benchmark

I looked more closely and now understand a bit more about the threshold. It 
would be useful to have some implementation notes detailing approximately when 
the parallel execution kicks in. For `a*a` it's `>= 
TOOM_COOK_SQUARE_THRESHOLD(216)` and for `a*b` it's `>= 
TOOM_COOK_THRESHOLD(240)`, so we could refer to certain bit lengths.

The branching factor is 4 (well 3 i guess, but its easier to think in powers of 
2). It might be reasonable to assume the problem gets split equally in 4 parts. 
I don't know in practice what the depth of recursion might, its hard to see 
this getting completely out of control, but we could always keep track of the 
depth and cut off in proportion to the # runtime processors if need be.
 
Given the existing degree of specialization, the minimal changes to code, and 
the fact that the creation of recursive tasks is in the noise this PR looks 
quite reasonable.

I think we can simplify the creation of the recursive tasks (assuming we don't 
track the depth):


private static final class RecursiveOp {
private static  RecursiveTask exec(RecursiveTask op, boolean 
parallel) {
if (parallel) {
op.fork();
} else {
op.invoke();
}
return op;
}

private static RecursiveTask multiply(BigInteger a, 
BigInteger b, boolean parallel) {
var op = new RecursiveTask() {
@Override
protected BigInteger compute() {
return a.multiply(b, true, parallel);
}
};

return exec(op, parallel);
}

private static RecursiveTask square(BigInteger a, boolean 
parallel) {
var op = new RecursiveTask() {
@Override
protected BigInteger compute() {
return a.square(true, parallel);
}
};

return exec(op, parallel);
}
}

-

PR: https://git.openjdk.java.net/jdk/pull/6409


Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-18 Thread Brian Burkhalter


On Nov 18, 2021, at 2:41 AM, kabutz 
mailto:d...@openjdk.java.net>> wrote:

Yes, it **must** be opt-in. However I'm not sure that a tuning knob will be 
necessary. BigInteger has thresholds for using different multiply algorithms 
and these are also not configurable.

Those thresholds are the result of *many* JMC runs on multiple platforms. That 
said, there is an issue on file to reevaluate them someday [1].

[1] https://bugs.openjdk.java.net/browse/JDK-8029425


Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-18 Thread Bernd Eckenfels
Yes but that does not help with the decision if parallel should be used or not. 
But yes, if it is generally not wanted to make the pool explicite a simple 
parallel signature without argument would also work to make the decision 
explicite (I.e. new api). That won’t automatically tune the performance but it 
does allow users to use it - majority would be crypto anyway where it can be 
used by the JCE and JSSE (maybe?).

Gruss
Bernd


--
http://bernd.eckenfels.net

Von: Remi Forax 
Gesendet: Thursday, November 18, 2021 12:16:31 PM
An: Bernd Eckenfels 
Cc: core-libs-dev 
Betreff: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger 
[v3]

- Original Message -
> From: "Bernd Eckenfels" 
> To: "core-libs-dev" 
> Sent: Jeudi 18 Novembre 2021 12:07:19
> Subject: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger 
> [v3]

> What about a new API multiply method which takes an forkjoinpool, and only if
> that is used/specified it will use the parallel mode (and only if Notsitzes
> threshold applies?). Most of the pool tuning can then done with this argument.
> It also avoids surprise threads.

You don't need it, here is the usual trick if you want to specify a specific 
fork join pool
https://stackoverflow.com/questions/21163108/custom-thread-pool-in-java-8-parallel-stream

>
> Gruss
> Bernd

regards,
Rémi

> --
> http://bernd.eckenfels.net
> 
> Von: core-libs-dev  im Auftrag von kabutz
> 
> Gesendet: Thursday, November 18, 2021 11:41:40 AM
> An: core-libs-dev@openjdk.java.net 
> Betreff: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger
> [v3]
>
> On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes  wrote:
>
>> To add my 2c IMO a parallel version of this type absolutely **must** be 
>> opt-in.
>> There are simply far too many side-effects of using the FJP and multiple
>> threads to perform the calculation in parallel as if it is just a minor
>> implementation detail. A clear API is 1000x better than a "kill switch".
>>
>> And yes you may still need to expose some kind of tuning knob.
>>
>> David
>
> Yes, it **must** be opt-in. However I'm not sure that a tuning knob will be
> necessary. BigInteger has thresholds for using different multiply algorithms
> and these are also not configurable.
>
> -
>
> PR: https://git.openjdk.java.net/jdk/pull/6409


Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-18 Thread Remi Forax
- Original Message -
> From: "Bernd Eckenfels" 
> To: "core-libs-dev" 
> Sent: Jeudi 18 Novembre 2021 12:07:19
> Subject: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger 
> [v3]

> What about a new API multiply method which takes an forkjoinpool, and only if
> that is used/specified it will use the parallel mode (and only if Notsitzes
> threshold applies?). Most of the pool tuning can then done with this argument.
> It also avoids surprise threads.

You don't need it, here is the usual trick if you want to specify a specific 
fork join pool
https://stackoverflow.com/questions/21163108/custom-thread-pool-in-java-8-parallel-stream

> 
> Gruss
> Bernd

regards,
Rémi

> --
> http://bernd.eckenfels.net
> 
> Von: core-libs-dev  im Auftrag von kabutz
> 
> Gesendet: Thursday, November 18, 2021 11:41:40 AM
> An: core-libs-dev@openjdk.java.net 
> Betreff: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger
> [v3]
> 
> On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes  wrote:
> 
>> To add my 2c IMO a parallel version of this type absolutely **must** be 
>> opt-in.
>> There are simply far too many side-effects of using the FJP and multiple
>> threads to perform the calculation in parallel as if it is just a minor
>> implementation detail. A clear API is 1000x better than a "kill switch".
>>
>> And yes you may still need to expose some kind of tuning knob.
>>
>> David
> 
> Yes, it **must** be opt-in. However I'm not sure that a tuning knob will be
> necessary. BigInteger has thresholds for using different multiply algorithms
> and these are also not configurable.
> 
> -
> 
> PR: https://git.openjdk.java.net/jdk/pull/6409


Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-18 Thread Bernd Eckenfels
What about a new API multiply method which takes an forkjoinpool, and only if 
that is used/specified it will use the parallel mode (and only if Notsitzes 
threshold applies?). Most of the pool tuning can then done with this argument. 
It also avoids surprise threads.

Gruss
Bernd
--
http://bernd.eckenfels.net

Von: core-libs-dev  im Auftrag von kabutz 

Gesendet: Thursday, November 18, 2021 11:41:40 AM
An: core-libs-dev@openjdk.java.net 
Betreff: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger 
[v3]

On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes  wrote:

> To add my 2c IMO a parallel version of this type absolutely **must** be 
> opt-in. There are simply far too many side-effects of using the FJP and 
> multiple threads to perform the calculation in parallel as if it is just a 
> minor implementation detail. A clear API is 1000x better than a "kill switch".
>
> And yes you may still need to expose some kind of tuning knob.
>
> David

Yes, it **must** be opt-in. However I'm not sure that a tuning knob will be 
necessary. BigInteger has thresholds for using different multiply algorithms 
and these are also not configurable.

-

PR: https://git.openjdk.java.net/jdk/pull/6409


Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-18 Thread kabutz
On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes  wrote:

> To add my 2c IMO a parallel version of this type absolutely **must** be 
> opt-in. There are simply far too many side-effects of using the FJP and 
> multiple threads to perform the calculation in parallel as if it is just a 
> minor implementation detail. A clear API is 1000x better than a "kill switch".
> 
> And yes you may still need to expose some kind of tuning knob.
> 
> David

Yes, it **must** be opt-in. However I'm not sure that a tuning knob will be 
necessary. BigInteger has thresholds for using different multiply algorithms 
and these are also not configurable.

-

PR: https://git.openjdk.java.net/jdk/pull/6409


Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-17 Thread David Holmes
On Wed, 17 Nov 2021 19:48:25 GMT, kabutz  wrote:

>> BigInteger currently uses three different algorithms for multiply. The 
>> simple quadratic algorithm, then the slightly better Karatsuba if we exceed 
>> a bit count and then Toom Cook 3 once we go into the several thousands of 
>> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to 
>> parallelize it. I have demonstrated this several times in conference talks. 
>> In order to be consistent with other classes such as Arrays and Collection, 
>> I have added a parallelMultiply() method. Internally we have added a 
>> parameter to the private multiply method to indicate whether the calculation 
>> should be done in parallel.
>> 
>> The performance improvements are as should be expected. Fibonacci of 100 
>> million (using a single-threaded Dijkstra's sum of squares version) 
>> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with 
>> the sequential multiply() method. This is on my 1-8-2 laptop. The final 
>> multiplications are with very large numbers, which then benefit from the 
>> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>> 
>> We have also parallelized the private square() method. Internally, the 
>> square() method defaults to be sequential.
>> 
>> Some benchmark results, run on my 1-6-2 server:
>> 
>> 
>> Benchmark  (n)  Mode  Cnt  Score 
>>  Error  Units
>> BigIntegerParallelMultiply.multiply100ss4 51.707 
>> ±   11.194  ms/op
>> BigIntegerParallelMultiply.multiply   1000ss4988.302 
>> ±  235.977  ms/op
>> BigIntegerParallelMultiply.multiply  1ss4  24662.063 
>> ± 1123.329  ms/op
>> BigIntegerParallelMultiply.parallelMultiply100ss4 49.337 
>> ±   26.611  ms/op
>> BigIntegerParallelMultiply.parallelMultiply   1000ss4527.560 
>> ±  268.903  ms/op
>> BigIntegerParallelMultiply.parallelMultiply  1ss4   9076.551 
>> ± 1899.444  ms/op
>> 
>> 
>> We can see that for larger calculations (fib 100m), the execution is 2.7x 
>> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for 
>> small (fib 1m) it is roughly the same. Considering that the fibonacci 
>> algorithm that we used was in itself sequential, and that the last 3 
>> calculations would dominate, 2.7x faster should probably be considered quite 
>> good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit 
> since the last revision:
> 
>   Removed JVM flags from benchmark

To add my 2c IMO a parallel version of this type absolutely **must** be opt-in. 
There are simply far too many side-effects of using the FJP and multiple 
threads to perform the calculation in parallel as if it is just a minor 
implementation detail. A clear API is 1000x better than a "kill switch".

And yes you may still need to expose some kind of tuning knob.

David

-

PR: https://git.openjdk.java.net/jdk/pull/6409


Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-17 Thread kabutz
On Wed, 17 Nov 2021 19:48:25 GMT, kabutz  wrote:

>> BigInteger currently uses three different algorithms for multiply. The 
>> simple quadratic algorithm, then the slightly better Karatsuba if we exceed 
>> a bit count and then Toom Cook 3 once we go into the several thousands of 
>> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to 
>> parallelize it. I have demonstrated this several times in conference talks. 
>> In order to be consistent with other classes such as Arrays and Collection, 
>> I have added a parallelMultiply() method. Internally we have added a 
>> parameter to the private multiply method to indicate whether the calculation 
>> should be done in parallel.
>> 
>> The performance improvements are as should be expected. Fibonacci of 100 
>> million (using a single-threaded Dijkstra's sum of squares version) 
>> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with 
>> the sequential multiply() method. This is on my 1-8-2 laptop. The final 
>> multiplications are with very large numbers, which then benefit from the 
>> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>> 
>> We have also parallelized the private square() method. Internally, the 
>> square() method defaults to be sequential.
>> 
>> Some benchmark results, run on my 1-6-2 server:
>> 
>> 
>> Benchmark  (n)  Mode  Cnt  Score 
>>  Error  Units
>> BigIntegerParallelMultiply.multiply100ss4 51.707 
>> ±   11.194  ms/op
>> BigIntegerParallelMultiply.multiply   1000ss4988.302 
>> ±  235.977  ms/op
>> BigIntegerParallelMultiply.multiply  1ss4  24662.063 
>> ± 1123.329  ms/op
>> BigIntegerParallelMultiply.parallelMultiply100ss4 49.337 
>> ±   26.611  ms/op
>> BigIntegerParallelMultiply.parallelMultiply   1000ss4527.560 
>> ±  268.903  ms/op
>> BigIntegerParallelMultiply.parallelMultiply  1ss4   9076.551 
>> ± 1899.444  ms/op
>> 
>> 
>> We can see that for larger calculations (fib 100m), the execution is 2.7x 
>> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for 
>> small (fib 1m) it is roughly the same. Considering that the fibonacci 
>> algorithm that we used was in itself sequential, and that the last 3 
>> calculations would dominate, 2.7x faster should probably be considered quite 
>> good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit 
> since the last revision:
> 
>   Removed JVM flags from benchmark

One concern I had was what would happen when the common pool parallelism was 
set to 0. In the parallelSort mechanism, they only sort in parallel if the 
parallelism is at least 2. Thus a parallel sort on a dual-core machine (without 
hyperthreading) will run sequentially. However, the parallelBigInteger seems to 
work OK with different common pool sizes.  We always have one thread more than 
the common pool parallelism working, since the thread that calls 
parallelMultiply() also does calculations:

Showing only the cases for Fibonacci(100m):


Method(n)   common_parallelism  ms/op
multiply  100_000_000   0   24619.004 
parallelMultiply  100_000_000   0   24977.280
multiply  100_000_000   1   24526.498
parallelMultiply  100_000_000   1   18169.969
multiply  100_000_000   2   24528.111 
parallelMultiply  100_000_000   2   11073.637 
multiply  100_000_000   3   24861.324
parallelMultiply  100_000_000   39472.986
multiply  100_000_000   4   25036.295 
parallelMultiply  100_000_000   49151.940 
multiply  100_000_000   5   25129.139
parallelMultiply  100_000_000   58964.403
multiply  100_000_000  11   24717.125 
parallelMultiply  100_000_000  119024.319

-

PR: https://git.openjdk.java.net/jdk/pull/6409


Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-17 Thread kabutz
> BigInteger currently uses three different algorithms for multiply. The simple 
> quadratic algorithm, then the slightly better Karatsuba if we exceed a bit 
> count and then Toom Cook 3 once we go into the several thousands of bits. 
> Since Toom Cook 3 is a recursive algorithm, it is trivial to parallelize it. 
> I have demonstrated this several times in conference talks. In order to be 
> consistent with other classes such as Arrays and Collection, I have added a 
> parallelMultiply() method. Internally we have added a parameter to the 
> private multiply method to indicate whether the calculation should be done in 
> parallel.
> 
> The performance improvements are as should be expected. Fibonacci of 100 
> million (using a single-threaded Dijkstra's sum of squares version) completes 
> in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with the 
> sequential multiply() method. This is on my 1-8-2 laptop. The final 
> multiplications are with very large numbers, which then benefit from the 
> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
> 
> We have also parallelized the private square() method. Internally, the 
> square() method defaults to be sequential.
> 
> Some benchmark results, run on my 1-6-2 server:
> 
> 
> Benchmark  (n)  Mode  Cnt  Score  
> Error  Units
> BigIntegerParallelMultiply.multiply100ss4 51.707 
> ±   11.194  ms/op
> BigIntegerParallelMultiply.multiply   1000ss4988.302 
> ±  235.977  ms/op
> BigIntegerParallelMultiply.multiply  1ss4  24662.063 
> ± 1123.329  ms/op
> BigIntegerParallelMultiply.parallelMultiply100ss4 49.337 
> ±   26.611  ms/op
> BigIntegerParallelMultiply.parallelMultiply   1000ss4527.560 
> ±  268.903  ms/op
> BigIntegerParallelMultiply.parallelMultiply  1ss4   9076.551 
> ± 1899.444  ms/op
> 
> 
> We can see that for larger calculations (fib 100m), the execution is 2.7x 
> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for 
> small (fib 1m) it is roughly the same. Considering that the fibonacci 
> algorithm that we used was in itself sequential, and that the last 3 
> calculations would dominate, 2.7x faster should probably be considered quite 
> good on a 1-6-2 machine.

kabutz has updated the pull request incrementally with one additional commit 
since the last revision:

  Removed JVM flags from benchmark

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/6409/files
  - new: https://git.openjdk.java.net/jdk/pull/6409/files/66cec9bd..81a8b599

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=6409=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=6409=01-02

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6409.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6409/head:pull/6409

PR: https://git.openjdk.java.net/jdk/pull/6409


Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

2021-11-16 Thread kabutz
> BigInteger currently uses three different algorithms for multiply. The simple 
> quadratic algorithm, then the slightly better Karatsuba if we exceed a bit 
> count and then Toom Cook 3 once we go into the several thousands of bits. 
> Since Toom Cook 3 is a recursive algorithm, it is trivial to parallelize it. 
> I have demonstrated this several times in conference talks. In order to be 
> consistent with other classes such as Arrays and Collection, I have added a 
> parallelMultiply() method. Internally we have added a parameter to the 
> private multiply method to indicate whether the calculation should be done in 
> parallel.
> 
> The performance improvements are as should be expected. Fibonacci of 100 
> million (using a single-threaded Dijkstra's sum of squares version) completes 
> in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with the 
> sequential multiply() method. This is on my 1-8-2 laptop. The final 
> multiplications are with very large numbers, which then benefit from the 
> parallelization of Toom-Cook 3.  Fibonacci 100 million is a 347084 bit number.
> 
> We have also parallelized the private square() method. Internally, the 
> square() method defaults to be sequential.
> 
> 
> Benchmark  (n)  Mode  Cnt  Score  
> Error  Units
> BigIntegerParallelMultiply.multiply100ss4 68,043 
> ±   25,317  ms/op
> BigIntegerParallelMultiply.multiply   1000ss4   1073,095 
> ±  125,296  ms/op
> BigIntegerParallelMultiply.multiply  1ss4  25317,535 
> ± 5806,205  ms/op
> BigIntegerParallelMultiply.parallelMultiply100ss4 56,552 
> ±   22,368  ms/op
> BigIntegerParallelMultiply.parallelMultiply   1000ss4536,193 
> ±   37,393  ms/op
> BigIntegerParallelMultiply.parallelMultiply  1ss4   9274,657 
> ±  826,197  ms/op

kabutz has updated the pull request with a new target base due to a merge or a 
rebase. The incremental webrev excludes the unrelated changes brought in by the 
merge/rebase. The pull request contains 15 additional commits since the last 
revision:

 - 8277119: Add asserts in GenericTaskQueueSet methods
   
   Reviewed-by: tschatzl
 - 8274757: Cleanup unnecessary calls to Throwable.initCause() in 
java.management module
   
   Reviewed-by: dfuchs, sspitsyn
 - 8274662: Replace 'while' cycles with iterator with enhanced-for in 
jdk.hotspot.agent
   
   Reviewed-by: amenkov, cjplummer, sspitsyn
 - 8274163: Use String.equals instead of String.compareTo in jdk.jcmd
   
   Reviewed-by: cjplummer, amenkov, sspitsyn
 - 8274190: Use String.equals instead of String.compareTo in 
jdk.internal.jvmstat
   
   Reviewed-by: cjplummer, sspitsyn
 - 8274168: Avoid String.compareTo == 0 to check String equality in 
java.management
   
   Reviewed-by: sspitsyn, dfuchs, cjplummer, dholmes
 - 8274261: Use enhanced-for instead of plain 'for' in jdk.jcmd
   
   Reviewed-by: sspitsyn, cjplummer
 - Update comments
 - Added parallelMultiply() method to BigInteger to allow large multiplications 
to run in parallel
 - Merge branch 'openjdk:master' into master
 - ... and 5 more: https://git.openjdk.java.net/jdk/compare/744e9e8e...0075f04d

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/6391/files
  - new: https://git.openjdk.java.net/jdk/pull/6391/files/03be5518..0075f04d

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=6391=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=6391=01-02

  Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6391.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6391/head:pull/6391

PR: https://git.openjdk.java.net/jdk/pull/6391