Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes wrote: >> kabutz has updated the pull request incrementally with one additional commit >> since the last revision: >> >> Removed JVM flags from benchmark > > To add my 2c IMO a parallel version of this type absolutely **must** be > opt-in. There are simply far too many side-effects of using the FJP and > multiple threads to perform the calculation in parallel as if it is just a > minor implementation detail. A clear API is 1000x better than a "kill switch". > > And yes you may still need to expose some kind of tuning knob. > > David > > Hi @dholmes-ora thanks - is it possible for me to create a CSR without > > having access to JIRA? Or do I need to find a sponsor who can help do this > > on my behalf? > > You will need a sponsor to do this for you. I'm a little surprised after all > these years that you've never even gotten Author role. :) Most of my contributions have been finding bugs, rather than fixing them. It's only recently that I've started adding code to the OpenJDK. So, I'm not surprised at all. - PR: https://git.openjdk.java.net/jdk/pull/6409
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
On 26/11/2021 4:15 am, kabutz wrote: On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes wrote: kabutz has updated the pull request incrementally with one additional commit since the last revision: Removed JVM flags from benchmark To add my 2c IMO a parallel version of this type absolutely **must** be opt-in. There are simply far too many side-effects of using the FJP and multiple threads to perform the calculation in parallel as if it is just a minor implementation detail. A clear API is 1000x better than a "kill switch". And yes you may still need to expose some kind of tuning knob. David @dholmes-ora has indicated that a [compatibility and specification](https://wiki.openjdk.java.net/display/csr/Main) (CSR) request is needed for this pull request. @kabutz please create a [CSR](https://wiki.openjdk.java.net/display/csr/Main) request for issue [JDK-8277175](https://bugs.openjdk.java.net/browse/JDK-8277175). This pull request cannot be integrated until the CSR request is approved. Hi @dholmes-ora thanks - is it possible for me to create a CSR without having access to JIRA? Or do I need to find a sponsor who can help do this on my behalf? You will need a sponsor to do this for you. I'm a little surprised after all these years that you've never even gotten Author role. :) Cheers, David (Sorry for this newbie question) - PR: https://git.openjdk.java.net/jdk/pull/6409
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes wrote: >> kabutz has updated the pull request incrementally with one additional commit >> since the last revision: >> >> Removed JVM flags from benchmark > > To add my 2c IMO a parallel version of this type absolutely **must** be > opt-in. There are simply far too many side-effects of using the FJP and > multiple threads to perform the calculation in parallel as if it is just a > minor implementation detail. A clear API is 1000x better than a "kill switch". > > And yes you may still need to expose some kind of tuning knob. > > David > @dholmes-ora has indicated that a [compatibility and > specification](https://wiki.openjdk.java.net/display/csr/Main) (CSR) request > is needed for this pull request. @kabutz please create a > [CSR](https://wiki.openjdk.java.net/display/csr/Main) request for issue > [JDK-8277175](https://bugs.openjdk.java.net/browse/JDK-8277175). This pull > request cannot be integrated until the CSR request is approved. Hi @dholmes-ora thanks - is it possible for me to create a CSR without having access to JIRA? Or do I need to find a sponsor who can help do this on my behalf? (Sorry for this newbie question) - PR: https://git.openjdk.java.net/jdk/pull/6409
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
On Fri, 19 Nov 2021 10:42:23 GMT, kabutz wrote: > > I think the functionality in this PR is worth pursuing, but with the JDK 18 > > rampdown 1 date fast approaching, as a non-urgent issue, I think we > > shouldn't try to rush it into JDK 18. > > > I looked more closely and now understand a bit more about the threshold. It > > would be useful to have some implementation notes detailing approximately > > when the parallel execution kicks in. For `a*a` it's `>= > > TOOM_COOK_SQUARE_THRESHOLD(216)` and for `a*b` it's `>= > > TOOM_COOK_THRESHOLD(240)`, so we could refer to certain bit lengths. > > The branching factor is 4 (well 3 i guess, but its easier to think in > > powers of 2). It might be reasonable to assume the problem gets split > > equally in 4 parts. I don't know in practice what the depth of recursion > > might, its hard to see this getting completely out of control, but we could > > always keep track of the depth and cut off in proportion to the # runtime > > processors if need be. > > Given the existing degree of specialization, the minimal changes to code, > > and the fact that the creation of recursive tasks is in the noise this PR > > looks quite reasonable. > > I have run some more tests. For my fibonacci algorithm, here are the worst > cases for the various calculations. > > ``` > n most_bits tasks time_ms > 1000 694 0 1 >10_000 6_942 0 1 > 100_000 69_42418 13 > 1_000_000 694_241 468143 >10_000_000 6_942_41811_718 1049 > 100_000_000 69_424_191 292_968 13695 > 1_000_000_000 694_241_913 7_324_218 237282 > ``` > > Each data point has 10x the number of bits in the final result and the number > of tasks in the final calculation is 25x more. Perhaps I can make the > threshold the recursive depth up to which we would run in parallel. And that > recursive depth could the availableProcessors() or some multiple thereof. > Defense in depth :-) Perhaps a depth that is between some multiple of log2 and log4 of `availableProcessors()`, given the branching factor. - PR: https://git.openjdk.java.net/jdk/pull/6409
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
On Wed, 17 Nov 2021 19:48:25 GMT, kabutz wrote: >> BigInteger currently uses three different algorithms for multiply. The >> simple quadratic algorithm, then the slightly better Karatsuba if we exceed >> a bit count and then Toom Cook 3 once we go into the several thousands of >> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to >> parallelize it. I have demonstrated this several times in conference talks. >> In order to be consistent with other classes such as Arrays and Collection, >> I have added a parallelMultiply() method. Internally we have added a >> parameter to the private multiply method to indicate whether the calculation >> should be done in parallel. >> >> The performance improvements are as should be expected. Fibonacci of 100 >> million (using a single-threaded Dijkstra's sum of squares version) >> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with >> the sequential multiply() method. This is on my 1-8-2 laptop. The final >> multiplications are with very large numbers, which then benefit from the >> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number. >> >> We have also parallelized the private square() method. Internally, the >> square() method defaults to be sequential. >> >> Some benchmark results, run on my 1-6-2 server: >> >> >> Benchmark (n) Mode Cnt Score >> Error Units >> BigIntegerParallelMultiply.multiply100ss4 51.707 >> ± 11.194 ms/op >> BigIntegerParallelMultiply.multiply 1000ss4988.302 >> ± 235.977 ms/op >> BigIntegerParallelMultiply.multiply 1ss4 24662.063 >> ± 1123.329 ms/op >> BigIntegerParallelMultiply.parallelMultiply100ss4 49.337 >> ± 26.611 ms/op >> BigIntegerParallelMultiply.parallelMultiply 1000ss4527.560 >> ± 268.903 ms/op >> BigIntegerParallelMultiply.parallelMultiply 1ss4 9076.551 >> ± 1899.444 ms/op >> >> >> We can see that for larger calculations (fib 100m), the execution is 2.7x >> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for >> small (fib 1m) it is roughly the same. Considering that the fibonacci >> algorithm that we used was in itself sequential, and that the last 3 >> calculations would dominate, 2.7x faster should probably be considered quite >> good on a 1-6-2 machine. > > kabutz has updated the pull request incrementally with one additional commit > since the last revision: > > Removed JVM flags from benchmark > I think the functionality in this PR is worth pursuing, but with the JDK 18 > rampdown 1 date fast approaching, as a non-urgent issue, I think we shouldn't > try to rush it into JDK 18. > I looked more closely and now understand a bit more about the threshold. It > would be useful to have some implementation notes detailing approximately > when the parallel execution kicks in. For `a*a` it's `>= > TOOM_COOK_SQUARE_THRESHOLD(216)` and for `a*b` it's `>= > TOOM_COOK_THRESHOLD(240)`, so we could refer to certain bit lengths. > > The branching factor is 4 (well 3 i guess, but its easier to think in powers > of 2). It might be reasonable to assume the problem gets split equally in 4 > parts. I don't know in practice what the depth of recursion might, its hard > to see this getting completely out of control, but we could always keep track > of the depth and cut off in proportion to the # runtime processors if need be. > > Given the existing degree of specialization, the minimal changes to code, and > the fact that the creation of recursive tasks is in the noise this PR looks > quite reasonable. I have run some more tests. For my fibonacci algorithm, here are the worst cases for the various calculations. n most_bits tasks time_ms 1000 694 0 1 10_000 6_942 0 1 100_000 69_42418 13 1_000_000 694_241 468143 10_000_000 6_942_41811_718 1049 100_000_000 69_424_191 292_968 13695 1_000_000_000 694_241_913 7_324_218 237282 Each data point has 10x the number of bits in the final result and the number of tasks in the final calculation is 25x more. Perhaps I can make the threshold the recursive depth up to which we would run in parallel. And that recursive depth could the availableProcessors() or some multiple thereof. > > I think we can simplify the creation of the recursive tasks (assuming we > don't track the depth): > > ```java > private static final class RecursiveOp { > private static RecursiveTask exec(RecursiveTask op, boolean > parallel) { > if (parallel) { > op.fork(); > } else { > op.invoke(); > } > return op; > } > > private static RecursiveTask
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
On Wed, 17 Nov 2021 19:48:25 GMT, kabutz wrote: >> BigInteger currently uses three different algorithms for multiply. The >> simple quadratic algorithm, then the slightly better Karatsuba if we exceed >> a bit count and then Toom Cook 3 once we go into the several thousands of >> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to >> parallelize it. I have demonstrated this several times in conference talks. >> In order to be consistent with other classes such as Arrays and Collection, >> I have added a parallelMultiply() method. Internally we have added a >> parameter to the private multiply method to indicate whether the calculation >> should be done in parallel. >> >> The performance improvements are as should be expected. Fibonacci of 100 >> million (using a single-threaded Dijkstra's sum of squares version) >> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with >> the sequential multiply() method. This is on my 1-8-2 laptop. The final >> multiplications are with very large numbers, which then benefit from the >> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number. >> >> We have also parallelized the private square() method. Internally, the >> square() method defaults to be sequential. >> >> Some benchmark results, run on my 1-6-2 server: >> >> >> Benchmark (n) Mode Cnt Score >> Error Units >> BigIntegerParallelMultiply.multiply100ss4 51.707 >> ± 11.194 ms/op >> BigIntegerParallelMultiply.multiply 1000ss4988.302 >> ± 235.977 ms/op >> BigIntegerParallelMultiply.multiply 1ss4 24662.063 >> ± 1123.329 ms/op >> BigIntegerParallelMultiply.parallelMultiply100ss4 49.337 >> ± 26.611 ms/op >> BigIntegerParallelMultiply.parallelMultiply 1000ss4527.560 >> ± 268.903 ms/op >> BigIntegerParallelMultiply.parallelMultiply 1ss4 9076.551 >> ± 1899.444 ms/op >> >> >> We can see that for larger calculations (fib 100m), the execution is 2.7x >> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for >> small (fib 1m) it is roughly the same. Considering that the fibonacci >> algorithm that we used was in itself sequential, and that the last 3 >> calculations would dominate, 2.7x faster should probably be considered quite >> good on a 1-6-2 machine. > > kabutz has updated the pull request incrementally with one additional commit > since the last revision: > > Removed JVM flags from benchmark I looked more closely and now understand a bit more about the threshold. It would be useful to have some implementation notes detailing approximately when the parallel execution kicks in. For `a*a` it's `>= TOOM_COOK_SQUARE_THRESHOLD(216)` and for `a*b` it's `>= TOOM_COOK_THRESHOLD(240)`, so we could refer to certain bit lengths. The branching factor is 4 (well 3 i guess, but its easier to think in powers of 2). It might be reasonable to assume the problem gets split equally in 4 parts. I don't know in practice what the depth of recursion might, its hard to see this getting completely out of control, but we could always keep track of the depth and cut off in proportion to the # runtime processors if need be. Given the existing degree of specialization, the minimal changes to code, and the fact that the creation of recursive tasks is in the noise this PR looks quite reasonable. I think we can simplify the creation of the recursive tasks (assuming we don't track the depth): private static final class RecursiveOp { private static RecursiveTask exec(RecursiveTask op, boolean parallel) { if (parallel) { op.fork(); } else { op.invoke(); } return op; } private static RecursiveTask multiply(BigInteger a, BigInteger b, boolean parallel) { var op = new RecursiveTask() { @Override protected BigInteger compute() { return a.multiply(b, true, parallel); } }; return exec(op, parallel); } private static RecursiveTask square(BigInteger a, boolean parallel) { var op = new RecursiveTask() { @Override protected BigInteger compute() { return a.square(true, parallel); } }; return exec(op, parallel); } } - PR: https://git.openjdk.java.net/jdk/pull/6409
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
On Nov 18, 2021, at 2:41 AM, kabutz mailto:d...@openjdk.java.net>> wrote: Yes, it **must** be opt-in. However I'm not sure that a tuning knob will be necessary. BigInteger has thresholds for using different multiply algorithms and these are also not configurable. Those thresholds are the result of *many* JMC runs on multiple platforms. That said, there is an issue on file to reevaluate them someday [1]. [1] https://bugs.openjdk.java.net/browse/JDK-8029425
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
Yes but that does not help with the decision if parallel should be used or not. But yes, if it is generally not wanted to make the pool explicite a simple parallel signature without argument would also work to make the decision explicite (I.e. new api). That won’t automatically tune the performance but it does allow users to use it - majority would be crypto anyway where it can be used by the JCE and JSSE (maybe?). Gruss Bernd -- http://bernd.eckenfels.net Von: Remi Forax Gesendet: Thursday, November 18, 2021 12:16:31 PM An: Bernd Eckenfels Cc: core-libs-dev Betreff: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3] - Original Message - > From: "Bernd Eckenfels" > To: "core-libs-dev" > Sent: Jeudi 18 Novembre 2021 12:07:19 > Subject: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger > [v3] > What about a new API multiply method which takes an forkjoinpool, and only if > that is used/specified it will use the parallel mode (and only if Notsitzes > threshold applies?). Most of the pool tuning can then done with this argument. > It also avoids surprise threads. You don't need it, here is the usual trick if you want to specify a specific fork join pool https://stackoverflow.com/questions/21163108/custom-thread-pool-in-java-8-parallel-stream > > Gruss > Bernd regards, Rémi > -- > http://bernd.eckenfels.net > > Von: core-libs-dev im Auftrag von kabutz > > Gesendet: Thursday, November 18, 2021 11:41:40 AM > An: core-libs-dev@openjdk.java.net > Betreff: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger > [v3] > > On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes wrote: > >> To add my 2c IMO a parallel version of this type absolutely **must** be >> opt-in. >> There are simply far too many side-effects of using the FJP and multiple >> threads to perform the calculation in parallel as if it is just a minor >> implementation detail. A clear API is 1000x better than a "kill switch". >> >> And yes you may still need to expose some kind of tuning knob. >> >> David > > Yes, it **must** be opt-in. However I'm not sure that a tuning knob will be > necessary. BigInteger has thresholds for using different multiply algorithms > and these are also not configurable. > > - > > PR: https://git.openjdk.java.net/jdk/pull/6409
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
- Original Message - > From: "Bernd Eckenfels" > To: "core-libs-dev" > Sent: Jeudi 18 Novembre 2021 12:07:19 > Subject: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger > [v3] > What about a new API multiply method which takes an forkjoinpool, and only if > that is used/specified it will use the parallel mode (and only if Notsitzes > threshold applies?). Most of the pool tuning can then done with this argument. > It also avoids surprise threads. You don't need it, here is the usual trick if you want to specify a specific fork join pool https://stackoverflow.com/questions/21163108/custom-thread-pool-in-java-8-parallel-stream > > Gruss > Bernd regards, Rémi > -- > http://bernd.eckenfels.net > > Von: core-libs-dev im Auftrag von kabutz > > Gesendet: Thursday, November 18, 2021 11:41:40 AM > An: core-libs-dev@openjdk.java.net > Betreff: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger > [v3] > > On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes wrote: > >> To add my 2c IMO a parallel version of this type absolutely **must** be >> opt-in. >> There are simply far too many side-effects of using the FJP and multiple >> threads to perform the calculation in parallel as if it is just a minor >> implementation detail. A clear API is 1000x better than a "kill switch". >> >> And yes you may still need to expose some kind of tuning knob. >> >> David > > Yes, it **must** be opt-in. However I'm not sure that a tuning knob will be > necessary. BigInteger has thresholds for using different multiply algorithms > and these are also not configurable. > > - > > PR: https://git.openjdk.java.net/jdk/pull/6409
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
What about a new API multiply method which takes an forkjoinpool, and only if that is used/specified it will use the parallel mode (and only if Notsitzes threshold applies?). Most of the pool tuning can then done with this argument. It also avoids surprise threads. Gruss Bernd -- http://bernd.eckenfels.net Von: core-libs-dev im Auftrag von kabutz Gesendet: Thursday, November 18, 2021 11:41:40 AM An: core-libs-dev@openjdk.java.net Betreff: Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3] On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes wrote: > To add my 2c IMO a parallel version of this type absolutely **must** be > opt-in. There are simply far too many side-effects of using the FJP and > multiple threads to perform the calculation in parallel as if it is just a > minor implementation detail. A clear API is 1000x better than a "kill switch". > > And yes you may still need to expose some kind of tuning knob. > > David Yes, it **must** be opt-in. However I'm not sure that a tuning knob will be necessary. BigInteger has thresholds for using different multiply algorithms and these are also not configurable. - PR: https://git.openjdk.java.net/jdk/pull/6409
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
On Thu, 18 Nov 2021 07:26:45 GMT, David Holmes wrote: > To add my 2c IMO a parallel version of this type absolutely **must** be > opt-in. There are simply far too many side-effects of using the FJP and > multiple threads to perform the calculation in parallel as if it is just a > minor implementation detail. A clear API is 1000x better than a "kill switch". > > And yes you may still need to expose some kind of tuning knob. > > David Yes, it **must** be opt-in. However I'm not sure that a tuning knob will be necessary. BigInteger has thresholds for using different multiply algorithms and these are also not configurable. - PR: https://git.openjdk.java.net/jdk/pull/6409
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
On Wed, 17 Nov 2021 19:48:25 GMT, kabutz wrote: >> BigInteger currently uses three different algorithms for multiply. The >> simple quadratic algorithm, then the slightly better Karatsuba if we exceed >> a bit count and then Toom Cook 3 once we go into the several thousands of >> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to >> parallelize it. I have demonstrated this several times in conference talks. >> In order to be consistent with other classes such as Arrays and Collection, >> I have added a parallelMultiply() method. Internally we have added a >> parameter to the private multiply method to indicate whether the calculation >> should be done in parallel. >> >> The performance improvements are as should be expected. Fibonacci of 100 >> million (using a single-threaded Dijkstra's sum of squares version) >> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with >> the sequential multiply() method. This is on my 1-8-2 laptop. The final >> multiplications are with very large numbers, which then benefit from the >> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number. >> >> We have also parallelized the private square() method. Internally, the >> square() method defaults to be sequential. >> >> Some benchmark results, run on my 1-6-2 server: >> >> >> Benchmark (n) Mode Cnt Score >> Error Units >> BigIntegerParallelMultiply.multiply100ss4 51.707 >> ± 11.194 ms/op >> BigIntegerParallelMultiply.multiply 1000ss4988.302 >> ± 235.977 ms/op >> BigIntegerParallelMultiply.multiply 1ss4 24662.063 >> ± 1123.329 ms/op >> BigIntegerParallelMultiply.parallelMultiply100ss4 49.337 >> ± 26.611 ms/op >> BigIntegerParallelMultiply.parallelMultiply 1000ss4527.560 >> ± 268.903 ms/op >> BigIntegerParallelMultiply.parallelMultiply 1ss4 9076.551 >> ± 1899.444 ms/op >> >> >> We can see that for larger calculations (fib 100m), the execution is 2.7x >> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for >> small (fib 1m) it is roughly the same. Considering that the fibonacci >> algorithm that we used was in itself sequential, and that the last 3 >> calculations would dominate, 2.7x faster should probably be considered quite >> good on a 1-6-2 machine. > > kabutz has updated the pull request incrementally with one additional commit > since the last revision: > > Removed JVM flags from benchmark To add my 2c IMO a parallel version of this type absolutely **must** be opt-in. There are simply far too many side-effects of using the FJP and multiple threads to perform the calculation in parallel as if it is just a minor implementation detail. A clear API is 1000x better than a "kill switch". And yes you may still need to expose some kind of tuning knob. David - PR: https://git.openjdk.java.net/jdk/pull/6409
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
On Wed, 17 Nov 2021 19:48:25 GMT, kabutz wrote: >> BigInteger currently uses three different algorithms for multiply. The >> simple quadratic algorithm, then the slightly better Karatsuba if we exceed >> a bit count and then Toom Cook 3 once we go into the several thousands of >> bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to >> parallelize it. I have demonstrated this several times in conference talks. >> In order to be consistent with other classes such as Arrays and Collection, >> I have added a parallelMultiply() method. Internally we have added a >> parameter to the private multiply method to indicate whether the calculation >> should be done in parallel. >> >> The performance improvements are as should be expected. Fibonacci of 100 >> million (using a single-threaded Dijkstra's sum of squares version) >> completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with >> the sequential multiply() method. This is on my 1-8-2 laptop. The final >> multiplications are with very large numbers, which then benefit from the >> parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number. >> >> We have also parallelized the private square() method. Internally, the >> square() method defaults to be sequential. >> >> Some benchmark results, run on my 1-6-2 server: >> >> >> Benchmark (n) Mode Cnt Score >> Error Units >> BigIntegerParallelMultiply.multiply100ss4 51.707 >> ± 11.194 ms/op >> BigIntegerParallelMultiply.multiply 1000ss4988.302 >> ± 235.977 ms/op >> BigIntegerParallelMultiply.multiply 1ss4 24662.063 >> ± 1123.329 ms/op >> BigIntegerParallelMultiply.parallelMultiply100ss4 49.337 >> ± 26.611 ms/op >> BigIntegerParallelMultiply.parallelMultiply 1000ss4527.560 >> ± 268.903 ms/op >> BigIntegerParallelMultiply.parallelMultiply 1ss4 9076.551 >> ± 1899.444 ms/op >> >> >> We can see that for larger calculations (fib 100m), the execution is 2.7x >> faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for >> small (fib 1m) it is roughly the same. Considering that the fibonacci >> algorithm that we used was in itself sequential, and that the last 3 >> calculations would dominate, 2.7x faster should probably be considered quite >> good on a 1-6-2 machine. > > kabutz has updated the pull request incrementally with one additional commit > since the last revision: > > Removed JVM flags from benchmark One concern I had was what would happen when the common pool parallelism was set to 0. In the parallelSort mechanism, they only sort in parallel if the parallelism is at least 2. Thus a parallel sort on a dual-core machine (without hyperthreading) will run sequentially. However, the parallelBigInteger seems to work OK with different common pool sizes. We always have one thread more than the common pool parallelism working, since the thread that calls parallelMultiply() also does calculations: Showing only the cases for Fibonacci(100m): Method(n) common_parallelism ms/op multiply 100_000_000 0 24619.004 parallelMultiply 100_000_000 0 24977.280 multiply 100_000_000 1 24526.498 parallelMultiply 100_000_000 1 18169.969 multiply 100_000_000 2 24528.111 parallelMultiply 100_000_000 2 11073.637 multiply 100_000_000 3 24861.324 parallelMultiply 100_000_000 39472.986 multiply 100_000_000 4 25036.295 parallelMultiply 100_000_000 49151.940 multiply 100_000_000 5 25129.139 parallelMultiply 100_000_000 58964.403 multiply 100_000_000 11 24717.125 parallelMultiply 100_000_000 119024.319 - PR: https://git.openjdk.java.net/jdk/pull/6409
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
> BigInteger currently uses three different algorithms for multiply. The simple > quadratic algorithm, then the slightly better Karatsuba if we exceed a bit > count and then Toom Cook 3 once we go into the several thousands of bits. > Since Toom Cook 3 is a recursive algorithm, it is trivial to parallelize it. > I have demonstrated this several times in conference talks. In order to be > consistent with other classes such as Arrays and Collection, I have added a > parallelMultiply() method. Internally we have added a parameter to the > private multiply method to indicate whether the calculation should be done in > parallel. > > The performance improvements are as should be expected. Fibonacci of 100 > million (using a single-threaded Dijkstra's sum of squares version) completes > in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with the > sequential multiply() method. This is on my 1-8-2 laptop. The final > multiplications are with very large numbers, which then benefit from the > parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number. > > We have also parallelized the private square() method. Internally, the > square() method defaults to be sequential. > > Some benchmark results, run on my 1-6-2 server: > > > Benchmark (n) Mode Cnt Score > Error Units > BigIntegerParallelMultiply.multiply100ss4 51.707 > ± 11.194 ms/op > BigIntegerParallelMultiply.multiply 1000ss4988.302 > ± 235.977 ms/op > BigIntegerParallelMultiply.multiply 1ss4 24662.063 > ± 1123.329 ms/op > BigIntegerParallelMultiply.parallelMultiply100ss4 49.337 > ± 26.611 ms/op > BigIntegerParallelMultiply.parallelMultiply 1000ss4527.560 > ± 268.903 ms/op > BigIntegerParallelMultiply.parallelMultiply 1ss4 9076.551 > ± 1899.444 ms/op > > > We can see that for larger calculations (fib 100m), the execution is 2.7x > faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for > small (fib 1m) it is roughly the same. Considering that the fibonacci > algorithm that we used was in itself sequential, and that the last 3 > calculations would dominate, 2.7x faster should probably be considered quite > good on a 1-6-2 machine. kabutz has updated the pull request incrementally with one additional commit since the last revision: Removed JVM flags from benchmark - Changes: - all: https://git.openjdk.java.net/jdk/pull/6409/files - new: https://git.openjdk.java.net/jdk/pull/6409/files/66cec9bd..81a8b599 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk=6409=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk=6409=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/6409.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6409/head:pull/6409 PR: https://git.openjdk.java.net/jdk/pull/6409
Re: RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]
> BigInteger currently uses three different algorithms for multiply. The simple > quadratic algorithm, then the slightly better Karatsuba if we exceed a bit > count and then Toom Cook 3 once we go into the several thousands of bits. > Since Toom Cook 3 is a recursive algorithm, it is trivial to parallelize it. > I have demonstrated this several times in conference talks. In order to be > consistent with other classes such as Arrays and Collection, I have added a > parallelMultiply() method. Internally we have added a parameter to the > private multiply method to indicate whether the calculation should be done in > parallel. > > The performance improvements are as should be expected. Fibonacci of 100 > million (using a single-threaded Dijkstra's sum of squares version) completes > in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with the > sequential multiply() method. This is on my 1-8-2 laptop. The final > multiplications are with very large numbers, which then benefit from the > parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number. > > We have also parallelized the private square() method. Internally, the > square() method defaults to be sequential. > > > Benchmark (n) Mode Cnt Score > Error Units > BigIntegerParallelMultiply.multiply100ss4 68,043 > ± 25,317 ms/op > BigIntegerParallelMultiply.multiply 1000ss4 1073,095 > ± 125,296 ms/op > BigIntegerParallelMultiply.multiply 1ss4 25317,535 > ± 5806,205 ms/op > BigIntegerParallelMultiply.parallelMultiply100ss4 56,552 > ± 22,368 ms/op > BigIntegerParallelMultiply.parallelMultiply 1000ss4536,193 > ± 37,393 ms/op > BigIntegerParallelMultiply.parallelMultiply 1ss4 9274,657 > ± 826,197 ms/op kabutz has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 15 additional commits since the last revision: - 8277119: Add asserts in GenericTaskQueueSet methods Reviewed-by: tschatzl - 8274757: Cleanup unnecessary calls to Throwable.initCause() in java.management module Reviewed-by: dfuchs, sspitsyn - 8274662: Replace 'while' cycles with iterator with enhanced-for in jdk.hotspot.agent Reviewed-by: amenkov, cjplummer, sspitsyn - 8274163: Use String.equals instead of String.compareTo in jdk.jcmd Reviewed-by: cjplummer, amenkov, sspitsyn - 8274190: Use String.equals instead of String.compareTo in jdk.internal.jvmstat Reviewed-by: cjplummer, sspitsyn - 8274168: Avoid String.compareTo == 0 to check String equality in java.management Reviewed-by: sspitsyn, dfuchs, cjplummer, dholmes - 8274261: Use enhanced-for instead of plain 'for' in jdk.jcmd Reviewed-by: sspitsyn, cjplummer - Update comments - Added parallelMultiply() method to BigInteger to allow large multiplications to run in parallel - Merge branch 'openjdk:master' into master - ... and 5 more: https://git.openjdk.java.net/jdk/compare/744e9e8e...0075f04d - Changes: - all: https://git.openjdk.java.net/jdk/pull/6391/files - new: https://git.openjdk.java.net/jdk/pull/6391/files/03be5518..0075f04d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk=6391=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk=6391=01-02 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/6391.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6391/head:pull/6391 PR: https://git.openjdk.java.net/jdk/pull/6391