[jira] [Commented] (RNG-88) Update the GenerationPerformance benchmark
[ https://issues.apache.org/jira/browse/RNG-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809761#comment-16809761 ] Alex D Herbert commented on RNG-88: --- {quote}Just a BTW The JMH harness goes to great length in removing artifacts from the actual measurements. You can add a baseline method here however I would avoid trying to massage the returned results in any way. For relative comparison it’s not needed anyway (and overhead should also be observable by varrying request length if possible) {quote} Hi [~b.eckenfels], I have studied the tutorials in JMH and tried to implement best practice. Specifically this example on [DeadCode|https://hg.openjdk.java.net/code-tools/jmh/file/f25ae8584db1/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_08_DeadCode.java] elimination which is used to demonstrate that there is a measured time for doing nothing. The issue we have here is that we want to measure the time for generation of a random number. So we ideally want to measure: {noformat} [time to generate nothing] + [time to consume the value] [time to generate random number] + [time to consume the value] {noformat} Subtracting one from the other gives the [time to generate random number]. The main issue with not considering a baseline is that the random generation time is very fast. It is so fast that the time taken by JMH to test a generated value is much longer than the time to generate it. Leaving the results unmassaged is fine for ranking. But is not representative for relative speed comparisons. The baseline subtraction is not part of the new code anyway. The new code in the PR just eliminates the loop from within the test method which is another JMH best practice (see [Loops example|https://hg.openjdk.java.net/code-tools/jmh/file/f25ae8584db1/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_11_Loops.java]). I hope I have explained what I am trying to achieve. Your comments on the results I have shown above and code in the PR are welcome. > Update the GenerationPerformance benchmark > -- > > Key: RNG-88 > URL: https://issues.apache.org/jira/browse/RNG-88 > Project: Commons RNG > Issue Type: Improvement > Components: examples >Affects Versions: 1.3 >Reporter: Alex D Herbert >Assignee: Alex D Herbert >Priority: Minor > Attachments: baseline.jpg, next.png > > Time Spent: 20m > Remaining Estimate: 0h > > The current GenerationPerformance benchmark runs all the generators to > collect timing data. However the act of running the test within JMH takes > some time (test overhead). This overhead should be measured and subtracted > from the timing data to create a time attributed only to the method exercised > in the RNG. > This can be done by creating a dummy RNG that returns a fixed value. The > implementation must be done in a manner where the JIT compiler is not able to > remove the dummy method from the execution path. This is achieved by > returning state variables from the dummy RNG (not final variables). > Testing can be done using a variable number of iterations and the run-times > assessed for linearity. If the run time scale with the number of iterations > then the JIT compiler has not removed it from execution. The dummy RNG then > serves as a baseline for comparison of true implementations. > This idea with examples is shown in > [RNG-87|https://issues.apache.org/jira/browse/RNG-87] which tested a variant > of the MWC_256 algorithm. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RNG-88) Update the GenerationPerformance benchmark
[ https://issues.apache.org/jira/browse/RNG-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809742#comment-16809742 ] Alex D Herbert commented on RNG-88: --- PS. Note that the example for a benchmark class has two {{@Benchmark}} methods to test JMH timing overhead. These should sum to approximately the same metric as the BASELINE implementation. In the above case for {{nextDouble()}} here are the results: ||randomSourceName||Method||Score||Error||Median||Error/Score|| |BASELINE|nextDouble|2.57593197585943|0.003460918835402205|2.57636702779831|0.00134355987185862| |ISAAC|nextDouble|11.492061181491472|0.03105190366683462|11.4930378375858|0.00270203083471616| |JDK|nextDouble|19.07710097015598|0.13765564275214587|19.0508121852192|0.00721575269573155| |KISS|nextDouble|9.205196709268291|0.006192317077079371|9.20614509815848|0.000672697963189055| |MT|nextDouble|10.653521635100258|0.010420185228341606|10.6524402774538|0.00097809772066451| |MT_64|nextDouble|6.8707343968410886|0.04118236194818618|6.8627989246393|0.00599388064936993| |MWC_256|nextDouble|6.582187644902875|0.05365540468787355|6.57988293788998|0.0081516073234| |SPLIT_MIX_64|nextDouble|4.5286626926183935|0.008298205686561705|4.52667047056436|0.0018323744226055| |TWO_CMRES|nextDouble|5.359469359437292|0.050123182243253586|5.34417676563963|0.00935226584605685| |WELL_1024_A|nextDouble|13.632475941306263|2.3697998386644143|12.8807508310923|0.173834881416071| |WELL_19937_A|nextDouble|18.45089096484812|0.0381182263045|18.4441767343849|0.00206592876344951| |WELL_19937_C|nextDouble|19.972261377408604|0.02239912024952394|19.9724925126603|0.0011215114716484| |WELL_44497_A|nextDouble|19.858159613693175|0.0871231306453897|19.8435844247295|0.00438727114396412| |WELL_44497_B|nextDouble|20.725319834240956|0.14352086213185533|20.7110885030745|0.00692490457468068| |WELL_512_A|nextDouble|12.967886115136594|0.023823029376018285|12.9610798024776|0.00183707885498865| |XOR_SHIFT_1024_S|nextDouble|5.3340350544353825|0.005344749168553254|5.33227602956032|0.00100200863211594| |XOR_SHIFT_1024_S_PHI|nextDouble|5.5063780535656885|0.7355642816548434|5.34975699418824|0.133584050077805| | |baselineDouble|2.3360076674923396|0.3117121027015|2.26776285101707|0.133437962143361| | |baselineVoid|0.2657420971315234|3.32183670742148E-4|0.265707936583524|0.00125002276390459| So: {noformat} baselineDouble + baselineVoid ~ BASELINE 2.33 + 0.27 ~ 2.57 {noformat} This shows the baseline implementation is being run. Note that the Error output by JMH is is the 0.999 percentile of the T-distribution with n-degrees of freedom multiplied by the standard error of the mean. This is a conservative +/- limit on the true value of the mean. Dividing the Error by the Score gives a quick check to see if there is a lot of variation in the mean. In this example WELL_1024_A has a single measurement that is 16.5, which is above the usual 12.9 time. Hence using the median for analysis to avoid skew from single outliers. > Update the GenerationPerformance benchmark > -- > > Key: RNG-88 > URL: https://issues.apache.org/jira/browse/RNG-88 > Project: Commons RNG > Issue Type: Improvement > Components: examples >Affects Versions: 1.3 >Reporter: Alex D Herbert >Assignee: Alex D Herbert >Priority: Minor > Attachments: baseline.jpg, next.png > > Time Spent: 10m > Remaining Estimate: 0h > > The current GenerationPerformance benchmark runs all the generators to > collect timing data. However the act of running the test within JMH takes > some time (test overhead). This overhead should be measured and subtracted > from the timing data to create a time attributed only to the method exercised > in the RNG. > This can be done by creating a dummy RNG that returns a fixed value. The > implementation must be done in a manner where the JIT compiler is not able to > remove the dummy method from the execution path. This is achieved by > returning state variables from the dummy RNG (not final variables). > Testing can be done using a variable number of iterations and the run-times > assessed for linearity. If the run time scale with the number of iterations > then the JIT compiler has not removed it from execution. The dummy RNG then > serves as a baseline for comparison of true implementations. > This idea with examples is shown in > [RNG-87|https://issues.apache.org/jira/browse/RNG-87] which tested a variant > of the MWC_256 algorithm. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RNG-88) Update the GenerationPerformance benchmark
[ https://issues.apache.org/jira/browse/RNG-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809735#comment-16809735 ] Alex D Herbert commented on RNG-88: --- I have created new JMH benchmarks for all the methods defined in {{UniformRandomProvider}}. A baseline implementation has been created for each method and scales linearly with workload: !baseline.jpg! (Note: The {{nextBytes()}} method is plotted on the second Y-axis as the measurement time is very different.) Each method from {{UniformRandomProvider}} for each random source is then benchmarked against a baseline implementation. This separates out the benchmarks into separate classes that share the baseline implementation. E.g. {{NextFloatGenerationPerformance}} for {{nextFloat}} but {{NextIntGenerationPerformance}} for {{nextInt()}} and {{nextInt(int)}}. An example benchmark looks like this: {code:java} /** * Executes benchmark to compare the speed of generation of random numbers from the * various source providers for {@link UniformRandomProvider#nextDouble()}. */ public class NextDoubleGenerationPerformance extends AbstractBenchmark { /** * The benchmark state (retrieve the various "RandomSource"s). */ @State(Scope.Benchmark) public static class Sources extends BaselineSources { @Override protected UniformRandomProvider createBaseline() { return BaselineUtils.getNextDouble(); } } /** The value. */ private double value; /** * Baseline for a JMH method call with no return value. */ @Benchmark public void baselineVoid() { // Do nothing, this is a baseline } /** * Baseline for a JMH method call returning a {@code double}. * * @return the value */ @Benchmark public double baselineDouble() { return value; } /** * Exercise the {@link UniformRandomProvider#nextDouble()} method. * * @param sources Source of randomness. * @return the double */ @Benchmark public double nextDouble(Sources sources) { return sources.getGenerator().nextDouble(); } } {code} The {{BaselineSources}} declares all RNG providers plus a keyword {{BASELINE}}. When it identifies this then it calls the method to create the baseline. Otherwise it just creates the RandomSource using the standard factory method. I found stability on my test machine using 10 iterations for warm-up and measurement of 1 second each. {noformat} Intel(R) Xeon(R) CPU E5-1680 v3 @ 3.20GHz openjdk version "1.8.0_191" Linux gc04016493.gdsc.susx.ac.uk 4.4.0-145-generic #171-Ubuntu SMP Tue Mar 26 12:43:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux {noformat} I ran all the benchmarks, subtracted the BASELINE from the runtime metric then expressed the results relative to the JDK. I found the median of 10 runs to be more stable than the average as there can be a few outliers that skew the score. Here are the relative median scores: !next.png! This should be compared to [RNG User Guide|https://commons.apache.org/proper/commons-rng/userguide/rng.html]. Those results show that the fastest generators are SPLIT_MIX, XOR_SHIFT_1024_S, TWO_CMRES, MWC_256 and slowest are JDK and WELL_44497x. This chart shows the same results but the differences are larger. For reference the results of the SplitMix in the User guide (v1.2) verses the new results: ||Method||New||Old|| |nextLong|0.0806000708|0.21751| |nextBytes|0.2179846927| | |nextDouble|0.1183835585|0.27611| |nextLongN|0.1803571971| | |nextInt|0.1618788839|0.41479| |nextFloat|0.2567586134| | |nextIntN|0.3662874911| | |nextBoolean|1.176949893|1.1214| In the case of the SplitMix the JMH overhead for time measurement for {{nextLong()}} is 63% of the total time. So by subtracting the baseline the differences between a fast generator and a slow one are more pronounced. I also note the strangely fast speed of {{nextBoolean()}} for the JDK implementation, where it is fastest despite not being a fast generator of {{int}} values. The implementation for this method is standard for an {{IntProvider}}. Store a single {{int}} value and then use each consecutive bit of the 32-bits for a {{boolean}}. So the speed up must be related to the relative speed of testing the bits from the JDK source verses the others, and we know from the BigCrush results that the JDK bit source is not good. I've opened [PR 35|https://github.com/apache/commons-rng/pull/35] with the new code. This idea can be applied to other benchmarks currently using a loop to test performance that do not require input data/state change on each loop iteration. These are: * GeometricSamplersPerformance * SamplersPerformance * NextGaussianPerformance In the case of SamplersPerformance it may be better to separate the DiscreteSampler and ContinuousSampler into two benchmarks that can have a baseline implementation of the
[jira] [Commented] (RNG-88) Update the GenerationPerformance benchmark
[ https://issues.apache.org/jira/browse/RNG-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809710#comment-16809710 ] Bernd Eckenfels commented on RNG-88: Just a BTW The JMH harness goes to great length in removing artifacts from the actual measurements. You can add a baseline method here however I would avoid trying to massage the returned results in any way. For relative comparison it’s not needed anyway (and overhead should also be observable by varrying request length if possible) > Update the GenerationPerformance benchmark > -- > > Key: RNG-88 > URL: https://issues.apache.org/jira/browse/RNG-88 > Project: Commons RNG > Issue Type: Improvement > Components: examples >Affects Versions: 1.3 >Reporter: Alex D Herbert >Assignee: Alex D Herbert >Priority: Minor > Attachments: baseline.jpg, next.png > > > The current GenerationPerformance benchmark runs all the generators to > collect timing data. However the act of running the test within JMH takes > some time (test overhead). This overhead should be measured and subtracted > from the timing data to create a time attributed only to the method exercised > in the RNG. > This can be done by creating a dummy RNG that returns a fixed value. The > implementation must be done in a manner where the JIT compiler is not able to > remove the dummy method from the execution path. This is achieved by > returning state variables from the dummy RNG (not final variables). > Testing can be done using a variable number of iterations and the run-times > assessed for linearity. If the run time scale with the number of iterations > then the JIT compiler has not removed it from execution. The dummy RNG then > serves as a baseline for comparison of true implementations. > This idea with examples is shown in > [RNG-87|https://issues.apache.org/jira/browse/RNG-87] which tested a variant > of the MWC_256 algorithm. -- This message was sent by Atlassian JIRA (v7.6.3#76005)