[jira] [Commented] (RNG-88) Update the GenerationPerformance benchmark

2019-04-04 Thread Alex D Herbert (JIRA)


[ 
https://issues.apache.org/jira/browse/RNG-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809761#comment-16809761
 ] 

Alex D Herbert commented on RNG-88:
---

{quote}Just a BTW The JMH harness goes to great length in removing artifacts 
from the actual measurements. You can add a baseline method here however I 
would avoid trying to massage the returned results in any way. For relative 
comparison it’s not needed anyway (and overhead should also be observable by 
varrying request length if possible)
{quote}
Hi [~b.eckenfels],

I have studied the tutorials in JMH and tried to implement best practice. 
Specifically this example on 
[DeadCode|https://hg.openjdk.java.net/code-tools/jmh/file/f25ae8584db1/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_08_DeadCode.java]
 elimination which is used to demonstrate that there is a measured time for 
doing nothing.

The issue we have here is that we want to measure the time for generation of a 
random number. So we ideally want to measure:
{noformat}
[time to generate nothing]   + [time to consume the value]
[time to generate random number] + [time to consume the value]
{noformat}
Subtracting one from the other gives the [time to generate random number].

The main issue with not considering a baseline is that the random generation 
time is very fast. It is so fast that the time taken by JMH to test a generated 
value is much longer than the time to generate it.

Leaving the results unmassaged is fine for ranking. But is not representative 
for relative speed comparisons.

The baseline subtraction is not part of the new code anyway. The new code in 
the PR just eliminates the loop from within the test method which is another 
JMH best practice (see [Loops 
example|https://hg.openjdk.java.net/code-tools/jmh/file/f25ae8584db1/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_11_Loops.java]).

I hope I have explained what I am trying to achieve. Your comments on the 
results I have shown above and code in the PR are welcome.

> Update the GenerationPerformance benchmark
> --
>
> Key: RNG-88
> URL: https://issues.apache.org/jira/browse/RNG-88
> Project: Commons RNG
>  Issue Type: Improvement
>  Components: examples
>Affects Versions: 1.3
>Reporter: Alex D Herbert
>Assignee: Alex D Herbert
>Priority: Minor
> Attachments: baseline.jpg, next.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The current GenerationPerformance benchmark runs all the generators to 
> collect timing data. However the act of running the test within JMH takes 
> some time (test overhead). This overhead should be measured and subtracted 
> from the timing data to create a time attributed only to the method exercised 
> in the RNG.
> This can be done by creating a dummy RNG that returns a fixed value. The 
> implementation must be done in a manner where the JIT compiler is not able to 
> remove the dummy method from the execution path. This is achieved by 
> returning state variables from the dummy RNG (not final variables).
> Testing can be done using a variable number of iterations and the run-times 
> assessed for linearity. If the run time scale with the number of iterations 
> then the JIT compiler has not removed it from execution. The dummy RNG then 
> serves as a baseline for comparison of true implementations.
> This idea with examples is shown in 
> [RNG-87|https://issues.apache.org/jira/browse/RNG-87] which tested a variant 
> of the MWC_256 algorithm.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RNG-88) Update the GenerationPerformance benchmark

2019-04-04 Thread Alex D Herbert (JIRA)


[ 
https://issues.apache.org/jira/browse/RNG-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809742#comment-16809742
 ] 

Alex D Herbert commented on RNG-88:
---

PS.

Note that the example for a benchmark class has two {{@Benchmark}} methods to 
test JMH timing overhead. These should sum to approximately the same metric as 
the BASELINE implementation. In the above case for {{nextDouble()}} here are 
the results:
||randomSourceName||Method||Score||Error||Median||Error/Score||
|BASELINE|nextDouble|2.57593197585943|0.003460918835402205|2.57636702779831|0.00134355987185862|
|ISAAC|nextDouble|11.492061181491472|0.03105190366683462|11.4930378375858|0.00270203083471616|
|JDK|nextDouble|19.07710097015598|0.13765564275214587|19.0508121852192|0.00721575269573155|
|KISS|nextDouble|9.205196709268291|0.006192317077079371|9.20614509815848|0.000672697963189055|
|MT|nextDouble|10.653521635100258|0.010420185228341606|10.6524402774538|0.00097809772066451|
|MT_64|nextDouble|6.8707343968410886|0.04118236194818618|6.8627989246393|0.00599388064936993|
|MWC_256|nextDouble|6.582187644902875|0.05365540468787355|6.57988293788998|0.0081516073234|
|SPLIT_MIX_64|nextDouble|4.5286626926183935|0.008298205686561705|4.52667047056436|0.0018323744226055|
|TWO_CMRES|nextDouble|5.359469359437292|0.050123182243253586|5.34417676563963|0.00935226584605685|
|WELL_1024_A|nextDouble|13.632475941306263|2.3697998386644143|12.8807508310923|0.173834881416071|
|WELL_19937_A|nextDouble|18.45089096484812|0.0381182263045|18.4441767343849|0.00206592876344951|
|WELL_19937_C|nextDouble|19.972261377408604|0.02239912024952394|19.9724925126603|0.0011215114716484|
|WELL_44497_A|nextDouble|19.858159613693175|0.0871231306453897|19.8435844247295|0.00438727114396412|
|WELL_44497_B|nextDouble|20.725319834240956|0.14352086213185533|20.7110885030745|0.00692490457468068|
|WELL_512_A|nextDouble|12.967886115136594|0.023823029376018285|12.9610798024776|0.00183707885498865|
|XOR_SHIFT_1024_S|nextDouble|5.3340350544353825|0.005344749168553254|5.33227602956032|0.00100200863211594|
|XOR_SHIFT_1024_S_PHI|nextDouble|5.5063780535656885|0.7355642816548434|5.34975699418824|0.133584050077805|
| 
|baselineDouble|2.3360076674923396|0.3117121027015|2.26776285101707|0.133437962143361|
| 
|baselineVoid|0.2657420971315234|3.32183670742148E-4|0.265707936583524|0.00125002276390459|

So:
{noformat}
baselineDouble + baselineVoid ~ BASELINE
2.33 + 0.27 ~ 2.57
{noformat}
This shows the baseline implementation is being run.

Note that the Error output by JMH is is the 0.999 percentile of the 
T-distribution with n-degrees of freedom multiplied by the standard error of 
the mean. This is a conservative +/- limit on the true value of the mean. 
Dividing the Error by the Score gives a quick check to see if there is a lot of 
variation in the mean. 

In this example WELL_1024_A has a single measurement that is 16.5, which is 
above the usual 12.9 time. Hence using the median for analysis to avoid skew 
from single outliers.

 

> Update the GenerationPerformance benchmark
> --
>
> Key: RNG-88
> URL: https://issues.apache.org/jira/browse/RNG-88
> Project: Commons RNG
>  Issue Type: Improvement
>  Components: examples
>Affects Versions: 1.3
>Reporter: Alex D Herbert
>Assignee: Alex D Herbert
>Priority: Minor
> Attachments: baseline.jpg, next.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The current GenerationPerformance benchmark runs all the generators to 
> collect timing data. However the act of running the test within JMH takes 
> some time (test overhead). This overhead should be measured and subtracted 
> from the timing data to create a time attributed only to the method exercised 
> in the RNG.
> This can be done by creating a dummy RNG that returns a fixed value. The 
> implementation must be done in a manner where the JIT compiler is not able to 
> remove the dummy method from the execution path. This is achieved by 
> returning state variables from the dummy RNG (not final variables).
> Testing can be done using a variable number of iterations and the run-times 
> assessed for linearity. If the run time scale with the number of iterations 
> then the JIT compiler has not removed it from execution. The dummy RNG then 
> serves as a baseline for comparison of true implementations.
> This idea with examples is shown in 
> [RNG-87|https://issues.apache.org/jira/browse/RNG-87] which tested a variant 
> of the MWC_256 algorithm.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RNG-88) Update the GenerationPerformance benchmark

2019-04-04 Thread Alex D Herbert (JIRA)


[ 
https://issues.apache.org/jira/browse/RNG-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809735#comment-16809735
 ] 

Alex D Herbert commented on RNG-88:
---

I have created new JMH benchmarks for all the methods defined in 
{{UniformRandomProvider}}.

A baseline implementation has been created for each method and scales linearly 
with workload:

!baseline.jpg!

(Note: The {{nextBytes()}} method is plotted on the second Y-axis as the 
measurement time is very different.)

Each method from {{UniformRandomProvider}} for each random source is then 
benchmarked against a baseline implementation. This separates out the 
benchmarks into separate classes that share the baseline implementation. E.g. 
{{NextFloatGenerationPerformance}} for {{nextFloat}} but 
{{NextIntGenerationPerformance}} for {{nextInt()}} and {{nextInt(int)}}.

An example benchmark looks like this:
{code:java}
/**
 * Executes benchmark to compare the speed of generation of random numbers from 
the
 * various source providers for {@link UniformRandomProvider#nextDouble()}.
 */
public class NextDoubleGenerationPerformance extends AbstractBenchmark {
/**
 * The benchmark state (retrieve the various "RandomSource"s).
 */
@State(Scope.Benchmark)
public static class Sources extends BaselineSources {
@Override
protected UniformRandomProvider createBaseline() {
return BaselineUtils.getNextDouble();
}
}

/** The value. */
private double value;

/**
 * Baseline for a JMH method call with no return value.
 */
@Benchmark
public void baselineVoid() {
// Do nothing, this is a baseline
}

/**
 * Baseline for a JMH method call returning a {@code double}.
 *
 * @return the value
 */
@Benchmark
public double baselineDouble() {
return value;
}

/**
 * Exercise the {@link UniformRandomProvider#nextDouble()} method.
 *
 * @param sources Source of randomness.
 * @return the double
 */
@Benchmark
public double nextDouble(Sources sources) {
return sources.getGenerator().nextDouble();
}
}
{code}
The {{BaselineSources}} declares all RNG providers plus a keyword {{BASELINE}}. 
When it identifies this then it calls the method to create the baseline. 
Otherwise it just creates the RandomSource using the standard factory method.

I found stability on my test machine using 10 iterations for warm-up and 
measurement of 1 second each.
{noformat}
Intel(R) Xeon(R) CPU E5-1680 v3 @ 3.20GHz
openjdk version "1.8.0_191"
Linux gc04016493.gdsc.susx.ac.uk 4.4.0-145-generic #171-Ubuntu SMP Tue Mar 26 
12:43:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
{noformat}
I ran all the benchmarks, subtracted the BASELINE from the runtime metric then 
expressed the results relative to the JDK. I found the median of 10 runs to be 
more stable than the average as there can be a few outliers that skew the 
score. Here are the relative median scores:

!next.png!

This should be compared to [RNG User 
Guide|https://commons.apache.org/proper/commons-rng/userguide/rng.html]. Those 
results show that the fastest generators are SPLIT_MIX, XOR_SHIFT_1024_S, 
TWO_CMRES, MWC_256 and slowest are JDK and WELL_44497x. This chart shows the 
same results but the differences are larger.

For reference the results of the SplitMix in the User guide (v1.2) verses the 
new results:
||Method||New||Old||
|nextLong|0.0806000708|0.21751|
|nextBytes|0.2179846927| |
|nextDouble|0.1183835585|0.27611|
|nextLongN|0.1803571971| |
|nextInt|0.1618788839|0.41479|
|nextFloat|0.2567586134| |
|nextIntN|0.3662874911| |
|nextBoolean|1.176949893|1.1214|

In the case of the SplitMix the JMH overhead for time measurement for 
{{nextLong()}} is 63% of the total time. So by subtracting the baseline the 
differences between a fast generator and a slow one are more pronounced.

I also note the strangely fast speed of {{nextBoolean()}} for the JDK 
implementation, where it is fastest despite not being a fast generator of 
{{int}} values. The implementation for this method is standard for an 
{{IntProvider}}. Store a single {{int}} value and then use each consecutive bit 
of the 32-bits for a {{boolean}}. So the speed up must be related to the 
relative speed of testing the bits from the JDK source verses the others, and 
we know from the BigCrush results that the JDK bit source is not good.

I've opened [PR 35|https://github.com/apache/commons-rng/pull/35] with the new 
code.

This idea can be applied to other benchmarks currently using a loop to test 
performance that do not require input data/state change on each loop iteration. 
These are:
 * GeometricSamplersPerformance
 * SamplersPerformance
 * NextGaussianPerformance

In the case of SamplersPerformance it may be better to separate the 
DiscreteSampler and ContinuousSampler into two benchmarks that can have a 
baseline implementation of the 

[jira] [Commented] (RNG-88) Update the GenerationPerformance benchmark

2019-04-04 Thread Bernd Eckenfels (JIRA)


[ 
https://issues.apache.org/jira/browse/RNG-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809710#comment-16809710
 ] 

Bernd Eckenfels commented on RNG-88:


Just a BTW The JMH harness goes to great length in removing artifacts from the 
actual measurements. You can add a baseline method here however I would avoid 
trying to massage the returned results in any way. For relative comparison it’s 
not needed anyway (and overhead should also be observable by varrying request 
length if possible)

 

> Update the GenerationPerformance benchmark
> --
>
> Key: RNG-88
> URL: https://issues.apache.org/jira/browse/RNG-88
> Project: Commons RNG
>  Issue Type: Improvement
>  Components: examples
>Affects Versions: 1.3
>Reporter: Alex D Herbert
>Assignee: Alex D Herbert
>Priority: Minor
> Attachments: baseline.jpg, next.png
>
>
> The current GenerationPerformance benchmark runs all the generators to 
> collect timing data. However the act of running the test within JMH takes 
> some time (test overhead). This overhead should be measured and subtracted 
> from the timing data to create a time attributed only to the method exercised 
> in the RNG.
> This can be done by creating a dummy RNG that returns a fixed value. The 
> implementation must be done in a manner where the JIT compiler is not able to 
> remove the dummy method from the execution path. This is achieved by 
> returning state variables from the dummy RNG (not final variables).
> Testing can be done using a variable number of iterations and the run-times 
> assessed for linearity. If the run time scale with the number of iterations 
> then the JIT compiler has not removed it from execution. The dummy RNG then 
> serves as a baseline for comparison of true implementations.
> This idea with examples is shown in 
> [RNG-87|https://issues.apache.org/jira/browse/RNG-87] which tested a variant 
> of the MWC_256 algorithm.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)