On Thu, 7 Apr 2016 18:09:10 +0100, Chris Lucas wrote:
Sorry, you're right that I should have summarized the key implications (at
least as I see them).

The short version:
(1) If you're multiplying small matrices with AMC, BlockRealMatrix is
your friend. For large matrices, stick with Array2DRowRealMatrix

This is surprising, as the exact opposite of the intended purpose of
"BlockRealMatrix".
It's certainly worth ascertaining with more data points (to make a nice plot).

(or
something else?).

Patch welcome. :-)

(2) Scaling isn't great -- O(N^3) where N is the number of rows/cols in a square matrix) -- but neither is OJAlgo's. The inter-library differences
can mostly be attributed to constant factors.

The longer version:

Observation 1: For Array2DRowRealMatrix, matrix multiplication appears to scale with ~ O(N^3) (i.e., the naive algorithm). In retrospect, that was a
bit silly to spend time benchmarking because I've just looked at the
source[^1] and it *is* the naive algorithm.

Not totally, there is a copy of the "current row" which otherwise makes
the computation scale really badly (and perhaps the initial reason for the
"BlockRealMatrix" implementation).

Naturally, better asymptotic
performance would be nice,

What do you mean?

but I recognize that the AMC devs likely have
better things to do.

Observation 2: When multiplying new matrices, BlockMatrix offers some
improvement when multiplying the same matrices repeatedly

Why would someone do that?

and for smaller
matrices. However on one person's setup (mine) it doesn't offer better
performance on large random matrices. Follow-up tests with N \in
{1000,1500,2000,2500} indicate it scales worse than the naive algorithm (didn't look at growth, but factor of ~5 worse on a 2500x2500 matrix).

In effect, that had been my conclusion some years ago. [But I hadn't check "small" matrices, as I was so convinced that the overhead of "BlockRealMatrix"
was unsuitable for the smaller sizes!]

IIUC, CM (both "Array2DRowRealMatrix" and "BlockRealMatrix") only makes data closer in RAM with the "hope" that the CPU cache can help (at least that's
what the doc for "BlockRealMatrix" indicates IIRC).

Observation 3: OJAlgo doesn't scale better asymptotically from what I can
tell. That's no great surprise based on
https://github.com/optimatika/ojAlgo/wiki/Multiply.
<https://github.com/optimatika/ojAlgo/wiki/Multiply> Scala Breeze w/native
libs scales closer to O(N^2.4).

Details on how you arrive to those numbers would really be a useful addition
to the CM documentation.

Observation 4: Breeze with native libraries does better than one might guess based on some previous benchmarks. People still might want to use math commons, not least because those native libs might be encumbered with
undesirable licenses.

Do they use multi-threading?

Regards,
Gilles


[^1]:

https://git-wip-us.apache.org/repos/asf?p=commons-math.git;a=blob;f=src/main/java/org/apache/commons/math4/linear/Array2DRowRealMatrix.java;h=3778db56ba406beb973e1355234593bc006adb59;hb=HEAD


On 7 April 2016 at 16:50, Gilles <gil...@harfang.homelinux.org> wrote:

On Thu, 7 Apr 2016 16:17:52 +0100, Chris Lucas wrote:

Thanks for suggesting the BlockRealMatrix. I've run a few benchmarks
comparing the two, along with some other libraries.


The email is not really suited for tables (see your message below).
What are the points which you want to make?

As said before, "reports" on CM features (a.o. benchmarks) are welcome but should be formatted for inclusion in the userguide (for now, until we might create a separate document for data that is subject to change more or less
rapidly).

If you suspect a problem with the code, then I suggest that you file a JIRA report, where files can be attached (or tables can be viewed without being
deformed thanks to the "{noformat}" tag).

Regards,
Gilles


I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups and 10 test
iterations.

The suffixes denote what's being tested:
  MC: AMC 3.6.1 using Array2DRowRealMatrix
  SB: Scala Breeze 0.12 with native libraries
OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an especially
speedy pure-java library.
  BMC: MC using a BlockRealMatrix.

I'm using the same matrix creation/multiplication/inverse code as I
mentioned in my previous note. When testing BlockReadMatrices, I'm using
fixed random matrices and annotating my class with
@State(Scope.Benchmark).
For the curious, my rationale for building matrices on the spot is that
I'm
already caching results pretty heavily and rarely perform the same
operation on the same matrix twice -- I'm most curious about performance
in
the absence of caching benefits.

Test 1a: Creating and multiplying two random/uniform (i.e., all entries
drawn from math.random()) 100x100 matrices:
[info] MatrixBenchmarks.buildMultTestMC100 thrpt 10 836.579
±       7.120  ops/s
[info] MatrixBenchmarks.buildMultTestSB100 thrpt 10 1649.089
±     170.649  ops/s
[info] MatrixBenchmarks.buildMultTestOJ100 thrpt 10 1485.014 ± 44.158
ops/s
[info] MatrixBenchmarks.multBMC100 thrpt 10 1051.055 ± 2.290 ops/s

Test 1b: Creating and multiplying two random/uniform 500x500 matrices: [info] MatrixBenchmarks.buildMultTestMC500 thrpt 10 8.997
±       0.064  ops/s
[info] MatrixBenchmarks.buildMultTestSB500 thrpt 10 80.558
±       0.627  ops/s
[info] MatrixBenchmarks.buildMultTestOJ500 thrpt 10 34.626
±       2.505  ops/s
[info] MatrixBenchmarks.multBMC500 thrpt 10 9.132 ± 0.059 ops/s [info] MatrixBenchmarks.multCacheBMC500 thrpt 10 13.630 ± 0.107 ops/s
[no matrix creation]

---

Test 2a: Creating a single 500x500 matrix (to get a sense of whether the
mult itself is driving differences:
[info] MatrixBenchmarks.buildMC500 thrpt 10 155.026
±       2.456  ops/s
[info] MatrixBenchmarks.buildSB500 thrpt 10 197.257
±       4.619  ops/s
[info] MatrixBenchmarks.buildOJ500 thrpt 10 176.036
±       2.121  ops/s

Test 2b: Creating a single 1000x1000 matrix (to show it scales w/N): [info] MatrixBenchmarks.buildMC1000 thrpt 10 36.112 ± 2.775 ops/s [info] MatrixBenchmarks.buildSB1000 thrpt 10 45.557 ± 2.893 ops/s [info] MatrixBenchmarks.buildOJ1000 thrpt 10 47.438 ± 1.370 ops/s [info] MatrixBenchmarks.buildBMC1000 thrpt 10 37.779 ± 0.871 ops/s

---

Test 3a: Inverting a random 100x100 matrix:
[info] MatrixBenchmarks.invMC100 thrpt 10 1033.011 ± 9.928 ops/s [info] MatrixBenchmarks.invSB100 thrpt 10 1664.833 ± 15.170 ops/s [info] MatrixBenchmarks.invOJ100 thrpt 10 1174.044 ± 52.083 ops/s [info] MatrixBenchmarks.invBMC100 thrpt 10 1043.858 ± 22.144 ops/s

Test 3b: Inverting a random 500x500 matrix:
[info] MatrixBenchmarks.invMC500 thrpt 10 9.089 ±
0.284  ops/s
[info] MatrixBenchmarks.invSB500 thrpt 10 43.857 ±
1.083  ops/s
[info] MatrixBenchmarks.invOJ500 thrpt 10 23.444 ±
1.484  ops/s
[info] MatrixBenchmarks.invBMC500 thrpt 10 9.156 ± 0.052 ops/s [info] MatrixBenchmarks.invCacheBMC500 thrpt 10 9.627 ± 0.084 ops/s
[no matrix creation]

Test 3c:
[info] MatrixBenchmarks.invMC1000 thrpt 10 0.704 ± 0.040 ops/s [info] MatrixBenchmarks.invSB1000 thrpt 10 8.611 ± 0.557
ops/s
[info] MatrixBenchmarks.invOJ1000 thrpt 10 3.539 ± 0.229 ops/s [info] MatrixBenchmarks.invBMC1000 thrpt 10 0.691 ± 0.095 ops/s

Also, isn't matrix multiplication at least O(N^2.37), rather than O(N^2)?

On 6 April 2016 at 12:55, Chris Lucas <cluc...@inf.ed.ac.uk> wrote:

Thanks for the quick reply!

I've pasted a small self-contained example at the bottom. It creates the matrices in advance, but nothing meaningful changes if they're created
on a
per-operation basis.

Results for 50 multiplications of [size]x[size] matrices:
   Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.062 seconds, time per: 0.001 seconds
   Size: 300, total time: 3.050 seconds, time per: 0.061 seconds
   Size: 500, total time: 15.186 seconds, time per: 0.304 seconds
   Size: 600, total time: 17.532 seconds, time per: 0.351 seconds

For comparison:

Results for 50 additions of [size]x[size] matrices (which should be faster, be the extent of the difference is nonetheless striking to me):
   Size: 10, total time: 0.011 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.012 seconds, time per: 0.000 seconds
   Size: 300, total time: 0.020 seconds, time per: 0.000 seconds
   Size: 500, total time: 0.032 seconds, time per: 0.001 seconds
   Size: 600, total time: 0.050 seconds, time per: 0.001 seconds

Results for 50 inversions of a [size]x[size] matrix, which I'd expect to
be slower than multiplication for larger matrices:
   Size: 10, total time: 0.014 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.074 seconds, time per: 0.001 seconds
   Size: 300, total time: 1.005 seconds, time per: 0.020 seconds
   Size: 500, total time: 5.024 seconds, time per: 0.100 seconds
   Size: 600, total time: 9.297 seconds, time per: 0.186 seconds

I hope this is useful, and if I'm doing something wrong that's leading to
this performance gap, I'd love to know.

----
import org.apache.commons.math3.linear.LUDecomposition;
import org.apache.commons.math3.linear.Array2DRowRealMatrix;
import org.apache.commons.math3.linear.RealMatrix;


public class AMCMatrices {

  public static void main(String[] args) {
    miniTest(0);
  }

  public static void miniTest(int tType) {
    int samples = 50;

    int sizes[] = { 10, 100, 300, 500, 600 };

    for (int sI = 0; sI < sizes.length; sI++) {
      int mSize = sizes[sI];

org.apache.commons.math3.linear.RealMatrix m0 = buildM(mSize,
mSize);
      RealMatrix m1 = buildM(mSize, mSize);

      long start = System.nanoTime();
      for (int n = 0; n < samples; n++) {
        switch (tType) {
        case 0:
          m0.multiply(m1);
          break;
        case 1:
          m0.add(m1);
          break;
        case 2:
          new LUDecomposition(m0).getSolver().getInverse();
          break;
        }

      }
      long end = System.nanoTime();

      double dt = ((double) (end - start)) / 1E9;
      System.out.println(String.format(
"Size: %d, total time: %3.3f seconds, time per: %3.3f seconds",
          mSize, dt, dt / samples));
    }
  }

  public static Array2DRowRealMatrix buildM(int r, int c) {
    double[][] matVals = new double[r][c];
    for (int i = 0; i < r; i++) {
      for (int j = 0; j < c; j++) {
        matVals[i][j] = Math.random();
      }
    }
    return new Array2DRowRealMatrix(matVals);
  }
}

----

On 5 April 2016 at 19:36, Gilles <gil...@harfang.homelinux.org> wrote:

Hi.

On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:

I recently ran a benchmark comparing the performance math commons
3.6.1's
linear algebra library to the that of scala Breeze (
https://github.com/scalanlp/breeze).

I looked at det, inverse, Cholesky factorization, addition, and
multiplication, including matrices with 10, 100, 500, and 1000
elements,
with symmetric, non-symmetric, and non-square cases where applicable.


It would be interesting to add this to the CM documentation:
  https://issues.apache.org/jira/browse/MATH-1194

In general, I was pleasantly surprised: math commons performed about as

well as Breeze, despite the latter relying on native libraries. There
was
one exception, however:

    m0.multiply(m1)

where m0 and m1 are both Array2DRowRealMatrix instances. It scaled very poorly in math commons, being much slower than nominally more expensive operations like inv and the Breeze implementation. Does anyone have a
thought as to what's going on?


Could your provide more information such as a plot of performance vs
size?
A self-contained and minimal code to run would be nice too.
See the CM micro-benchmarking tool here:




https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java
And an example of how to use it:




https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java

In case it's useful, one representative test

involves multiplying two instances of

    new Array2DRowRealMatrix(matVals)

where matVals is 1000x1000 entries of math.random and the second
instance
is created as part of the loop. This part of the benchmark is not
specific
to the expensive multiplication step, and takes very little time
relative
to the multiplication itself. I'm using System.nanotime for the timing,
and
take the average time over several consecutive iterations, on a 3.5 GHz
Intel Core i7, Oracle JRE (build 1.8.0_05-b13).


You might want to confirm the behaviour using JMH (becoming the Java
standard
for benchmarking):
  http://openjdk.java.net/projects/code-tools/jmh/


Best regards,
Gilles


Thanks,

Chris



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@commons.apache.org
For additional commands, e-mail: user-h...@commons.apache.org

Reply via email to