Excellent news!  I don't have much time at the moment but, I definitely
want to try out the new code and experiment with other solvers, in
particular KLU from Tim Davis' SuiteSparse.

Just a few quick comments to summarize and clarify my previous post:

   - It's not obvious at all that MATLAB uses different sparse solvers
   dependng on the way you make the lu() call.   I re-read my post and
   realized that I probably didn't explain this in detail.  The problem is
   that MathWorks does not document this at all. Luckily, Tim Davis (the main
   guy behind SuiteSparse, and several of the sparse solvers in MATLAB)
   published how this works in his FACTORIZE paper
   <http://dl.acm.org/citation.cfm?id=2491498>.
   - The summary is: backslash and the four- and five-output syntax calls
   will invoke UMFPACK, while the two- and three-output syntax calls will
   invoke the GP solver.  Still, since all this is not officially documented,
   MathWorks may change it in future versions.
   - About the performance of sparse solvers, in general: to my knowledge,
   there are two very important factors:
      - The use of BLAS kernels. Methods with names such as "Supernodal"
      and "Multi-frontal" try to take advantage of dense BLAS calls as much as
      possible. In application areas such as finite-element methods, this is
      crucial for performance, because the matrices do exhibit many dense
      sub-blocks.  Not so in power systems matrices, which is the reason why
      "simplicial" methods are faster (using the BLAS would incurr in too much
      overhead -- too many BLAS calls for sub-blocks of very small size).
      - Fill-reducing reorderings. This affects all sparse methods,
      regardless of the use of the BLAS.  The problem is that the reordering
      algorithms have different performance for different types of matrices.
      Experimentally, people have found that AMD is the best for electrical
      circuits and power system matrices.


As you said, it is very likely that the choice of fill-reducing algorithm
has a greater influence than the choice of UMFPACK vs. GP.  I can imagine
that UMFPACK disables calls to the BLAS for very small blocks of, say,
3x3.  If that were the case, both methods would be similar for the kind of
matrices we typically have here.

On the other hand, Davis makes this comment in his paper:

*GP is faster than UMFPACK for small matrices and for some circuit
matrices, but in general the results in Figure 3 show that UMFPACK
([L,U,P,Q,R]=lu(A)) is the best choice for most sparse unsymmetric
matrices. UMFPACK does poorly for some circuit matrices because its
automatic ordering method selection makes the wrong choice (it selects
COLAMD, but AMD works much better for those matrices). *
So the problem seems to be that you can't tell MATLAB's UMFPACK what
reordering algorithm to use.

It's great to have the code you prepared, it'll be fun for experimenting.

-- 
Jose L. Marin
Grupo AIA




2017-02-22 21:12 GMT+01:00 Ray Zimmerman <r...@cornell.edu>:

> First, a big thanks to Richard Lincoln and Mirko Todorovski for their
> valuable contributions since we put MATPOWER on GitHub. Richard contributed
> the Travis CI integration <https://travis-ci.org/MATPOWER/matpower/> that
> now automatically runs the full test suite on every commit and pull request
> (supplying a little green check mark when everything passes). And Mirko
> contributed some distribution power flow algorithms and cases.
>
> Second, in response to the very informative post to MATPOWER-L back in Dec
> by Jose Luis Marin (see below), I made some modifications to the way the
> update step is computed in the Newton-Raphson power flow, resulting in
> significant speed improvements (up to 2x) for large models under Matlab
> (marginally better under Octave).
>
> You can see the details and comment in this pull request ...
>
>     https://github.com/MATPOWER/matpower/pull/7
>
> … or by checking out the lu branch
> <https://github.com/MATPOWER/matpower/tree/lu> of the Git repository.
> Jose, I’m particularly interested in any further comments you might have.
> In particular, it appears that the vast majority of the speedup comes from
> the AMD reordering and not from the Gilbert-Peierls algorithm.
>
>     Ray
>
>
>
>
>
> Begin forwarded message:
>
> *From: *Jose Luis Marín <mari...@aia.es>
> *Subject: **Re: RunPF vs GPU calculations*
> *Date: *December 1, 2016 at 12:52:05 PM EST
> *To: *MATPOWER discussion forum <matpowe...@list.cornell.edu>
> *Reply-To: *MATPOWER discussion forum <matpowe...@list.cornell.edu>
>
> Hello all,
>
> I've been meaning to report the following findings to Ray and the MATPOWER
> team for some time, but never got around doing it.  Well, here it is in
> summarized form:
>
>    - There's a very cheap way to speedup the J \ F step (actually, the
>    sparse LU decomposition) by factors of about x1.3 to x1.8, and maybe higher
>    for bigger cases.  It just involves invoking the MATLAB solver that is best
>    suited for power system matrices (more below).
>    - GPU computing, at least in its current state, is not a panacea for
>    the powerflow problem, because the kind of sparse linear algebra that's
>    needed in powerflow problems just doesn't offer enough opportunities to
>    exploit GPU architectures.
>
>
> Try this simple change in *newtonpf.m*:
>
> marinjl@xwing:~/Library/matpower6.0b2$ diff newtonpf.m.ORIG newtonpf.m
> 89c89,93
> <     dx = -(J \ F);
> ---
> >     %dx = -(J \ F);
> >     % JLM: For performance, factorize using MATLAB's Gilbert-Peierls +
> AMD (see T. Davis)
> >     Q = sparse(amd(J), 1:size(J,1), 1);  % AMD reordering
> >     [L, U, P] = lu(Q'*J*Q, 1.0);  % this form ensures that G-P is
> called. Last arg: 0==no pivoting, 1.0==partial pivoting.
> >     dx = - Q * ( U \ ( L \ (P * Q' * F)));
>
>
> My benchmarks (3GHz Intel Core 2 Duo) yield the following speedups:
>
> %                    UMFPACK        G-P        speedup
> % ======================================================
> % case300:          1.689e-02    1.230e-02      1.37
> % case1354pegase:   5.848e-02    3.652e-02      1.60
> % case1888rte:      2.792e-02    2.130e-02      1.31
> % case1951rte:      3.037e-02    2.169e-02      1.40
> % case2383wp:       1.429e-01    8.711e-02      1.64
> % case2736sp:       1.327e-01    7.512e-02      1.76
> % case2737sop:      1.632e-01    9.074e-02      1.79
> % case2746wop:      1.589e-01    8.725e-02      1.82
> % case2746wp:       1.448e-01    8.636e-02      1.67
> % case2848rte:      4.652e-02    3.301e-02      1.40
> % case2868rte:      1.681e-01    9.661e-02      1.74
> % case2869pegase:   2.402e-01    1.335e-01      1.79
> % case3012wp:       1.125e-01    6.733e-02      1.67
> % case3120sp:       2.155e-01    1.228e-01      1.75
> % case3375wp:       1.598e-01    9.565e-02      1.67
> % case6468rte:      2.155e-01    1.443e-01      1.49
> % case6470rte:      2.316e-01    1.434e-01      1.61
> % case6495rte:      2.118e-01    1.433e-01      1.47
> % case6515rte:      2.244e-01    1.443e-01      1.55
> % case9241pegase:   1.185e+00    6.392e-01      1.85
> % case13659pegase:  1.484e+00    7.991e-01      1.85
> %
> %
> % max speedup: 1.85
> % min speedup: 1.31
>
> (note: I used pf.tol=1.0e-10 in order to trigger a few more iterations. I
> ran each case 10 times, and kept the best timing)
>
>
> *Longer explanation:*
> The Jacobian matrices that one typically obtains in power networks are *very,
> very* sparse.  They are very different than sparse matrices obtained in
> other areas (such as, e.g., finite-element modeling), where the matrix
> still has fairly dense blocks here and there (or at least it can be
> re-arranged to have them).  Power system matrices, as well as many
> electronic circuit matrices, have dense blocks of sizes 3x3 or 4x4 at most,
> which is rather small.  Now, the vast majority of research going into
> sparse LU solvers focuses on the former kind of matrices, i.e. those with
> large dense blocks.  The methods go by names like "supernodal" and
> "multifrontal", and basically consist in designing the algorithm so that it
> can exploit the BLAS for processing those dense blocks.  The point is that
> power system matrices are ill-suited to exploit the power of the BLAS, and
> therefore those methods are actually slower than the more straight-forward
> sparse solvers that don't try to use the BLAS at all.
>
> MATLAB contains sparse solvers of both types, but the problem is that it's
> not clearly documented how the backslash operator (or the lu call) uses
> internally one algorithm or the other.  It turns out that the way you call
> lu() influences the algorithm being used!!!  The default is UMFPACK, but if
> you use the "three-output, two-input args" call as above then you'll be
> invoking a Gilbert-Peierls method.  As you may have guessed by now, UMFPACK
> is one of those methods using the BLAS, while G-P is not.  As shown in the
> benchmarks above, UMFPACK clearly has an overhead.
>
> Coming back to the issue of GPUs, one can see why the potential
> performance gains for this type of sparse linear algebra are not as great
> as they seem.
>
> If you are interested in more details, I can provide references.  See for
> instance SuiteSparse by Tim Davies, which in terms of sparse solvers is
> probably the best in class now  (some of it is already integrated within
> MATLAB, btw).
>
> --
> Jose L. Marin
> Grupo AIA
>
>

Reply via email to