Re: RunPF vs GPU calculations

2016-12-01 Thread Jose Luis Marín
Ooops, sorry, they're *seconds* (for one runpf() run).

Here's the benchmarking script, it's pretty simple:

case_list = {'case300', 'case1354pegase', 'case1888rte', 'case1951rte', ...
'case2383wp', 'case2736sp', 'case2737sop', 'case2746wop', ...
'case2746wp', 'case2848rte', 'case2868rte', 'case2869pegase',
...
'case3012wp', 'case3120sp', 'case3375wp', ...
'case6468rte', 'case6470rte', 'case6495rte', 'case6515rte', ...
'case9241pegase', 'case13659pegase'};

opts = mpoption('pf.alg', 'NR', 'pf.tol', 1.0e-10, 'pf.nr.max_it', 100,
'out.all', 0, 'verbose', 0);
NREPEAT = 10;
timings = zeros(NREPEAT,1);


for k = 1:length(case_list)

cfile = char(case_list(k));
fprintf('*** BENCHMARKING %s:\t', cfile);
mpc = loadcase(cfile);
for i = 1:NREPEAT;
tic;
mpsol = runpf(mpc, opts);
timings(i) = toc;
end
fprintf('%11.3e\n', min(timings));
end


-- 
Jose L. Marin
Grupo AIA



2016-12-01 19:01 GMT+01:00 davor sutic :

> Hi Jose,
> thanks for sharing. I have one quick question: What are the units of the
> your benchmark (seconds, ticks, etc)?
>
> Thanks a lot
>
> On Thu, Dec 1, 2016 at 6:52 PM, Jose Luis Marín  wrote:
>
>> Hello all,
>>
>> I've been meaning to report the following findings to Ray and the
>> MATPOWER team for some time, but never got around doing it.  Well, here it
>> is in summarized form:
>>
>>- There's a very cheap way to speedup the J \ F step (actually, the
>>sparse LU decomposition) by factors of about x1.3 to x1.8, and maybe 
>> higher
>>for bigger cases.  It just involves invoking the MATLAB solver that is 
>> best
>>suited for power system matrices (more below).
>>- GPU computing, at least in its current state, is not a panacea for
>>the powerflow problem, because the kind of sparse linear algebra that's
>>needed in powerflow problems just doesn't offer enough opportunities to
>>exploit GPU architectures.
>>
>>
>> Try this simple change in *newtonpf.m*:
>>
>> marinjl@xwing:~/Library/matpower6.0b2$ diff newtonpf.m.ORIG newtonpf.m
>> 89c89,93
>> < dx = -(J \ F);
>> ---
>> > %dx = -(J \ F);
>> > % JLM: For performance, factorize using MATLAB's Gilbert-Peierls +
>> AMD (see T. Davis)
>> > Q = sparse(amd(J), 1:size(J,1), 1);  % AMD reordering
>> > [L, U, P] = lu(Q'*J*Q, 1.0);  % this form ensures that G-P is
>> called. Last arg: 0==no pivoting, 1.0==partial pivoting.
>> > dx = - Q * ( U \ ( L \ (P * Q' * F)));
>>
>>
>> My benchmarks (3GHz Intel Core 2 Duo) yield the following speedups:
>>
>> %UMFPACKG-Pspeedup
>> % ==
>> % case300:  1.689e-021.230e-02  1.37
>> % case1354pegase:   5.848e-023.652e-02  1.60
>> % case1888rte:  2.792e-022.130e-02  1.31
>> % case1951rte:  3.037e-022.169e-02  1.40
>> % case2383wp:   1.429e-018.711e-02  1.64
>> % case2736sp:   1.327e-017.512e-02  1.76
>> % case2737sop:  1.632e-019.074e-02  1.79
>> % case2746wop:  1.589e-018.725e-02  1.82
>> % case2746wp:   1.448e-018.636e-02  1.67
>> % case2848rte:  4.652e-023.301e-02  1.40
>> % case2868rte:  1.681e-019.661e-02  1.74
>> % case2869pegase:   2.402e-011.335e-01  1.79
>> % case3012wp:   1.125e-016.733e-02  1.67
>> % case3120sp:   2.155e-011.228e-01  1.75
>> % case3375wp:   1.598e-019.565e-02  1.67
>> % case6468rte:  2.155e-011.443e-01  1.49
>> % case6470rte:  2.316e-011.434e-01  1.61
>> % case6495rte:  2.118e-011.433e-01  1.47
>> % case6515rte:  2.244e-011.443e-01  1.55
>> % case9241pegase:   1.185e+006.392e-01  1.85
>> % case13659pegase:  1.484e+007.991e-01  1.85
>> %
>> %
>> % max speedup: 1.85
>> % min speedup: 1.31
>>
>> (note: I used pf.tol=1.0e-10 in order to trigger a few more iterations. I
>> ran each case 10 times, and kept the best timing)
>>
>>
>> *Longer explanation:*
>> The Jacobian matrices that one typically obtains in power networks are *very,
>> very* sparse.  They are very different than sparse matrices obtained in
>> other areas (such as, e.g., finite-element modeling), where the matrix
>> still has fairly dense blocks here and there (or at least it can be
>> re-arranged to have them).  Power system matrices, as well as many
>> electronic circuit matrices, have dense blocks of sizes 3x3 or 4x4 at most,
>> which is rather small.  Now, the vast majority of research going into
>> sparse LU solvers focuses on the former kind of matrices, i.e. those with
>> large dense blocks.  The methods go by names like "supernodal" and
>> "multifrontal", and basically consist in designing the algorithm so that it
>> can exploit the BLAS for processing those dense blocks.  The point is that
>> power system 

Re: RunPF vs GPU calculations

2016-12-01 Thread Jose Luis Marín
Hello all,

I've been meaning to report the following findings to Ray and the MATPOWER
team for some time, but never got around doing it.  Well, here it is in
summarized form:

   - There's a very cheap way to speedup the J \ F step (actually, the
   sparse LU decomposition) by factors of about x1.3 to x1.8, and maybe higher
   for bigger cases.  It just involves invoking the MATLAB solver that is best
   suited for power system matrices (more below).
   - GPU computing, at least in its current state, is not a panacea for the
   powerflow problem, because the kind of sparse linear algebra that's needed
   in powerflow problems just doesn't offer enough opportunities to exploit
   GPU architectures.


Try this simple change in *newtonpf.m*:

marinjl@xwing:~/Library/matpower6.0b2$ diff newtonpf.m.ORIG newtonpf.m
89c89,93
< dx = -(J \ F);
---
> %dx = -(J \ F);
> % JLM: For performance, factorize using MATLAB's Gilbert-Peierls +
AMD (see T. Davis)
> Q = sparse(amd(J), 1:size(J,1), 1);  % AMD reordering
> [L, U, P] = lu(Q'*J*Q, 1.0);  % this form ensures that G-P is called.
Last arg: 0==no pivoting, 1.0==partial pivoting.
> dx = - Q * ( U \ ( L \ (P * Q' * F)));


My benchmarks (3GHz Intel Core 2 Duo) yield the following speedups:

%UMFPACKG-Pspeedup
% ==
% case300:  1.689e-021.230e-02  1.37
% case1354pegase:   5.848e-023.652e-02  1.60
% case1888rte:  2.792e-022.130e-02  1.31
% case1951rte:  3.037e-022.169e-02  1.40
% case2383wp:   1.429e-018.711e-02  1.64
% case2736sp:   1.327e-017.512e-02  1.76
% case2737sop:  1.632e-019.074e-02  1.79
% case2746wop:  1.589e-018.725e-02  1.82
% case2746wp:   1.448e-018.636e-02  1.67
% case2848rte:  4.652e-023.301e-02  1.40
% case2868rte:  1.681e-019.661e-02  1.74
% case2869pegase:   2.402e-011.335e-01  1.79
% case3012wp:   1.125e-016.733e-02  1.67
% case3120sp:   2.155e-011.228e-01  1.75
% case3375wp:   1.598e-019.565e-02  1.67
% case6468rte:  2.155e-011.443e-01  1.49
% case6470rte:  2.316e-011.434e-01  1.61
% case6495rte:  2.118e-011.433e-01  1.47
% case6515rte:  2.244e-011.443e-01  1.55
% case9241pegase:   1.185e+006.392e-01  1.85
% case13659pegase:  1.484e+007.991e-01  1.85
%
%
% max speedup: 1.85
% min speedup: 1.31

(note: I used pf.tol=1.0e-10 in order to trigger a few more iterations. I
ran each case 10 times, and kept the best timing)


*Longer explanation:*
The Jacobian matrices that one typically obtains in power networks are *very,
very* sparse.  They are very different than sparse matrices obtained in
other areas (such as, e.g., finite-element modeling), where the matrix
still has fairly dense blocks here and there (or at least it can be
re-arranged to have them).  Power system matrices, as well as many
electronic circuit matrices, have dense blocks of sizes 3x3 or 4x4 at most,
which is rather small.  Now, the vast majority of research going into
sparse LU solvers focuses on the former kind of matrices, i.e. those with
large dense blocks.  The methods go by names like "supernodal" and
"multifrontal", and basically consist in designing the algorithm so that it
can exploit the BLAS for processing those dense blocks.  The point is that
power system matrices are ill-suited to exploit the power of the BLAS, and
therefore those methods are actually slower than the more straight-forward
sparse solvers that don't try to use the BLAS at all.

MATLAB contains sparse solvers of both types, but the problem is that it's
not clearly documented how the backslash operator (or the lu call) uses
internally one algorithm or the other.  It turns out that the way you call
lu() influences the algorithm being used!!!  The default is UMFPACK, but if
you use the "three-output, two-input args" call as above then you'll be
invoking a Gilbert-Peierls method.  As you may have guessed by now, UMFPACK
is one of those methods using the BLAS, while G-P is not.  As shown in the
benchmarks above, UMFPACK clearly has an overhead.

Coming back to the issue of GPUs, one can see why the potential performance
gains for this type of sparse linear algebra are not as great as they seem.

If you are interested in more details, I can provide references.  See for
instance SuiteSparse by Tim Davies, which in terms of sparse solvers is
probably the best in class now  (some of it is already integrated within
MATLAB, btw).

-- 
Jose L. Marin
Grupo AIA




2016-11-30 11:02 GMT+01:00 davor sutic :

> Hello Jaroslaw,
> I find your analysis very interesting. Would you mind sharing your
> performance findings about a broader range of network sizes (e.g. 118, 300,
> perhaps even those of a few thousand)?
>
> Also, are the times you stated the mean 

Re: RunPF vs GPU calculations

2016-11-30 Thread davor sutic
Hello Jaroslaw,
I find your analysis very interesting. Would you mind sharing your
performance findings about a broader range of network sizes (e.g. 118, 300,
perhaps even those of a few thousand)?

Also, are the times you stated the mean time per iteration of the total
running times of the 1000 iterations?

Thanks a lot
Davor

On Nov 30, 2016 6:26 AM, "Jaroslaw Krata"  wrote:

> Hello Matpower Team,
>
>
> I have a quick question related to runpf calculation time.
>
> Some time ago I started to use matpower to calculate ac power flow.
>
> Due to the fact that I need thousands of power flows I wish to run it
> effectively.
>
>
> I run Matlab Profiler and I have found out that there is a lot of
> computation time taken by simple algebra.
>
> I.e. (calculation time for 1000 iterations of runpf of 49 bus system):
>
> Function Calls  Total timeSelf time
> runpf   1000 1.6920.183 s
>
> newtonpf   1000 0.684 0.552 s
>
>
> If you have a closer look into newtonpf, the 50% of time is taken by "dx =
> -(J \ F);" operation!
>
> Similarly, dSbus_dV consumes >60% of time by only two lines:
> "dSbus_dVm = diagV * conj(Ybus * diagVnorm) + conj(diagIbus) * diagVnorm;
> dSbus_dVa = 1j * diagV * conj(diagIbus - Ybus * diagV);"
>
> These 3 lines take in total 0.420 out of 1.692 second.
>
>
> Have you considered to move this algebra to CUDA GPU?
>
> Are there any known constraints not to use CUDA?
>
>
> https://developer.nvidia.com/matlab-cuda
>
> https://www.youtube.com/watch?v=PwCAWxPcDkw
>
>
>
> BR,
>
> Jaroslaw
>
>
>
>


RE: RunPF vs GPU calculations

2016-11-30 Thread MAEGHT Jean
Dear Jaroslaw
Please consider that small size systems are only toys for early testing of 
methods.
If you wish to perform real calculation time analysis, use larger systems 
available in Matpower: >1000 bus.
Best regards

De : bounce-121041364-75398...@list.cornell.edu 
[mailto:bounce-121041364-75398...@list.cornell.edu] De la part de Jaroslaw Krata
Envoyé : mercredi 30 novembre 2016 06:26
À : MATPOWER-L@cornell.edu
Objet : RunPF vs GPU calculations


Hello Matpower Team,



I have a quick question related to runpf calculation time.

Some time ago I started to use matpower to calculate ac power flow.

Due to the fact that I need thousands of power flows I wish to run it 
effectively.



I run Matlab Profiler and I have found out that there is a lot of computation 
time taken by simple algebra.

I.e. (calculation time for 1000 iterations of runpf of 49 bus system):

Function Calls  Total timeSelf time
runpf   1000 1.6920.183 s

newtonpf   1000 0.684 0.552 s



If you have a closer look into newtonpf, the 50% of time is taken by "dx = -(J 
\ F);" operation!

Similarly, dSbus_dV consumes >60% of time by only two lines:
"dSbus_dVm = diagV * conj(Ybus * diagVnorm) + conj(diagIbus) * diagVnorm;
dSbus_dVa = 1j * diagV * conj(diagIbus - Ybus * diagV);"

These 3 lines take in total 0.420 out of 1.692 second.



Have you considered to move this algebra to CUDA GPU?

Are there any known constraints not to use CUDA?



https://developer.nvidia.com/matlab-cuda

https://www.youtube.com/watch?v=PwCAWxPcDkw





BR,

Jaroslaw






"Ce message est destiné exclusivement aux personnes ou entités auxquelles il 
est adressé et peut contenir des informations privilégiées ou confidentielles. 
Si vous avez reçu ce document par erreur, merci de nous l'indiquer par retour, 
de ne pas le transmettre et de procéder à sa destruction.

This message is solely intended for the use of the individual or entity to 
which it is addressed and may contain information that is privileged or 
confidential. If you have received this communication by error, please notify 
us immediately by electronic mail, do not disclose it and delete the original 
message."


RunPF vs GPU calculations

2016-11-29 Thread Jaroslaw Krata
Hello Matpower Team,


I have a quick question related to runpf calculation time.

Some time ago I started to use matpower to calculate ac power flow.

Due to the fact that I need thousands of power flows I wish to run it 
effectively.


I run Matlab Profiler and I have found out that there is a lot of computation 
time taken by simple algebra.

I.e. (calculation time for 1000 iterations of runpf of 49 bus system):

Function Calls  Total timeSelf time
runpf   1000 1.6920.183 s

newtonpf   1000 0.684 0.552 s


If you have a closer look into newtonpf, the 50% of time is taken by "dx = -(J 
\ F);" operation!

Similarly, dSbus_dV consumes >60% of time by only two lines:
"dSbus_dVm = diagV * conj(Ybus * diagVnorm) + conj(diagIbus) * diagVnorm;
dSbus_dVa = 1j * diagV * conj(diagIbus - Ybus * diagV);"

These 3 lines take in total 0.420 out of 1.692 second.


Have you considered to move this algebra to CUDA GPU?

Are there any known constraints not to use CUDA?


https://developer.nvidia.com/matlab-cuda

https://www.youtube.com/watch?v=PwCAWxPcDkw



BR,

Jaroslaw