Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-15 Thread Niclas Götting
On the basis of your suggestion, I tried using vode for the real-valued 
problem with scipy and I get roughly the same speed as before with 
scipy, which could have three reasons


1. (Z)VODE is slower than plain RK (however, I must admit that I'm not
   quite sure what (Z)VODE does precisely)
2. Sparse matrix operations in scipy are slow. Some of them are even
   written in pure python.
3. The RHS function in scipy must *return* a vector and therefore
   allocates new memory for each iteration.

Parallelizing the code is of course a goal of mine, but I believe this 
will only become relevant for larger systems, which I want to 
investigate in the near future.


Regarding the RHS Jacobian, I see why defining RHSFunction vs 
RHSJacobian should be computationally equivalent, but I found it much 
easier to optimize the RHSFunction in this case, and I'm not quite sure 
as to why the documentation is so specific in strictly recommending the 
pattern of only providing a Jacobian and not a RHS function, while that 
should be equivalent.


Lastly, I'm aware that another performance boost awaits upon turning off 
the debugging functionality, but for this simple test I just wanted to 
see, if there is *any* improvement in performance and I was very much 
surprised over the factor of 7 with debugging turned on already.


Thank you all for the interesting input and have a nice day!
Niclas

On 15.08.23 00:37, Zhang, Hong wrote:
PETSs is not necessarily faster than scipy for your problem when 
executed in serial. But you get benefits when running in parallel. The 
PETSc code you wrote uses float64 while your scipy code uses 
complex128, so the comparison may not be fair.


In addition, using the RHS Jacobian does not necessarily make your 
PETSc code slower. In your case, the bottleneck is the matrix 
operations. For best performance, you should avoid adding two sparse 
matrices (especially with different sparsity patterns) which is very 
costly. So one MatMult + one MultAdd is the best option. MatAXPY with 
the same nonzero pattern would be a bit slower but still faster than 
MatAXPY with subset nonzero pattern, which you used in the Jacobian 
function.


I echo Barry’s suggestion that debugging should be turned off before 
you do any performance study.


Hong (Mr.)

On Aug 10, 2023, at 4:40 AM, Niclas Götting 
 wrote:


Thank you both for the very quick answer!

So far, I compiled PETSc with debugging turned on, but I think it 
should still be faster than standard scipy in both cases. Actually, 
Stefano's answer has got me very far already; now I only define the 
RHS of the ODE and no Jacobian (I wonder, why the documentation 
suggests otherwise, though). I had the following four tries at 
implementing the RHS:


 1. def rhsfunc1(ts, t, u, F):
    scale = 0.5 * (5 < t < 10)
    (l + scale * pump).mult(u, F)
 2. def rhsfunc2(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    (scale * pump).multAdd(u, F, F)
 3. def rhsfunc3(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    if scale != 0:
    pump.scale(scale)
    pump.multAdd(u, F, F)
    pump.scale(1/scale)
 4. def rhsfunc4(ts, t, u, F):
    tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    tmp_pump.axpy(scale, pump,
structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
    tmp_pump.multAdd(u, F, F)

They all yield the same results, but with 50it/s, 800it/, 2300it/s 
and 1900it/s, respectively, which is a huge performance boost (almost 
7 times as fast as scipy, with PETSc debugging still turned on). As 
the scale function will most likely be a gaussian in the future, I 
think that option 3 will be become numerically unstable and I'll have 
to go with option 4, which is already faster than I expected. If you 
think it is possible to speed up the RHS calculation even more, I'd 
be happy to hear your suggestions; the -log_view is attached to this 
message.


One last point: If I didn't misunderstand the documentation at 
https://petsc.org/release/manual/ts/#special-cases, should this maybe 
be changed?


Best regards
Niclas

On 09.08.23 17:51, Stefano Zampini wrote:
TSRK is an explicit solver. Unless you are changing the ts type from 
command line,  the explicit  jacobian should not be needed. On top 
of Barry's suggestion, I would suggest you to write the explicit RHS 
instead of assembly a throw away matrix every time that function 
needs to be sampled.


On Wed, Aug 9, 2023, 17:09 Niclas Götting 
 wrote:


Hi all,

I'm currently trying to convert a quantum simulation from scipy to
PETSc. The problem itself is extremely simple and of the form
\dot{u}(t)
= (A_const + f(t)*B_const)*u(t), where f(t) in this simple test
case is
a square function. The matrices A_const and B_const are
extremely sparse
and therefore I thought, the problem will be well suited for PETSc.

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-14 Thread Zhang, Hong via petsc-users
PETSs is not necessarily faster than scipy for your problem when executed in 
serial. But you get benefits when running in parallel. The PETSc code you wrote 
uses float64 while your scipy code uses complex128, so the comparison may not 
be fair.

In addition, using the RHS Jacobian does not necessarily make your PETSc code 
slower. In your case, the bottleneck is the matrix operations. For best 
performance, you should avoid adding two sparse matrices (especially with 
different sparsity patterns) which is very costly. So one MatMult + one MultAdd 
is the best option. MatAXPY with the same nonzero pattern would be a bit slower 
but still faster than MatAXPY with subset nonzero pattern, which you used in 
the Jacobian function.

I echo Barry’s suggestion that debugging should be turned off before you do any 
performance study.

Hong (Mr.)

On Aug 10, 2023, at 4:40 AM, Niclas Götting  wrote:


Thank you both for the very quick answer!

So far, I compiled PETSc with debugging turned on, but I think it should still 
be faster than standard scipy in both cases. Actually, Stefano's answer has got 
me very far already; now I only define the RHS of the ODE and no Jacobian (I 
wonder, why the documentation suggests otherwise, though). I had the following 
four tries at implementing the RHS:

  1.  def rhsfunc1(ts, t, u, F):
scale = 0.5 * (5 < t < 10)
(l + scale * pump).mult(u, F)
  2.  def rhsfunc2(ts, t, u, F):
l.mult(u, F)
scale = 0.5 * (5 < t < 10)
(scale * pump).multAdd(u, F, F)
  3.  def rhsfunc3(ts, t, u, F):
l.mult(u, F)
scale = 0.5 * (5 < t < 10)
if scale != 0:
pump.scale(scale)
pump.multAdd(u, F, F)
pump.scale(1/scale)
  4.  def rhsfunc4(ts, t, u, F):
tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
l.mult(u, F)
scale = 0.5 * (5 < t < 10)
tmp_pump.axpy(scale, pump, 
structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
tmp_pump.multAdd(u, F, F)

They all yield the same results, but with 50it/s, 800it/, 2300it/s and 
1900it/s, respectively, which is a huge performance boost (almost 7 times as 
fast as scipy, with PETSc debugging still turned on). As the scale function 
will most likely be a gaussian in the future, I think that option 3 will be 
become numerically unstable and I'll have to go with option 4, which is already 
faster than I expected. If you think it is possible to speed up the RHS 
calculation even more, I'd be happy to hear your suggestions; the -log_view is 
attached to this message.

One last point: If I didn't misunderstand the documentation at 
https://petsc.org/release/manual/ts/#special-cases, should this maybe be 
changed?

Best regards
Niclas

On 09.08.23 17:51, Stefano Zampini wrote:
TSRK is an explicit solver. Unless you are changing the ts type from command 
line,  the explicit  jacobian should not be needed. On top of Barry's 
suggestion, I would suggest you to write the explicit RHS instead of assembly a 
throw away matrix every time that function needs to be sampled.

On Wed, Aug 9, 2023, 17:09 Niclas Götting 
mailto:ngoett...@itp.uni-bremen.de>> wrote:
Hi all,

I'm currently trying to convert a quantum simulation from scipy to
PETSc. The problem itself is extremely simple and of the form \dot{u}(t)
= (A_const + f(t)*B_const)*u(t), where f(t) in this simple test case is
a square function. The matrices A_const and B_const are extremely sparse
and therefore I thought, the problem will be well suited for PETSc.
Currently, I solve the ODE with the following procedure in scipy (I can
provide the necessary data files, if needed, but they are just some
trace-preserving, very sparse matrices):

import numpy as np
import scipy.sparse
import scipy.integrate

from tqdm import tqdm


l = np.load("../liouvillian.npy")
pump = np.load("../pump_operator.npy")
state = np.load("../initial_state.npy")

l = scipy.sparse.csr_array(l)
pump = scipy.sparse.csr_array(pump)

def f(t, y, *args):
 return (l + 0.5 * (5 < t < 10) * pump) @ y
 #return l @ y # Uncomment for f(t) = 0

dt = 0.1
NUM_STEPS = 200
res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
solver =
scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
times = []
for i in tqdm(range(NUM_STEPS)):
 res[i, :] = solver.integrate(solver.t + dt)
 times.append(solver.t)

Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports
about 330it/s on my machine. When converting the code to PETSc, I came
to the following result (according to the chapter
https://petsc.org/main/manual/ts/#special-cases)

import sys
import petsc4py
petsc4py.init(args=sys.argv)
import numpy as np
import scipy.sparse

from tqdm import tqdm
from petsc4py import PETSc

comm = PETSc.COMM_WORLD


def mat_to_real(arr):
 return np.block([[arr.real, -arr.imag], [arr.imag,
arr.real]]).astype(np.float64)

def mat_to_petsc_aij(arr):
 arr_sc_sp = scipy.sparse.csr_array(arr)
 mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
 rstart, rend = 

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Niclas Götting
Alright. Again, thank you very much for taking the time to answer my 
beginner questions! Still a lot to learn..


Have a good day!

On 10.08.23 12:27, Stefano Zampini wrote:
Then just do the multiplications you need. My proposal was for the 
example function you were showing.


On Thu, Aug 10, 2023, 12:25 Niclas Götting 
 wrote:


You are absolutely right for this specific case (I get about
2400it/s instead of 2100it/s). However, the single square function
will be replaced by a series of gaussian pulses in the future,
which will never be zero. Maybe one could do an approximation and
skip the second mult, if the gaussians are close to zero.

On 10.08.23 12:16, Stefano Zampini wrote:

If you do the mult of "pump" inside an if it should be faster

On Thu, Aug 10, 2023, 12:12 Niclas Götting
 wrote:

If I understood you right, this should be the resulting RHS:

def rhsfunc5(ts, t, u, F):
    l.mult(u, F)
    pump.mult(u, tmp_vec)
    scale = 0.5 * (5 < t < 10)
    F.axpy(scale, tmp_vec)

It is a little bit slower than option 3, but with about
2100it/s consistently ~10% faster than option 4.

Thank you very much for the suggestion!

On 10.08.23 11:47, Stefano Zampini wrote:

I would use option 3. Keep a work vector and do a vector
summation instead of the multiple multiplication by scale
and 1/scale.

I agree with you the docs are a little misleading here.

On Thu, Aug 10, 2023, 11:40 Niclas Götting
 wrote:

Thank you both for the very quick answer!

So far, I compiled PETSc with debugging turned on, but I
think it should still be faster than standard scipy in
both cases. Actually, Stefano's answer has got me very
far already; now I only define the RHS of the ODE and no
Jacobian (I wonder, why the documentation suggests
otherwise, though). I had the following four tries at
implementing the RHS:

 1. def rhsfunc1(ts, t, u, F):
    scale = 0.5 * (5 < t < 10)
    (l + scale * pump).mult(u, F)
 2. def rhsfunc2(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    (scale * pump).multAdd(u, F, F)
 3. def rhsfunc3(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    if scale != 0:
    pump.scale(scale)
    pump.multAdd(u, F, F)
    pump.scale(1/scale)
 4. def rhsfunc4(ts, t, u, F):
    tmp_pump.zeroEntries() # tmp_pump is
pump.duplicate()
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    tmp_pump.axpy(scale, pump,
structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
    tmp_pump.multAdd(u, F, F)

They all yield the same results, but with 50it/s,
800it/, 2300it/s and 1900it/s, respectively, which is a
huge performance boost (almost 7 times as fast as scipy,
with PETSc debugging still turned on). As the scale
function will most likely be a gaussian in the future, I
think that option 3 will be become numerically unstable
and I'll have to go with option 4, which is already
faster than I expected. If you think it is possible to
speed up the RHS calculation even more, I'd be happy to
hear your suggestions; the -log_view is attached to this
message.

One last point: If I didn't misunderstand the
documentation at
https://petsc.org/release/manual/ts/#special-cases,
should this maybe be changed?

Best regards
Niclas

On 09.08.23 17:51, Stefano Zampini wrote:

TSRK is an explicit solver. Unless you are changing the
ts type from command line,  the explicit  jacobian
should not be needed. On top of Barry's suggestion, I
would suggest you to write the explicit RHS instead of
assembly a throw away matrix every time that function
needs to be sampled.

On Wed, Aug 9, 2023, 17:09 Niclas Götting
 wrote:

Hi all,

I'm currently trying to convert a quantum
simulation from scipy to
PETSc. The problem itself is extremely simple and
of the form \dot{u}(t)
= (A_const + f(t)*B_const)*u(t), where f(t) in this
simple test case is
a square function. The matrices A_const and B_const
are extremely sparse
and therefore I thought, the problem will be well
  

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Stefano Zampini
Then just do the multiplications you need. My proposal was for the example
function you were showing.

On Thu, Aug 10, 2023, 12:25 Niclas Götting 
wrote:

> You are absolutely right for this specific case (I get about 2400it/s
> instead of 2100it/s). However, the single square function will be replaced
> by a series of gaussian pulses in the future, which will never be zero.
> Maybe one could do an approximation and skip the second mult, if the
> gaussians are close to zero.
> On 10.08.23 12:16, Stefano Zampini wrote:
>
> If you do the mult of "pump" inside an if it should be faster
>
> On Thu, Aug 10, 2023, 12:12 Niclas Götting 
> wrote:
>
>> If I understood you right, this should be the resulting RHS:
>>
>> def rhsfunc5(ts, t, u, F):
>> l.mult(u, F)
>> pump.mult(u, tmp_vec)
>> scale = 0.5 * (5 < t < 10)
>> F.axpy(scale, tmp_vec)
>>
>> It is a little bit slower than option 3, but with about 2100it/s
>> consistently ~10% faster than option 4.
>>
>> Thank you very much for the suggestion!
>> On 10.08.23 11:47, Stefano Zampini wrote:
>>
>> I would use option 3. Keep a work vector and do a vector summation
>> instead of the multiple multiplication by scale and 1/scale.
>>
>> I agree with you the docs are a little misleading here.
>>
>> On Thu, Aug 10, 2023, 11:40 Niclas Götting 
>> wrote:
>>
>>> Thank you both for the very quick answer!
>>>
>>> So far, I compiled PETSc with debugging turned on, but I think it should
>>> still be faster than standard scipy in both cases. Actually, Stefano's
>>> answer has got me very far already; now I only define the RHS of the ODE
>>> and no Jacobian (I wonder, why the documentation suggests otherwise,
>>> though). I had the following four tries at implementing the RHS:
>>>
>>>1. def rhsfunc1(ts, t, u, F):
>>>scale = 0.5 * (5 < t < 10)
>>>(l + scale * pump).mult(u, F)
>>>2. def rhsfunc2(ts, t, u, F):
>>>l.mult(u, F)
>>>scale = 0.5 * (5 < t < 10)
>>>(scale * pump).multAdd(u, F, F)
>>>3. def rhsfunc3(ts, t, u, F):
>>>l.mult(u, F)
>>>scale = 0.5 * (5 < t < 10)
>>>if scale != 0:
>>>pump.scale(scale)
>>>pump.multAdd(u, F, F)
>>>pump.scale(1/scale)
>>>4. def rhsfunc4(ts, t, u, F):
>>>tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
>>>l.mult(u, F)
>>>scale = 0.5 * (5 < t < 10)
>>>tmp_pump.axpy(scale, pump,
>>>structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
>>>tmp_pump.multAdd(u, F, F)
>>>
>>> They all yield the same results, but with 50it/s, 800it/, 2300it/s and
>>> 1900it/s, respectively, which is a huge performance boost (almost 7 times
>>> as fast as scipy, with PETSc debugging still turned on). As the scale
>>> function will most likely be a gaussian in the future, I think that option
>>> 3 will be become numerically unstable and I'll have to go with option 4,
>>> which is already faster than I expected. If you think it is possible to
>>> speed up the RHS calculation even more, I'd be happy to hear your
>>> suggestions; the -log_view is attached to this message.
>>>
>>> One last point: If I didn't misunderstand the documentation at
>>> https://petsc.org/release/manual/ts/#special-cases, should this maybe
>>> be changed?
>>>
>>> Best regards
>>> Niclas
>>> On 09.08.23 17:51, Stefano Zampini wrote:
>>>
>>> TSRK is an explicit solver. Unless you are changing the ts type from
>>> command line,  the explicit  jacobian should not be needed. On top of
>>> Barry's suggestion, I would suggest you to write the explicit RHS instead
>>> of assembly a throw away matrix every time that function needs to be
>>> sampled.
>>>
>>> On Wed, Aug 9, 2023, 17:09 Niclas Götting 
>>> wrote:
>>>
 Hi all,

 I'm currently trying to convert a quantum simulation from scipy to
 PETSc. The problem itself is extremely simple and of the form
 \dot{u}(t)
 = (A_const + f(t)*B_const)*u(t), where f(t) in this simple test case is
 a square function. The matrices A_const and B_const are extremely
 sparse
 and therefore I thought, the problem will be well suited for PETSc.
 Currently, I solve the ODE with the following procedure in scipy (I can
 provide the necessary data files, if needed, but they are just some
 trace-preserving, very sparse matrices):

 import numpy as np
 import scipy.sparse
 import scipy.integrate

 from tqdm import tqdm


 l = np.load("../liouvillian.npy")
 pump = np.load("../pump_operator.npy")
 state = np.load("../initial_state.npy")

 l = scipy.sparse.csr_array(l)
 pump = scipy.sparse.csr_array(pump)

 def f(t, y, *args):
  return (l + 0.5 * (5 < t < 10) * pump) @ y
  #return l @ y # Uncomment for f(t) = 0

 dt = 0.1
 NUM_STEPS = 200
 res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
 solver =
 

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Niclas Götting
You are absolutely right for this specific case (I get about 2400it/s 
instead of 2100it/s). However, the single square function will be 
replaced by a series of gaussian pulses in the future, which will never 
be zero. Maybe one could do an approximation and skip the second mult, 
if the gaussians are close to zero.


On 10.08.23 12:16, Stefano Zampini wrote:

If you do the mult of "pump" inside an if it should be faster

On Thu, Aug 10, 2023, 12:12 Niclas Götting 
 wrote:


If I understood you right, this should be the resulting RHS:

def rhsfunc5(ts, t, u, F):
    l.mult(u, F)
    pump.mult(u, tmp_vec)
    scale = 0.5 * (5 < t < 10)
    F.axpy(scale, tmp_vec)

It is a little bit slower than option 3, but with about 2100it/s
consistently ~10% faster than option 4.

Thank you very much for the suggestion!

On 10.08.23 11:47, Stefano Zampini wrote:

I would use option 3. Keep a work vector and do a vector
summation instead of the multiple multiplication by scale and
1/scale.

I agree with you the docs are a little misleading here.

On Thu, Aug 10, 2023, 11:40 Niclas Götting
 wrote:

Thank you both for the very quick answer!

So far, I compiled PETSc with debugging turned on, but I
think it should still be faster than standard scipy in both
cases. Actually, Stefano's answer has got me very far
already; now I only define the RHS of the ODE and no Jacobian
(I wonder, why the documentation suggests otherwise, though).
I had the following four tries at implementing the RHS:

 1. def rhsfunc1(ts, t, u, F):
    scale = 0.5 * (5 < t < 10)
    (l + scale * pump).mult(u, F)
 2. def rhsfunc2(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    (scale * pump).multAdd(u, F, F)
 3. def rhsfunc3(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    if scale != 0:
    pump.scale(scale)
    pump.multAdd(u, F, F)
    pump.scale(1/scale)
 4. def rhsfunc4(ts, t, u, F):
    tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    tmp_pump.axpy(scale, pump,
structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
    tmp_pump.multAdd(u, F, F)

They all yield the same results, but with 50it/s, 800it/,
2300it/s and 1900it/s, respectively, which is a huge
performance boost (almost 7 times as fast as scipy, with
PETSc debugging still turned on). As the scale function will
most likely be a gaussian in the future, I think that option
3 will be become numerically unstable and I'll have to go
with option 4, which is already faster than I expected. If
you think it is possible to speed up the RHS calculation even
more, I'd be happy to hear your suggestions; the -log_view is
attached to this message.

One last point: If I didn't misunderstand the documentation
at https://petsc.org/release/manual/ts/#special-cases, should
this maybe be changed?

Best regards
Niclas

On 09.08.23 17:51, Stefano Zampini wrote:

TSRK is an explicit solver. Unless you are changing the ts
type from command line,  the explicit  jacobian should not
be needed. On top of Barry's suggestion, I would suggest you
to write the explicit RHS instead of assembly a throw away
matrix every time that function needs to be sampled.

On Wed, Aug 9, 2023, 17:09 Niclas Götting
 wrote:

Hi all,

I'm currently trying to convert a quantum simulation
from scipy to
PETSc. The problem itself is extremely simple and of the
form \dot{u}(t)
= (A_const + f(t)*B_const)*u(t), where f(t) in this
simple test case is
a square function. The matrices A_const and B_const are
extremely sparse
and therefore I thought, the problem will be well suited
for PETSc.
Currently, I solve the ODE with the following procedure
in scipy (I can
provide the necessary data files, if needed, but they
are just some
trace-preserving, very sparse matrices):

import numpy as np
import scipy.sparse
import scipy.integrate

from tqdm import tqdm


l = np.load("../liouvillian.npy")
pump = np.load("../pump_operator.npy")
state = np.load("../initial_state.npy")

l = scipy.sparse.csr_array(l)
pump = scipy.sparse.csr_array(pump)

def f(t, y, *args):
 return (l + 0.5 * (5 < t < 10) * pump) @ y
 

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Stefano Zampini
If you do the mult of "pump" inside an if it should be faster

On Thu, Aug 10, 2023, 12:12 Niclas Götting 
wrote:

> If I understood you right, this should be the resulting RHS:
>
> def rhsfunc5(ts, t, u, F):
> l.mult(u, F)
> pump.mult(u, tmp_vec)
> scale = 0.5 * (5 < t < 10)
> F.axpy(scale, tmp_vec)
>
> It is a little bit slower than option 3, but with about 2100it/s
> consistently ~10% faster than option 4.
>
> Thank you very much for the suggestion!
> On 10.08.23 11:47, Stefano Zampini wrote:
>
> I would use option 3. Keep a work vector and do a vector summation instead
> of the multiple multiplication by scale and 1/scale.
>
> I agree with you the docs are a little misleading here.
>
> On Thu, Aug 10, 2023, 11:40 Niclas Götting 
> wrote:
>
>> Thank you both for the very quick answer!
>>
>> So far, I compiled PETSc with debugging turned on, but I think it should
>> still be faster than standard scipy in both cases. Actually, Stefano's
>> answer has got me very far already; now I only define the RHS of the ODE
>> and no Jacobian (I wonder, why the documentation suggests otherwise,
>> though). I had the following four tries at implementing the RHS:
>>
>>1. def rhsfunc1(ts, t, u, F):
>>scale = 0.5 * (5 < t < 10)
>>(l + scale * pump).mult(u, F)
>>2. def rhsfunc2(ts, t, u, F):
>>l.mult(u, F)
>>scale = 0.5 * (5 < t < 10)
>>(scale * pump).multAdd(u, F, F)
>>3. def rhsfunc3(ts, t, u, F):
>>l.mult(u, F)
>>scale = 0.5 * (5 < t < 10)
>>if scale != 0:
>>pump.scale(scale)
>>pump.multAdd(u, F, F)
>>pump.scale(1/scale)
>>4. def rhsfunc4(ts, t, u, F):
>>tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
>>l.mult(u, F)
>>scale = 0.5 * (5 < t < 10)
>>tmp_pump.axpy(scale, pump,
>>structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
>>tmp_pump.multAdd(u, F, F)
>>
>> They all yield the same results, but with 50it/s, 800it/, 2300it/s and
>> 1900it/s, respectively, which is a huge performance boost (almost 7 times
>> as fast as scipy, with PETSc debugging still turned on). As the scale
>> function will most likely be a gaussian in the future, I think that option
>> 3 will be become numerically unstable and I'll have to go with option 4,
>> which is already faster than I expected. If you think it is possible to
>> speed up the RHS calculation even more, I'd be happy to hear your
>> suggestions; the -log_view is attached to this message.
>>
>> One last point: If I didn't misunderstand the documentation at
>> https://petsc.org/release/manual/ts/#special-cases, should this maybe be
>> changed?
>>
>> Best regards
>> Niclas
>> On 09.08.23 17:51, Stefano Zampini wrote:
>>
>> TSRK is an explicit solver. Unless you are changing the ts type from
>> command line,  the explicit  jacobian should not be needed. On top of
>> Barry's suggestion, I would suggest you to write the explicit RHS instead
>> of assembly a throw away matrix every time that function needs to be
>> sampled.
>>
>> On Wed, Aug 9, 2023, 17:09 Niclas Götting 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm currently trying to convert a quantum simulation from scipy to
>>> PETSc. The problem itself is extremely simple and of the form \dot{u}(t)
>>> = (A_const + f(t)*B_const)*u(t), where f(t) in this simple test case is
>>> a square function. The matrices A_const and B_const are extremely sparse
>>> and therefore I thought, the problem will be well suited for PETSc.
>>> Currently, I solve the ODE with the following procedure in scipy (I can
>>> provide the necessary data files, if needed, but they are just some
>>> trace-preserving, very sparse matrices):
>>>
>>> import numpy as np
>>> import scipy.sparse
>>> import scipy.integrate
>>>
>>> from tqdm import tqdm
>>>
>>>
>>> l = np.load("../liouvillian.npy")
>>> pump = np.load("../pump_operator.npy")
>>> state = np.load("../initial_state.npy")
>>>
>>> l = scipy.sparse.csr_array(l)
>>> pump = scipy.sparse.csr_array(pump)
>>>
>>> def f(t, y, *args):
>>>  return (l + 0.5 * (5 < t < 10) * pump) @ y
>>>  #return l @ y # Uncomment for f(t) = 0
>>>
>>> dt = 0.1
>>> NUM_STEPS = 200
>>> res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
>>> solver =
>>> scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
>>> times = []
>>> for i in tqdm(range(NUM_STEPS)):
>>>  res[i, :] = solver.integrate(solver.t + dt)
>>>  times.append(solver.t)
>>>
>>> Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports
>>> about 330it/s on my machine. When converting the code to PETSc, I came
>>> to the following result (according to the chapter
>>> https://petsc.org/main/manual/ts/#special-cases)
>>>
>>> import sys
>>> import petsc4py
>>> petsc4py.init(args=sys.argv)
>>> import numpy as np
>>> import scipy.sparse
>>>
>>> from tqdm import tqdm
>>> from petsc4py import PETSc
>>>
>>> comm = PETSc.COMM_WORLD
>>>
>>>
>>> def mat_to_real(arr):
>>> 

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Niclas Götting

If I understood you right, this should be the resulting RHS:

def rhsfunc5(ts, t, u, F):
    l.mult(u, F)
    pump.mult(u, tmp_vec)
    scale = 0.5 * (5 < t < 10)
    F.axpy(scale, tmp_vec)

It is a little bit slower than option 3, but with about 2100it/s 
consistently ~10% faster than option 4.


Thank you very much for the suggestion!

On 10.08.23 11:47, Stefano Zampini wrote:
I would use option 3. Keep a work vector and do a vector summation 
instead of the multiple multiplication by scale and 1/scale.


I agree with you the docs are a little misleading here.

On Thu, Aug 10, 2023, 11:40 Niclas Götting 
 wrote:


Thank you both for the very quick answer!

So far, I compiled PETSc with debugging turned on, but I think it
should still be faster than standard scipy in both cases.
Actually, Stefano's answer has got me very far already; now I only
define the RHS of the ODE and no Jacobian (I wonder, why the
documentation suggests otherwise, though). I had the following
four tries at implementing the RHS:

 1. def rhsfunc1(ts, t, u, F):
    scale = 0.5 * (5 < t < 10)
    (l + scale * pump).mult(u, F)
 2. def rhsfunc2(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    (scale * pump).multAdd(u, F, F)
 3. def rhsfunc3(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    if scale != 0:
    pump.scale(scale)
    pump.multAdd(u, F, F)
    pump.scale(1/scale)
 4. def rhsfunc4(ts, t, u, F):
    tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    tmp_pump.axpy(scale, pump,
structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
    tmp_pump.multAdd(u, F, F)

They all yield the same results, but with 50it/s, 800it/, 2300it/s
and 1900it/s, respectively, which is a huge performance boost
(almost 7 times as fast as scipy, with PETSc debugging still
turned on). As the scale function will most likely be a gaussian
in the future, I think that option 3 will be become numerically
unstable and I'll have to go with option 4, which is already
faster than I expected. If you think it is possible to speed up
the RHS calculation even more, I'd be happy to hear your
suggestions; the -log_view is attached to this message.

One last point: If I didn't misunderstand the documentation at
https://petsc.org/release/manual/ts/#special-cases, should this
maybe be changed?

Best regards
Niclas

On 09.08.23 17:51, Stefano Zampini wrote:

TSRK is an explicit solver. Unless you are changing the ts type
from command line,  the explicit  jacobian should not be needed.
On top of Barry's suggestion, I would suggest you to write the
explicit RHS instead of assembly a throw away matrix every time
that function needs to be sampled.

On Wed, Aug 9, 2023, 17:09 Niclas Götting
 wrote:

Hi all,

I'm currently trying to convert a quantum simulation from
scipy to
PETSc. The problem itself is extremely simple and of the form
\dot{u}(t)
= (A_const + f(t)*B_const)*u(t), where f(t) in this simple
test case is
a square function. The matrices A_const and B_const are
extremely sparse
and therefore I thought, the problem will be well suited for
PETSc.
Currently, I solve the ODE with the following procedure in
scipy (I can
provide the necessary data files, if needed, but they are
just some
trace-preserving, very sparse matrices):

import numpy as np
import scipy.sparse
import scipy.integrate

from tqdm import tqdm


l = np.load("../liouvillian.npy")
pump = np.load("../pump_operator.npy")
state = np.load("../initial_state.npy")

l = scipy.sparse.csr_array(l)
pump = scipy.sparse.csr_array(pump)

def f(t, y, *args):
 return (l + 0.5 * (5 < t < 10) * pump) @ y
 #return l @ y # Uncomment for f(t) = 0

dt = 0.1
NUM_STEPS = 200
res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
solver =
scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
times = []
for i in tqdm(range(NUM_STEPS)):
 res[i, :] = solver.integrate(solver.t + dt)
 times.append(solver.t)

Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm
reports
about 330it/s on my machine. When converting the code to
PETSc, I came
to the following result (according to the chapter
https://petsc.org/main/manual/ts/#special-cases)

import sys
import petsc4py
petsc4py.init(args=sys.argv)
import numpy as np
import scipy.sparse

from tqdm import 

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Stefano Zampini
I would use option 3. Keep a work vector and do a vector summation instead
of the multiple multiplication by scale and 1/scale.

I agree with you the docs are a little misleading here.

On Thu, Aug 10, 2023, 11:40 Niclas Götting 
wrote:

> Thank you both for the very quick answer!
>
> So far, I compiled PETSc with debugging turned on, but I think it should
> still be faster than standard scipy in both cases. Actually, Stefano's
> answer has got me very far already; now I only define the RHS of the ODE
> and no Jacobian (I wonder, why the documentation suggests otherwise,
> though). I had the following four tries at implementing the RHS:
>
>1. def rhsfunc1(ts, t, u, F):
>scale = 0.5 * (5 < t < 10)
>(l + scale * pump).mult(u, F)
>2. def rhsfunc2(ts, t, u, F):
>l.mult(u, F)
>scale = 0.5 * (5 < t < 10)
>(scale * pump).multAdd(u, F, F)
>3. def rhsfunc3(ts, t, u, F):
>l.mult(u, F)
>scale = 0.5 * (5 < t < 10)
>if scale != 0:
>pump.scale(scale)
>pump.multAdd(u, F, F)
>pump.scale(1/scale)
>4. def rhsfunc4(ts, t, u, F):
>tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
>l.mult(u, F)
>scale = 0.5 * (5 < t < 10)
>tmp_pump.axpy(scale, pump,
>structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
>tmp_pump.multAdd(u, F, F)
>
> They all yield the same results, but with 50it/s, 800it/, 2300it/s and
> 1900it/s, respectively, which is a huge performance boost (almost 7 times
> as fast as scipy, with PETSc debugging still turned on). As the scale
> function will most likely be a gaussian in the future, I think that option
> 3 will be become numerically unstable and I'll have to go with option 4,
> which is already faster than I expected. If you think it is possible to
> speed up the RHS calculation even more, I'd be happy to hear your
> suggestions; the -log_view is attached to this message.
>
> One last point: If I didn't misunderstand the documentation at
> https://petsc.org/release/manual/ts/#special-cases, should this maybe be
> changed?
>
> Best regards
> Niclas
> On 09.08.23 17:51, Stefano Zampini wrote:
>
> TSRK is an explicit solver. Unless you are changing the ts type from
> command line,  the explicit  jacobian should not be needed. On top of
> Barry's suggestion, I would suggest you to write the explicit RHS instead
> of assembly a throw away matrix every time that function needs to be
> sampled.
>
> On Wed, Aug 9, 2023, 17:09 Niclas Götting 
> wrote:
>
>> Hi all,
>>
>> I'm currently trying to convert a quantum simulation from scipy to
>> PETSc. The problem itself is extremely simple and of the form \dot{u}(t)
>> = (A_const + f(t)*B_const)*u(t), where f(t) in this simple test case is
>> a square function. The matrices A_const and B_const are extremely sparse
>> and therefore I thought, the problem will be well suited for PETSc.
>> Currently, I solve the ODE with the following procedure in scipy (I can
>> provide the necessary data files, if needed, but they are just some
>> trace-preserving, very sparse matrices):
>>
>> import numpy as np
>> import scipy.sparse
>> import scipy.integrate
>>
>> from tqdm import tqdm
>>
>>
>> l = np.load("../liouvillian.npy")
>> pump = np.load("../pump_operator.npy")
>> state = np.load("../initial_state.npy")
>>
>> l = scipy.sparse.csr_array(l)
>> pump = scipy.sparse.csr_array(pump)
>>
>> def f(t, y, *args):
>>  return (l + 0.5 * (5 < t < 10) * pump) @ y
>>  #return l @ y # Uncomment for f(t) = 0
>>
>> dt = 0.1
>> NUM_STEPS = 200
>> res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
>> solver =
>> scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
>> times = []
>> for i in tqdm(range(NUM_STEPS)):
>>  res[i, :] = solver.integrate(solver.t + dt)
>>  times.append(solver.t)
>>
>> Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports
>> about 330it/s on my machine. When converting the code to PETSc, I came
>> to the following result (according to the chapter
>> https://petsc.org/main/manual/ts/#special-cases)
>>
>> import sys
>> import petsc4py
>> petsc4py.init(args=sys.argv)
>> import numpy as np
>> import scipy.sparse
>>
>> from tqdm import tqdm
>> from petsc4py import PETSc
>>
>> comm = PETSc.COMM_WORLD
>>
>>
>> def mat_to_real(arr):
>>  return np.block([[arr.real, -arr.imag], [arr.imag,
>> arr.real]]).astype(np.float64)
>>
>> def mat_to_petsc_aij(arr):
>>  arr_sc_sp = scipy.sparse.csr_array(arr)
>>  mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
>>  rstart, rend = mat.getOwnershipRange()
>>  print(rstart, rend)
>>  print(arr.shape[0])
>>  print(mat.sizes)
>>  I = arr_sc_sp.indptr[rstart : rend + 1] - arr_sc_sp.indptr[rstart]
>>  J = arr_sc_sp.indices[arr_sc_sp.indptr[rstart] :
>> arr_sc_sp.indptr[rend]]
>>  V = arr_sc_sp.data[arr_sc_sp.indptr[rstart] : arr_sc_sp.indptr[rend]]
>>
>>  print(I.shape, J.shape, V.shape)
>>   

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Niclas Götting

Thank you both for the very quick answer!

So far, I compiled PETSc with debugging turned on, but I think it should 
still be faster than standard scipy in both cases. Actually, Stefano's 
answer has got me very far already; now I only define the RHS of the ODE 
and no Jacobian (I wonder, why the documentation suggests otherwise, 
though). I had the following four tries at implementing the RHS:


1. def rhsfunc1(ts, t, u, F):
    scale = 0.5 * (5 < t < 10)
    (l + scale * pump).mult(u, F)
2. def rhsfunc2(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    (scale * pump).multAdd(u, F, F)
3. def rhsfunc3(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    if scale != 0:
    pump.scale(scale)
    pump.multAdd(u, F, F)
    pump.scale(1/scale)
4. def rhsfunc4(ts, t, u, F):
    tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    tmp_pump.axpy(scale, pump,
   structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
    tmp_pump.multAdd(u, F, F)

They all yield the same results, but with 50it/s, 800it/, 2300it/s and 
1900it/s, respectively, which is a huge performance boost (almost 7 
times as fast as scipy, with PETSc debugging still turned on). As the 
scale function will most likely be a gaussian in the future, I think 
that option 3 will be become numerically unstable and I'll have to go 
with option 4, which is already faster than I expected. If you think it 
is possible to speed up the RHS calculation even more, I'd be happy to 
hear your suggestions; the -log_view is attached to this message.


One last point: If I didn't misunderstand the documentation at 
https://petsc.org/release/manual/ts/#special-cases, should this maybe be 
changed?


Best regards
Niclas

On 09.08.23 17:51, Stefano Zampini wrote:
TSRK is an explicit solver. Unless you are changing the ts type from 
command line,  the explicit  jacobian should not be needed. On top of 
Barry's suggestion, I would suggest you to write the explicit RHS 
instead of assembly a throw away matrix every time that function needs 
to be sampled.


On Wed, Aug 9, 2023, 17:09 Niclas Götting 
 wrote:


Hi all,

I'm currently trying to convert a quantum simulation from scipy to
PETSc. The problem itself is extremely simple and of the form
\dot{u}(t)
= (A_const + f(t)*B_const)*u(t), where f(t) in this simple test
case is
a square function. The matrices A_const and B_const are extremely
sparse
and therefore I thought, the problem will be well suited for PETSc.
Currently, I solve the ODE with the following procedure in scipy
(I can
provide the necessary data files, if needed, but they are just some
trace-preserving, very sparse matrices):

import numpy as np
import scipy.sparse
import scipy.integrate

from tqdm import tqdm


l = np.load("../liouvillian.npy")
pump = np.load("../pump_operator.npy")
state = np.load("../initial_state.npy")

l = scipy.sparse.csr_array(l)
pump = scipy.sparse.csr_array(pump)

def f(t, y, *args):
 return (l + 0.5 * (5 < t < 10) * pump) @ y
 #return l @ y # Uncomment for f(t) = 0

dt = 0.1
NUM_STEPS = 200
res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
solver =
scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
times = []
for i in tqdm(range(NUM_STEPS)):
 res[i, :] = solver.integrate(solver.t + dt)
 times.append(solver.t)

Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports
about 330it/s on my machine. When converting the code to PETSc, I
came
to the following result (according to the chapter
https://petsc.org/main/manual/ts/#special-cases)

import sys
import petsc4py
petsc4py.init(args=sys.argv)
import numpy as np
import scipy.sparse

from tqdm import tqdm
from petsc4py import PETSc

comm = PETSc.COMM_WORLD


def mat_to_real(arr):
 return np.block([[arr.real, -arr.imag], [arr.imag,
arr.real]]).astype(np.float64)

def mat_to_petsc_aij(arr):
 arr_sc_sp = scipy.sparse.csr_array(arr)
 mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
 rstart, rend = mat.getOwnershipRange()
 print(rstart, rend)
 print(arr.shape[0])
 print(mat.sizes)
 I = arr_sc_sp.indptr[rstart : rend + 1] -
arr_sc_sp.indptr[rstart]
 J = arr_sc_sp.indices[arr_sc_sp.indptr[rstart] :
arr_sc_sp.indptr[rend]]
 V = arr_sc_sp.data[arr_sc_sp.indptr[rstart] :
arr_sc_sp.indptr[rend]]

 print(I.shape, J.shape, V.shape)
 mat.setValuesCSR(I, J, V)
 mat.assemble()
 return mat


l = np.load("../liouvillian.npy")
l = mat_to_real(l)
pump = np.load("../pump_operator.npy")
pump = mat_to_real(pump)
state = np.load("../initial_state.npy")

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-09 Thread Stefano Zampini
TSRK is an explicit solver. Unless you are changing the ts type from
command line,  the explicit  jacobian should not be needed. On top of
Barry's suggestion, I would suggest you to write the explicit RHS instead
of assembly a throw away matrix every time that function needs to be
sampled.

On Wed, Aug 9, 2023, 17:09 Niclas Götting 
wrote:

> Hi all,
>
> I'm currently trying to convert a quantum simulation from scipy to
> PETSc. The problem itself is extremely simple and of the form \dot{u}(t)
> = (A_const + f(t)*B_const)*u(t), where f(t) in this simple test case is
> a square function. The matrices A_const and B_const are extremely sparse
> and therefore I thought, the problem will be well suited for PETSc.
> Currently, I solve the ODE with the following procedure in scipy (I can
> provide the necessary data files, if needed, but they are just some
> trace-preserving, very sparse matrices):
>
> import numpy as np
> import scipy.sparse
> import scipy.integrate
>
> from tqdm import tqdm
>
>
> l = np.load("../liouvillian.npy")
> pump = np.load("../pump_operator.npy")
> state = np.load("../initial_state.npy")
>
> l = scipy.sparse.csr_array(l)
> pump = scipy.sparse.csr_array(pump)
>
> def f(t, y, *args):
>  return (l + 0.5 * (5 < t < 10) * pump) @ y
>  #return l @ y # Uncomment for f(t) = 0
>
> dt = 0.1
> NUM_STEPS = 200
> res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
> solver =
> scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
> times = []
> for i in tqdm(range(NUM_STEPS)):
>  res[i, :] = solver.integrate(solver.t + dt)
>  times.append(solver.t)
>
> Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports
> about 330it/s on my machine. When converting the code to PETSc, I came
> to the following result (according to the chapter
> https://petsc.org/main/manual/ts/#special-cases)
>
> import sys
> import petsc4py
> petsc4py.init(args=sys.argv)
> import numpy as np
> import scipy.sparse
>
> from tqdm import tqdm
> from petsc4py import PETSc
>
> comm = PETSc.COMM_WORLD
>
>
> def mat_to_real(arr):
>  return np.block([[arr.real, -arr.imag], [arr.imag,
> arr.real]]).astype(np.float64)
>
> def mat_to_petsc_aij(arr):
>  arr_sc_sp = scipy.sparse.csr_array(arr)
>  mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
>  rstart, rend = mat.getOwnershipRange()
>  print(rstart, rend)
>  print(arr.shape[0])
>  print(mat.sizes)
>  I = arr_sc_sp.indptr[rstart : rend + 1] - arr_sc_sp.indptr[rstart]
>  J = arr_sc_sp.indices[arr_sc_sp.indptr[rstart] :
> arr_sc_sp.indptr[rend]]
>  V = arr_sc_sp.data[arr_sc_sp.indptr[rstart] : arr_sc_sp.indptr[rend]]
>
>  print(I.shape, J.shape, V.shape)
>  mat.setValuesCSR(I, J, V)
>  mat.assemble()
>  return mat
>
>
> l = np.load("../liouvillian.npy")
> l = mat_to_real(l)
> pump = np.load("../pump_operator.npy")
> pump = mat_to_real(pump)
> state = np.load("../initial_state.npy")
> state = np.hstack([state.real, state.imag]).astype(np.float64)
>
> l = mat_to_petsc_aij(l)
> pump = mat_to_petsc_aij(pump)
>
>
> jac = l.duplicate()
> for i in range(8192):
>  jac.setValue(i, i, 0)
> jac.assemble()
> jac += l
>
> vec = l.createVecRight()
> vec.setValues(np.arange(state.shape[0], dtype=np.int32), state)
> vec.assemble()
>
>
> dt = 0.1
>
> ts = PETSc.TS().create(comm=comm)
> ts.setFromOptions()
> ts.setProblemType(ts.ProblemType.LINEAR)
> ts.setEquationType(ts.EquationType.ODE_EXPLICIT)
> ts.setType(ts.Type.RK)
> ts.setRKType(ts.RKType.RK3BS)
> ts.setTime(0)
> print("KSP:", ts.getKSP().getType())
> print("KSP PC:",ts.getKSP().getPC().getType())
> print("SNES :", ts.getSNES().getType())
>
> def jacobian(ts, t, u, Amat, Pmat):
>  Amat.zeroEntries()
>  Amat.aypx(1, l, structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
>  Amat.axpy(0.5 * (5 < t < 10), pump,
> structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
>
> ts.setRHSFunction(PETSc.TS.computeRHSFunctionLinear)
> #ts.setRHSJacobian(PETSc.TS.computeRHSJacobianConstant, l, l) #
> Uncomment for f(t) = 0
> ts.setRHSJacobian(jacobian, jac)
>
> NUM_STEPS = 200
> res = np.empty((NUM_STEPS, 8192), dtype=np.float64)
> times = []
> rstart, rend = vec.getOwnershipRange()
> for i in tqdm(range(NUM_STEPS)):
>  time = ts.getTime()
>  ts.setMaxTime(time + dt)
>  ts.solve(vec)
>  res[i, rstart:rend] = vec.getArray()[:]
>  times.append(time)
>
> I decomposed the complex ODE into a larger real ODE, so that I can
> easily switch maybe to GPU computation later on. Now, the solutions of
> both scripts are very much identical, but PETSc runs about 3 times
> slower at 120it/s on my machine. I don't use MPI for PETSc yet.
>
> I strongly suppose that the problem lies within the jacobian definition,
> as PETSc is about 3 times *faster* than scipy with f(t) = 0 and
> therefore a constant jacobian.
>
> Thank you in advance.
>
> All the best,
> Niclas
>
>
>


Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-09 Thread Barry Smith


  Was PETSc built with debugging turned off; so ./configure --with-debugging=0 ?

  Can you run with the equivalent of -log_view to get information about the 
time spent in the various operations and send that information. The data 
generated is the best starting point for determining where the code is spending 
the time.

   Thanks

   Barry


> On Aug 9, 2023, at 9:40 AM, Niclas Götting  
> wrote:
> 
> Hi all,
> 
> I'm currently trying to convert a quantum simulation from scipy to PETSc. The 
> problem itself is extremely simple and of the form \dot{u}(t) = (A_const + 
> f(t)*B_const)*u(t), where f(t) in this simple test case is a square function. 
> The matrices A_const and B_const are extremely sparse and therefore I 
> thought, the problem will be well suited for PETSc. Currently, I solve the 
> ODE with the following procedure in scipy (I can provide the necessary data 
> files, if needed, but they are just some trace-preserving, very sparse 
> matrices):
> 
> import numpy as np
> import scipy.sparse
> import scipy.integrate
> 
> from tqdm import tqdm
> 
> 
> l = np.load("../liouvillian.npy")
> pump = np.load("../pump_operator.npy")
> state = np.load("../initial_state.npy")
> 
> l = scipy.sparse.csr_array(l)
> pump = scipy.sparse.csr_array(pump)
> 
> def f(t, y, *args):
> return (l + 0.5 * (5 < t < 10) * pump) @ y
> #return l @ y # Uncomment for f(t) = 0
> 
> dt = 0.1
> NUM_STEPS = 200
> res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
> solver = 
> scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
> times = []
> for i in tqdm(range(NUM_STEPS)):
> res[i, :] = solver.integrate(solver.t + dt)
> times.append(solver.t)
> 
> Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports about 
> 330it/s on my machine. When converting the code to PETSc, I came to the 
> following result (according to the chapter 
> https://petsc.org/main/manual/ts/#special-cases)
> 
> import sys
> import petsc4py
> petsc4py.init(args=sys.argv)
> import numpy as np
> import scipy.sparse
> 
> from tqdm import tqdm
> from petsc4py import PETSc
> 
> comm = PETSc.COMM_WORLD
> 
> 
> def mat_to_real(arr):
> return np.block([[arr.real, -arr.imag], [arr.imag, 
> arr.real]]).astype(np.float64)
> 
> def mat_to_petsc_aij(arr):
> arr_sc_sp = scipy.sparse.csr_array(arr)
> mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
> rstart, rend = mat.getOwnershipRange()
> print(rstart, rend)
> print(arr.shape[0])
> print(mat.sizes)
> I = arr_sc_sp.indptr[rstart : rend + 1] - arr_sc_sp.indptr[rstart]
> J = arr_sc_sp.indices[arr_sc_sp.indptr[rstart] : arr_sc_sp.indptr[rend]]
> V = arr_sc_sp.data[arr_sc_sp.indptr[rstart] : arr_sc_sp.indptr[rend]]
> 
> print(I.shape, J.shape, V.shape)
> mat.setValuesCSR(I, J, V)
> mat.assemble()
> return mat
> 
> 
> l = np.load("../liouvillian.npy")
> l = mat_to_real(l)
> pump = np.load("../pump_operator.npy")
> pump = mat_to_real(pump)
> state = np.load("../initial_state.npy")
> state = np.hstack([state.real, state.imag]).astype(np.float64)
> 
> l = mat_to_petsc_aij(l)
> pump = mat_to_petsc_aij(pump)
> 
> 
> jac = l.duplicate()
> for i in range(8192):
> jac.setValue(i, i, 0)
> jac.assemble()
> jac += l
> 
> vec = l.createVecRight()
> vec.setValues(np.arange(state.shape[0], dtype=np.int32), state)
> vec.assemble()
> 
> 
> dt = 0.1
> 
> ts = PETSc.TS().create(comm=comm)
> ts.setFromOptions()
> ts.setProblemType(ts.ProblemType.LINEAR)
> ts.setEquationType(ts.EquationType.ODE_EXPLICIT)
> ts.setType(ts.Type.RK)
> ts.setRKType(ts.RKType.RK3BS)
> ts.setTime(0)
> print("KSP:", ts.getKSP().getType())
> print("KSP PC:",ts.getKSP().getPC().getType())
> print("SNES :", ts.getSNES().getType())
> 
> def jacobian(ts, t, u, Amat, Pmat):
> Amat.zeroEntries()
> Amat.aypx(1, l, structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
> Amat.axpy(0.5 * (5 < t < 10), pump, 
> structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
> 
> ts.setRHSFunction(PETSc.TS.computeRHSFunctionLinear)
> #ts.setRHSJacobian(PETSc.TS.computeRHSJacobianConstant, l, l) # Uncomment for 
> f(t) = 0
> ts.setRHSJacobian(jacobian, jac)
> 
> NUM_STEPS = 200
> res = np.empty((NUM_STEPS, 8192), dtype=np.float64)
> times = []
> rstart, rend = vec.getOwnershipRange()
> for i in tqdm(range(NUM_STEPS)):
> time = ts.getTime()
> ts.setMaxTime(time + dt)
> ts.solve(vec)
> res[i, rstart:rend] = vec.getArray()[:]
> times.append(time)
> 
> I decomposed the complex ODE into a larger real ODE, so that I can easily 
> switch maybe to GPU computation later on. Now, the solutions of both scripts 
> are very much identical, but PETSc runs about 3 times slower at 120it/s on my 
> machine. I don't use MPI for PETSc yet.
> 
> I strongly suppose that the problem lies within the jacobian definition, as 
> PETSc is about 3 times *faster* than scipy with f(t) = 0 and therefore a 
> constant jacobian.
> 
> Thank you in advance.
> 
> All the 

[petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-09 Thread Niclas Götting

Hi all,

I'm currently trying to convert a quantum simulation from scipy to 
PETSc. The problem itself is extremely simple and of the form \dot{u}(t) 
= (A_const + f(t)*B_const)*u(t), where f(t) in this simple test case is 
a square function. The matrices A_const and B_const are extremely sparse 
and therefore I thought, the problem will be well suited for PETSc. 
Currently, I solve the ODE with the following procedure in scipy (I can 
provide the necessary data files, if needed, but they are just some 
trace-preserving, very sparse matrices):


import numpy as np
import scipy.sparse
import scipy.integrate

from tqdm import tqdm


l = np.load("../liouvillian.npy")
pump = np.load("../pump_operator.npy")
state = np.load("../initial_state.npy")

l = scipy.sparse.csr_array(l)
pump = scipy.sparse.csr_array(pump)

def f(t, y, *args):
    return (l + 0.5 * (5 < t < 10) * pump) @ y
    #return l @ y # Uncomment for f(t) = 0

dt = 0.1
NUM_STEPS = 200
res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
solver = 
scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)

times = []
for i in tqdm(range(NUM_STEPS)):
    res[i, :] = solver.integrate(solver.t + dt)
    times.append(solver.t)

Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports 
about 330it/s on my machine. When converting the code to PETSc, I came 
to the following result (according to the chapter 
https://petsc.org/main/manual/ts/#special-cases)


import sys
import petsc4py
petsc4py.init(args=sys.argv)
import numpy as np
import scipy.sparse

from tqdm import tqdm
from petsc4py import PETSc

comm = PETSc.COMM_WORLD


def mat_to_real(arr):
    return np.block([[arr.real, -arr.imag], [arr.imag, 
arr.real]]).astype(np.float64)


def mat_to_petsc_aij(arr):
    arr_sc_sp = scipy.sparse.csr_array(arr)
    mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
    rstart, rend = mat.getOwnershipRange()
    print(rstart, rend)
    print(arr.shape[0])
    print(mat.sizes)
    I = arr_sc_sp.indptr[rstart : rend + 1] - arr_sc_sp.indptr[rstart]
    J = arr_sc_sp.indices[arr_sc_sp.indptr[rstart] : 
arr_sc_sp.indptr[rend]]

    V = arr_sc_sp.data[arr_sc_sp.indptr[rstart] : arr_sc_sp.indptr[rend]]

    print(I.shape, J.shape, V.shape)
    mat.setValuesCSR(I, J, V)
    mat.assemble()
    return mat


l = np.load("../liouvillian.npy")
l = mat_to_real(l)
pump = np.load("../pump_operator.npy")
pump = mat_to_real(pump)
state = np.load("../initial_state.npy")
state = np.hstack([state.real, state.imag]).astype(np.float64)

l = mat_to_petsc_aij(l)
pump = mat_to_petsc_aij(pump)


jac = l.duplicate()
for i in range(8192):
    jac.setValue(i, i, 0)
jac.assemble()
jac += l

vec = l.createVecRight()
vec.setValues(np.arange(state.shape[0], dtype=np.int32), state)
vec.assemble()


dt = 0.1

ts = PETSc.TS().create(comm=comm)
ts.setFromOptions()
ts.setProblemType(ts.ProblemType.LINEAR)
ts.setEquationType(ts.EquationType.ODE_EXPLICIT)
ts.setType(ts.Type.RK)
ts.setRKType(ts.RKType.RK3BS)
ts.setTime(0)
print("KSP:", ts.getKSP().getType())
print("KSP PC:",ts.getKSP().getPC().getType())
print("SNES :", ts.getSNES().getType())

def jacobian(ts, t, u, Amat, Pmat):
    Amat.zeroEntries()
    Amat.aypx(1, l, structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)
    Amat.axpy(0.5 * (5 < t < 10), pump, 
structure=PETSc.Mat.Structure.SUBSET_NONZERO_PATTERN)


ts.setRHSFunction(PETSc.TS.computeRHSFunctionLinear)
#ts.setRHSJacobian(PETSc.TS.computeRHSJacobianConstant, l, l) # 
Uncomment for f(t) = 0

ts.setRHSJacobian(jacobian, jac)

NUM_STEPS = 200
res = np.empty((NUM_STEPS, 8192), dtype=np.float64)
times = []
rstart, rend = vec.getOwnershipRange()
for i in tqdm(range(NUM_STEPS)):
    time = ts.getTime()
    ts.setMaxTime(time + dt)
    ts.solve(vec)
    res[i, rstart:rend] = vec.getArray()[:]
    times.append(time)

I decomposed the complex ODE into a larger real ODE, so that I can 
easily switch maybe to GPU computation later on. Now, the solutions of 
both scripts are very much identical, but PETSc runs about 3 times 
slower at 120it/s on my machine. I don't use MPI for PETSc yet.


I strongly suppose that the problem lies within the jacobian definition, 
as PETSc is about 3 times *faster* than scipy with f(t) = 0 and 
therefore a constant jacobian.


Thank you in advance.

All the best,
Niclas