Re: [petsc-users] performance regression with GAMG

2023-08-10 Thread Pierre Jolivet


> On 11 Aug 2023, at 1:14 AM, Mark Adams  wrote:
> 
> BTW, nice bug report ...
>> 
>> So in the first step it coarsens from 150e6 to 5.4e6 DOFs instead of to 
>> 2.6e6 DOFs.
> 
> Yes, this is the critical place to see what is different and going wrong.
> 
> My 3D tests were not that different and I see you lowered the threshold.
> Note, you can set the threshold to zero, but your test is running so much 
> differently than mine there is something else going on.
> Note, the new, bad, coarsening rate of 30:1 is what we tend to shoot for in 
> 3D.
> 
> So it is not clear what the problem is.  Some questions:
> 
> * do you have a picture of this mesh to show me?
> * what do you mean by Q1-Q2 elements?
> 
> It would be nice to see if the new and old codes are similar without 
> aggressive coarsening.
> This was the intended change of the major change in this time frame as you 
> noticed.
> If these jobs are easy to run, could you check that the old and new versions 
> are similar with "-pc_gamg_square_graph  0 ",  ( and you only need one time 
> step).
> All you need to do is check that the first coarse grid has about the same 
> number of equations (large).
> 
> BTW, I am starting to think I should add the old method back as an option. I 
> did not think this change would cause large differences.

Not op, but that would be extremely valuable, IMHO.
This is impacting codes left, right, and center (see, e.g., another research 
group left wondering https://github.com/feelpp/feelpp/issues/2138).

Mini-rant: as developers, we are being asked to maintain backward compatibility 
of the API/headers, but there is no such an enforcement for the numerics.
A breakage in the API is “easy” to fix, you get a compilation error, you either 
try to fix your code or stick to a lower version of PETSc.
Changes in the numerics trigger silent errors which are much more delicate to 
fix because users do not know whether something needs to be addressed in their 
code or if there is a change in PETSc.
I don’t see the point of enforcing one backward compatibility but not the other.

Thanks,
Pierre

> Thanks,
> Mark
> 
> 
>  
>> Note that we are providing the rigid body near nullspace, 
>> hence the bs=3 to bs=6.
>> We have tried different values for the gamg_threshold but it doesn't 
>> really seem to significantly alter the coarsening amount in that first step.
>> 
>> Do you have any suggestions for further things we should try/look at? 
>> Any feedback would be much appreciated
>> 
>> Best wishes
>> Stephan Kramer
>> 
>> Full logs including log_view timings available from 
>> https://github.com/stephankramer/petsc-scaling/
>> 
>> In particular:
>> 
>> https://github.com/stephankramer/petsc-scaling/blob/main/before/Level_5/output_2.dat
>> https://github.com/stephankramer/petsc-scaling/blob/main/after/Level_5/output_2.dat
>> https://github.com/stephankramer/petsc-scaling/blob/main/before/Level_6/output_2.dat
>> https://github.com/stephankramer/petsc-scaling/blob/main/after/Level_6/output_2.dat
>> https://github.com/stephankramer/petsc-scaling/blob/main/before/Level_7/output_2.dat
>> https://github.com/stephankramer/petsc-scaling/blob/main/after/Level_7/output_2.dat
>>  
>> 



Re: [petsc-users] performance regression with GAMG

2023-08-10 Thread Mark Adams
BTW, nice bug report ...

>
> So in the first step it coarsens from 150e6 to 5.4e6 DOFs instead of to
> 2.6e6 DOFs.


Yes, this is the critical place to see what is different and going wrong.

My 3D tests were not that different and I see you lowered the threshold.
Note, you can set the threshold to zero, but your test is running so much
differently than mine there is something else going on.
Note, the new, bad, coarsening rate of 30:1 is what we tend to shoot for
in 3D.

So it is not clear what the problem is.  Some questions:

* do you have a picture of this mesh to show me?
* what do you mean by Q1-Q2 elements?

It would be nice to see if the new and old codes are similar without
aggressive coarsening.
This was the intended change of the major change in this time frame as you
noticed.
If these jobs are easy to run, could you check that the old and new
versions are similar with "-pc_gamg_square_graph  0 ",  ( and you only need
one time step).
All you need to do is check that the first coarse grid has about the same
number of equations (large).

BTW, I am starting to think I should add the old method back as an option.
I did not think this change would cause large differences.

Thanks,
Mark




> Note that we are providing the rigid body near nullspace,
> hence the bs=3 to bs=6.
> We have tried different values for the gamg_threshold but it doesn't
> really seem to significantly alter the coarsening amount in that first
> step.
>
> Do you have any suggestions for further things we should try/look at?
> Any feedback would be much appreciated
>
> Best wishes
> Stephan Kramer
>
> Full logs including log_view timings available from
> https://github.com/stephankramer/petsc-scaling/
>
> In particular:
>
>
> https://github.com/stephankramer/petsc-scaling/blob/main/before/Level_5/output_2.dat
>
> https://github.com/stephankramer/petsc-scaling/blob/main/after/Level_5/output_2.dat
>
> https://github.com/stephankramer/petsc-scaling/blob/main/before/Level_6/output_2.dat
>
> https://github.com/stephankramer/petsc-scaling/blob/main/after/Level_6/output_2.dat
>
> https://github.com/stephankramer/petsc-scaling/blob/main/before/Level_7/output_2.dat
>
> https://github.com/stephankramer/petsc-scaling/blob/main/after/Level_7/output_2.dat
>
>


[petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-10 Thread Vanella, Marcos (Fed) via petsc-users
Hi, I'm trying to run a parallel matrix vector build and linear solution with 
PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix build and 
solution is successful in CPUs only. I'm using cuda 11.5 and cuda enabled 
openmpi and gcc 9.3. When I run the job with GPU enabled I get the following 
error:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an 
illegal memory access was encountered

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an 
illegal memory access was encountered

Program received signal SIGABRT: Process abort signal.

I'm new to submitting jobs in slurm that also use GPU resources, so I might be 
doing something wrong in my submission script. This is it:

#!/bin/bash
#SBATCH -J test
#SBATCH -e /home/Issues/PETSc/test.err
#SBATCH -o /home/Issues/PETSc/test.log
#SBATCH --partition=batch
#SBATCH --ntasks=2
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=2
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1

export OMP_NUM_THREADS=1
module load cuda/11.5
module load openmpi/4.1.1

cd /home/Issues/PETSc
mpirun -n 2 /home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds 
-vec_type mpicuda -mat_type mpiaijcusparse -pc_type gamg

If anyone has any suggestions on how o troubleshoot this please let me know.
Thanks!
Marcos





Re: [petsc-users] performance regression with GAMG

2023-08-10 Thread Mark Adams
Hi Stephan,

Yes, MIS(A^T A) -> MIS(MIS(A)) change?

Yep, that is it.

This change was required because A^T A is super expensive. This change did
not do much to my tests but this is complex.

I am on travel now, but I can get to this in a few days. You provided me
with a lot of data and I can take a look, but I think we need to look at
parameters,

Thanks,
Mark

On Wed, Aug 9, 2023 at 10:08 AM Stephan Kramer 
wrote:

> Dear petsc devs
>
> We have noticed a performance regression using GAMG as the
> preconditioner to solve the velocity block in a Stokes equations saddle
> point system with variable viscosity solved on a 3D hexahedral mesh of a
> spherical shell using Q2-Q1 elements. This is comparing performance from
> the beginning of last year (petsc 3.16.4) and a more recent petsc master
> (from around May this year). This is the weak scaling analysis we
> published in https://doi.org/10.5194/gmd-15-5127-2022 Previously the
> number of iterations for the velocity block (inner solve of the Schur
> complement) starts at 40 iterations
> (
> https://gmd.copernicus.org/articles/15/5127/2022/gmd-15-5127-2022-f10-web.png)
>
> and only slowly going for larger problems (+more cores). Now the number
> of iterations now starts at 60
> (
> https://github.com/stephankramer/petsc-scaling/blob/main/after/SPD_Combined_Iterations.png),
>
> same tolerances, again slowly going up with increasing size, with the
> cost per iteration also gone up (slightly) - resulting in an increased
> runtime of > 50%.
>
> The main change we can see is that the coarsening seems to have gotten a
> lot less aggressive at the first coarsening stage (finest->to
> one-but-finest) - presumably after the MIS(A^T A) -> MIS(MIS(A)) change?
> The performance issues might be similar to
> https://lists.mcs.anl.gov/mailman/htdig/petsc-users/2023-April/048366.html
> ?
>
> As an example at "Level 7" (6,389,890 vertices, run on 1536 cpus) on the
> older petsc version we had:
>
>rows=126, cols=126, bs=6
>total: nonzeros=15876, allocated nonzeros=15876
> --
>rows=3072, cols=3072, bs=6
>total: nonzeros=3344688, allocated nonzeros=3344688
> --
>rows=91152, cols=91152, bs=6
>total: nonzeros=109729584, allocated nonzeros=109729584
> --
>rows=2655378, cols=2655378, bs=6
>total: nonzeros=1468980252, allocated nonzeros=1468980252
> --
>rows=152175366, cols=152175366, bs=3
>total: nonzeros=29047661586, allocated nonzeros=29047661586
>
> Whereas with the newer version we get:
>
>rows=420, cols=420, bs=6
>total: nonzeros=176400, allocated nonzeros=176400
> --
>rows=6462, cols=6462, bs=6
>total: nonzeros=10891908, allocated nonzeros=10891908
> --
>rows=91716, cols=91716, bs=6
>total: nonzeros=81687384, allocated nonzeros=81687384
> --
>rows=5419362, cols=5419362, bs=6
>total: nonzeros=3668190588, allocated nonzeros=3668190588
> --
>rows=152175366, cols=152175366, bs=3
>total: nonzeros=29047661586, allocated nonzeros=29047661586
>
> So in the first step it coarsens from 150e6 to 5.4e6 DOFs instead of to
> 2.6e6 DOFs. Note that we are providing the rigid body near nullspace,
> hence the bs=3 to bs=6.
> We have tried different values for the gamg_threshold but it doesn't
> really seem to significantly alter the coarsening amount in that first
> step.
>
> Do you have any suggestions for further things we should try/look at?
> Any feedback would be much appreciated
>
> Best wishes
> Stephan Kramer
>
> Full logs including log_view timings available from
> https://github.com/stephankramer/petsc-scaling/
>
> In particular:
>
>
> https://github.com/stephankramer/petsc-scaling/blob/main/before/Level_5/output_2.dat
>
> https://github.com/stephankramer/petsc-scaling/blob/main/after/Level_5/output_2.dat
>
> https://github.com/stephankramer/petsc-scaling/blob/main/before/Level_6/output_2.dat
>
> https://github.com/stephankramer/petsc-scaling/blob/main/after/Level_6/output_2.dat
>
> https://github.com/stephankramer/petsc-scaling/blob/main/before/Level_7/output_2.dat
>
> https://github.com/stephankramer/petsc-scaling/blob/main/after/Level_7/output_2.dat
>
>


Re: [petsc-users] error related to 'valgrind' when using MatView

2023-08-10 Thread Matthew Knepley
On Thu, Aug 10, 2023 at 2:30 AM maitri ksh  wrote:

> I am unable to understand what possibly went wrong with my code, I could
> load a matrix (large sparse matrix) into petsc, write it out and read it
> back into Matlab but when I tried to use MatView to see the matrix-info, it
> produces error of some 'corrupt argument, #valgrind'. Can anyone please
> help?
>

You use

  viewer = PETSC_VIEWER_STDOUT_WORLD

but then you Destroy() that viewer. You should not since you did not create
it.

  THanks,

 Matt


> Maitri
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Niclas Götting
Alright. Again, thank you very much for taking the time to answer my 
beginner questions! Still a lot to learn..


Have a good day!

On 10.08.23 12:27, Stefano Zampini wrote:
Then just do the multiplications you need. My proposal was for the 
example function you were showing.


On Thu, Aug 10, 2023, 12:25 Niclas Götting 
 wrote:


You are absolutely right for this specific case (I get about
2400it/s instead of 2100it/s). However, the single square function
will be replaced by a series of gaussian pulses in the future,
which will never be zero. Maybe one could do an approximation and
skip the second mult, if the gaussians are close to zero.

On 10.08.23 12:16, Stefano Zampini wrote:

If you do the mult of "pump" inside an if it should be faster

On Thu, Aug 10, 2023, 12:12 Niclas Götting
 wrote:

If I understood you right, this should be the resulting RHS:

def rhsfunc5(ts, t, u, F):
    l.mult(u, F)
    pump.mult(u, tmp_vec)
    scale = 0.5 * (5 < t < 10)
    F.axpy(scale, tmp_vec)

It is a little bit slower than option 3, but with about
2100it/s consistently ~10% faster than option 4.

Thank you very much for the suggestion!

On 10.08.23 11:47, Stefano Zampini wrote:

I would use option 3. Keep a work vector and do a vector
summation instead of the multiple multiplication by scale
and 1/scale.

I agree with you the docs are a little misleading here.

On Thu, Aug 10, 2023, 11:40 Niclas Götting
 wrote:

Thank you both for the very quick answer!

So far, I compiled PETSc with debugging turned on, but I
think it should still be faster than standard scipy in
both cases. Actually, Stefano's answer has got me very
far already; now I only define the RHS of the ODE and no
Jacobian (I wonder, why the documentation suggests
otherwise, though). I had the following four tries at
implementing the RHS:

 1. def rhsfunc1(ts, t, u, F):
    scale = 0.5 * (5 < t < 10)
    (l + scale * pump).mult(u, F)
 2. def rhsfunc2(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    (scale * pump).multAdd(u, F, F)
 3. def rhsfunc3(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    if scale != 0:
    pump.scale(scale)
    pump.multAdd(u, F, F)
    pump.scale(1/scale)
 4. def rhsfunc4(ts, t, u, F):
    tmp_pump.zeroEntries() # tmp_pump is
pump.duplicate()
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    tmp_pump.axpy(scale, pump,
structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
    tmp_pump.multAdd(u, F, F)

They all yield the same results, but with 50it/s,
800it/, 2300it/s and 1900it/s, respectively, which is a
huge performance boost (almost 7 times as fast as scipy,
with PETSc debugging still turned on). As the scale
function will most likely be a gaussian in the future, I
think that option 3 will be become numerically unstable
and I'll have to go with option 4, which is already
faster than I expected. If you think it is possible to
speed up the RHS calculation even more, I'd be happy to
hear your suggestions; the -log_view is attached to this
message.

One last point: If I didn't misunderstand the
documentation at
https://petsc.org/release/manual/ts/#special-cases,
should this maybe be changed?

Best regards
Niclas

On 09.08.23 17:51, Stefano Zampini wrote:

TSRK is an explicit solver. Unless you are changing the
ts type from command line,  the explicit  jacobian
should not be needed. On top of Barry's suggestion, I
would suggest you to write the explicit RHS instead of
assembly a throw away matrix every time that function
needs to be sampled.

On Wed, Aug 9, 2023, 17:09 Niclas Götting
 wrote:

Hi all,

I'm currently trying to convert a quantum
simulation from scipy to
PETSc. The problem itself is extremely simple and
of the form \dot{u}(t)
= (A_const + f(t)*B_const)*u(t), where f(t) in this
simple test case is
a square function. The matrices A_const and B_const
are extremely sparse
and therefore I thought, the problem will be well
  

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Stefano Zampini
Then just do the multiplications you need. My proposal was for the example
function you were showing.

On Thu, Aug 10, 2023, 12:25 Niclas Götting 
wrote:

> You are absolutely right for this specific case (I get about 2400it/s
> instead of 2100it/s). However, the single square function will be replaced
> by a series of gaussian pulses in the future, which will never be zero.
> Maybe one could do an approximation and skip the second mult, if the
> gaussians are close to zero.
> On 10.08.23 12:16, Stefano Zampini wrote:
>
> If you do the mult of "pump" inside an if it should be faster
>
> On Thu, Aug 10, 2023, 12:12 Niclas Götting 
> wrote:
>
>> If I understood you right, this should be the resulting RHS:
>>
>> def rhsfunc5(ts, t, u, F):
>> l.mult(u, F)
>> pump.mult(u, tmp_vec)
>> scale = 0.5 * (5 < t < 10)
>> F.axpy(scale, tmp_vec)
>>
>> It is a little bit slower than option 3, but with about 2100it/s
>> consistently ~10% faster than option 4.
>>
>> Thank you very much for the suggestion!
>> On 10.08.23 11:47, Stefano Zampini wrote:
>>
>> I would use option 3. Keep a work vector and do a vector summation
>> instead of the multiple multiplication by scale and 1/scale.
>>
>> I agree with you the docs are a little misleading here.
>>
>> On Thu, Aug 10, 2023, 11:40 Niclas Götting 
>> wrote:
>>
>>> Thank you both for the very quick answer!
>>>
>>> So far, I compiled PETSc with debugging turned on, but I think it should
>>> still be faster than standard scipy in both cases. Actually, Stefano's
>>> answer has got me very far already; now I only define the RHS of the ODE
>>> and no Jacobian (I wonder, why the documentation suggests otherwise,
>>> though). I had the following four tries at implementing the RHS:
>>>
>>>1. def rhsfunc1(ts, t, u, F):
>>>scale = 0.5 * (5 < t < 10)
>>>(l + scale * pump).mult(u, F)
>>>2. def rhsfunc2(ts, t, u, F):
>>>l.mult(u, F)
>>>scale = 0.5 * (5 < t < 10)
>>>(scale * pump).multAdd(u, F, F)
>>>3. def rhsfunc3(ts, t, u, F):
>>>l.mult(u, F)
>>>scale = 0.5 * (5 < t < 10)
>>>if scale != 0:
>>>pump.scale(scale)
>>>pump.multAdd(u, F, F)
>>>pump.scale(1/scale)
>>>4. def rhsfunc4(ts, t, u, F):
>>>tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
>>>l.mult(u, F)
>>>scale = 0.5 * (5 < t < 10)
>>>tmp_pump.axpy(scale, pump,
>>>structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
>>>tmp_pump.multAdd(u, F, F)
>>>
>>> They all yield the same results, but with 50it/s, 800it/, 2300it/s and
>>> 1900it/s, respectively, which is a huge performance boost (almost 7 times
>>> as fast as scipy, with PETSc debugging still turned on). As the scale
>>> function will most likely be a gaussian in the future, I think that option
>>> 3 will be become numerically unstable and I'll have to go with option 4,
>>> which is already faster than I expected. If you think it is possible to
>>> speed up the RHS calculation even more, I'd be happy to hear your
>>> suggestions; the -log_view is attached to this message.
>>>
>>> One last point: If I didn't misunderstand the documentation at
>>> https://petsc.org/release/manual/ts/#special-cases, should this maybe
>>> be changed?
>>>
>>> Best regards
>>> Niclas
>>> On 09.08.23 17:51, Stefano Zampini wrote:
>>>
>>> TSRK is an explicit solver. Unless you are changing the ts type from
>>> command line,  the explicit  jacobian should not be needed. On top of
>>> Barry's suggestion, I would suggest you to write the explicit RHS instead
>>> of assembly a throw away matrix every time that function needs to be
>>> sampled.
>>>
>>> On Wed, Aug 9, 2023, 17:09 Niclas Götting 
>>> wrote:
>>>
 Hi all,

 I'm currently trying to convert a quantum simulation from scipy to
 PETSc. The problem itself is extremely simple and of the form
 \dot{u}(t)
 = (A_const + f(t)*B_const)*u(t), where f(t) in this simple test case is
 a square function. The matrices A_const and B_const are extremely
 sparse
 and therefore I thought, the problem will be well suited for PETSc.
 Currently, I solve the ODE with the following procedure in scipy (I can
 provide the necessary data files, if needed, but they are just some
 trace-preserving, very sparse matrices):

 import numpy as np
 import scipy.sparse
 import scipy.integrate

 from tqdm import tqdm


 l = np.load("../liouvillian.npy")
 pump = np.load("../pump_operator.npy")
 state = np.load("../initial_state.npy")

 l = scipy.sparse.csr_array(l)
 pump = scipy.sparse.csr_array(pump)

 def f(t, y, *args):
  return (l + 0.5 * (5 < t < 10) * pump) @ y
  #return l @ y # Uncomment for f(t) = 0

 dt = 0.1
 NUM_STEPS = 200
 res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
 solver =
 

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Niclas Götting
You are absolutely right for this specific case (I get about 2400it/s 
instead of 2100it/s). However, the single square function will be 
replaced by a series of gaussian pulses in the future, which will never 
be zero. Maybe one could do an approximation and skip the second mult, 
if the gaussians are close to zero.


On 10.08.23 12:16, Stefano Zampini wrote:

If you do the mult of "pump" inside an if it should be faster

On Thu, Aug 10, 2023, 12:12 Niclas Götting 
 wrote:


If I understood you right, this should be the resulting RHS:

def rhsfunc5(ts, t, u, F):
    l.mult(u, F)
    pump.mult(u, tmp_vec)
    scale = 0.5 * (5 < t < 10)
    F.axpy(scale, tmp_vec)

It is a little bit slower than option 3, but with about 2100it/s
consistently ~10% faster than option 4.

Thank you very much for the suggestion!

On 10.08.23 11:47, Stefano Zampini wrote:

I would use option 3. Keep a work vector and do a vector
summation instead of the multiple multiplication by scale and
1/scale.

I agree with you the docs are a little misleading here.

On Thu, Aug 10, 2023, 11:40 Niclas Götting
 wrote:

Thank you both for the very quick answer!

So far, I compiled PETSc with debugging turned on, but I
think it should still be faster than standard scipy in both
cases. Actually, Stefano's answer has got me very far
already; now I only define the RHS of the ODE and no Jacobian
(I wonder, why the documentation suggests otherwise, though).
I had the following four tries at implementing the RHS:

 1. def rhsfunc1(ts, t, u, F):
    scale = 0.5 * (5 < t < 10)
    (l + scale * pump).mult(u, F)
 2. def rhsfunc2(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    (scale * pump).multAdd(u, F, F)
 3. def rhsfunc3(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    if scale != 0:
    pump.scale(scale)
    pump.multAdd(u, F, F)
    pump.scale(1/scale)
 4. def rhsfunc4(ts, t, u, F):
    tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    tmp_pump.axpy(scale, pump,
structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
    tmp_pump.multAdd(u, F, F)

They all yield the same results, but with 50it/s, 800it/,
2300it/s and 1900it/s, respectively, which is a huge
performance boost (almost 7 times as fast as scipy, with
PETSc debugging still turned on). As the scale function will
most likely be a gaussian in the future, I think that option
3 will be become numerically unstable and I'll have to go
with option 4, which is already faster than I expected. If
you think it is possible to speed up the RHS calculation even
more, I'd be happy to hear your suggestions; the -log_view is
attached to this message.

One last point: If I didn't misunderstand the documentation
at https://petsc.org/release/manual/ts/#special-cases, should
this maybe be changed?

Best regards
Niclas

On 09.08.23 17:51, Stefano Zampini wrote:

TSRK is an explicit solver. Unless you are changing the ts
type from command line,  the explicit  jacobian should not
be needed. On top of Barry's suggestion, I would suggest you
to write the explicit RHS instead of assembly a throw away
matrix every time that function needs to be sampled.

On Wed, Aug 9, 2023, 17:09 Niclas Götting
 wrote:

Hi all,

I'm currently trying to convert a quantum simulation
from scipy to
PETSc. The problem itself is extremely simple and of the
form \dot{u}(t)
= (A_const + f(t)*B_const)*u(t), where f(t) in this
simple test case is
a square function. The matrices A_const and B_const are
extremely sparse
and therefore I thought, the problem will be well suited
for PETSc.
Currently, I solve the ODE with the following procedure
in scipy (I can
provide the necessary data files, if needed, but they
are just some
trace-preserving, very sparse matrices):

import numpy as np
import scipy.sparse
import scipy.integrate

from tqdm import tqdm


l = np.load("../liouvillian.npy")
pump = np.load("../pump_operator.npy")
state = np.load("../initial_state.npy")

l = scipy.sparse.csr_array(l)
pump = scipy.sparse.csr_array(pump)

def f(t, y, *args):
 return (l + 0.5 * (5 < t < 10) * pump) @ y
 

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Stefano Zampini
If you do the mult of "pump" inside an if it should be faster

On Thu, Aug 10, 2023, 12:12 Niclas Götting 
wrote:

> If I understood you right, this should be the resulting RHS:
>
> def rhsfunc5(ts, t, u, F):
> l.mult(u, F)
> pump.mult(u, tmp_vec)
> scale = 0.5 * (5 < t < 10)
> F.axpy(scale, tmp_vec)
>
> It is a little bit slower than option 3, but with about 2100it/s
> consistently ~10% faster than option 4.
>
> Thank you very much for the suggestion!
> On 10.08.23 11:47, Stefano Zampini wrote:
>
> I would use option 3. Keep a work vector and do a vector summation instead
> of the multiple multiplication by scale and 1/scale.
>
> I agree with you the docs are a little misleading here.
>
> On Thu, Aug 10, 2023, 11:40 Niclas Götting 
> wrote:
>
>> Thank you both for the very quick answer!
>>
>> So far, I compiled PETSc with debugging turned on, but I think it should
>> still be faster than standard scipy in both cases. Actually, Stefano's
>> answer has got me very far already; now I only define the RHS of the ODE
>> and no Jacobian (I wonder, why the documentation suggests otherwise,
>> though). I had the following four tries at implementing the RHS:
>>
>>1. def rhsfunc1(ts, t, u, F):
>>scale = 0.5 * (5 < t < 10)
>>(l + scale * pump).mult(u, F)
>>2. def rhsfunc2(ts, t, u, F):
>>l.mult(u, F)
>>scale = 0.5 * (5 < t < 10)
>>(scale * pump).multAdd(u, F, F)
>>3. def rhsfunc3(ts, t, u, F):
>>l.mult(u, F)
>>scale = 0.5 * (5 < t < 10)
>>if scale != 0:
>>pump.scale(scale)
>>pump.multAdd(u, F, F)
>>pump.scale(1/scale)
>>4. def rhsfunc4(ts, t, u, F):
>>tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
>>l.mult(u, F)
>>scale = 0.5 * (5 < t < 10)
>>tmp_pump.axpy(scale, pump,
>>structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
>>tmp_pump.multAdd(u, F, F)
>>
>> They all yield the same results, but with 50it/s, 800it/, 2300it/s and
>> 1900it/s, respectively, which is a huge performance boost (almost 7 times
>> as fast as scipy, with PETSc debugging still turned on). As the scale
>> function will most likely be a gaussian in the future, I think that option
>> 3 will be become numerically unstable and I'll have to go with option 4,
>> which is already faster than I expected. If you think it is possible to
>> speed up the RHS calculation even more, I'd be happy to hear your
>> suggestions; the -log_view is attached to this message.
>>
>> One last point: If I didn't misunderstand the documentation at
>> https://petsc.org/release/manual/ts/#special-cases, should this maybe be
>> changed?
>>
>> Best regards
>> Niclas
>> On 09.08.23 17:51, Stefano Zampini wrote:
>>
>> TSRK is an explicit solver. Unless you are changing the ts type from
>> command line,  the explicit  jacobian should not be needed. On top of
>> Barry's suggestion, I would suggest you to write the explicit RHS instead
>> of assembly a throw away matrix every time that function needs to be
>> sampled.
>>
>> On Wed, Aug 9, 2023, 17:09 Niclas Götting 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm currently trying to convert a quantum simulation from scipy to
>>> PETSc. The problem itself is extremely simple and of the form \dot{u}(t)
>>> = (A_const + f(t)*B_const)*u(t), where f(t) in this simple test case is
>>> a square function. The matrices A_const and B_const are extremely sparse
>>> and therefore I thought, the problem will be well suited for PETSc.
>>> Currently, I solve the ODE with the following procedure in scipy (I can
>>> provide the necessary data files, if needed, but they are just some
>>> trace-preserving, very sparse matrices):
>>>
>>> import numpy as np
>>> import scipy.sparse
>>> import scipy.integrate
>>>
>>> from tqdm import tqdm
>>>
>>>
>>> l = np.load("../liouvillian.npy")
>>> pump = np.load("../pump_operator.npy")
>>> state = np.load("../initial_state.npy")
>>>
>>> l = scipy.sparse.csr_array(l)
>>> pump = scipy.sparse.csr_array(pump)
>>>
>>> def f(t, y, *args):
>>>  return (l + 0.5 * (5 < t < 10) * pump) @ y
>>>  #return l @ y # Uncomment for f(t) = 0
>>>
>>> dt = 0.1
>>> NUM_STEPS = 200
>>> res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
>>> solver =
>>> scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
>>> times = []
>>> for i in tqdm(range(NUM_STEPS)):
>>>  res[i, :] = solver.integrate(solver.t + dt)
>>>  times.append(solver.t)
>>>
>>> Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports
>>> about 330it/s on my machine. When converting the code to PETSc, I came
>>> to the following result (according to the chapter
>>> https://petsc.org/main/manual/ts/#special-cases)
>>>
>>> import sys
>>> import petsc4py
>>> petsc4py.init(args=sys.argv)
>>> import numpy as np
>>> import scipy.sparse
>>>
>>> from tqdm import tqdm
>>> from petsc4py import PETSc
>>>
>>> comm = PETSc.COMM_WORLD
>>>
>>>
>>> def mat_to_real(arr):
>>> 

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Niclas Götting

If I understood you right, this should be the resulting RHS:

def rhsfunc5(ts, t, u, F):
    l.mult(u, F)
    pump.mult(u, tmp_vec)
    scale = 0.5 * (5 < t < 10)
    F.axpy(scale, tmp_vec)

It is a little bit slower than option 3, but with about 2100it/s 
consistently ~10% faster than option 4.


Thank you very much for the suggestion!

On 10.08.23 11:47, Stefano Zampini wrote:
I would use option 3. Keep a work vector and do a vector summation 
instead of the multiple multiplication by scale and 1/scale.


I agree with you the docs are a little misleading here.

On Thu, Aug 10, 2023, 11:40 Niclas Götting 
 wrote:


Thank you both for the very quick answer!

So far, I compiled PETSc with debugging turned on, but I think it
should still be faster than standard scipy in both cases.
Actually, Stefano's answer has got me very far already; now I only
define the RHS of the ODE and no Jacobian (I wonder, why the
documentation suggests otherwise, though). I had the following
four tries at implementing the RHS:

 1. def rhsfunc1(ts, t, u, F):
    scale = 0.5 * (5 < t < 10)
    (l + scale * pump).mult(u, F)
 2. def rhsfunc2(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    (scale * pump).multAdd(u, F, F)
 3. def rhsfunc3(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    if scale != 0:
    pump.scale(scale)
    pump.multAdd(u, F, F)
    pump.scale(1/scale)
 4. def rhsfunc4(ts, t, u, F):
    tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    tmp_pump.axpy(scale, pump,
structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
    tmp_pump.multAdd(u, F, F)

They all yield the same results, but with 50it/s, 800it/, 2300it/s
and 1900it/s, respectively, which is a huge performance boost
(almost 7 times as fast as scipy, with PETSc debugging still
turned on). As the scale function will most likely be a gaussian
in the future, I think that option 3 will be become numerically
unstable and I'll have to go with option 4, which is already
faster than I expected. If you think it is possible to speed up
the RHS calculation even more, I'd be happy to hear your
suggestions; the -log_view is attached to this message.

One last point: If I didn't misunderstand the documentation at
https://petsc.org/release/manual/ts/#special-cases, should this
maybe be changed?

Best regards
Niclas

On 09.08.23 17:51, Stefano Zampini wrote:

TSRK is an explicit solver. Unless you are changing the ts type
from command line,  the explicit  jacobian should not be needed.
On top of Barry's suggestion, I would suggest you to write the
explicit RHS instead of assembly a throw away matrix every time
that function needs to be sampled.

On Wed, Aug 9, 2023, 17:09 Niclas Götting
 wrote:

Hi all,

I'm currently trying to convert a quantum simulation from
scipy to
PETSc. The problem itself is extremely simple and of the form
\dot{u}(t)
= (A_const + f(t)*B_const)*u(t), where f(t) in this simple
test case is
a square function. The matrices A_const and B_const are
extremely sparse
and therefore I thought, the problem will be well suited for
PETSc.
Currently, I solve the ODE with the following procedure in
scipy (I can
provide the necessary data files, if needed, but they are
just some
trace-preserving, very sparse matrices):

import numpy as np
import scipy.sparse
import scipy.integrate

from tqdm import tqdm


l = np.load("../liouvillian.npy")
pump = np.load("../pump_operator.npy")
state = np.load("../initial_state.npy")

l = scipy.sparse.csr_array(l)
pump = scipy.sparse.csr_array(pump)

def f(t, y, *args):
 return (l + 0.5 * (5 < t < 10) * pump) @ y
 #return l @ y # Uncomment for f(t) = 0

dt = 0.1
NUM_STEPS = 200
res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
solver =
scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
times = []
for i in tqdm(range(NUM_STEPS)):
 res[i, :] = solver.integrate(solver.t + dt)
 times.append(solver.t)

Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm
reports
about 330it/s on my machine. When converting the code to
PETSc, I came
to the following result (according to the chapter
https://petsc.org/main/manual/ts/#special-cases)

import sys
import petsc4py
petsc4py.init(args=sys.argv)
import numpy as np
import scipy.sparse

from tqdm import 

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Stefano Zampini
I would use option 3. Keep a work vector and do a vector summation instead
of the multiple multiplication by scale and 1/scale.

I agree with you the docs are a little misleading here.

On Thu, Aug 10, 2023, 11:40 Niclas Götting 
wrote:

> Thank you both for the very quick answer!
>
> So far, I compiled PETSc with debugging turned on, but I think it should
> still be faster than standard scipy in both cases. Actually, Stefano's
> answer has got me very far already; now I only define the RHS of the ODE
> and no Jacobian (I wonder, why the documentation suggests otherwise,
> though). I had the following four tries at implementing the RHS:
>
>1. def rhsfunc1(ts, t, u, F):
>scale = 0.5 * (5 < t < 10)
>(l + scale * pump).mult(u, F)
>2. def rhsfunc2(ts, t, u, F):
>l.mult(u, F)
>scale = 0.5 * (5 < t < 10)
>(scale * pump).multAdd(u, F, F)
>3. def rhsfunc3(ts, t, u, F):
>l.mult(u, F)
>scale = 0.5 * (5 < t < 10)
>if scale != 0:
>pump.scale(scale)
>pump.multAdd(u, F, F)
>pump.scale(1/scale)
>4. def rhsfunc4(ts, t, u, F):
>tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
>l.mult(u, F)
>scale = 0.5 * (5 < t < 10)
>tmp_pump.axpy(scale, pump,
>structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
>tmp_pump.multAdd(u, F, F)
>
> They all yield the same results, but with 50it/s, 800it/, 2300it/s and
> 1900it/s, respectively, which is a huge performance boost (almost 7 times
> as fast as scipy, with PETSc debugging still turned on). As the scale
> function will most likely be a gaussian in the future, I think that option
> 3 will be become numerically unstable and I'll have to go with option 4,
> which is already faster than I expected. If you think it is possible to
> speed up the RHS calculation even more, I'd be happy to hear your
> suggestions; the -log_view is attached to this message.
>
> One last point: If I didn't misunderstand the documentation at
> https://petsc.org/release/manual/ts/#special-cases, should this maybe be
> changed?
>
> Best regards
> Niclas
> On 09.08.23 17:51, Stefano Zampini wrote:
>
> TSRK is an explicit solver. Unless you are changing the ts type from
> command line,  the explicit  jacobian should not be needed. On top of
> Barry's suggestion, I would suggest you to write the explicit RHS instead
> of assembly a throw away matrix every time that function needs to be
> sampled.
>
> On Wed, Aug 9, 2023, 17:09 Niclas Götting 
> wrote:
>
>> Hi all,
>>
>> I'm currently trying to convert a quantum simulation from scipy to
>> PETSc. The problem itself is extremely simple and of the form \dot{u}(t)
>> = (A_const + f(t)*B_const)*u(t), where f(t) in this simple test case is
>> a square function. The matrices A_const and B_const are extremely sparse
>> and therefore I thought, the problem will be well suited for PETSc.
>> Currently, I solve the ODE with the following procedure in scipy (I can
>> provide the necessary data files, if needed, but they are just some
>> trace-preserving, very sparse matrices):
>>
>> import numpy as np
>> import scipy.sparse
>> import scipy.integrate
>>
>> from tqdm import tqdm
>>
>>
>> l = np.load("../liouvillian.npy")
>> pump = np.load("../pump_operator.npy")
>> state = np.load("../initial_state.npy")
>>
>> l = scipy.sparse.csr_array(l)
>> pump = scipy.sparse.csr_array(pump)
>>
>> def f(t, y, *args):
>>  return (l + 0.5 * (5 < t < 10) * pump) @ y
>>  #return l @ y # Uncomment for f(t) = 0
>>
>> dt = 0.1
>> NUM_STEPS = 200
>> res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
>> solver =
>> scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
>> times = []
>> for i in tqdm(range(NUM_STEPS)):
>>  res[i, :] = solver.integrate(solver.t + dt)
>>  times.append(solver.t)
>>
>> Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports
>> about 330it/s on my machine. When converting the code to PETSc, I came
>> to the following result (according to the chapter
>> https://petsc.org/main/manual/ts/#special-cases)
>>
>> import sys
>> import petsc4py
>> petsc4py.init(args=sys.argv)
>> import numpy as np
>> import scipy.sparse
>>
>> from tqdm import tqdm
>> from petsc4py import PETSc
>>
>> comm = PETSc.COMM_WORLD
>>
>>
>> def mat_to_real(arr):
>>  return np.block([[arr.real, -arr.imag], [arr.imag,
>> arr.real]]).astype(np.float64)
>>
>> def mat_to_petsc_aij(arr):
>>  arr_sc_sp = scipy.sparse.csr_array(arr)
>>  mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
>>  rstart, rend = mat.getOwnershipRange()
>>  print(rstart, rend)
>>  print(arr.shape[0])
>>  print(mat.sizes)
>>  I = arr_sc_sp.indptr[rstart : rend + 1] - arr_sc_sp.indptr[rstart]
>>  J = arr_sc_sp.indices[arr_sc_sp.indptr[rstart] :
>> arr_sc_sp.indptr[rend]]
>>  V = arr_sc_sp.data[arr_sc_sp.indptr[rstart] : arr_sc_sp.indptr[rend]]
>>
>>  print(I.shape, J.shape, V.shape)
>>   

Re: [petsc-users] Python PETSc performance vs scipy ZVODE

2023-08-10 Thread Niclas Götting

Thank you both for the very quick answer!

So far, I compiled PETSc with debugging turned on, but I think it should 
still be faster than standard scipy in both cases. Actually, Stefano's 
answer has got me very far already; now I only define the RHS of the ODE 
and no Jacobian (I wonder, why the documentation suggests otherwise, 
though). I had the following four tries at implementing the RHS:


1. def rhsfunc1(ts, t, u, F):
    scale = 0.5 * (5 < t < 10)
    (l + scale * pump).mult(u, F)
2. def rhsfunc2(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    (scale * pump).multAdd(u, F, F)
3. def rhsfunc3(ts, t, u, F):
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    if scale != 0:
    pump.scale(scale)
    pump.multAdd(u, F, F)
    pump.scale(1/scale)
4. def rhsfunc4(ts, t, u, F):
    tmp_pump.zeroEntries() # tmp_pump is pump.duplicate()
    l.mult(u, F)
    scale = 0.5 * (5 < t < 10)
    tmp_pump.axpy(scale, pump,
   structure=PETSc.Mat.Structure.SAME_NONZERO_PATTERN)
    tmp_pump.multAdd(u, F, F)

They all yield the same results, but with 50it/s, 800it/, 2300it/s and 
1900it/s, respectively, which is a huge performance boost (almost 7 
times as fast as scipy, with PETSc debugging still turned on). As the 
scale function will most likely be a gaussian in the future, I think 
that option 3 will be become numerically unstable and I'll have to go 
with option 4, which is already faster than I expected. If you think it 
is possible to speed up the RHS calculation even more, I'd be happy to 
hear your suggestions; the -log_view is attached to this message.


One last point: If I didn't misunderstand the documentation at 
https://petsc.org/release/manual/ts/#special-cases, should this maybe be 
changed?


Best regards
Niclas

On 09.08.23 17:51, Stefano Zampini wrote:
TSRK is an explicit solver. Unless you are changing the ts type from 
command line,  the explicit  jacobian should not be needed. On top of 
Barry's suggestion, I would suggest you to write the explicit RHS 
instead of assembly a throw away matrix every time that function needs 
to be sampled.


On Wed, Aug 9, 2023, 17:09 Niclas Götting 
 wrote:


Hi all,

I'm currently trying to convert a quantum simulation from scipy to
PETSc. The problem itself is extremely simple and of the form
\dot{u}(t)
= (A_const + f(t)*B_const)*u(t), where f(t) in this simple test
case is
a square function. The matrices A_const and B_const are extremely
sparse
and therefore I thought, the problem will be well suited for PETSc.
Currently, I solve the ODE with the following procedure in scipy
(I can
provide the necessary data files, if needed, but they are just some
trace-preserving, very sparse matrices):

import numpy as np
import scipy.sparse
import scipy.integrate

from tqdm import tqdm


l = np.load("../liouvillian.npy")
pump = np.load("../pump_operator.npy")
state = np.load("../initial_state.npy")

l = scipy.sparse.csr_array(l)
pump = scipy.sparse.csr_array(pump)

def f(t, y, *args):
 return (l + 0.5 * (5 < t < 10) * pump) @ y
 #return l @ y # Uncomment for f(t) = 0

dt = 0.1
NUM_STEPS = 200
res = np.empty((NUM_STEPS, 4096), dtype=np.complex128)
solver =
scipy.integrate.ode(f).set_integrator("zvode").set_initial_value(state)
times = []
for i in tqdm(range(NUM_STEPS)):
 res[i, :] = solver.integrate(solver.t + dt)
 times.append(solver.t)

Here, A_const = l, B_const = pump and f(t) = 5 < t < 10. tqdm reports
about 330it/s on my machine. When converting the code to PETSc, I
came
to the following result (according to the chapter
https://petsc.org/main/manual/ts/#special-cases)

import sys
import petsc4py
petsc4py.init(args=sys.argv)
import numpy as np
import scipy.sparse

from tqdm import tqdm
from petsc4py import PETSc

comm = PETSc.COMM_WORLD


def mat_to_real(arr):
 return np.block([[arr.real, -arr.imag], [arr.imag,
arr.real]]).astype(np.float64)

def mat_to_petsc_aij(arr):
 arr_sc_sp = scipy.sparse.csr_array(arr)
 mat = PETSc.Mat().createAIJ(arr.shape[0], comm=comm)
 rstart, rend = mat.getOwnershipRange()
 print(rstart, rend)
 print(arr.shape[0])
 print(mat.sizes)
 I = arr_sc_sp.indptr[rstart : rend + 1] -
arr_sc_sp.indptr[rstart]
 J = arr_sc_sp.indices[arr_sc_sp.indptr[rstart] :
arr_sc_sp.indptr[rend]]
 V = arr_sc_sp.data[arr_sc_sp.indptr[rstart] :
arr_sc_sp.indptr[rend]]

 print(I.shape, J.shape, V.shape)
 mat.setValuesCSR(I, J, V)
 mat.assemble()
 return mat


l = np.load("../liouvillian.npy")
l = mat_to_real(l)
pump = np.load("../pump_operator.npy")
pump = mat_to_real(pump)
state = np.load("../initial_state.npy")

[petsc-users] error related to 'valgrind' when using MatView

2023-08-10 Thread maitri ksh
I am unable to understand what possibly went wrong with my code, I could
load a matrix (large sparse matrix) into petsc, write it out and read it
back into Matlab but when I tried to use MatView to see the matrix-info, it
produces error of some 'corrupt argument, #valgrind'. Can anyone please
help?
Maitri
Mat Object: 1 MPI process
  type: mpiaij
  rows=48, cols=48
  total: nonzeros=17795897, allocated nonzeros=17795897
  total number of mallocs used during MatSetValues calls=0
not using I-node (on process 0) routines
after Loading A matrix... Memory: Total: 0.483803 Max: 0.483803 [0]
after MatDestroy... Memory: Total: 0.483803 Max: 0.483803 [0]
Done
[0]PETSC ERROR: - Error Message 
--
[0]PETSC ERROR: Corrupt argument: https://petsc.org/release/faq/#valgrind
[0]PETSC ERROR: Object already free: Parameter # 1
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Development GIT revision: v3.19.4-889-g617fea4  GIT Date: 
2023-08-01 05:10:03 +
[0]PETSC ERROR: ./loadMat on a linux-gnu-c-debug named zeus.technion.ac.il by 
maitri.ksh Thu Aug 10 09:20:43 2023
[0]PETSC ERROR: Configure options --with-cc=/usr/local/gcc11/bin/gcc 
--with-cxx=/usr/local/gcc11/bin/g++ --with-fc=gfortran --download-mpich 
--download-fblaslapack --with-matlab --with-matlab-dir=/usr/local/matlab 
--download-superlu --with-superlu --download-superlu_dist --with-superlu_dist 
--download-hdf5 --with-hdf5=1 --download-mumps --with-mumps 
--download-scalapack --with-scalapack --download-parmetis --download-metis 
--download-ptscotch --download-bison --download-cmake
[0]PETSC ERROR: #1 PetscObjectDestroy() at 
/home/maitri.ksh/Maitri/petsc/src/sys/objects/destroy.c:50
[0]PETSC ERROR: #2 PetscObjectRegisterDestroyAll() at 
/home/maitri.ksh/Maitri/petsc/src/sys/objects/destroy.c:355
[0]PETSC ERROR: #3 PetscFinalize() at 
/home/maitri.ksh/Maitri/petsc/src/sys/objects/pinit.c:1452
/* Load unfactored matrix (A) */
PetscViewer viewerA;
ierr = PetscViewerBinaryOpen(PETSC_COMM_WORLD, "Jmat.dat", FILE_MODE_READ, 
); CHKERRQ(ierr);
ierr = MatCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
ierr = MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, N, N);
ierr = MatSetType(A, type);
ierr = MatSetFromOptions(A);
PetscPrintf(PETSC_COMM_WORLD, "Loading unfactored matrix\n");
ierr = MatLoad(A, viewerA); CHKERRQ(ierr);
PetscViewerDestroy();

PetscViewer viewer;
viewer = PETSC_VIEWER_STDOUT_WORLD;
ierr = PetscViewerPushFormat(viewer, PETSC_VIEWER_ASCII_INFO);
ierr = MatView(A, viewer);
PetscViewerPopFormat(viewer);
PetscViewerDestroy();