[petsc-users] superlu_dist issue

2016-11-29 Thread Kong, Fande
Hi All,

I think we have been discussing this topic for a while in other threads.
But I still did not get yet. PETSc uses 'SamePattern' as the default
FactPattern. Some test cases in MOOSE fail with this default option, but I
can make these tests pass if I set the FactPattern as
'SamePattern_SameRowPerm' by using -mat_superlu_dist_fact
SamePattern_SameRowPerm.

Does this make sense mathematically? I can not understand.  'SamePattern'
should  be more general than 'SamePattern_SameRowPerm'. In other words, if
something works with 'SamePattern_SameRowPerm', it definitely should work
with 'SamePattern' too.

Thanks as always.

Fande,


Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-11-26 Thread Anton Popov

Hong,

I checked out & compiled your new branch: 
hzhang/fix-superlu_dist-reuse-factornumeric. Unfortunately it did not 
solve the problem.


Sorry.

On 11/21/2016 04:43 AM, Hong wrote:

Anton,
I pushed a fix
https://bitbucket.org/petsc/petsc/commits/28865de08051eb99557d70672c208e14da23c8b1
in branch hzhang/fix-superlu_dist-reuse-factornumeric.
Can you give it a try to see if it works?
I do not have an example which produces your problem.

In your email, you asked "Setting Options.Fact = DOFACT for all 
factorizations is currently impossible via PETSc interface.

The user is expected to choose some kind of reuse model.
If you could add it, I (and other users probably too) would really 
appreciate that."


We do not allow user set superlu' Options.Fact = DOFACT. If user 
changes matrix structure, then user must call
KSPSetOperators() -> call symbolic matrix factorization again, in 
which we set Options.Fact = DOFACT.


I have a conceptual question. How can sparsity (column) permutation be 
reused if it's applied on top of the equilibration (row) permutation? 
Symbolic factorization should be repeated anyway. Does it run in some 
kind of faster update mode in this case? Please correct me if I 
misunderstand something.


I would still appreciate the full factorization even for the same 
pattern without destroying KSP/PC object (just as a custom option).




Hong


Thanks a lot,
Anton


Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-11-21 Thread Anton Popov

Thanks, Hong.

I will try as soon as possible and let you know.

Anton

On 11/21/2016 04:43 AM, Hong wrote:

Anton,
I pushed a fix
https://bitbucket.org/petsc/petsc/commits/28865de08051eb99557d70672c208e14da23c8b1
in branch hzhang/fix-superlu_dist-reuse-factornumeric.
Can you give it a try to see if it works?
I do not have an example which produces your problem.

In your email, you asked "Setting Options.Fact = DOFACT for all 
factorizations is currently impossible via PETSc interface.

The user is expected to choose some kind of reuse model.
If you could add it, I (and other users probably too) would really 
appreciate that."


We do not allow user set superlu' Options.Fact = DOFACT. If user 
changes matrix structure, then user must call
KSPSetOperators() -> call symbolic matrix factorization again, in 
which we set Options.Fact = DOFACT.


Hong




Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-11-20 Thread Hong
Anton,
I pushed a fix
https://bitbucket.org/petsc/petsc/commits/28865de08051eb99557d70672c208e14da23c8b1
in branch hzhang/fix-superlu_dist-reuse-factornumeric.
Can you give it a try to see if it works?
I do not have an example which produces your problem.

In your email, you asked "Setting Options.Fact = DOFACT for all
factorizations is currently impossible via PETSc interface.
The user is expected to choose some kind of reuse model.
If you could add it, I (and other users probably too) would really
appreciate that."

We do not allow user set superlu' Options.Fact = DOFACT. If user changes
matrix structure, then user must call
KSPSetOperators() -> call symbolic matrix factorization again, in which we
set Options.Fact = DOFACT.

Hong


Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-11-07 Thread Hong
Anton:
I am planning to work on this as soon as I get time. I assume that your
code is working with the option '-mat_superlu_dist_fact
SamePattern_SameRowPerm'. If not, let me know.

What I'm planing to do is to detect the existence of Pc and Pr in petsc
interface, then set reuse option, so users will not be bothered by it.

>
> Setting Options.Fact = DOFACT for all factorizations is currently
> impossible via PETSc interface.
>
This might be a bug in our side. I'll check it.


> The user is expected to choose some kind of reuse model.
> If you could add it, I (and other users probably too) would really
> appreciate that.
>

I'll try to get it done soon, will let you know. Thanks for your patience.

Hong

>
>
>
> I'll check our interface to see if we can add flag-checking for Pr and Pc,
> then set default accordingly.
>
> Hong
>
> On Wed, Oct 26, 2016 at 3:23 PM, Xiaoye S. Li  wrote:
>
>> Some graph preprocessing steps can be skipped ONLY IF a previous
>> factorization was done, and the information can be reused (AS INPUT) to the
>> new factorization.
>>
>> In general, the driver routine SRC/pdgssvx.c() performs the LU
>> factorization of the following (preprocessed) matrix:
>>  Pc*Pr*diag(R)*A*diag(C)*Pc^T = L*U
>>
>> The default is to do LU from scratch, including all the steps to compute
>> equilibration (R, C), pivot ordering (Pr), and sparsity ordering (Pc).
>>
>> -- The default should be set as options.Fact = DOFACT.
>>
>> -- When you set options.Fact = SamePattern, the sparsity ordering step is
>> skipped, but you need to input Pc which was obtained from a previous
>> factorization.
>>
>> -- When you set options.Fact = SamePattern_SameRowPerm, both sparsity
>> reordering and pivoting ordering steps are skipped, but you need to input
>> both Pr and Pc.
>>
>> Please see Lines 258 - 307 comments in SRC/pdgssvx.c for details,
>> regarding which data structures should be inputs and which are outputs.
>> The Users Guide also explains this.
>>
>> In EXAMPLE/ directory, I have various examples of these usage situations,
>> see EXAMPLE/README.
>>
>> I am a little puzzled why in PETSc, the default is set to SamePattern ??
>>
>> Sherry
>>
>>
>> On Tue, Oct 25, 2016 at 9:18 AM, Hong  wrote:
>>
>>> Sherry,
>>>
>>> We set '-mat_superlu_dist_fact SamePattern'  as default in
>>> petsc/superlu_dist on 12/6/15 (see attached email below).
>>>
>>> However, Anton must set 'SamePattern_SameRowPerm' to avoid crash in his
>>> code. Checking
>>> http://crd-legacy.lbl.gov/~xiaoye/SuperLU/superlu_dist_code_
>>> html/pzgssvx___a_bglobal_8c.html
>>> I see detailed description on using SamePattern_SameRowPerm, which
>>> requires more from user than SamePattern. I guess these flags are used
>>> for efficiency. The library sets a default, then have users to switch for
>>> their own applications. The default setting should not cause crash. If
>>> crash occurs, give a meaningful error message would be help.
>>>
>>> Do you have suggestion how should we set default in petsc for this flag?
>>>
>>> Hong
>>>
>>> ---
>>> Hong 
>>> 12/7/15
>>>
>>> to Danyang, petsc-maint, PETSc, Xiaoye
>>> Danyang :
>>>
>>> Adding '-mat_superlu_dist_fact SamePattern' fixed the problem. Below is
>>> how I figured it out.
>>>
>>> 1. Reading ex52f.F, I see '-superlu_default' =
>>> '-pc_factor_mat_solver_package superlu_dist', the later enables runtime
>>> options for other packages. I use superlu_dist-4.2 and superlu-4.1 for the
>>> tests below.
>>> ...
>>> 5.
>>> Using a_flow_check_1.bin, I am able to reproduce the error you reported:
>>> all packages give correct results except superlu_dist:
>>> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
>>> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
>>> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
>>> superlu_dist
>>> Norm of error  2.5970E-12 iterations 1
>>>  -->Test for matrix  168
>>> Norm of error  1.3936E-01 iterations34
>>>  -->Test for matrix  169
>>>
>>> I guess the error might come from reuse of matrix factor. Replacing
>>> default
>>> -mat_superlu_dist_fact  with
>>> -mat_superlu_dist_fact SamePattern, I get
>>>
>>> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
>>> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
>>> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
>>> superlu_dist -mat_superlu_dist_fact SamePattern
>>>
>>> Norm of error  2.5970E-12 iterations 1
>>>  -->Test for matrix  168
>>> ...
>>> Sherry may tell you why SamePattern_SameRowPerm cause the difference
>>> here.
>>> Best on the above experiments, I would set following as default
>>> '-mat_superlu_diagpivotthresh 0.0' in petsc/superlu interface.
>>> '-mat_superlu_dist_fact SamePattern' in petsc/superlu_dist interface.
>>>
>>> Hong
>>>
>>> On Tue, Oct 25, 2016 at 10:38 AM, Hong  wrote:
>>>
 Anton,
 I 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-11-07 Thread Anton Popov



On 10/27/2016 04:51 PM, Hong wrote:

Sherry,
Thanks for detailed explanation.
We use options.Fact = DOFACT as default for the first factorization. 
When user reuses matrix factor, then we must provide a default,

either 'options.Fact = SamePattern' or 'SamePattern_SameRowPerm'.
We previously set 'SamePattern_SameRowPerm'. After a user reported 
error, we switched to 'SamePattern' which causes problem for 2nd user.

Hong,

Setting Options.Fact = DOFACT for all factorizations is currently 
impossible via PETSc interface.

The user is expected to choose some kind of reuse model.
If you could add it, I (and other users probably too) would really 
appreciate that.


Thanks a lot,
Anton



I'll check our interface to see if we can add flag-checking for Pr and 
Pc, then set default accordingly.


Hong

On Wed, Oct 26, 2016 at 3:23 PM, Xiaoye S. Li > wrote:


Some graph preprocessing steps can be skipped ONLY IF a previous
factorization was done, and the information can be reused (AS
INPUT) to the new factorization.

In general, the driver routine SRC/pdgssvx.c() performs the LU
factorization of the following (preprocessed) matrix:
 Pc*Pr*diag(R)*A*diag(C)*Pc^T = L*U

The default is to do LU from scratch, including all the steps to
compute equilibration (R, C), pivot ordering (Pr), and sparsity
ordering (Pc).

-- The default should be set as options.Fact = DOFACT.

-- When you set options.Fact = SamePattern, the sparsity ordering
step is skipped, but you need to input Pc which was obtained from
a previous factorization.

-- When you set options.Fact = SamePattern_SameRowPerm, both
sparsity reordering and pivoting ordering steps are skipped, but
you need to input both Pr and Pc.

Please see Lines 258 - 307 comments in SRC/pdgssvx.c for details,
regarding which data structures should be inputs and which are
outputs.  The Users Guide also explains this.

In EXAMPLE/ directory, I have various examples of these usage
situations, see EXAMPLE/README.

I am a little puzzled why in PETSc, the default is set to
SamePattern ??

Sherry


On Tue, Oct 25, 2016 at 9:18 AM, Hong > wrote:

Sherry,

We set '-mat_superlu_dist_fact SamePattern'  as default in
petsc/superlu_dist on 12/6/15 (see attached email below).

However, Anton must set 'SamePattern_SameRowPerm' to avoid
crash in his code. Checking

http://crd-legacy.lbl.gov/~xiaoye/SuperLU/superlu_dist_code_html/pzgssvx___a_bglobal_8c.html


I see detailed description on using SamePattern_SameRowPerm,
which requires more from user than SamePattern. I guess these
flags are used for efficiency. The library sets a default,
then have users to switch for their own applications. The
default setting should not cause crash. If crash occurs, give
a meaningful error message would be help.

Do you have suggestion how should we set default in petsc for
this flag?

Hong

---


  Hong >


12/7/15


to Danyang, petsc-maint, PETSc, Xiaoye

Danyang :

Adding '-mat_superlu_dist_fact SamePattern' fixed the problem.
Below is how I figured it out.

1. Reading ex52f.F, I see '-superlu_default' =
'-pc_factor_mat_solver_package superlu_dist', the later
enables runtime options for other packages. I use
superlu_dist-4.2 and superlu-4.1 for the tests below.
...
5.
Using a_flow_check_1.bin, I am able to reproduce the error you
reported: all packages give correct results except superlu_dist:
./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
-pc_factor_mat_solver_package superlu_dist
Norm of error  2.5970E-12 iterations 1
 -->Test for matrix  168
Norm of error  1.3936E-01 iterations34
 -->Test for matrix  169

I guess the error might come from reuse of matrix factor.
Replacing default
-mat_superlu_dist_fact  with
-mat_superlu_dist_fact SamePattern, I get

./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices
flow_check -loop_folder matrix_and_rhs_bin -pc_type lu
-pc_factor_mat_solver_package superlu_dist
-mat_superlu_dist_fact SamePattern

Norm of error  2.5970E-12 iterations 1
 -->Test for matrix  168
...
 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-27 Thread Hong
Sherry,
Thanks for detailed explanation.
We use options.Fact = DOFACT as default for the first factorization. When
user reuses matrix factor, then we must provide a default,
either 'options.Fact = SamePattern' or 'SamePattern_SameRowPerm'.
We previously set 'SamePattern_SameRowPerm'. After a user reported error,
we switched to 'SamePattern' which causes problem for 2nd user.

I'll check our interface to see if we can add flag-checking for Pr and Pc,
then set default accordingly.

Hong

On Wed, Oct 26, 2016 at 3:23 PM, Xiaoye S. Li  wrote:

> Some graph preprocessing steps can be skipped ONLY IF a previous
> factorization was done, and the information can be reused (AS INPUT) to the
> new factorization.
>
> In general, the driver routine SRC/pdgssvx.c() performs the LU
> factorization of the following (preprocessed) matrix:
>  Pc*Pr*diag(R)*A*diag(C)*Pc^T = L*U
>
> The default is to do LU from scratch, including all the steps to compute
> equilibration (R, C), pivot ordering (Pr), and sparsity ordering (Pc).
>
> -- The default should be set as options.Fact = DOFACT.
>
> -- When you set options.Fact = SamePattern, the sparsity ordering step is
> skipped, but you need to input Pc which was obtained from a previous
> factorization.
>
> -- When you set options.Fact = SamePattern_SameRowPerm, both sparsity
> reordering and pivoting ordering steps are skipped, but you need to input
> both Pr and Pc.
>
> Please see Lines 258 - 307 comments in SRC/pdgssvx.c for details,
> regarding which data structures should be inputs and which are outputs.
> The Users Guide also explains this.
>
> In EXAMPLE/ directory, I have various examples of these usage situations,
> see EXAMPLE/README.
>
> I am a little puzzled why in PETSc, the default is set to SamePattern ??
>
> Sherry
>
>
> On Tue, Oct 25, 2016 at 9:18 AM, Hong  wrote:
>
>> Sherry,
>>
>> We set '-mat_superlu_dist_fact SamePattern'  as default in
>> petsc/superlu_dist on 12/6/15 (see attached email below).
>>
>> However, Anton must set 'SamePattern_SameRowPerm' to avoid crash in his
>> code. Checking
>> http://crd-legacy.lbl.gov/~xiaoye/SuperLU/superlu_dist_code_
>> html/pzgssvx___a_bglobal_8c.html
>> I see detailed description on using SamePattern_SameRowPerm, which
>> requires more from user than SamePattern. I guess these flags are used
>> for efficiency. The library sets a default, then have users to switch for
>> their own applications. The default setting should not cause crash. If
>> crash occurs, give a meaningful error message would be help.
>>
>> Do you have suggestion how should we set default in petsc for this flag?
>>
>> Hong
>>
>> ---
>> Hong 
>> 12/7/15
>> to Danyang, petsc-maint, PETSc, Xiaoye
>> Danyang :
>>
>> Adding '-mat_superlu_dist_fact SamePattern' fixed the problem. Below is
>> how I figured it out.
>>
>> 1. Reading ex52f.F, I see '-superlu_default' =
>> '-pc_factor_mat_solver_package superlu_dist', the later enables runtime
>> options for other packages. I use superlu_dist-4.2 and superlu-4.1 for the
>> tests below.
>> ...
>> 5.
>> Using a_flow_check_1.bin, I am able to reproduce the error you reported:
>> all packages give correct results except superlu_dist:
>> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
>> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
>> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
>> superlu_dist
>> Norm of error  2.5970E-12 iterations 1
>>  -->Test for matrix  168
>> Norm of error  1.3936E-01 iterations34
>>  -->Test for matrix  169
>>
>> I guess the error might come from reuse of matrix factor. Replacing
>> default
>> -mat_superlu_dist_fact  with
>> -mat_superlu_dist_fact SamePattern, I get
>>
>> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
>> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
>> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
>> superlu_dist -mat_superlu_dist_fact SamePattern
>>
>> Norm of error  2.5970E-12 iterations 1
>>  -->Test for matrix  168
>> ...
>> Sherry may tell you why SamePattern_SameRowPerm cause the difference
>> here.
>> Best on the above experiments, I would set following as default
>> '-mat_superlu_diagpivotthresh 0.0' in petsc/superlu interface.
>> '-mat_superlu_dist_fact SamePattern' in petsc/superlu_dist interface.
>>
>> Hong
>>
>> On Tue, Oct 25, 2016 at 10:38 AM, Hong  wrote:
>>
>>> Anton,
>>> I guess, when you reuse matrix and its symbolic factor with updated
>>> numerical values, superlu_dist requires this option. I'm cc'ing Sherry to
>>> confirm it.
>>>
>>> I'll check petsc/superlu-dist interface to set this flag for this case.
>>>
>>> Hong
>>>
>>>
>>> On Tue, Oct 25, 2016 at 8:20 AM, Anton Popov  wrote:
>>>
 Hong,

 I get all the problems gone and valgrind-clean output if I specify this:

 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-26 Thread Xiaoye S. Li
Some graph preprocessing steps can be skipped ONLY IF a previous
factorization was done, and the information can be reused (AS INPUT) to the
new factorization.

In general, the driver routine SRC/pdgssvx.c() performs the LU
factorization of the following (preprocessed) matrix:
 Pc*Pr*diag(R)*A*diag(C)*Pc^T = L*U

The default is to do LU from scratch, including all the steps to compute
equilibration (R, C), pivot ordering (Pr), and sparsity ordering (Pc).

-- The default should be set as options.Fact = DOFACT.

-- When you set options.Fact = SamePattern, the sparsity ordering step is
skipped, but you need to input Pc which was obtained from a previous
factorization.

-- When you set options.Fact = SamePattern_SameRowPerm, both sparsity
reordering and pivoting ordering steps are skipped, but you need to input
both Pr and Pc.

Please see Lines 258 - 307 comments in SRC/pdgssvx.c for details, regarding
which data structures should be inputs and which are outputs.  The Users
Guide also explains this.

In EXAMPLE/ directory, I have various examples of these usage situations,
see EXAMPLE/README.

I am a little puzzled why in PETSc, the default is set to SamePattern ??

Sherry


On Tue, Oct 25, 2016 at 9:18 AM, Hong  wrote:

> Sherry,
>
> We set '-mat_superlu_dist_fact SamePattern'  as default in
> petsc/superlu_dist on 12/6/15 (see attached email below).
>
> However, Anton must set 'SamePattern_SameRowPerm' to avoid crash in his
> code. Checking
> http://crd-legacy.lbl.gov/~xiaoye/SuperLU/superlu_dist_
> code_html/pzgssvx___a_bglobal_8c.html
> I see detailed description on using SamePattern_SameRowPerm, which
> requires more from user than SamePattern. I guess these flags are used
> for efficiency. The library sets a default, then have users to switch for
> their own applications. The default setting should not cause crash. If
> crash occurs, give a meaningful error message would be help.
>
> Do you have suggestion how should we set default in petsc for this flag?
>
> Hong
>
> ---
> Hong 
> 12/7/15
> to Danyang, petsc-maint, PETSc, Xiaoye
> Danyang :
>
> Adding '-mat_superlu_dist_fact SamePattern' fixed the problem. Below is
> how I figured it out.
>
> 1. Reading ex52f.F, I see '-superlu_default' =
> '-pc_factor_mat_solver_package superlu_dist', the later enables runtime
> options for other packages. I use superlu_dist-4.2 and superlu-4.1 for the
> tests below.
> ...
> 5.
> Using a_flow_check_1.bin, I am able to reproduce the error you reported:
> all packages give correct results except superlu_dist:
> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
> superlu_dist
> Norm of error  2.5970E-12 iterations 1
>  -->Test for matrix  168
> Norm of error  1.3936E-01 iterations34
>  -->Test for matrix  169
>
> I guess the error might come from reuse of matrix factor. Replacing default
> -mat_superlu_dist_fact  with
> -mat_superlu_dist_fact SamePattern, I get
>
> ./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
> matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
> -loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
> superlu_dist -mat_superlu_dist_fact SamePattern
>
> Norm of error  2.5970E-12 iterations 1
>  -->Test for matrix  168
> ...
> Sherry may tell you why SamePattern_SameRowPerm cause the difference here.
> Best on the above experiments, I would set following as default
> '-mat_superlu_diagpivotthresh 0.0' in petsc/superlu interface.
> '-mat_superlu_dist_fact SamePattern' in petsc/superlu_dist interface.
>
> Hong
>
> On Tue, Oct 25, 2016 at 10:38 AM, Hong  wrote:
>
>> Anton,
>> I guess, when you reuse matrix and its symbolic factor with updated
>> numerical values, superlu_dist requires this option. I'm cc'ing Sherry to
>> confirm it.
>>
>> I'll check petsc/superlu-dist interface to set this flag for this case.
>>
>> Hong
>>
>>
>> On Tue, Oct 25, 2016 at 8:20 AM, Anton Popov  wrote:
>>
>>> Hong,
>>>
>>> I get all the problems gone and valgrind-clean output if I specify this:
>>>
>>> -mat_superlu_dist_fact SamePattern_SameRowPerm
>>> What does SamePattern_SameRowPerm actually mean?
>>> Row permutations are for large diagonal, column permutations are for
>>> sparsity, right?
>>> Will it skip subsequent matrix permutations for large diagonal even if
>>> matrix values change significantly?
>>>
>>> Surprisingly everything works even with:
>>>
>>> -mat_superlu_dist_colperm PARMETIS
>>> -mat_superlu_dist_parsymbfact TRUE
>>>
>>> Thanks,
>>> Anton
>>>
>>> On 10/24/2016 09:06 PM, Hong wrote:
>>>
>>> Anton:

 If replacing superlu_dist with mumps, does your code work?

 yes

>>>
>>> You may use mumps in your code, or tests different options for
>>> superlu_dist:
>>>
>>>   

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-25 Thread Hong
Sherry,

We set '-mat_superlu_dist_fact SamePattern'  as default in
petsc/superlu_dist on 12/6/15 (see attached email below).

However, Anton must set 'SamePattern_SameRowPerm' to avoid crash in his
code. Checking
http://crd-legacy.lbl.gov/~xiaoye/SuperLU/superlu_dist_code_html/pzgssvx___a_bglobal_8c.html
I see detailed description on using SamePattern_SameRowPerm, which requires
more from user than SamePattern. I guess these flags are used for
efficiency. The library sets a default, then have users to switch for their
own applications. The default setting should not cause crash. If crash
occurs, give a meaningful error message would be help.

Do you have suggestion how should we set default in petsc for this flag?

Hong

---
Hong 
12/7/15
to Danyang, petsc-maint, PETSc, Xiaoye
Danyang :

Adding '-mat_superlu_dist_fact SamePattern' fixed the problem. Below is how
I figured it out.

1. Reading ex52f.F, I see '-superlu_default' =
'-pc_factor_mat_solver_package superlu_dist', the later enables runtime
options for other packages. I use superlu_dist-4.2 and superlu-4.1 for the
tests below.
...
5.
Using a_flow_check_1.bin, I am able to reproduce the error you reported:
all packages give correct results except superlu_dist:
./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
-loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
superlu_dist
Norm of error  2.5970E-12 iterations 1
 -->Test for matrix  168
Norm of error  1.3936E-01 iterations34
 -->Test for matrix  169

I guess the error might come from reuse of matrix factor. Replacing default
-mat_superlu_dist_fact  with
-mat_superlu_dist_fact SamePattern, I get

./ex52f -f0 matrix_and_rhs_bin/a_flow_check_1.bin -rhs
matrix_and_rhs_bin/b_flow_check_168.bin -loop_matrices flow_check
-loop_folder matrix_and_rhs_bin -pc_type lu -pc_factor_mat_solver_package
superlu_dist -mat_superlu_dist_fact SamePattern

Norm of error  2.5970E-12 iterations 1
 -->Test for matrix  168
...
Sherry may tell you why SamePattern_SameRowPerm cause the difference here.
Best on the above experiments, I would set following as default
'-mat_superlu_diagpivotthresh 0.0' in petsc/superlu interface.
'-mat_superlu_dist_fact SamePattern' in petsc/superlu_dist interface.

Hong

On Tue, Oct 25, 2016 at 10:38 AM, Hong  wrote:

> Anton,
> I guess, when you reuse matrix and its symbolic factor with updated
> numerical values, superlu_dist requires this option. I'm cc'ing Sherry to
> confirm it.
>
> I'll check petsc/superlu-dist interface to set this flag for this case.
>
> Hong
>
>
> On Tue, Oct 25, 2016 at 8:20 AM, Anton Popov  wrote:
>
>> Hong,
>>
>> I get all the problems gone and valgrind-clean output if I specify this:
>>
>> -mat_superlu_dist_fact SamePattern_SameRowPerm
>> What does SamePattern_SameRowPerm actually mean?
>> Row permutations are for large diagonal, column permutations are for
>> sparsity, right?
>> Will it skip subsequent matrix permutations for large diagonal even if
>> matrix values change significantly?
>>
>> Surprisingly everything works even with:
>>
>> -mat_superlu_dist_colperm PARMETIS
>> -mat_superlu_dist_parsymbfact TRUE
>>
>> Thanks,
>> Anton
>>
>> On 10/24/2016 09:06 PM, Hong wrote:
>>
>> Anton:
>>>
>>> If replacing superlu_dist with mumps, does your code work?
>>>
>>> yes
>>>
>>
>> You may use mumps in your code, or tests different options for
>> superlu_dist:
>>
>>   -mat_superlu_dist_equil:  Equilibrate matrix (None)
>>   -mat_superlu_dist_rowperm  Row permutation (choose one of)
>> LargeDiag NATURAL (None)
>>   -mat_superlu_dist_colperm  Column permutation (choose
>> one of) NATURAL MMD_AT_PLUS_A MMD_ATA METIS_AT_PLUS_A PARMETIS (None)
>>   -mat_superlu_dist_replacetinypivot:  Replace tiny pivots (None)
>>   -mat_superlu_dist_parsymbfact:  Parallel symbolic factorization
>> (None)
>>   -mat_superlu_dist_fact  Sparsity pattern for repeated
>> matrix factorization (choose one of) SamePattern SamePattern_SameRowPerm
>> (None)
>>
>> The options inside <> are defaults. You may try others. This might help
>> narrow down the bug.
>>
>> Hong
>>
>>>
>>> Hong

 On 10/24/2016 05:47 PM, Hong wrote:

 Barry,
 Your change indeed fixed the error of his testing code.
 As Satish tested, on your branch, ex16 runs smooth.

 I do not understand why on maint or master branch, ex16 creases inside
 superlu_dist, but not with mumps.


 I also confirm that ex16 runs fine with latest fix, but unfortunately
 not my code.

 This is something to be expected, since my code preallocates once in
 the beginning. So there is no way it can be affected by multiple
 preallocations. Subsequently I only do matrix assembly, that makes sure
 structure doesn't change (set to get error otherwise).

 Summary: we don't have a simple 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-25 Thread Hong
Anton,
I guess, when you reuse matrix and its symbolic factor with updated
numerical values, superlu_dist requires this option. I'm cc'ing Sherry to
confirm it.

I'll check petsc/superlu-dist interface to set this flag for this case.

Hong

On Tue, Oct 25, 2016 at 8:20 AM, Anton Popov  wrote:

> Hong,
>
> I get all the problems gone and valgrind-clean output if I specify this:
>
> -mat_superlu_dist_fact SamePattern_SameRowPerm
> What does SamePattern_SameRowPerm actually mean?
> Row permutations are for large diagonal, column permutations are for
> sparsity, right?
> Will it skip subsequent matrix permutations for large diagonal even if
> matrix values change significantly?
>
> Surprisingly everything works even with:
>
> -mat_superlu_dist_colperm PARMETIS
> -mat_superlu_dist_parsymbfact TRUE
>
> Thanks,
> Anton
>
> On 10/24/2016 09:06 PM, Hong wrote:
>
> Anton:
>>
>> If replacing superlu_dist with mumps, does your code work?
>>
>> yes
>>
>
> You may use mumps in your code, or tests different options for
> superlu_dist:
>
>   -mat_superlu_dist_equil:  Equilibrate matrix (None)
>   -mat_superlu_dist_rowperm  Row permutation (choose one of)
> LargeDiag NATURAL (None)
>   -mat_superlu_dist_colperm  Column permutation (choose
> one of) NATURAL MMD_AT_PLUS_A MMD_ATA METIS_AT_PLUS_A PARMETIS (None)
>   -mat_superlu_dist_replacetinypivot:  Replace tiny pivots (None)
>   -mat_superlu_dist_parsymbfact:  Parallel symbolic factorization
> (None)
>   -mat_superlu_dist_fact  Sparsity pattern for repeated
> matrix factorization (choose one of) SamePattern SamePattern_SameRowPerm
> (None)
>
> The options inside <> are defaults. You may try others. This might help
> narrow down the bug.
>
> Hong
>
>>
>> Hong
>>>
>>> On 10/24/2016 05:47 PM, Hong wrote:
>>>
>>> Barry,
>>> Your change indeed fixed the error of his testing code.
>>> As Satish tested, on your branch, ex16 runs smooth.
>>>
>>> I do not understand why on maint or master branch, ex16 creases inside
>>> superlu_dist, but not with mumps.
>>>
>>>
>>> I also confirm that ex16 runs fine with latest fix, but unfortunately
>>> not my code.
>>>
>>> This is something to be expected, since my code preallocates once in the
>>> beginning. So there is no way it can be affected by multiple
>>> preallocations. Subsequently I only do matrix assembly, that makes sure
>>> structure doesn't change (set to get error otherwise).
>>>
>>> Summary: we don't have a simple test code to debug superlu issue anymore.
>>>
>>> Anton
>>>
>>> Hong
>>>
>>> On Mon, Oct 24, 2016 at 9:34 AM, Satish Balay  wrote:
>>>
 On Mon, 24 Oct 2016, Barry Smith wrote:

 >
 > > [Or perhaps Hong is using a different test code and is observing
 bugs
 > > with superlu_dist interface..]
 >
 >She states that her test does a NEW MatCreate() for each matrix
 load (I cut and pasted it in the email I just sent). The bug I fixed was
 only related to using the SAME matrix from one MatLoad() in another
 MatLoad().

 Ah - ok.. Sorry - wasn't thinking clearly :(

 Satish

>>>
>>>
>>>
>>
>>
>
>


Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-25 Thread Anton Popov

Hong,

I get all the problems gone and valgrind-clean output if I specify this:

-mat_superlu_dist_fact SamePattern_SameRowPerm

What does SamePattern_SameRowPerm actually mean?
Row permutations are for large diagonal, column permutations are for 
sparsity, right?
Will it skip subsequent matrix permutations for large diagonal even if 
matrix values change significantly?


Surprisingly everything works even with:

-mat_superlu_dist_colperm PARMETIS
-mat_superlu_dist_parsymbfact TRUE

Thanks,
Anton

On 10/24/2016 09:06 PM, Hong wrote:

Anton:


If replacing superlu_dist with mumps, does your code work?

yes

You may use mumps in your code, or tests different options for 
superlu_dist:


  -mat_superlu_dist_equil:  Equilibrate matrix (None)
  -mat_superlu_dist_rowperm  Row permutation (choose one 
of) LargeDiag NATURAL (None)
  -mat_superlu_dist_colperm  Column permutation 
(choose one of) NATURAL MMD_AT_PLUS_A MMD_ATA METIS_AT_PLUS_A PARMETIS 
(None)

  -mat_superlu_dist_replacetinypivot:  Replace tiny pivots (None)
  -mat_superlu_dist_parsymbfact:  Parallel symbolic 
factorization (None)
  -mat_superlu_dist_fact  Sparsity pattern for repeated 
matrix factorization (choose one of) SamePattern 
SamePattern_SameRowPerm (None)


The options inside <> are defaults. You may try others. This might 
help narrow down the bug.


Hong



Hong

On 10/24/2016 05:47 PM, Hong wrote:

Barry,
Your change indeed fixed the error of his testing code.
As Satish tested, on your branch, ex16 runs smooth.

I do not understand why on maint or master branch, ex16
creases inside superlu_dist, but not with mumps.



I also confirm that ex16 runs fine with latest fix, but
unfortunately not my code.

This is something to be expected, since my code preallocates
once in the beginning. So there is no way it can be affected
by multiple preallocations. Subsequently I only do matrix
assembly, that makes sure structure doesn't change (set to
get error otherwise).

Summary: we don't have a simple test code to debug superlu
issue anymore.

Anton


Hong

On Mon, Oct 24, 2016 at 9:34 AM, Satish Balay
> wrote:

On Mon, 24 Oct 2016, Barry Smith wrote:

>
> > [Or perhaps Hong is using a different test code and is
observing bugs
> > with superlu_dist interface..]
>
>She states that her test does a NEW MatCreate() for
each matrix load (I cut and pasted it in the email I
just sent). The bug I fixed was only related to using
the SAME matrix from one MatLoad() in another MatLoad().

Ah - ok.. Sorry - wasn't thinking clearly :(

Satish












Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-25 Thread Anton Popov



On 10/25/2016 01:58 PM, Anton Popov wrote:



On 10/24/2016 10:32 PM, Barry Smith wrote:

Valgrind doesn't report any problems?



Valgrind hangs and never returns (waited hours for a 5 sec run) after 
entering factorization for the second time.


Before it happens it prints this (attached)

Anton






On Oct 24, 2016, at 12:09 PM, Anton Popov  wrote:



On 10/24/2016 05:47 PM, Hong wrote:

Barry,
Your change indeed fixed the error of his testing code.
As Satish tested, on your branch, ex16 runs smooth.

I do not understand why on maint or master branch, ex16 creases 
inside superlu_dist, but not with mumps.


I also confirm that ex16 runs fine with latest fix, but 
unfortunately not my code.


This is something to be expected, since my code preallocates once in 
the beginning. So there is no way it can be affected by multiple 
preallocations. Subsequently I only do matrix assembly, that makes 
sure structure doesn't change (set to get error otherwise).


Summary: we don't have a simple test code to debug superlu issue 
anymore.


Anton


Hong

On Mon, Oct 24, 2016 at 9:34 AM, Satish Balay  
wrote:

On Mon, 24 Oct 2016, Barry Smith wrote:

[Or perhaps Hong is using a different test code and is observing 
bugs

with superlu_dist interface..]
She states that her test does a NEW MatCreate() for each 
matrix load (I cut and pasted it in the email I just sent). The 
bug I fixed was only related to using the SAME matrix from one 
MatLoad() in another MatLoad().

Ah - ok.. Sorry - wasn't thinking clearly :(

Satish





USING PICARD JACOBIAN for iteration 0, ||F||/||F0||=1.00e+00
==10744== Use of uninitialised value of size 8
==10744==at 0x18087A8: static_schedule (static_schedule.c:960)
==10744==by 0x17D42AB: pdgstrf (pdgstrf.c:572)
==10744==by 0x17B94B1: pdgssvx (pdgssvx.c:1124)
==10744==by 0xA9E777: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427)
==10744==by 0x6CAA90: MatLUFactorNumeric (matrix.c:3099)
==10744==by 0x137DFE9: PCSetUp_LU (lu.c:139)
==10744==by 0xECC779: PCSetUp (precon.c:968)
==10744==by 0x47AD01: PCStokesUserSetup (lsolve.c:602)
==10744==by 0x476EC3: PCStokesSetup (lsolve.c:173)
==10744==by 0x473BE4: FormJacobian (nlsolve.c:389)
==10744==by 0xF40C3D: SNESComputeJacobian (snes.c:2367)
==10745== Use of uninitialised value of size 8
==10745==at 0x18087A8: static_schedule (static_schedule.c:960)
==10745==by 0x17D42AB: pdgstrf (pdgstrf.c:572)
==10745==by 0x17B94B1: pdgssvx (pdgssvx.c:1124)
==10745==by 0xA9E777: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427)
==10744==by 0xFA5F1F: SNESSolve_KSPONLY (ksponly.c:38)
==10744==
==10745==by 0x6CAA90: MatLUFactorNumeric (matrix.c:3099)
==10745==by 0x137DFE9: PCSetUp_LU (lu.c:139)
==10745==by 0xECC779: PCSetUp (precon.c:968)
==10745==by 0x47AD01: PCStokesUserSetup (lsolve.c:602)
==10745==by 0x476EC3: PCStokesSetup (lsolve.c:173)
==10745==by 0x473BE4: FormJacobian (nlsolve.c:389)
==10745==by 0xF40C3D: SNESComputeJacobian (snes.c:2367)
==10745==by 0xFA5F1F: SNESSolve_KSPONLY (ksponly.c:38)
==10745==
==10745== Invalid write of size 4
==10745==at 0x18087A8: static_schedule (static_schedule.c:960)
==10745==by 0x17D42AB: pdgstrf (pdgstrf.c:572)
==10745==by 0x17B94B1: pdgssvx (pdgssvx.c:1124)
==10745==by 0xA9E777: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427)
==10745==by 0x6CAA90: MatLUFactorNumeric (matrix.c:3099)
==10745==by 0x137DFE9: PCSetUp_LU (lu.c:139)
==10745==by 0xECC779: PCSetUp (precon.c:968)
==10745==by 0x47AD01: PCStokesUserSetup (lsolve.c:602)
==10745==by 0x476EC3: PCStokesSetup (lsolve.c:173)
==10745==by 0x473BE4: FormJacobian (nlsolve.c:389)
==10745==by 0xF40C3D: SNESComputeJacobian (snes.c:2367)
==10745==by 0xFA5F1F: SNESSolve_KSPONLY (ksponly.c:38)
==10745==  Address 0xa077c48 is 200 bytes inside a block of size 13,936 free'd
==10745==at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10745==by 0x17A8A16: superlu_free_dist (memory.c:124)
==10745==by 0x18086A2: static_schedule (static_schedule.c:946)
==10745==by 0x17D42AB: pdgstrf (pdgstrf.c:572)
==10745==by 0x17B94B1: pdgssvx (pdgssvx.c:1124)
==10745==by 0xA9E777: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427)
==10745==by 0x6CAA90: MatLUFactorNumeric (matrix.c:3099)
==10745==by 0x137DFE9: PCSetUp_LU (lu.c:139)
==10745==by 0xECC779: PCSetUp (precon.c:968)
==10745==by 0x47AD01: PCStokesUserSetup (lsolve.c:602)
==10745==by 0x476EC3: PCStokesSetup (lsolve.c:173)
==10745==by 0x473BE4: FormJacobian (nlsolve.c:389)
==10745==  Block was alloc'd at
==10745==at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10745==by 0x17A89F4: superlu_malloc_dist (memory.c:118)
==10745==by 0x18051F5: static_schedule (static_schedule.c:274)
==10745==by 0x17D42AB: pdgstrf (pdgstrf.c:572)

Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-25 Thread Anton Popov



On 10/24/2016 10:32 PM, Barry Smith wrote:

Valgrind doesn't report any problems?



Valgrind hangs and never returns (waited hours for a 5 sec run) after 
entering factorization for the second time.



On Oct 24, 2016, at 12:09 PM, Anton Popov  wrote:



On 10/24/2016 05:47 PM, Hong wrote:

Barry,
Your change indeed fixed the error of his testing code.
As Satish tested, on your branch, ex16 runs smooth.

I do not understand why on maint or master branch, ex16 creases inside 
superlu_dist, but not with mumps.


I also confirm that ex16 runs fine with latest fix, but unfortunately not my 
code.

This is something to be expected, since my code preallocates once in the 
beginning. So there is no way it can be affected by multiple preallocations. 
Subsequently I only do matrix assembly, that makes sure structure doesn't 
change (set to get error otherwise).

Summary: we don't have a simple test code to debug superlu issue anymore.

Anton


Hong

On Mon, Oct 24, 2016 at 9:34 AM, Satish Balay  wrote:
On Mon, 24 Oct 2016, Barry Smith wrote:


[Or perhaps Hong is using a different test code and is observing bugs
with superlu_dist interface..]

She states that her test does a NEW MatCreate() for each matrix load (I cut 
and pasted it in the email I just sent). The bug I fixed was only related to 
using the SAME matrix from one MatLoad() in another MatLoad().

Ah - ok.. Sorry - wasn't thinking clearly :(

Satish





Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Barry Smith

   Valgrind doesn't report any problems?


> On Oct 24, 2016, at 12:09 PM, Anton Popov  wrote:
> 
> 
> 
> On 10/24/2016 05:47 PM, Hong wrote:
>> Barry,
>> Your change indeed fixed the error of his testing code.
>> As Satish tested, on your branch, ex16 runs smooth.
>> 
>> I do not understand why on maint or master branch, ex16 creases inside 
>> superlu_dist, but not with mumps. 
>> 
> 
> I also confirm that ex16 runs fine with latest fix, but unfortunately not my 
> code.
> 
> This is something to be expected, since my code preallocates once in the 
> beginning. So there is no way it can be affected by multiple preallocations. 
> Subsequently I only do matrix assembly, that makes sure structure doesn't 
> change (set to get error otherwise).
> 
> Summary: we don't have a simple test code to debug superlu issue anymore.
> 
> Anton
> 
>> Hong
>> 
>> On Mon, Oct 24, 2016 at 9:34 AM, Satish Balay  wrote:
>> On Mon, 24 Oct 2016, Barry Smith wrote:
>> 
>> >
>> > > [Or perhaps Hong is using a different test code and is observing bugs
>> > > with superlu_dist interface..]
>> >
>> >She states that her test does a NEW MatCreate() for each matrix load (I 
>> > cut and pasted it in the email I just sent). The bug I fixed was only 
>> > related to using the SAME matrix from one MatLoad() in another MatLoad().
>> 
>> Ah - ok.. Sorry - wasn't thinking clearly :(
>> 
>> Satish
>> 
> 



Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Hong
Anton:
>
> If replacing superlu_dist with mumps, does your code work?
>
> yes
>

You may use mumps in your code, or tests different options for superlu_dist:

  -mat_superlu_dist_equil:  Equilibrate matrix (None)
  -mat_superlu_dist_rowperm  Row permutation (choose one of)
LargeDiag NATURAL (None)
  -mat_superlu_dist_colperm  Column permutation (choose
one of) NATURAL MMD_AT_PLUS_A MMD_ATA METIS_AT_PLUS_A PARMETIS (None)
  -mat_superlu_dist_replacetinypivot:  Replace tiny pivots (None)
  -mat_superlu_dist_parsymbfact:  Parallel symbolic factorization
(None)
  -mat_superlu_dist_fact  Sparsity pattern for repeated matrix
factorization (choose one of) SamePattern SamePattern_SameRowPerm (None)

The options inside <> are defaults. You may try others. This might help
narrow down the bug.

Hong

>
> Hong
>>
>> On 10/24/2016 05:47 PM, Hong wrote:
>>
>> Barry,
>> Your change indeed fixed the error of his testing code.
>> As Satish tested, on your branch, ex16 runs smooth.
>>
>> I do not understand why on maint or master branch, ex16 creases inside
>> superlu_dist, but not with mumps.
>>
>>
>> I also confirm that ex16 runs fine with latest fix, but unfortunately not
>> my code.
>>
>> This is something to be expected, since my code preallocates once in the
>> beginning. So there is no way it can be affected by multiple
>> preallocations. Subsequently I only do matrix assembly, that makes sure
>> structure doesn't change (set to get error otherwise).
>>
>> Summary: we don't have a simple test code to debug superlu issue anymore.
>>
>> Anton
>>
>> Hong
>>
>> On Mon, Oct 24, 2016 at 9:34 AM, Satish Balay  wrote:
>>
>>> On Mon, 24 Oct 2016, Barry Smith wrote:
>>>
>>> >
>>> > > [Or perhaps Hong is using a different test code and is observing bugs
>>> > > with superlu_dist interface..]
>>> >
>>> >She states that her test does a NEW MatCreate() for each matrix
>>> load (I cut and pasted it in the email I just sent). The bug I fixed was
>>> only related to using the SAME matrix from one MatLoad() in another
>>> MatLoad().
>>>
>>> Ah - ok.. Sorry - wasn't thinking clearly :(
>>>
>>> Satish
>>>
>>
>>
>>
>
>


Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Anton



On 10/24/16 8:21 PM, Hong wrote:

 Anton :
If replacing superlu_dist with mumps, does your code work?

yes

Hong

On 10/24/2016 05:47 PM, Hong wrote:

Barry,
Your change indeed fixed the error of his testing code.
As Satish tested, on your branch, ex16 runs smooth.

I do not understand why on maint or master branch, ex16 creases
inside superlu_dist, but not with mumps.



I also confirm that ex16 runs fine with latest fix, but
unfortunately not my code.

This is something to be expected, since my code preallocates once
in the beginning. So there is no way it can be affected by
multiple preallocations. Subsequently I only do matrix assembly,
that makes sure structure doesn't change (set to get error otherwise).

Summary: we don't have a simple test code to debug superlu issue
anymore.

Anton


Hong

On Mon, Oct 24, 2016 at 9:34 AM, Satish Balay > wrote:

On Mon, 24 Oct 2016, Barry Smith wrote:

>
> > [Or perhaps Hong is using a different test code and is observing 
bugs
> > with superlu_dist interface..]
>
>She states that her test does a NEW MatCreate() for each
matrix load (I cut and pasted it in the email I just sent).
The bug I fixed was only related to using the SAME matrix
from one MatLoad() in another MatLoad().

Ah - ok.. Sorry - wasn't thinking clearly :(

Satish









Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Hong
 Anton :
If replacing superlu_dist with mumps, does your code work?
Hong
>
> On 10/24/2016 05:47 PM, Hong wrote:
>
> Barry,
> Your change indeed fixed the error of his testing code.
> As Satish tested, on your branch, ex16 runs smooth.
>
> I do not understand why on maint or master branch, ex16 creases inside
> superlu_dist, but not with mumps.
>
>
> I also confirm that ex16 runs fine with latest fix, but unfortunately not
> my code.
>
> This is something to be expected, since my code preallocates once in the
> beginning. So there is no way it can be affected by multiple
> preallocations. Subsequently I only do matrix assembly, that makes sure
> structure doesn't change (set to get error otherwise).
>
> Summary: we don't have a simple test code to debug superlu issue anymore.
>
> Anton
>
> Hong
>
> On Mon, Oct 24, 2016 at 9:34 AM, Satish Balay  wrote:
>
>> On Mon, 24 Oct 2016, Barry Smith wrote:
>>
>> >
>> > > [Or perhaps Hong is using a different test code and is observing bugs
>> > > with superlu_dist interface..]
>> >
>> >She states that her test does a NEW MatCreate() for each matrix load
>> (I cut and pasted it in the email I just sent). The bug I fixed was only
>> related to using the SAME matrix from one MatLoad() in another MatLoad().
>>
>> Ah - ok.. Sorry - wasn't thinking clearly :(
>>
>> Satish
>>
>
>
>


Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Anton Popov



On 10/24/2016 05:47 PM, Hong wrote:

Barry,
Your change indeed fixed the error of his testing code.
As Satish tested, on your branch, ex16 runs smooth.

I do not understand why on maint or master branch, ex16 creases inside 
superlu_dist, but not with mumps.




I also confirm that ex16 runs fine with latest fix, but unfortunately 
not my code.


This is something to be expected, since my code preallocates once in the 
beginning. So there is no way it can be affected by multiple 
preallocations. Subsequently I only do matrix assembly, that makes sure 
structure doesn't change (set to get error otherwise).


Summary: we don't have a simple test code to debug superlu issue anymore.

Anton


Hong

On Mon, Oct 24, 2016 at 9:34 AM, Satish Balay > wrote:


On Mon, 24 Oct 2016, Barry Smith wrote:

>
> > [Or perhaps Hong is using a different test code and is observing bugs
> > with superlu_dist interface..]
>
>She states that her test does a NEW MatCreate() for each
matrix load (I cut and pasted it in the email I just sent). The
bug I fixed was only related to using the SAME matrix from one
MatLoad() in another MatLoad().

Ah - ok.. Sorry - wasn't thinking clearly :(

Satish






Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Hong
Barry,
Your change indeed fixed the error of his testing code.
As Satish tested, on your branch, ex16 runs smooth.

I do not understand why on maint or master branch, ex16 creases inside
superlu_dist, but not with mumps.

Hong

On Mon, Oct 24, 2016 at 9:34 AM, Satish Balay  wrote:

> On Mon, 24 Oct 2016, Barry Smith wrote:
>
> >
> > > [Or perhaps Hong is using a different test code and is observing bugs
> > > with superlu_dist interface..]
> >
> >She states that her test does a NEW MatCreate() for each matrix load
> (I cut and pasted it in the email I just sent). The bug I fixed was only
> related to using the SAME matrix from one MatLoad() in another MatLoad().
>
> Ah - ok.. Sorry - wasn't thinking clearly :(
>
> Satish
>


Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Satish Balay
On Mon, 24 Oct 2016, Barry Smith wrote:

> 
> > [Or perhaps Hong is using a different test code and is observing bugs
> > with superlu_dist interface..]
> 
>She states that her test does a NEW MatCreate() for each matrix load (I 
> cut and pasted it in the email I just sent). The bug I fixed was only related 
> to using the SAME matrix from one MatLoad() in another MatLoad(). 

Ah - ok.. Sorry - wasn't thinking clearly :(

Satish


Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Barry Smith

> On Oct 24, 2016, at 9:24 AM, Kong, Fande  wrote:
> 
> 
> 
> On Mon, Oct 24, 2016 at 8:07 AM, Kong, Fande  wrote:
> 
> 
> On Sun, Oct 23, 2016 at 3:56 PM, Barry Smith  wrote:
> 
>Thanks Satish,
> 
>   I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant  (in 
> next for testing)
> 
> Fande,
> 
> This will also make MatMPIAIJSetPreallocation() work properly with 
> multiple calls (you will not need a MatReset()).
> 
> 
> Does this work for MPIAIJ only? There are also other functions:  
> MatSeqAIJSetPreallocation(), MatMPIAIJSetPreallocation(), 
> MatSeqBAIJSetPreallocation(), MatMPIBAIJSetPreallocation(), 
> MatSeqSBAIJSetPreallocation(), MatMPISBAIJSetPreallocation(), and 
> MatXAIJSetPreallocation. 

  It works for all of them.

> 
> We have to use different function for different type. Could we have an 
> unified-interface for all of them? 

  Supposedly you can call MatXAIJSetPreallocation()  and it is the same as 
calling all of them, so I think it is a "unified" interface.

   Barry

> 
> Fande,
>  
> 
>Barry
> 
> Thanks, Barry.
> 
> Fande,
>  
> 
> 
> > On Oct 21, 2016, at 6:48 PM, Satish Balay  wrote:
> >
> > On Fri, 21 Oct 2016, Barry Smith wrote:
> >
> >>
> >>  valgrind first
> >
> > balay@asterix /home/balay/download-pine/x/superlu_dist_test
> > $ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small
> > First MatLoad!
> > Mat Object: 2 MPI processes
> >  type: mpiaij
> > row 0: (0, 4.)  (1, -1.)  (6, -1.)
> > row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.)
> > row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.)
> > row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.)
> > row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.)
> > row 5: (4, -1.)  (5, 4.)  (11, -1.)
> > row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.)
> > row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.)
> > row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.)
> > row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.)
> > row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.)
> > row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.)
> > row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.)
> > row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.)
> > row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.)
> > row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.)
> > row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.)
> > row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.)
> > row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.)
> > row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.)
> > row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.)
> > row 21: (15, -1.)  (20, -1.)  (21, 4.)  (22, -1.)  (27, -1.)
> > row 22: (16, -1.)  (21, -1.)  (22, 4.)  (23, -1.)  (28, -1.)
> > row 23: (17, -1.)  (22, -1.)  (23, 4.)  (29, -1.)
> > row 24: (18, -1.)  (24, 4.)  (25, -1.)  (30, -1.)
> > row 25: (19, -1.)  (24, -1.)  (25, 4.)  (26, -1.)  (31, -1.)
> > row 26: (20, -1.)  (25, -1.)  (26, 4.)  (27, -1.)  (32, -1.)
> > row 27: (21, -1.)  (26, -1.)  (27, 4.)  (28, -1.)  (33, -1.)
> > row 28: (22, -1.)  (27, -1.)  (28, 4.)  (29, -1.)  (34, -1.)
> > row 29: (23, -1.)  (28, -1.)  (29, 4.)  (35, -1.)
> > row 30: (24, -1.)  (30, 4.)  (31, -1.)
> > row 31: (25, -1.)  (30, -1.)  (31, 4.)  (32, -1.)
> > row 32: (26, -1.)  (31, -1.)  (32, 4.)  (33, -1.)
> > row 33: (27, -1.)  (32, -1.)  (33, 4.)  (34, -1.)
> > row 34: (28, -1.)  (33, -1.)  (34, 4.)  (35, -1.)
> > row 35: (29, -1.)  (34, -1.)  (35, 4.)
> > Second MatLoad!
> > Mat Object: 2 MPI processes
> >  type: mpiaij
> > ==4592== Invalid read of size 4
> > ==4592==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1402)
> > ==4592==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
> > ==4592==by 0x53373D7: MatView (matrix.c:989)
> > ==4592==by 0x40107E: main (ex16.c:30)
> > ==4592==  Address 0xa47b460 is 20 bytes after a block of size 28 alloc'd
> > ==4592==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
> > ==4592==by 0x4FD121A: PetscMallocAlign (mal.c:28)
> > ==4592==by 0x5842C70: MatSetUpMultiply_MPIAIJ (mmaij.c:41)
> > ==4592==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
> > ==4592==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
> > ==4592==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
> > ==4592==by 0x5337FEA: MatLoad (matrix.c:1101)
> > ==4592==by 0x400D9F: main (ex16.c:22)
> > ==4592==
> > ==4591== Invalid read of size 4
> > ==4591==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1402)
> > ==4591==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
> > ==4591==by 0x53373D7: MatView (matrix.c:989)
> > ==4591==by 0x40107E: main (ex16.c:30)
> > ==4591==  Address 0xa482958 is 24 bytes before a block of size 7 alloc'd
> > ==4591==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
> > ==4591==by 0x4FD121A: PetscMallocAlign (mal.c:28)
> > ==4591==by 0x4F31FB5: PetscStrallocpy (str.c:197)
> 

Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Barry Smith

> [Or perhaps Hong is using a different test code and is observing bugs
> with superlu_dist interface..]

   She states that her test does a NEW MatCreate() for each matrix load (I cut 
and pasted it in the email I just sent). The bug I fixed was only related to 
using the SAME matrix from one MatLoad() in another MatLoad(). 

  Barry



> On Oct 24, 2016, at 9:25 AM, Satish Balay <ba...@mcs.anl.gov> wrote:
> 
> Yes - but this test code [that Hong is also using] is buggy due to
> using MatLoad() twice - so the corrupted Matrix does have wierd
> behavior later in PC.
> 
> With your fix - the test code rpovided by Anton behaves fine for
> me. So Hong would have to restart the diagnosis - and I suspect all
> the wierd behavior she observed will go away [well I don't see the the
> original wired behavior with this test code anymore]..
> 
> Sinced you said "This will also make MatMPIAIJSetPreallocation() work
> properly with multiple calls" - perhaps Anton's issue is also somehow
> releated? I think its best if he can try this fix.
> 
> And if it doesn't work - then we'll need a better test case to
> reproduce.
> 
> [Or perhaps Hong is using a different test code and is observing bugs
> with superlu_dist interface..]
> 
> Satish
> 
> On Mon, 24 Oct 2016, Barry Smith wrote:
> 
>> 
>>   Hong wrote:  (Note that it creates a new Mat each time so shouldn't be 
>> affected by the bug I fixed; it also "works" with MUMPs but not 
>> superlu_dist.)
>> 
>> 
>> It is not problem with Matload twice. The file has one matrix, but is loaded 
>> twice.
>> 
>> Replacing pc with ksp, the code runs fine. 
>> The error occurs when PCSetUp_LU() is called with SAME_NONZERO_PATTERN.
>> I'll further look at it later.
>> 
>> Hong
>> 
>> From: Zhang, Hong
>> Sent: Friday, October 21, 2016 8:18 PM
>> To: Barry Smith; petsc-users
>> Subject: RE: [petsc-users] SuperLU_dist issue in 3.7.4
>> 
>> I am investigating it. The file has two matrices. The code takes following 
>> steps:
>> 
>> PCCreate(PETSC_COMM_WORLD, );
>> 
>> MatCreate(PETSC_COMM_WORLD,);
>> MatLoad(A,fd);
>> PCSetOperators(pc,A,A);
>> PCSetUp(pc);
>> 
>> MatCreate(PETSC_COMM_WORLD,);
>> MatLoad(A,fd);
>> PCSetOperators(pc,A,A);
>> PCSetUp(pc);  //crash here with np=2, superlu_dist, not with mumps/superlu 
>> or superlu_dist np=1
>> 
>> Hong
>> 
>>> On Oct 24, 2016, at 9:00 AM, Satish Balay <ba...@mcs.anl.gov> wrote:
>>> 
>>> Since the provided test code dosn't crash [and is valgrind clean] -
>>> with this fix - I'm not sure what bug Hong is chasing..
>>> 
>>> Satish
>>> 
>>> On Mon, 24 Oct 2016, Barry Smith wrote:
>>> 
>>>> 
>>>> Anton,
>>>> 
>>>>  Sorry for any confusion. This doesn't resolve the SuperLU_DIST issue 
>>>> which I think Hong is working on, this only resolves multiple loads of 
>>>> matrices into the same Mat.
>>>> 
>>>> Barry
>>>> 
>>>>> On Oct 24, 2016, at 5:07 AM, Anton Popov <po...@uni-mainz.de> wrote:
>>>>> 
>>>>> Thank you Barry, Satish, Fande!
>>>>> 
>>>>> Is there a chance to get this fix in the maintenance release 3.7.5 
>>>>> together with the latest SuperLU_DIST? Or next release is a more 
>>>>> realistic option?
>>>>> 
>>>>> Anton
>>>>> 
>>>>> On 10/24/2016 01:58 AM, Satish Balay wrote:
>>>>>> The original testcode from Anton also works [i.e is valgrind clean] with 
>>>>>> this change..
>>>>>> 
>>>>>> Satish
>>>>>> 
>>>>>> On Sun, 23 Oct 2016, Barry Smith wrote:
>>>>>> 
>>>>>>>  Thanks Satish,
>>>>>>> 
>>>>>>> I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant  
>>>>>>> (in next for testing)
>>>>>>> 
>>>>>>>   Fande,
>>>>>>> 
>>>>>>>   This will also make MatMPIAIJSetPreallocation() work properly 
>>>>>>> with multiple calls (you will not need a MatReset()).
>>>>>>> 
>>>>>>>  Barry
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 21, 2016, at 6:48 PM, Satish Balay <ba...@mcs.anl.gov

Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Satish Balay
Yes - but this test code [that Hong is also using] is buggy due to
using MatLoad() twice - so the corrupted Matrix does have wierd
behavior later in PC.

With your fix - the test code rpovided by Anton behaves fine for
me. So Hong would have to restart the diagnosis - and I suspect all
the wierd behavior she observed will go away [well I don't see the the
original wired behavior with this test code anymore]..

Sinced you said "This will also make MatMPIAIJSetPreallocation() work
properly with multiple calls" - perhaps Anton's issue is also somehow
releated? I think its best if he can try this fix.

And if it doesn't work - then we'll need a better test case to
reproduce.

[Or perhaps Hong is using a different test code and is observing bugs
with superlu_dist interface..]

Satish

On Mon, 24 Oct 2016, Barry Smith wrote:

> 
>Hong wrote:  (Note that it creates a new Mat each time so shouldn't be 
> affected by the bug I fixed; it also "works" with MUMPs but not superlu_dist.)
> 
> 
> It is not problem with Matload twice. The file has one matrix, but is loaded 
> twice.
> 
> Replacing pc with ksp, the code runs fine. 
> The error occurs when PCSetUp_LU() is called with SAME_NONZERO_PATTERN.
> I'll further look at it later.
> 
> Hong
> 
> From: Zhang, Hong
> Sent: Friday, October 21, 2016 8:18 PM
> To: Barry Smith; petsc-users
> Subject: RE: [petsc-users] SuperLU_dist issue in 3.7.4
> 
> I am investigating it. The file has two matrices. The code takes following 
> steps:
> 
> PCCreate(PETSC_COMM_WORLD, );
> 
> MatCreate(PETSC_COMM_WORLD,);
> MatLoad(A,fd);
> PCSetOperators(pc,A,A);
> PCSetUp(pc);
> 
> MatCreate(PETSC_COMM_WORLD,);
> MatLoad(A,fd);
> PCSetOperators(pc,A,A);
> PCSetUp(pc);  //crash here with np=2, superlu_dist, not with mumps/superlu or 
> superlu_dist np=1
> 
> Hong
> 
> > On Oct 24, 2016, at 9:00 AM, Satish Balay <ba...@mcs.anl.gov> wrote:
> > 
> > Since the provided test code dosn't crash [and is valgrind clean] -
> > with this fix - I'm not sure what bug Hong is chasing..
> > 
> > Satish
> > 
> > On Mon, 24 Oct 2016, Barry Smith wrote:
> > 
> >> 
> >>  Anton,
> >> 
> >>   Sorry for any confusion. This doesn't resolve the SuperLU_DIST issue 
> >> which I think Hong is working on, this only resolves multiple loads of 
> >> matrices into the same Mat.
> >> 
> >>  Barry
> >> 
> >>> On Oct 24, 2016, at 5:07 AM, Anton Popov <po...@uni-mainz.de> wrote:
> >>> 
> >>> Thank you Barry, Satish, Fande!
> >>> 
> >>> Is there a chance to get this fix in the maintenance release 3.7.5 
> >>> together with the latest SuperLU_DIST? Or next release is a more 
> >>> realistic option?
> >>> 
> >>> Anton
> >>> 
> >>> On 10/24/2016 01:58 AM, Satish Balay wrote:
> >>>> The original testcode from Anton also works [i.e is valgrind clean] with 
> >>>> this change..
> >>>> 
> >>>> Satish
> >>>> 
> >>>> On Sun, 23 Oct 2016, Barry Smith wrote:
> >>>> 
> >>>>>   Thanks Satish,
> >>>>> 
> >>>>>  I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant 
> >>>>>  (in next for testing)
> >>>>> 
> >>>>>Fande,
> >>>>> 
> >>>>>This will also make MatMPIAIJSetPreallocation() work properly 
> >>>>> with multiple calls (you will not need a MatReset()).
> >>>>> 
> >>>>>   Barry
> >>>>> 
> >>>>> 
> >>>>>> On Oct 21, 2016, at 6:48 PM, Satish Balay <ba...@mcs.anl.gov> wrote:
> >>>>>> 
> >>>>>> On Fri, 21 Oct 2016, Barry Smith wrote:
> >>>>>> 
> >>>>>>> valgrind first
> >>>>>> balay@asterix /home/balay/download-pine/x/superlu_dist_test
> >>>>>> $ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small
> >>>>>> First MatLoad!
> >>>>>> Mat Object: 2 MPI processes
> >>>>>> type: mpiaij
> >>>>>> row 0: (0, 4.)  (1, -1.)  (6, -1.)
> >>>>>> row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.)
> >>>>>> row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.)
> >>>>>> row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.)
> >>>>>> row 4: (3, -1.)  (4, 4.

Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Kong, Fande
On Mon, Oct 24, 2016 at 8:07 AM, Kong, Fande  wrote:

>
>
> On Sun, Oct 23, 2016 at 3:56 PM, Barry Smith  wrote:
>
>>
>>Thanks Satish,
>>
>>   I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant
>> (in next for testing)
>>
>> Fande,
>>
>> This will also make MatMPIAIJSetPreallocation() work properly
>> with multiple calls (you will not need a MatReset()).
>>
>

Does this work for MPIAIJ only? There are also other functions:
MatSeqAIJSetPreallocation(), MatMPIAIJSetPreallocation(),
MatSeqBAIJSetPreallocation(), MatMPIBAIJSetPreallocation(),
MatSeqSBAIJSetPreallocation(), MatMPISBAIJSetPreallocation(), and
MatXAIJSetPreallocation.

We have to use different function for different type. Could we have an
unified-interface for all of them?

Fande,


>
>>Barry
>>
>
> Thanks, Barry.
>
> Fande,
>
>
>>
>>
>> > On Oct 21, 2016, at 6:48 PM, Satish Balay  wrote:
>> >
>> > On Fri, 21 Oct 2016, Barry Smith wrote:
>> >
>> >>
>> >>  valgrind first
>> >
>> > balay@asterix /home/balay/download-pine/x/superlu_dist_test
>> > $ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small
>> > First MatLoad!
>> > Mat Object: 2 MPI processes
>> >  type: mpiaij
>> > row 0: (0, 4.)  (1, -1.)  (6, -1.)
>> > row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.)
>> > row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.)
>> > row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.)
>> > row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.)
>> > row 5: (4, -1.)  (5, 4.)  (11, -1.)
>> > row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.)
>> > row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.)
>> > row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.)
>> > row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.)
>> > row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.)
>> > row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.)
>> > row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.)
>> > row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.)
>> > row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.)
>> > row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.)
>> > row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.)
>> > row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.)
>> > row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.)
>> > row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.)
>> > row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.)
>> > row 21: (15, -1.)  (20, -1.)  (21, 4.)  (22, -1.)  (27, -1.)
>> > row 22: (16, -1.)  (21, -1.)  (22, 4.)  (23, -1.)  (28, -1.)
>> > row 23: (17, -1.)  (22, -1.)  (23, 4.)  (29, -1.)
>> > row 24: (18, -1.)  (24, 4.)  (25, -1.)  (30, -1.)
>> > row 25: (19, -1.)  (24, -1.)  (25, 4.)  (26, -1.)  (31, -1.)
>> > row 26: (20, -1.)  (25, -1.)  (26, 4.)  (27, -1.)  (32, -1.)
>> > row 27: (21, -1.)  (26, -1.)  (27, 4.)  (28, -1.)  (33, -1.)
>> > row 28: (22, -1.)  (27, -1.)  (28, 4.)  (29, -1.)  (34, -1.)
>> > row 29: (23, -1.)  (28, -1.)  (29, 4.)  (35, -1.)
>> > row 30: (24, -1.)  (30, 4.)  (31, -1.)
>> > row 31: (25, -1.)  (30, -1.)  (31, 4.)  (32, -1.)
>> > row 32: (26, -1.)  (31, -1.)  (32, 4.)  (33, -1.)
>> > row 33: (27, -1.)  (32, -1.)  (33, 4.)  (34, -1.)
>> > row 34: (28, -1.)  (33, -1.)  (34, 4.)  (35, -1.)
>> > row 35: (29, -1.)  (34, -1.)  (35, 4.)
>> > Second MatLoad!
>> > Mat Object: 2 MPI processes
>> >  type: mpiaij
>> > ==4592== Invalid read of size 4
>> > ==4592==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket
>> (mpiaij.c:1402)
>> > ==4592==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
>> > ==4592==by 0x53373D7: MatView (matrix.c:989)
>> > ==4592==by 0x40107E: main (ex16.c:30)
>> > ==4592==  Address 0xa47b460 is 20 bytes after a block of size 28 alloc'd
>> > ==4592==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
>> > ==4592==by 0x4FD121A: PetscMallocAlign (mal.c:28)
>> > ==4592==by 0x5842C70: MatSetUpMultiply_MPIAIJ (mmaij.c:41)
>> > ==4592==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
>> > ==4592==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
>> > ==4592==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
>> > ==4592==by 0x5337FEA: MatLoad (matrix.c:1101)
>> > ==4592==by 0x400D9F: main (ex16.c:22)
>> > ==4592==
>> > ==4591== Invalid read of size 4
>> > ==4591==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket
>> (mpiaij.c:1402)
>> > ==4591==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
>> > ==4591==by 0x53373D7: MatView (matrix.c:989)
>> > ==4591==by 0x40107E: main (ex16.c:30)
>> > ==4591==  Address 0xa482958 is 24 bytes before a block of size 7 alloc'd
>> > ==4591==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
>> > ==4591==by 0x4FD121A: PetscMallocAlign (mal.c:28)
>> > ==4591==by 0x4F31FB5: PetscStrallocpy (str.c:197)
>> > ==4591==by 0x4F0D3F5: PetscClassRegLogRegister (classlog.c:253)
>> > ==4591==by 0x4EF96E2: PetscClassIdRegister (plog.c:2053)
>> > ==4591==by 0x51FA018: VecInitializePackage 

Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Barry Smith

   Hong wrote:  (Note that it creates a new Mat each time so shouldn't be 
affected by the bug I fixed; it also "works" with MUMPs but not superlu_dist.)


It is not problem with Matload twice. The file has one matrix, but is loaded 
twice.

Replacing pc with ksp, the code runs fine. 
The error occurs when PCSetUp_LU() is called with SAME_NONZERO_PATTERN.
I'll further look at it later.

Hong

From: Zhang, Hong
Sent: Friday, October 21, 2016 8:18 PM
To: Barry Smith; petsc-users
Subject: RE: [petsc-users] SuperLU_dist issue in 3.7.4

I am investigating it. The file has two matrices. The code takes following 
steps:

PCCreate(PETSC_COMM_WORLD, );

MatCreate(PETSC_COMM_WORLD,);
MatLoad(A,fd);
PCSetOperators(pc,A,A);
PCSetUp(pc);

MatCreate(PETSC_COMM_WORLD,);
MatLoad(A,fd);
PCSetOperators(pc,A,A);
PCSetUp(pc);  //crash here with np=2, superlu_dist, not with mumps/superlu or 
superlu_dist np=1

Hong

> On Oct 24, 2016, at 9:00 AM, Satish Balay <ba...@mcs.anl.gov> wrote:
> 
> Since the provided test code dosn't crash [and is valgrind clean] -
> with this fix - I'm not sure what bug Hong is chasing..
> 
> Satish
> 
> On Mon, 24 Oct 2016, Barry Smith wrote:
> 
>> 
>>  Anton,
>> 
>>   Sorry for any confusion. This doesn't resolve the SuperLU_DIST issue which 
>> I think Hong is working on, this only resolves multiple loads of matrices 
>> into the same Mat.
>> 
>>  Barry
>> 
>>> On Oct 24, 2016, at 5:07 AM, Anton Popov <po...@uni-mainz.de> wrote:
>>> 
>>> Thank you Barry, Satish, Fande!
>>> 
>>> Is there a chance to get this fix in the maintenance release 3.7.5 together 
>>> with the latest SuperLU_DIST? Or next release is a more realistic option?
>>> 
>>> Anton
>>> 
>>> On 10/24/2016 01:58 AM, Satish Balay wrote:
>>>> The original testcode from Anton also works [i.e is valgrind clean] with 
>>>> this change..
>>>> 
>>>> Satish
>>>> 
>>>> On Sun, 23 Oct 2016, Barry Smith wrote:
>>>> 
>>>>>   Thanks Satish,
>>>>> 
>>>>>  I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant  
>>>>> (in next for testing)
>>>>> 
>>>>>Fande,
>>>>> 
>>>>>This will also make MatMPIAIJSetPreallocation() work properly with 
>>>>> multiple calls (you will not need a MatReset()).
>>>>> 
>>>>>   Barry
>>>>> 
>>>>> 
>>>>>> On Oct 21, 2016, at 6:48 PM, Satish Balay <ba...@mcs.anl.gov> wrote:
>>>>>> 
>>>>>> On Fri, 21 Oct 2016, Barry Smith wrote:
>>>>>> 
>>>>>>> valgrind first
>>>>>> balay@asterix /home/balay/download-pine/x/superlu_dist_test
>>>>>> $ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small
>>>>>> First MatLoad!
>>>>>> Mat Object: 2 MPI processes
>>>>>> type: mpiaij
>>>>>> row 0: (0, 4.)  (1, -1.)  (6, -1.)
>>>>>> row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.)
>>>>>> row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.)
>>>>>> row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.)
>>>>>> row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.)
>>>>>> row 5: (4, -1.)  (5, 4.)  (11, -1.)
>>>>>> row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.)
>>>>>> row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.)
>>>>>> row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.)
>>>>>> row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.)
>>>>>> row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.)
>>>>>> row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.)
>>>>>> row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.)
>>>>>> row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.)
>>>>>> row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.)
>>>>>> row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.)
>>>>>> row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.)
>>>>>> row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.)
>>>>>> row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.)
>>>>>> row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.)
>>>>>> row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.)
>>>>>> row 21: (15, -1.)  (20, -1.)  (21, 4.)  (2

Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Kong, Fande
On Sun, Oct 23, 2016 at 3:56 PM, Barry Smith  wrote:

>
>Thanks Satish,
>
>   I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant
> (in next for testing)
>
> Fande,
>
> This will also make MatMPIAIJSetPreallocation() work properly with
> multiple calls (you will not need a MatReset()).
>
>Barry
>

Thanks, Barry.

Fande,


>
>
> > On Oct 21, 2016, at 6:48 PM, Satish Balay  wrote:
> >
> > On Fri, 21 Oct 2016, Barry Smith wrote:
> >
> >>
> >>  valgrind first
> >
> > balay@asterix /home/balay/download-pine/x/superlu_dist_test
> > $ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small
> > First MatLoad!
> > Mat Object: 2 MPI processes
> >  type: mpiaij
> > row 0: (0, 4.)  (1, -1.)  (6, -1.)
> > row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.)
> > row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.)
> > row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.)
> > row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.)
> > row 5: (4, -1.)  (5, 4.)  (11, -1.)
> > row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.)
> > row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.)
> > row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.)
> > row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.)
> > row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.)
> > row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.)
> > row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.)
> > row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.)
> > row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.)
> > row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.)
> > row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.)
> > row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.)
> > row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.)
> > row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.)
> > row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.)
> > row 21: (15, -1.)  (20, -1.)  (21, 4.)  (22, -1.)  (27, -1.)
> > row 22: (16, -1.)  (21, -1.)  (22, 4.)  (23, -1.)  (28, -1.)
> > row 23: (17, -1.)  (22, -1.)  (23, 4.)  (29, -1.)
> > row 24: (18, -1.)  (24, 4.)  (25, -1.)  (30, -1.)
> > row 25: (19, -1.)  (24, -1.)  (25, 4.)  (26, -1.)  (31, -1.)
> > row 26: (20, -1.)  (25, -1.)  (26, 4.)  (27, -1.)  (32, -1.)
> > row 27: (21, -1.)  (26, -1.)  (27, 4.)  (28, -1.)  (33, -1.)
> > row 28: (22, -1.)  (27, -1.)  (28, 4.)  (29, -1.)  (34, -1.)
> > row 29: (23, -1.)  (28, -1.)  (29, 4.)  (35, -1.)
> > row 30: (24, -1.)  (30, 4.)  (31, -1.)
> > row 31: (25, -1.)  (30, -1.)  (31, 4.)  (32, -1.)
> > row 32: (26, -1.)  (31, -1.)  (32, 4.)  (33, -1.)
> > row 33: (27, -1.)  (32, -1.)  (33, 4.)  (34, -1.)
> > row 34: (28, -1.)  (33, -1.)  (34, 4.)  (35, -1.)
> > row 35: (29, -1.)  (34, -1.)  (35, 4.)
> > Second MatLoad!
> > Mat Object: 2 MPI processes
> >  type: mpiaij
> > ==4592== Invalid read of size 4
> > ==4592==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket
> (mpiaij.c:1402)
> > ==4592==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
> > ==4592==by 0x53373D7: MatView (matrix.c:989)
> > ==4592==by 0x40107E: main (ex16.c:30)
> > ==4592==  Address 0xa47b460 is 20 bytes after a block of size 28 alloc'd
> > ==4592==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
> > ==4592==by 0x4FD121A: PetscMallocAlign (mal.c:28)
> > ==4592==by 0x5842C70: MatSetUpMultiply_MPIAIJ (mmaij.c:41)
> > ==4592==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
> > ==4592==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
> > ==4592==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
> > ==4592==by 0x5337FEA: MatLoad (matrix.c:1101)
> > ==4592==by 0x400D9F: main (ex16.c:22)
> > ==4592==
> > ==4591== Invalid read of size 4
> > ==4591==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket
> (mpiaij.c:1402)
> > ==4591==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
> > ==4591==by 0x53373D7: MatView (matrix.c:989)
> > ==4591==by 0x40107E: main (ex16.c:30)
> > ==4591==  Address 0xa482958 is 24 bytes before a block of size 7 alloc'd
> > ==4591==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
> > ==4591==by 0x4FD121A: PetscMallocAlign (mal.c:28)
> > ==4591==by 0x4F31FB5: PetscStrallocpy (str.c:197)
> > ==4591==by 0x4F0D3F5: PetscClassRegLogRegister (classlog.c:253)
> > ==4591==by 0x4EF96E2: PetscClassIdRegister (plog.c:2053)
> > ==4591==by 0x51FA018: VecInitializePackage (dlregisvec.c:165)
> > ==4591==by 0x51F6DE9: VecCreate (veccreate.c:35)
> > ==4591==by 0x51C49F0: VecCreateSeq (vseqcr.c:37)
> > ==4591==by 0x5843191: MatSetUpMultiply_MPIAIJ (mmaij.c:104)
> > ==4591==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
> > ==4591==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
> > ==4591==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
> > ==4591==by 0x5337FEA: MatLoad (matrix.c:1101)
> > ==4591==by 0x400D9F: main (ex16.c:22)
> > ==4591==
> > [0]PETSC ERROR: - Error Message
> 

Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Satish Balay
Since the provided test code dosn't crash [and is valgrind clean] -
with this fix - I'm not sure what bug Hong is chasing..

Satish

On Mon, 24 Oct 2016, Barry Smith wrote:

> 
>   Anton,
> 
>Sorry for any confusion. This doesn't resolve the SuperLU_DIST issue which 
> I think Hong is working on, this only resolves multiple loads of matrices 
> into the same Mat.
> 
>   Barry
> 
> > On Oct 24, 2016, at 5:07 AM, Anton Popov  wrote:
> > 
> > Thank you Barry, Satish, Fande!
> > 
> > Is there a chance to get this fix in the maintenance release 3.7.5 together 
> > with the latest SuperLU_DIST? Or next release is a more realistic option?
> > 
> > Anton
> > 
> > On 10/24/2016 01:58 AM, Satish Balay wrote:
> >> The original testcode from Anton also works [i.e is valgrind clean] with 
> >> this change..
> >> 
> >> Satish
> >> 
> >> On Sun, 23 Oct 2016, Barry Smith wrote:
> >> 
> >>>Thanks Satish,
> >>> 
> >>>   I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant  
> >>> (in next for testing)
> >>> 
> >>> Fande,
> >>> 
> >>> This will also make MatMPIAIJSetPreallocation() work properly 
> >>> with multiple calls (you will not need a MatReset()).
> >>> 
> >>>Barry
> >>> 
> >>> 
>  On Oct 21, 2016, at 6:48 PM, Satish Balay  wrote:
>  
>  On Fri, 21 Oct 2016, Barry Smith wrote:
>  
> >  valgrind first
>  balay@asterix /home/balay/download-pine/x/superlu_dist_test
>  $ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small
>  First MatLoad!
>  Mat Object: 2 MPI processes
>   type: mpiaij
>  row 0: (0, 4.)  (1, -1.)  (6, -1.)
>  row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.)
>  row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.)
>  row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.)
>  row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.)
>  row 5: (4, -1.)  (5, 4.)  (11, -1.)
>  row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.)
>  row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.)
>  row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.)
>  row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.)
>  row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.)
>  row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.)
>  row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.)
>  row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.)
>  row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.)
>  row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.)
>  row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.)
>  row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.)
>  row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.)
>  row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.)
>  row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.)
>  row 21: (15, -1.)  (20, -1.)  (21, 4.)  (22, -1.)  (27, -1.)
>  row 22: (16, -1.)  (21, -1.)  (22, 4.)  (23, -1.)  (28, -1.)
>  row 23: (17, -1.)  (22, -1.)  (23, 4.)  (29, -1.)
>  row 24: (18, -1.)  (24, 4.)  (25, -1.)  (30, -1.)
>  row 25: (19, -1.)  (24, -1.)  (25, 4.)  (26, -1.)  (31, -1.)
>  row 26: (20, -1.)  (25, -1.)  (26, 4.)  (27, -1.)  (32, -1.)
>  row 27: (21, -1.)  (26, -1.)  (27, 4.)  (28, -1.)  (33, -1.)
>  row 28: (22, -1.)  (27, -1.)  (28, 4.)  (29, -1.)  (34, -1.)
>  row 29: (23, -1.)  (28, -1.)  (29, 4.)  (35, -1.)
>  row 30: (24, -1.)  (30, 4.)  (31, -1.)
>  row 31: (25, -1.)  (30, -1.)  (31, 4.)  (32, -1.)
>  row 32: (26, -1.)  (31, -1.)  (32, 4.)  (33, -1.)
>  row 33: (27, -1.)  (32, -1.)  (33, 4.)  (34, -1.)
>  row 34: (28, -1.)  (33, -1.)  (34, 4.)  (35, -1.)
>  row 35: (29, -1.)  (34, -1.)  (35, 4.)
>  Second MatLoad!
>  Mat Object: 2 MPI processes
>   type: mpiaij
>  ==4592== Invalid read of size 4
>  ==4592==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket 
>  (mpiaij.c:1402)
>  ==4592==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
>  ==4592==by 0x53373D7: MatView (matrix.c:989)
>  ==4592==by 0x40107E: main (ex16.c:30)
>  ==4592==  Address 0xa47b460 is 20 bytes after a block of size 28 alloc'd
>  ==4592==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
>  ==4592==by 0x4FD121A: PetscMallocAlign (mal.c:28)
>  ==4592==by 0x5842C70: MatSetUpMultiply_MPIAIJ (mmaij.c:41)
>  ==4592==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
>  ==4592==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
>  ==4592==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
>  ==4592==by 0x5337FEA: MatLoad (matrix.c:1101)
>  ==4592==by 0x400D9F: main (ex16.c:22)
>  ==4592==
>  ==4591== Invalid read of size 4
>  ==4591==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket 
>  (mpiaij.c:1402)
>  ==4591==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
>  ==4591==by 0x53373D7: MatView (matrix.c:989)
>  ==4591==  

Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Barry Smith

  Anton,

   Sorry for any confusion. This doesn't resolve the SuperLU_DIST issue which I 
think Hong is working on, this only resolves multiple loads of matrices into 
the same Mat.

  Barry

> On Oct 24, 2016, at 5:07 AM, Anton Popov  wrote:
> 
> Thank you Barry, Satish, Fande!
> 
> Is there a chance to get this fix in the maintenance release 3.7.5 together 
> with the latest SuperLU_DIST? Or next release is a more realistic option?
> 
> Anton
> 
> On 10/24/2016 01:58 AM, Satish Balay wrote:
>> The original testcode from Anton also works [i.e is valgrind clean] with 
>> this change..
>> 
>> Satish
>> 
>> On Sun, 23 Oct 2016, Barry Smith wrote:
>> 
>>>Thanks Satish,
>>> 
>>>   I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant  
>>> (in next for testing)
>>> 
>>> Fande,
>>> 
>>> This will also make MatMPIAIJSetPreallocation() work properly with 
>>> multiple calls (you will not need a MatReset()).
>>> 
>>>Barry
>>> 
>>> 
 On Oct 21, 2016, at 6:48 PM, Satish Balay  wrote:
 
 On Fri, 21 Oct 2016, Barry Smith wrote:
 
>  valgrind first
 balay@asterix /home/balay/download-pine/x/superlu_dist_test
 $ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small
 First MatLoad!
 Mat Object: 2 MPI processes
  type: mpiaij
 row 0: (0, 4.)  (1, -1.)  (6, -1.)
 row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.)
 row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.)
 row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.)
 row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.)
 row 5: (4, -1.)  (5, 4.)  (11, -1.)
 row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.)
 row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.)
 row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.)
 row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.)
 row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.)
 row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.)
 row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.)
 row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.)
 row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.)
 row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.)
 row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.)
 row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.)
 row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.)
 row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.)
 row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.)
 row 21: (15, -1.)  (20, -1.)  (21, 4.)  (22, -1.)  (27, -1.)
 row 22: (16, -1.)  (21, -1.)  (22, 4.)  (23, -1.)  (28, -1.)
 row 23: (17, -1.)  (22, -1.)  (23, 4.)  (29, -1.)
 row 24: (18, -1.)  (24, 4.)  (25, -1.)  (30, -1.)
 row 25: (19, -1.)  (24, -1.)  (25, 4.)  (26, -1.)  (31, -1.)
 row 26: (20, -1.)  (25, -1.)  (26, 4.)  (27, -1.)  (32, -1.)
 row 27: (21, -1.)  (26, -1.)  (27, 4.)  (28, -1.)  (33, -1.)
 row 28: (22, -1.)  (27, -1.)  (28, 4.)  (29, -1.)  (34, -1.)
 row 29: (23, -1.)  (28, -1.)  (29, 4.)  (35, -1.)
 row 30: (24, -1.)  (30, 4.)  (31, -1.)
 row 31: (25, -1.)  (30, -1.)  (31, 4.)  (32, -1.)
 row 32: (26, -1.)  (31, -1.)  (32, 4.)  (33, -1.)
 row 33: (27, -1.)  (32, -1.)  (33, 4.)  (34, -1.)
 row 34: (28, -1.)  (33, -1.)  (34, 4.)  (35, -1.)
 row 35: (29, -1.)  (34, -1.)  (35, 4.)
 Second MatLoad!
 Mat Object: 2 MPI processes
  type: mpiaij
 ==4592== Invalid read of size 4
 ==4592==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket 
 (mpiaij.c:1402)
 ==4592==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
 ==4592==by 0x53373D7: MatView (matrix.c:989)
 ==4592==by 0x40107E: main (ex16.c:30)
 ==4592==  Address 0xa47b460 is 20 bytes after a block of size 28 alloc'd
 ==4592==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
 ==4592==by 0x4FD121A: PetscMallocAlign (mal.c:28)
 ==4592==by 0x5842C70: MatSetUpMultiply_MPIAIJ (mmaij.c:41)
 ==4592==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
 ==4592==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
 ==4592==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
 ==4592==by 0x5337FEA: MatLoad (matrix.c:1101)
 ==4592==by 0x400D9F: main (ex16.c:22)
 ==4592==
 ==4591== Invalid read of size 4
 ==4591==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket 
 (mpiaij.c:1402)
 ==4591==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
 ==4591==by 0x53373D7: MatView (matrix.c:989)
 ==4591==by 0x40107E: main (ex16.c:30)
 ==4591==  Address 0xa482958 is 24 bytes before a block of size 7 alloc'd
 ==4591==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
 ==4591==by 0x4FD121A: PetscMallocAlign (mal.c:28)
 ==4591==by 0x4F31FB5: PetscStrallocpy (str.c:197)
 ==4591==by 0x4F0D3F5: PetscClassRegLogRegister (classlog.c:253)
 ==4591==by 

Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-24 Thread Anton Popov

Thank you Barry, Satish, Fande!

Is there a chance to get this fix in the maintenance release 3.7.5 
together with the latest SuperLU_DIST? Or next release is a more 
realistic option?


Anton

On 10/24/2016 01:58 AM, Satish Balay wrote:

The original testcode from Anton also works [i.e is valgrind clean] with this 
change..

Satish

On Sun, 23 Oct 2016, Barry Smith wrote:


Thanks Satish,

   I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant  (in 
next for testing)

 Fande,

 This will also make MatMPIAIJSetPreallocation() work properly with 
multiple calls (you will not need a MatReset()).

Barry



On Oct 21, 2016, at 6:48 PM, Satish Balay  wrote:

On Fri, 21 Oct 2016, Barry Smith wrote:


  valgrind first

balay@asterix /home/balay/download-pine/x/superlu_dist_test
$ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small
First MatLoad!
Mat Object: 2 MPI processes
  type: mpiaij
row 0: (0, 4.)  (1, -1.)  (6, -1.)
row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.)
row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.)
row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.)
row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.)
row 5: (4, -1.)  (5, 4.)  (11, -1.)
row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.)
row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.)
row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.)
row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.)
row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.)
row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.)
row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.)
row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.)
row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.)
row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.)
row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.)
row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.)
row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.)
row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.)
row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.)
row 21: (15, -1.)  (20, -1.)  (21, 4.)  (22, -1.)  (27, -1.)
row 22: (16, -1.)  (21, -1.)  (22, 4.)  (23, -1.)  (28, -1.)
row 23: (17, -1.)  (22, -1.)  (23, 4.)  (29, -1.)
row 24: (18, -1.)  (24, 4.)  (25, -1.)  (30, -1.)
row 25: (19, -1.)  (24, -1.)  (25, 4.)  (26, -1.)  (31, -1.)
row 26: (20, -1.)  (25, -1.)  (26, 4.)  (27, -1.)  (32, -1.)
row 27: (21, -1.)  (26, -1.)  (27, 4.)  (28, -1.)  (33, -1.)
row 28: (22, -1.)  (27, -1.)  (28, 4.)  (29, -1.)  (34, -1.)
row 29: (23, -1.)  (28, -1.)  (29, 4.)  (35, -1.)
row 30: (24, -1.)  (30, 4.)  (31, -1.)
row 31: (25, -1.)  (30, -1.)  (31, 4.)  (32, -1.)
row 32: (26, -1.)  (31, -1.)  (32, 4.)  (33, -1.)
row 33: (27, -1.)  (32, -1.)  (33, 4.)  (34, -1.)
row 34: (28, -1.)  (33, -1.)  (34, 4.)  (35, -1.)
row 35: (29, -1.)  (34, -1.)  (35, 4.)
Second MatLoad!
Mat Object: 2 MPI processes
  type: mpiaij
==4592== Invalid read of size 4
==4592==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1402)
==4592==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
==4592==by 0x53373D7: MatView (matrix.c:989)
==4592==by 0x40107E: main (ex16.c:30)
==4592==  Address 0xa47b460 is 20 bytes after a block of size 28 alloc'd
==4592==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
==4592==by 0x4FD121A: PetscMallocAlign (mal.c:28)
==4592==by 0x5842C70: MatSetUpMultiply_MPIAIJ (mmaij.c:41)
==4592==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
==4592==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
==4592==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
==4592==by 0x5337FEA: MatLoad (matrix.c:1101)
==4592==by 0x400D9F: main (ex16.c:22)
==4592==
==4591== Invalid read of size 4
==4591==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1402)
==4591==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
==4591==by 0x53373D7: MatView (matrix.c:989)
==4591==by 0x40107E: main (ex16.c:30)
==4591==  Address 0xa482958 is 24 bytes before a block of size 7 alloc'd
==4591==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
==4591==by 0x4FD121A: PetscMallocAlign (mal.c:28)
==4591==by 0x4F31FB5: PetscStrallocpy (str.c:197)
==4591==by 0x4F0D3F5: PetscClassRegLogRegister (classlog.c:253)
==4591==by 0x4EF96E2: PetscClassIdRegister (plog.c:2053)
==4591==by 0x51FA018: VecInitializePackage (dlregisvec.c:165)
==4591==by 0x51F6DE9: VecCreate (veccreate.c:35)
==4591==by 0x51C49F0: VecCreateSeq (vseqcr.c:37)
==4591==by 0x5843191: MatSetUpMultiply_MPIAIJ (mmaij.c:104)
==4591==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
==4591==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
==4591==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
==4591==by 0x5337FEA: MatLoad (matrix.c:1101)
==4591==by 0x400D9F: main (ex16.c:22)
==4591==
[0]PETSC ERROR: - Error Message 
--
[0]PETSC ERROR: Argument out of range
[0]PETSC ERROR: 

Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-23 Thread Satish Balay
The original testcode from Anton also works [i.e is valgrind clean] with this 
change..

Satish

On Sun, 23 Oct 2016, Barry Smith wrote:

> 
>Thanks Satish,
> 
>   I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant  (in 
> next for testing)
> 
> Fande,
> 
> This will also make MatMPIAIJSetPreallocation() work properly with 
> multiple calls (you will not need a MatReset()).
> 
>Barry
> 
> 
> > On Oct 21, 2016, at 6:48 PM, Satish Balay  wrote:
> > 
> > On Fri, 21 Oct 2016, Barry Smith wrote:
> > 
> >> 
> >>  valgrind first
> > 
> > balay@asterix /home/balay/download-pine/x/superlu_dist_test
> > $ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small
> > First MatLoad! 
> > Mat Object: 2 MPI processes
> >  type: mpiaij
> > row 0: (0, 4.)  (1, -1.)  (6, -1.) 
> > row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.) 
> > row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.) 
> > row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.) 
> > row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.) 
> > row 5: (4, -1.)  (5, 4.)  (11, -1.) 
> > row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.) 
> > row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.) 
> > row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.) 
> > row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.) 
> > row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.) 
> > row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.) 
> > row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.) 
> > row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.) 
> > row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.) 
> > row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.) 
> > row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.) 
> > row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.) 
> > row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.) 
> > row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.) 
> > row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.) 
> > row 21: (15, -1.)  (20, -1.)  (21, 4.)  (22, -1.)  (27, -1.) 
> > row 22: (16, -1.)  (21, -1.)  (22, 4.)  (23, -1.)  (28, -1.) 
> > row 23: (17, -1.)  (22, -1.)  (23, 4.)  (29, -1.) 
> > row 24: (18, -1.)  (24, 4.)  (25, -1.)  (30, -1.) 
> > row 25: (19, -1.)  (24, -1.)  (25, 4.)  (26, -1.)  (31, -1.) 
> > row 26: (20, -1.)  (25, -1.)  (26, 4.)  (27, -1.)  (32, -1.) 
> > row 27: (21, -1.)  (26, -1.)  (27, 4.)  (28, -1.)  (33, -1.) 
> > row 28: (22, -1.)  (27, -1.)  (28, 4.)  (29, -1.)  (34, -1.) 
> > row 29: (23, -1.)  (28, -1.)  (29, 4.)  (35, -1.) 
> > row 30: (24, -1.)  (30, 4.)  (31, -1.) 
> > row 31: (25, -1.)  (30, -1.)  (31, 4.)  (32, -1.) 
> > row 32: (26, -1.)  (31, -1.)  (32, 4.)  (33, -1.) 
> > row 33: (27, -1.)  (32, -1.)  (33, 4.)  (34, -1.) 
> > row 34: (28, -1.)  (33, -1.)  (34, 4.)  (35, -1.) 
> > row 35: (29, -1.)  (34, -1.)  (35, 4.) 
> > Second MatLoad! 
> > Mat Object: 2 MPI processes
> >  type: mpiaij
> > ==4592== Invalid read of size 4
> > ==4592==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1402)
> > ==4592==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
> > ==4592==by 0x53373D7: MatView (matrix.c:989)
> > ==4592==by 0x40107E: main (ex16.c:30)
> > ==4592==  Address 0xa47b460 is 20 bytes after a block of size 28 alloc'd
> > ==4592==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
> > ==4592==by 0x4FD121A: PetscMallocAlign (mal.c:28)
> > ==4592==by 0x5842C70: MatSetUpMultiply_MPIAIJ (mmaij.c:41)
> > ==4592==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
> > ==4592==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
> > ==4592==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
> > ==4592==by 0x5337FEA: MatLoad (matrix.c:1101)
> > ==4592==by 0x400D9F: main (ex16.c:22)
> > ==4592== 
> > ==4591== Invalid read of size 4
> > ==4591==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1402)
> > ==4591==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
> > ==4591==by 0x53373D7: MatView (matrix.c:989)
> > ==4591==by 0x40107E: main (ex16.c:30)
> > ==4591==  Address 0xa482958 is 24 bytes before a block of size 7 alloc'd
> > ==4591==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
> > ==4591==by 0x4FD121A: PetscMallocAlign (mal.c:28)
> > ==4591==by 0x4F31FB5: PetscStrallocpy (str.c:197)
> > ==4591==by 0x4F0D3F5: PetscClassRegLogRegister (classlog.c:253)
> > ==4591==by 0x4EF96E2: PetscClassIdRegister (plog.c:2053)
> > ==4591==by 0x51FA018: VecInitializePackage (dlregisvec.c:165)
> > ==4591==by 0x51F6DE9: VecCreate (veccreate.c:35)
> > ==4591==by 0x51C49F0: VecCreateSeq (vseqcr.c:37)
> > ==4591==by 0x5843191: MatSetUpMultiply_MPIAIJ (mmaij.c:104)
> > ==4591==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
> > ==4591==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
> > ==4591==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
> > ==4591==by 0x5337FEA: MatLoad (matrix.c:1101)
> > ==4591==by 0x400D9F: main (ex16.c:22)
> > ==4591== 
> > 

Re: [petsc-users] SuperLU_dist issue in 3.7.4 failure of repeated calls to MatLoad() or MatMPIAIJSetPreallocation() with the same matrix

2016-10-23 Thread Barry Smith

   Thanks Satish,

  I have fixed this in barry/fix-matmpixxxsetpreallocation-reentrant  (in 
next for testing)

Fande,

This will also make MatMPIAIJSetPreallocation() work properly with 
multiple calls (you will not need a MatReset()).

   Barry


> On Oct 21, 2016, at 6:48 PM, Satish Balay  wrote:
> 
> On Fri, 21 Oct 2016, Barry Smith wrote:
> 
>> 
>>  valgrind first
> 
> balay@asterix /home/balay/download-pine/x/superlu_dist_test
> $ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small
> First MatLoad! 
> Mat Object: 2 MPI processes
>  type: mpiaij
> row 0: (0, 4.)  (1, -1.)  (6, -1.) 
> row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.) 
> row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.) 
> row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.) 
> row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.) 
> row 5: (4, -1.)  (5, 4.)  (11, -1.) 
> row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.) 
> row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.) 
> row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.) 
> row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.) 
> row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.) 
> row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.) 
> row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.) 
> row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.) 
> row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.) 
> row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.) 
> row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.) 
> row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.) 
> row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.) 
> row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.) 
> row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.) 
> row 21: (15, -1.)  (20, -1.)  (21, 4.)  (22, -1.)  (27, -1.) 
> row 22: (16, -1.)  (21, -1.)  (22, 4.)  (23, -1.)  (28, -1.) 
> row 23: (17, -1.)  (22, -1.)  (23, 4.)  (29, -1.) 
> row 24: (18, -1.)  (24, 4.)  (25, -1.)  (30, -1.) 
> row 25: (19, -1.)  (24, -1.)  (25, 4.)  (26, -1.)  (31, -1.) 
> row 26: (20, -1.)  (25, -1.)  (26, 4.)  (27, -1.)  (32, -1.) 
> row 27: (21, -1.)  (26, -1.)  (27, 4.)  (28, -1.)  (33, -1.) 
> row 28: (22, -1.)  (27, -1.)  (28, 4.)  (29, -1.)  (34, -1.) 
> row 29: (23, -1.)  (28, -1.)  (29, 4.)  (35, -1.) 
> row 30: (24, -1.)  (30, 4.)  (31, -1.) 
> row 31: (25, -1.)  (30, -1.)  (31, 4.)  (32, -1.) 
> row 32: (26, -1.)  (31, -1.)  (32, 4.)  (33, -1.) 
> row 33: (27, -1.)  (32, -1.)  (33, 4.)  (34, -1.) 
> row 34: (28, -1.)  (33, -1.)  (34, 4.)  (35, -1.) 
> row 35: (29, -1.)  (34, -1.)  (35, 4.) 
> Second MatLoad! 
> Mat Object: 2 MPI processes
>  type: mpiaij
> ==4592== Invalid read of size 4
> ==4592==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1402)
> ==4592==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
> ==4592==by 0x53373D7: MatView (matrix.c:989)
> ==4592==by 0x40107E: main (ex16.c:30)
> ==4592==  Address 0xa47b460 is 20 bytes after a block of size 28 alloc'd
> ==4592==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
> ==4592==by 0x4FD121A: PetscMallocAlign (mal.c:28)
> ==4592==by 0x5842C70: MatSetUpMultiply_MPIAIJ (mmaij.c:41)
> ==4592==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
> ==4592==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
> ==4592==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
> ==4592==by 0x5337FEA: MatLoad (matrix.c:1101)
> ==4592==by 0x400D9F: main (ex16.c:22)
> ==4592== 
> ==4591== Invalid read of size 4
> ==4591==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1402)
> ==4591==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
> ==4591==by 0x53373D7: MatView (matrix.c:989)
> ==4591==by 0x40107E: main (ex16.c:30)
> ==4591==  Address 0xa482958 is 24 bytes before a block of size 7 alloc'd
> ==4591==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
> ==4591==by 0x4FD121A: PetscMallocAlign (mal.c:28)
> ==4591==by 0x4F31FB5: PetscStrallocpy (str.c:197)
> ==4591==by 0x4F0D3F5: PetscClassRegLogRegister (classlog.c:253)
> ==4591==by 0x4EF96E2: PetscClassIdRegister (plog.c:2053)
> ==4591==by 0x51FA018: VecInitializePackage (dlregisvec.c:165)
> ==4591==by 0x51F6DE9: VecCreate (veccreate.c:35)
> ==4591==by 0x51C49F0: VecCreateSeq (vseqcr.c:37)
> ==4591==by 0x5843191: MatSetUpMultiply_MPIAIJ (mmaij.c:104)
> ==4591==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
> ==4591==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
> ==4591==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
> ==4591==by 0x5337FEA: MatLoad (matrix.c:1101)
> ==4591==by 0x400D9F: main (ex16.c:22)
> ==4591== 
> [0]PETSC ERROR: - Error Message 
> --
> [0]PETSC ERROR: Argument out of range
> [0]PETSC ERROR: Column too large: col 96 max 35
> [0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for 
> trouble shooting.
> [0]PETSC ERROR: Petsc Development 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-21 Thread Zhang, Hong
It is not problem with Matload twice. The file has one matrix, but is loaded 
twice.

Replacing pc with ksp, the code runs fine. 
The error occurs when PCSetUp_LU() is called with SAME_NONZERO_PATTERN.
I'll further look at it later.

Hong

From: Zhang, Hong
Sent: Friday, October 21, 2016 8:18 PM
To: Barry Smith; petsc-users
Subject: RE: [petsc-users] SuperLU_dist issue in 3.7.4

I am investigating it. The file has two matrices. The code takes following 
steps:

PCCreate(PETSC_COMM_WORLD, );

MatCreate(PETSC_COMM_WORLD,);
MatLoad(A,fd);
PCSetOperators(pc,A,A);
PCSetUp(pc);

MatCreate(PETSC_COMM_WORLD,);
MatLoad(A,fd);
PCSetOperators(pc,A,A);
PCSetUp(pc);  //crash here with np=2, superlu_dist, not with mumps/superlu or 
superlu_dist np=1

Hong


From: Barry Smith [bsm...@mcs.anl.gov]
Sent: Friday, October 21, 2016 5:59 PM
To: petsc-users
Cc: Zhang, Hong
Subject: Re: [petsc-users] SuperLU_dist issue in 3.7.4

> On Oct 21, 2016, at 5:16 PM, Satish Balay <ba...@mcs.anl.gov> wrote:
>
> The issue with this test code is - using MatLoad() twice [with the
> same object - without destroying it]. Not sure if thats supporsed to
> work..

   If the file has two matrices in it then yes a second call to MatLoad() with 
the same matrix should just load in the second matrix from the file correctly. 
Perhaps we need a test in our test suite just to make sure that works.

  Barry



>
> Satish
>
> On Fri, 21 Oct 2016, Hong wrote:
>
>> I can reproduce the error on a linux machine with petsc-maint. It crashes
>> at 2nd solve, on both processors:
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x7f051dc835bd in pdgsequ (A=0x1563910, r=0x176dfe0, c=0x178f7f0,
>>rowcnd=0x7fffcb8dab30, colcnd=0x7fffcb8dab38, amax=0x7fffcb8dab40,
>>info=0x7fffcb8dab4c, grid=0x1563858)
>>at
>> /sandbox/hzhang/petsc/arch-linux-gcc-gfortran/externalpackages/git.superlu_dist/SRC/pdgsequ.c:182
>> 182 c[jcol] = SUPERLU_MAX( c[jcol], fabs(Aval[j]) * r[irow]
>> );
>>
>> The version of superlu_dist:
>> commit 0b5369f304507f1c7904a913f4c0c86777a60639
>> Author: Xiaoye Li <x...@lbl.gov>
>> Date:   Thu May 26 11:33:19 2016 -0700
>>
>>rename 'struct pair' to 'struct superlu_pair'.
>>
>> Hong
>>
>> On Fri, Oct 21, 2016 at 5:36 AM, Anton Popov <po...@uni-mainz.de> wrote:
>>
>>>
>>> On 10/19/2016 05:22 PM, Anton Popov wrote:
>>>
>>> I looked at each valgrind-complained item in your email dated Oct. 11.
>>> Those reports are really superficial; I don't see anything  wrong with
>>> those lines (mostly uninitialized variables) singled out.  I did a few
>>> tests with the latest version in github,  all went fine.
>>>
>>> Perhaps you can print your matrix that caused problem, I can run it using
>>> your matrix.
>>>
>>> Sherry
>>>
>>> Hi Sherry,
>>>
>>> I finally figured out a minimalistic setup (attached) that reproduces the
>>> problem.
>>>
>>> I use petsc-maint:
>>>
>>> git clone -b maint https://bitbucket.org/petsc/petsc.git
>>>
>>> and configure it in the debug mode without optimization using the options:
>>>
>>> --download-superlu_dist=1 \
>>> --download-superlu_dist-commit=origin/maint \
>>>
>>> Compile the test, assuming PETSC_DIR points to the described petsc
>>> installation:
>>>
>>> make ex16
>>>
>>> Run with:
>>>
>>> mpirun -n 2 ./ex16 -f binaryoutput -pc_type lu
>>> -pc_factor_mat_solver_package superlu_dist
>>>
>>> Matrix partitioning between the processors will be completely the same as
>>> in our code (hard-coded).
>>>
>>> I factorize the same matrix twice with the same PC object. Remarkably it
>>> runs fine for the first time, but fails for the second.
>>>
>>> Thank you very much for looking into this problem.
>>>
>>> Cheers,
>>> Anton
>>>
>>
>



Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-21 Thread Zhang, Hong
I am investigating it. The file has two matrices. The code takes following 
steps:

PCCreate(PETSC_COMM_WORLD, );

MatCreate(PETSC_COMM_WORLD,);
MatLoad(A,fd); 
PCSetOperators(pc,A,A);
PCSetUp(pc);

MatCreate(PETSC_COMM_WORLD,);
MatLoad(A,fd); 
PCSetOperators(pc,A,A);
PCSetUp(pc);  //crash here with np=2, superlu_dist, not with mumps/superlu or 
superlu_dist np=1

Hong


From: Barry Smith [bsm...@mcs.anl.gov]
Sent: Friday, October 21, 2016 5:59 PM
To: petsc-users
Cc: Zhang, Hong
Subject: Re: [petsc-users] SuperLU_dist issue in 3.7.4

> On Oct 21, 2016, at 5:16 PM, Satish Balay <ba...@mcs.anl.gov> wrote:
>
> The issue with this test code is - using MatLoad() twice [with the
> same object - without destroying it]. Not sure if thats supporsed to
> work..

   If the file has two matrices in it then yes a second call to MatLoad() with 
the same matrix should just load in the second matrix from the file correctly. 
Perhaps we need a test in our test suite just to make sure that works.

  Barry



>
> Satish
>
> On Fri, 21 Oct 2016, Hong wrote:
>
>> I can reproduce the error on a linux machine with petsc-maint. It crashes
>> at 2nd solve, on both processors:
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x7f051dc835bd in pdgsequ (A=0x1563910, r=0x176dfe0, c=0x178f7f0,
>>rowcnd=0x7fffcb8dab30, colcnd=0x7fffcb8dab38, amax=0x7fffcb8dab40,
>>info=0x7fffcb8dab4c, grid=0x1563858)
>>at
>> /sandbox/hzhang/petsc/arch-linux-gcc-gfortran/externalpackages/git.superlu_dist/SRC/pdgsequ.c:182
>> 182 c[jcol] = SUPERLU_MAX( c[jcol], fabs(Aval[j]) * r[irow]
>> );
>>
>> The version of superlu_dist:
>> commit 0b5369f304507f1c7904a913f4c0c86777a60639
>> Author: Xiaoye Li <x...@lbl.gov>
>> Date:   Thu May 26 11:33:19 2016 -0700
>>
>>rename 'struct pair' to 'struct superlu_pair'.
>>
>> Hong
>>
>> On Fri, Oct 21, 2016 at 5:36 AM, Anton Popov <po...@uni-mainz.de> wrote:
>>
>>>
>>> On 10/19/2016 05:22 PM, Anton Popov wrote:
>>>
>>> I looked at each valgrind-complained item in your email dated Oct. 11.
>>> Those reports are really superficial; I don't see anything  wrong with
>>> those lines (mostly uninitialized variables) singled out.  I did a few
>>> tests with the latest version in github,  all went fine.
>>>
>>> Perhaps you can print your matrix that caused problem, I can run it using
>>> your matrix.
>>>
>>> Sherry
>>>
>>> Hi Sherry,
>>>
>>> I finally figured out a minimalistic setup (attached) that reproduces the
>>> problem.
>>>
>>> I use petsc-maint:
>>>
>>> git clone -b maint https://bitbucket.org/petsc/petsc.git
>>>
>>> and configure it in the debug mode without optimization using the options:
>>>
>>> --download-superlu_dist=1 \
>>> --download-superlu_dist-commit=origin/maint \
>>>
>>> Compile the test, assuming PETSC_DIR points to the described petsc
>>> installation:
>>>
>>> make ex16
>>>
>>> Run with:
>>>
>>> mpirun -n 2 ./ex16 -f binaryoutput -pc_type lu
>>> -pc_factor_mat_solver_package superlu_dist
>>>
>>> Matrix partitioning between the processors will be completely the same as
>>> in our code (hard-coded).
>>>
>>> I factorize the same matrix twice with the same PC object. Remarkably it
>>> runs fine for the first time, but fails for the second.
>>>
>>> Thank you very much for looking into this problem.
>>>
>>> Cheers,
>>> Anton
>>>
>>
>



Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-21 Thread Satish Balay
On Fri, 21 Oct 2016, Barry Smith wrote:

> 
>   valgrind first

balay@asterix /home/balay/download-pine/x/superlu_dist_test
$ mpiexec -n 2 $VG ./ex16 -f ~/datafiles/matrices/small
First MatLoad! 
Mat Object: 2 MPI processes
  type: mpiaij
row 0: (0, 4.)  (1, -1.)  (6, -1.) 
row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.) 
row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.) 
row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.) 
row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.) 
row 5: (4, -1.)  (5, 4.)  (11, -1.) 
row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.) 
row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.) 
row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.) 
row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.) 
row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.) 
row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.) 
row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.) 
row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.) 
row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.) 
row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.) 
row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.) 
row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.) 
row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.) 
row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.) 
row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.) 
row 21: (15, -1.)  (20, -1.)  (21, 4.)  (22, -1.)  (27, -1.) 
row 22: (16, -1.)  (21, -1.)  (22, 4.)  (23, -1.)  (28, -1.) 
row 23: (17, -1.)  (22, -1.)  (23, 4.)  (29, -1.) 
row 24: (18, -1.)  (24, 4.)  (25, -1.)  (30, -1.) 
row 25: (19, -1.)  (24, -1.)  (25, 4.)  (26, -1.)  (31, -1.) 
row 26: (20, -1.)  (25, -1.)  (26, 4.)  (27, -1.)  (32, -1.) 
row 27: (21, -1.)  (26, -1.)  (27, 4.)  (28, -1.)  (33, -1.) 
row 28: (22, -1.)  (27, -1.)  (28, 4.)  (29, -1.)  (34, -1.) 
row 29: (23, -1.)  (28, -1.)  (29, 4.)  (35, -1.) 
row 30: (24, -1.)  (30, 4.)  (31, -1.) 
row 31: (25, -1.)  (30, -1.)  (31, 4.)  (32, -1.) 
row 32: (26, -1.)  (31, -1.)  (32, 4.)  (33, -1.) 
row 33: (27, -1.)  (32, -1.)  (33, 4.)  (34, -1.) 
row 34: (28, -1.)  (33, -1.)  (34, 4.)  (35, -1.) 
row 35: (29, -1.)  (34, -1.)  (35, 4.) 
Second MatLoad! 
Mat Object: 2 MPI processes
  type: mpiaij
==4592== Invalid read of size 4
==4592==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1402)
==4592==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
==4592==by 0x53373D7: MatView (matrix.c:989)
==4592==by 0x40107E: main (ex16.c:30)
==4592==  Address 0xa47b460 is 20 bytes after a block of size 28 alloc'd
==4592==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
==4592==by 0x4FD121A: PetscMallocAlign (mal.c:28)
==4592==by 0x5842C70: MatSetUpMultiply_MPIAIJ (mmaij.c:41)
==4592==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
==4592==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
==4592==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
==4592==by 0x5337FEA: MatLoad (matrix.c:1101)
==4592==by 0x400D9F: main (ex16.c:22)
==4592== 
==4591== Invalid read of size 4
==4591==at 0x5814014: MatView_MPIAIJ_ASCIIorDraworSocket (mpiaij.c:1402)
==4591==by 0x5814A75: MatView_MPIAIJ (mpiaij.c:1440)
==4591==by 0x53373D7: MatView (matrix.c:989)
==4591==by 0x40107E: main (ex16.c:30)
==4591==  Address 0xa482958 is 24 bytes before a block of size 7 alloc'd
==4591==at 0x4C2FF83: memalign (vg_replace_malloc.c:858)
==4591==by 0x4FD121A: PetscMallocAlign (mal.c:28)
==4591==by 0x4F31FB5: PetscStrallocpy (str.c:197)
==4591==by 0x4F0D3F5: PetscClassRegLogRegister (classlog.c:253)
==4591==by 0x4EF96E2: PetscClassIdRegister (plog.c:2053)
==4591==by 0x51FA018: VecInitializePackage (dlregisvec.c:165)
==4591==by 0x51F6DE9: VecCreate (veccreate.c:35)
==4591==by 0x51C49F0: VecCreateSeq (vseqcr.c:37)
==4591==by 0x5843191: MatSetUpMultiply_MPIAIJ (mmaij.c:104)
==4591==by 0x5809943: MatAssemblyEnd_MPIAIJ (mpiaij.c:747)
==4591==by 0x536B299: MatAssemblyEnd (matrix.c:5298)
==4591==by 0x5829C05: MatLoad_MPIAIJ (mpiaij.c:3032)
==4591==by 0x5337FEA: MatLoad (matrix.c:1101)
==4591==by 0x400D9F: main (ex16.c:22)
==4591== 
[0]PETSC ERROR: - Error Message 
--
[0]PETSC ERROR: Argument out of range
[0]PETSC ERROR: Column too large: col 96 max 35
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for 
trouble shooting.
[0]PETSC ERROR: Petsc Development GIT revision: v3.7.4-1729-g4c4de23  GIT Date: 
2016-10-20 22:22:58 +
[0]PETSC ERROR: ./ex16 on a arch-idx64-slu named asterix by balay Fri Oct 21 
18:47:51 2016
[0]PETSC ERROR: Configure options --download-metis --download-parmetis 
--download-superlu_dist PETSC_ARCH=arch-idx64-slu
[0]PETSC ERROR: #1 MatSetValues_MPIAIJ() line 585 in 
/home/balay/petsc/src/mat/impls/aij/mpi/mpiaij.c
[0]PETSC ERROR: #2 MatAssemblyEnd_MPIAIJ() line 724 in 
/home/balay/petsc/src/mat/impls/aij/mpi/mpiaij.c
[0]PETSC ERROR: #3 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-21 Thread Barry Smith

  valgrind first

> On Oct 21, 2016, at 6:33 PM, Satish Balay  wrote:
> 
> On Fri, 21 Oct 2016, Barry Smith wrote:
> 
>> 
>>> On Oct 21, 2016, at 5:16 PM, Satish Balay  wrote:
>>> 
>>> The issue with this test code is - using MatLoad() twice [with the
>>> same object - without destroying it]. Not sure if thats supporsed to
>>> work..
>> 
>>   If the file has two matrices in it then yes a second call to MatLoad() 
>> with the same matrix should just load in the second matrix from the file 
>> correctly. Perhaps we need a test in our test suite just to make sure that 
>> works.
> 
> This test code crashes with:
> 
> MatLoad()
> MatView()
> MatLoad()
> MatView()
> 
> Satish
> 
> 
> 
> balay@asterix /home/balay/download-pine/x/superlu_dist_test
> $ cat ex16.c
> static char help[] = "Reads matrix and debug solver\n\n";
> #include 
> #undef __FUNCT__
> #define __FUNCT__ "main"
> int main(int argc,char **args)
> {
>  Mat   A;
>  PetscViewer   fd;/* viewer */
>  char  file[PETSC_MAX_PATH_LEN];  /* input file name */
>  PetscErrorCodeierr;
>  PetscBool flg;
> 
>  PetscInitialize(,,(char*)0,help);
> 
>  ierr = PetscOptionsGetString(NULL,NULL,"-f",file,PETSC_MAX_PATH_LEN,); 
> CHKERRQ(ierr);
>  if (!flg) SETERRQ(PETSC_COMM_WORLD,1,"Must indicate binary file with the -f 
> option");
> 
>  ierr = MatCreate(PETSC_COMM_WORLD,); CHKERRQ(ierr);
> 
>  ierr = PetscPrintf(PETSC_COMM_WORLD, "First MatLoad! \n");CHKERRQ(ierr);
>  ierr = PetscViewerBinaryOpen(PETSC_COMM_WORLD,file,FILE_MODE_READ,); 
> CHKERRQ(ierr);
>  ierr = MatLoad(A,fd); CHKERRQ(ierr);
>  ierr = PetscViewerDestroy(); CHKERRQ(ierr);
>  ierr = MatView(A,0);CHKERRQ(ierr);
> 
>  ierr = PetscPrintf(PETSC_COMM_WORLD, "Second MatLoad! \n");CHKERRQ(ierr);
>  ierr = PetscViewerBinaryOpen(PETSC_COMM_WORLD,file,FILE_MODE_READ,); 
> CHKERRQ(ierr);
>  ierr = MatLoad(A,fd); CHKERRQ(ierr);
>  ierr = PetscViewerDestroy(); CHKERRQ(ierr);
>  ierr = MatView(A,0);CHKERRQ(ierr);
> 
>  ierr = MatDestroy(); CHKERRQ(ierr);
>  ierr = PetscFinalize();
>  return 0;
> }
> 
> balay@asterix /home/balay/download-pine/x/superlu_dist_test
> $ make ex16
> mpicc -o ex16.o -c -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing 
> -Wno-unknown-pragmas -fvisibility=hidden -g3   -I/home/balay/petsc/include 
> -I/home/balay/petsc/arch-idx64-slu/include`pwd`/ex16.c
> mpicc -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas 
> -fvisibility=hidden -g3   -o ex16 ex16.o 
> -Wl,-rpath,/home/balay/petsc/arch-idx64-slu/lib 
> -L/home/balay/petsc/arch-idx64-slu/lib  -lpetsc 
> -Wl,-rpath,/home/balay/petsc/arch-idx64-slu/lib -lsuperlu_dist -llapack 
> -lblas -lparmetis -lmetis -lX11 -lpthread -lm 
> -Wl,-rpath,/home/balay/soft/mpich-3.1.4/lib 
> -L/home/balay/soft/mpich-3.1.4/lib 
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/6.2.1 
> -L/usr/lib/gcc/x86_64-redhat-linux/6.2.1 -lmpifort -lgfortran -lm -lgfortran 
> -lm -lquadmath -lm -lmpicxx -lstdc++ 
> -Wl,-rpath,/home/balay/soft/mpich-3.1.4/lib 
> -L/home/balay/soft/mpich-3.1.4/lib 
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/6.2.1 
> -L/usr/lib/gcc/x86_64-redhat-linux/6.2.1 -ldl 
> -Wl,-rpath,/home/balay/soft/mpich-3.1.4/lib -lmpi -lgcc_s -ldl 
> /usr/bin/rm -f ex16.o
> balay@asterix /home/balay/download-pine/x/superlu_dist_test
> $ mpiexec -n 2 ./ex16 -f ~/datafiles/matrices/small
> First MatLoad! 
> Mat Object: 2 MPI processes
>  type: mpiaij
> row 0: (0, 4.)  (1, -1.)  (6, -1.) 
> row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.) 
> row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.) 
> row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.) 
> row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.) 
> row 5: (4, -1.)  (5, 4.)  (11, -1.) 
> row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.) 
> row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.) 
> row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.) 
> row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.) 
> row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.) 
> row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.) 
> row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.) 
> row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.) 
> row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.) 
> row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.) 
> row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.) 
> row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.) 
> row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.) 
> row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.) 
> row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.) 
> row 21: (15, -1.)  (20, -1.)  (21, 4.)  (22, -1.)  (27, -1.) 
> row 22: (16, -1.)  (21, -1.)  (22, 4.)  (23, -1.)  (28, -1.) 
> row 23: (17, -1.)  (22, -1.)  (23, 4.)  (29, -1.) 
> row 24: (18, -1.)  (24, 4.)  (25, -1.)  (30, -1.) 
> row 25: (19, -1.)  (24, -1.)  (25, 4.)  (26, -1.)  (31, -1.) 
> row 26: (20, -1.)  (25, -1.)  (26, 4.)  (27, -1.)  (32, 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-21 Thread Satish Balay
On Fri, 21 Oct 2016, Barry Smith wrote:

> 
> > On Oct 21, 2016, at 5:16 PM, Satish Balay  wrote:
> > 
> > The issue with this test code is - using MatLoad() twice [with the
> > same object - without destroying it]. Not sure if thats supporsed to
> > work..
> 
>If the file has two matrices in it then yes a second call to MatLoad() 
> with the same matrix should just load in the second matrix from the file 
> correctly. Perhaps we need a test in our test suite just to make sure that 
> works.

This test code crashes with:

MatLoad()
MatView()
MatLoad()
MatView()

Satish



balay@asterix /home/balay/download-pine/x/superlu_dist_test
$ cat ex16.c
static char help[] = "Reads matrix and debug solver\n\n";
#include 
#undef __FUNCT__
#define __FUNCT__ "main"
int main(int argc,char **args)
{
  Mat   A;
  PetscViewer   fd;/* viewer */
  char  file[PETSC_MAX_PATH_LEN];  /* input file name */
  PetscErrorCodeierr;
  PetscBool flg;

  PetscInitialize(,,(char*)0,help);

  ierr = PetscOptionsGetString(NULL,NULL,"-f",file,PETSC_MAX_PATH_LEN,); 
CHKERRQ(ierr);
  if (!flg) SETERRQ(PETSC_COMM_WORLD,1,"Must indicate binary file with the -f 
option");

  ierr = MatCreate(PETSC_COMM_WORLD,); CHKERRQ(ierr);

  ierr = PetscPrintf(PETSC_COMM_WORLD, "First MatLoad! \n");CHKERRQ(ierr);
  ierr = PetscViewerBinaryOpen(PETSC_COMM_WORLD,file,FILE_MODE_READ,); 
CHKERRQ(ierr);
  ierr = MatLoad(A,fd); CHKERRQ(ierr);
  ierr = PetscViewerDestroy(); CHKERRQ(ierr);
  ierr = MatView(A,0);CHKERRQ(ierr);

  ierr = PetscPrintf(PETSC_COMM_WORLD, "Second MatLoad! \n");CHKERRQ(ierr);
  ierr = PetscViewerBinaryOpen(PETSC_COMM_WORLD,file,FILE_MODE_READ,); 
CHKERRQ(ierr);
  ierr = MatLoad(A,fd); CHKERRQ(ierr);
  ierr = PetscViewerDestroy(); CHKERRQ(ierr);
  ierr = MatView(A,0);CHKERRQ(ierr);

  ierr = MatDestroy(); CHKERRQ(ierr);
  ierr = PetscFinalize();
  return 0;
}

balay@asterix /home/balay/download-pine/x/superlu_dist_test
$ make ex16
mpicc -o ex16.o -c -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing 
-Wno-unknown-pragmas -fvisibility=hidden -g3   -I/home/balay/petsc/include 
-I/home/balay/petsc/arch-idx64-slu/include`pwd`/ex16.c
mpicc -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas 
-fvisibility=hidden -g3   -o ex16 ex16.o 
-Wl,-rpath,/home/balay/petsc/arch-idx64-slu/lib 
-L/home/balay/petsc/arch-idx64-slu/lib  -lpetsc 
-Wl,-rpath,/home/balay/petsc/arch-idx64-slu/lib -lsuperlu_dist -llapack -lblas 
-lparmetis -lmetis -lX11 -lpthread -lm 
-Wl,-rpath,/home/balay/soft/mpich-3.1.4/lib -L/home/balay/soft/mpich-3.1.4/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/6.2.1 
-L/usr/lib/gcc/x86_64-redhat-linux/6.2.1 -lmpifort -lgfortran -lm -lgfortran 
-lm -lquadmath -lm -lmpicxx -lstdc++ 
-Wl,-rpath,/home/balay/soft/mpich-3.1.4/lib -L/home/balay/soft/mpich-3.1.4/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/6.2.1 
-L/usr/lib/gcc/x86_64-redhat-linux/6.2.1 -ldl 
-Wl,-rpath,/home/balay/soft/mpich-3.1.4/lib -lmpi -lgcc_s -ldl 
/usr/bin/rm -f ex16.o
balay@asterix /home/balay/download-pine/x/superlu_dist_test
$ mpiexec -n 2 ./ex16 -f ~/datafiles/matrices/small
First MatLoad! 
Mat Object: 2 MPI processes
  type: mpiaij
row 0: (0, 4.)  (1, -1.)  (6, -1.) 
row 1: (0, -1.)  (1, 4.)  (2, -1.)  (7, -1.) 
row 2: (1, -1.)  (2, 4.)  (3, -1.)  (8, -1.) 
row 3: (2, -1.)  (3, 4.)  (4, -1.)  (9, -1.) 
row 4: (3, -1.)  (4, 4.)  (5, -1.)  (10, -1.) 
row 5: (4, -1.)  (5, 4.)  (11, -1.) 
row 6: (0, -1.)  (6, 4.)  (7, -1.)  (12, -1.) 
row 7: (1, -1.)  (6, -1.)  (7, 4.)  (8, -1.)  (13, -1.) 
row 8: (2, -1.)  (7, -1.)  (8, 4.)  (9, -1.)  (14, -1.) 
row 9: (3, -1.)  (8, -1.)  (9, 4.)  (10, -1.)  (15, -1.) 
row 10: (4, -1.)  (9, -1.)  (10, 4.)  (11, -1.)  (16, -1.) 
row 11: (5, -1.)  (10, -1.)  (11, 4.)  (17, -1.) 
row 12: (6, -1.)  (12, 4.)  (13, -1.)  (18, -1.) 
row 13: (7, -1.)  (12, -1.)  (13, 4.)  (14, -1.)  (19, -1.) 
row 14: (8, -1.)  (13, -1.)  (14, 4.)  (15, -1.)  (20, -1.) 
row 15: (9, -1.)  (14, -1.)  (15, 4.)  (16, -1.)  (21, -1.) 
row 16: (10, -1.)  (15, -1.)  (16, 4.)  (17, -1.)  (22, -1.) 
row 17: (11, -1.)  (16, -1.)  (17, 4.)  (23, -1.) 
row 18: (12, -1.)  (18, 4.)  (19, -1.)  (24, -1.) 
row 19: (13, -1.)  (18, -1.)  (19, 4.)  (20, -1.)  (25, -1.) 
row 20: (14, -1.)  (19, -1.)  (20, 4.)  (21, -1.)  (26, -1.) 
row 21: (15, -1.)  (20, -1.)  (21, 4.)  (22, -1.)  (27, -1.) 
row 22: (16, -1.)  (21, -1.)  (22, 4.)  (23, -1.)  (28, -1.) 
row 23: (17, -1.)  (22, -1.)  (23, 4.)  (29, -1.) 
row 24: (18, -1.)  (24, 4.)  (25, -1.)  (30, -1.) 
row 25: (19, -1.)  (24, -1.)  (25, 4.)  (26, -1.)  (31, -1.) 
row 26: (20, -1.)  (25, -1.)  (26, 4.)  (27, -1.)  (32, -1.) 
row 27: (21, -1.)  (26, -1.)  (27, 4.)  (28, -1.)  (33, -1.) 
row 28: (22, -1.)  (27, -1.)  (28, 4.)  (29, -1.)  (34, -1.) 
row 29: (23, -1.)  (28, -1.)  (29, 4.)  (35, -1.) 
row 30: (24, -1.)  (30, 4.)  (31, -1.) 
row 31: (25, -1.)  (30, -1.)  (31, 4.)  (32, -1.) 
row 32: (26, -1.)  (31, 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-21 Thread Barry Smith

> On Oct 21, 2016, at 5:16 PM, Satish Balay  wrote:
> 
> The issue with this test code is - using MatLoad() twice [with the
> same object - without destroying it]. Not sure if thats supporsed to
> work..

   If the file has two matrices in it then yes a second call to MatLoad() with 
the same matrix should just load in the second matrix from the file correctly. 
Perhaps we need a test in our test suite just to make sure that works.

  Barry



> 
> Satish
> 
> On Fri, 21 Oct 2016, Hong wrote:
> 
>> I can reproduce the error on a linux machine with petsc-maint. It crashes
>> at 2nd solve, on both processors:
>> 
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x7f051dc835bd in pdgsequ (A=0x1563910, r=0x176dfe0, c=0x178f7f0,
>>rowcnd=0x7fffcb8dab30, colcnd=0x7fffcb8dab38, amax=0x7fffcb8dab40,
>>info=0x7fffcb8dab4c, grid=0x1563858)
>>at
>> /sandbox/hzhang/petsc/arch-linux-gcc-gfortran/externalpackages/git.superlu_dist/SRC/pdgsequ.c:182
>> 182 c[jcol] = SUPERLU_MAX( c[jcol], fabs(Aval[j]) * r[irow]
>> );
>> 
>> The version of superlu_dist:
>> commit 0b5369f304507f1c7904a913f4c0c86777a60639
>> Author: Xiaoye Li 
>> Date:   Thu May 26 11:33:19 2016 -0700
>> 
>>rename 'struct pair' to 'struct superlu_pair'.
>> 
>> Hong
>> 
>> On Fri, Oct 21, 2016 at 5:36 AM, Anton Popov  wrote:
>> 
>>> 
>>> On 10/19/2016 05:22 PM, Anton Popov wrote:
>>> 
>>> I looked at each valgrind-complained item in your email dated Oct. 11.
>>> Those reports are really superficial; I don't see anything  wrong with
>>> those lines (mostly uninitialized variables) singled out.  I did a few
>>> tests with the latest version in github,  all went fine.
>>> 
>>> Perhaps you can print your matrix that caused problem, I can run it using
>>> your matrix.
>>> 
>>> Sherry
>>> 
>>> Hi Sherry,
>>> 
>>> I finally figured out a minimalistic setup (attached) that reproduces the
>>> problem.
>>> 
>>> I use petsc-maint:
>>> 
>>> git clone -b maint https://bitbucket.org/petsc/petsc.git
>>> 
>>> and configure it in the debug mode without optimization using the options:
>>> 
>>> --download-superlu_dist=1 \
>>> --download-superlu_dist-commit=origin/maint \
>>> 
>>> Compile the test, assuming PETSC_DIR points to the described petsc
>>> installation:
>>> 
>>> make ex16
>>> 
>>> Run with:
>>> 
>>> mpirun -n 2 ./ex16 -f binaryoutput -pc_type lu
>>> -pc_factor_mat_solver_package superlu_dist
>>> 
>>> Matrix partitioning between the processors will be completely the same as
>>> in our code (hard-coded).
>>> 
>>> I factorize the same matrix twice with the same PC object. Remarkably it
>>> runs fine for the first time, but fails for the second.
>>> 
>>> Thank you very much for looking into this problem.
>>> 
>>> Cheers,
>>> Anton
>>> 
>> 
> 



Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-21 Thread Satish Balay
The issue with this test code is - using MatLoad() twice [with the
same object - without destroying it]. Not sure if thats supporsed to
work..

Satish

On Fri, 21 Oct 2016, Hong wrote:

> I can reproduce the error on a linux machine with petsc-maint. It crashes
> at 2nd solve, on both processors:
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x7f051dc835bd in pdgsequ (A=0x1563910, r=0x176dfe0, c=0x178f7f0,
> rowcnd=0x7fffcb8dab30, colcnd=0x7fffcb8dab38, amax=0x7fffcb8dab40,
> info=0x7fffcb8dab4c, grid=0x1563858)
> at
> /sandbox/hzhang/petsc/arch-linux-gcc-gfortran/externalpackages/git.superlu_dist/SRC/pdgsequ.c:182
> 182 c[jcol] = SUPERLU_MAX( c[jcol], fabs(Aval[j]) * r[irow]
> );
> 
> The version of superlu_dist:
> commit 0b5369f304507f1c7904a913f4c0c86777a60639
> Author: Xiaoye Li 
> Date:   Thu May 26 11:33:19 2016 -0700
> 
> rename 'struct pair' to 'struct superlu_pair'.
> 
> Hong
> 
> On Fri, Oct 21, 2016 at 5:36 AM, Anton Popov  wrote:
> 
> >
> > On 10/19/2016 05:22 PM, Anton Popov wrote:
> >
> > I looked at each valgrind-complained item in your email dated Oct. 11.
> > Those reports are really superficial; I don't see anything  wrong with
> > those lines (mostly uninitialized variables) singled out.  I did a few
> > tests with the latest version in github,  all went fine.
> >
> > Perhaps you can print your matrix that caused problem, I can run it using
> >  your matrix.
> >
> > Sherry
> >
> > Hi Sherry,
> >
> > I finally figured out a minimalistic setup (attached) that reproduces the
> > problem.
> >
> > I use petsc-maint:
> >
> > git clone -b maint https://bitbucket.org/petsc/petsc.git
> >
> > and configure it in the debug mode without optimization using the options:
> >
> > --download-superlu_dist=1 \
> > --download-superlu_dist-commit=origin/maint \
> >
> > Compile the test, assuming PETSC_DIR points to the described petsc
> > installation:
> >
> > make ex16
> >
> > Run with:
> >
> > mpirun -n 2 ./ex16 -f binaryoutput -pc_type lu
> > -pc_factor_mat_solver_package superlu_dist
> >
> > Matrix partitioning between the processors will be completely the same as
> > in our code (hard-coded).
> >
> > I factorize the same matrix twice with the same PC object. Remarkably it
> > runs fine for the first time, but fails for the second.
> >
> > Thank you very much for looking into this problem.
> >
> > Cheers,
> > Anton
> >
> 



Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-21 Thread Hong
I can reproduce the error on a linux machine with petsc-maint. It crashes
at 2nd solve, on both processors:

Program received signal SIGSEGV, Segmentation fault.
0x7f051dc835bd in pdgsequ (A=0x1563910, r=0x176dfe0, c=0x178f7f0,
rowcnd=0x7fffcb8dab30, colcnd=0x7fffcb8dab38, amax=0x7fffcb8dab40,
info=0x7fffcb8dab4c, grid=0x1563858)
at
/sandbox/hzhang/petsc/arch-linux-gcc-gfortran/externalpackages/git.superlu_dist/SRC/pdgsequ.c:182
182 c[jcol] = SUPERLU_MAX( c[jcol], fabs(Aval[j]) * r[irow]
);

The version of superlu_dist:
commit 0b5369f304507f1c7904a913f4c0c86777a60639
Author: Xiaoye Li 
Date:   Thu May 26 11:33:19 2016 -0700

rename 'struct pair' to 'struct superlu_pair'.

Hong

On Fri, Oct 21, 2016 at 5:36 AM, Anton Popov  wrote:

>
> On 10/19/2016 05:22 PM, Anton Popov wrote:
>
> I looked at each valgrind-complained item in your email dated Oct. 11.
> Those reports are really superficial; I don't see anything  wrong with
> those lines (mostly uninitialized variables) singled out.  I did a few
> tests with the latest version in github,  all went fine.
>
> Perhaps you can print your matrix that caused problem, I can run it using
>  your matrix.
>
> Sherry
>
> Hi Sherry,
>
> I finally figured out a minimalistic setup (attached) that reproduces the
> problem.
>
> I use petsc-maint:
>
> git clone -b maint https://bitbucket.org/petsc/petsc.git
>
> and configure it in the debug mode without optimization using the options:
>
> --download-superlu_dist=1 \
> --download-superlu_dist-commit=origin/maint \
>
> Compile the test, assuming PETSC_DIR points to the described petsc
> installation:
>
> make ex16
>
> Run with:
>
> mpirun -n 2 ./ex16 -f binaryoutput -pc_type lu
> -pc_factor_mat_solver_package superlu_dist
>
> Matrix partitioning between the processors will be completely the same as
> in our code (hard-coded).
>
> I factorize the same matrix twice with the same PC object. Remarkably it
> runs fine for the first time, but fails for the second.
>
> Thank you very much for looking into this problem.
>
> Cheers,
> Anton
>


Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-19 Thread Anton Popov

Thank you Sherry for your efforts

but before I can setup an example that reproduces the problem, I have to 
ask PETSc related question.


When I pump matrix via MatView MatLoad it ignores its original partitioning.

Say originally I have 100 and 110 equations on two processors, after 
MatLoad I will have 105 and 105 also on two processors.


What do I do to pass partitioning info through MatView MatLoad?

I guess it's important for reproducing my setup exactly.

Thanks


On 10/19/2016 08:06 AM, Xiaoye S. Li wrote:
I looked at each valgrind-complained item in your email dated Oct. 
11.  Those reports are really superficial; I don't see anything  wrong 
with those lines (mostly uninitialized variables) singled out.  I did 
a few tests with the latest version in github,  all went fine.


Perhaps you can print your matrix that caused problem, I can run it 
using  your matrix.


Sherry


On Tue, Oct 11, 2016 at 2:18 PM, Anton > wrote:




On 10/11/16 7:19 PM, Satish Balay wrote:

This log looks truncated. Are there any valgrind mesages
before this?
[like from your application code - or from MPI]

Yes it is indeed truncated. I only included relevant messages.


Perhaps you can send the complete log - with:
valgrind -q --tool=memcheck --leak-check=yes --num-callers=20
--track-origins=yes

[and if there were more valgrind messages from MPI - rebuild petsc

There are no messages originating from our code, just a few MPI
related ones (probably false positives) and from SuperLU_DIST
(most of them).

Thanks,
Anton

with --download-mpich - for a valgrind clean mpi]

Sherry,
Perhaps this log points to some issue in superlu_dist?

thanks,
Satish

On Tue, 11 Oct 2016, Anton Popov wrote:

Valgrind immediately detects interesting stuff:

==25673== Use of uninitialised value of size 8
==25673==at 0x178272C: static_schedule
(static_schedule.c:960)
==25674== Use of uninitialised value of size 8
==25674==at 0x178272C: static_schedule
(static_schedule.c:960)
==25674==by 0x174E74E: pdgstrf (pdgstrf.c:572)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)


==25673== Conditional jump or move depends on
uninitialised value(s)
==25673==at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)


==25673== Conditional jump or move depends on
uninitialised value(s)
==25673==at 0x5C83F43: PMPI_Recv (in
/opt/mpich3/lib/libmpi.so.12.1.0)
==25673==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
==25673==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)

==25674== Use of uninitialised value of size 8
==25674==at 0x62BF72B: _itoa_word (_itoa.c:179)
==25674==by 0x62C1289: printf_positional (vfprintf.c:2022)
==25674==by 0x62C2465: vfprintf (vfprintf.c:1677)
==25674==by 0x638AFD5: __vsnprintf_chk
(vsnprintf_chk.c:63)
==25674==by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
==25674==by 0x5CC6C08: MPIR_Err_create_code_valist (in
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674==by 0x5CC7A9A: MPIR_Err_create_code (in
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674==by 0x5C83FB1: PMPI_Recv (in
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
==25674==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)

==25674== Use of uninitialised value of size 8
==25674==at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)

And it crashes after this:

==25674== Invalid write of size 4
==25674==at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
==25674==by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST
(superlu_dist.c:421)
==25674==  Address 0xa0 is not stack'd, malloc'd or
(recently) free'd
==25674==
[1]PETSC ERROR:


[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation
Violation, probably
memory access out of range


On 10/11/2016 03:26 PM, Anton Popov wrote:

On 10/10/2016 07:11 PM, Satish Balay wrote:


Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-19 Thread Xiaoye S. Li
I looked at each valgrind-complained item in your email dated Oct. 11.
Those reports are really superficial; I don't see anything  wrong with
those lines (mostly uninitialized variables) singled out.  I did a few
tests with the latest version in github,  all went fine.

Perhaps you can print your matrix that caused problem, I can run it using
 your matrix.

Sherry


On Tue, Oct 11, 2016 at 2:18 PM, Anton  wrote:

>
>
> On 10/11/16 7:19 PM, Satish Balay wrote:
>
>> This log looks truncated. Are there any valgrind mesages before this?
>> [like from your application code - or from MPI]
>>
> Yes it is indeed truncated. I only included relevant messages.
>
>>
>> Perhaps you can send the complete log - with:
>> valgrind -q --tool=memcheck --leak-check=yes --num-callers=20
>> --track-origins=yes
>>
>> [and if there were more valgrind messages from MPI - rebuild petsc
>>
> There are no messages originating from our code, just a few MPI related
> ones (probably false positives) and from SuperLU_DIST (most of them).
>
> Thanks,
> Anton
>
> with --download-mpich - for a valgrind clean mpi]
>>
>> Sherry,
>> Perhaps this log points to some issue in superlu_dist?
>>
>> thanks,
>> Satish
>>
>> On Tue, 11 Oct 2016, Anton Popov wrote:
>>
>> Valgrind immediately detects interesting stuff:
>>>
>>> ==25673== Use of uninitialised value of size 8
>>> ==25673==at 0x178272C: static_schedule (static_schedule.c:960)
>>> ==25674== Use of uninitialised value of size 8
>>> ==25674==at 0x178272C: static_schedule (static_schedule.c:960)
>>> ==25674==by 0x174E74E: pdgstrf (pdgstrf.c:572)
>>> ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>>
>>>
>>> ==25673== Conditional jump or move depends on uninitialised value(s)
>>> ==25673==at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
>>> ==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>>
>>>
>>> ==25673== Conditional jump or move depends on uninitialised value(s)
>>> ==25673==at 0x5C83F43: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1
>>> .0)
>>> ==25673==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
>>> ==25673==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
>>> ==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>>
>>> ==25674== Use of uninitialised value of size 8
>>> ==25674==at 0x62BF72B: _itoa_word (_itoa.c:179)
>>> ==25674==by 0x62C1289: printf_positional (vfprintf.c:2022)
>>> ==25674==by 0x62C2465: vfprintf (vfprintf.c:1677)
>>> ==25674==by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63)
>>> ==25674==by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
>>> ==25674==by 0x5CC6C08: MPIR_Err_create_code_valist (in
>>> /opt/mpich3/lib/libmpi.so.12.1.0)
>>> ==25674==by 0x5CC7A9A: MPIR_Err_create_code (in
>>> /opt/mpich3/lib/libmpi.so.12.1.0)
>>> ==25674==by 0x5C83FB1: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1
>>> .0)
>>> ==25674==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
>>> ==25674==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
>>> ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>>
>>> ==25674== Use of uninitialised value of size 8
>>> ==25674==at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
>>> ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>>
>>> And it crashes after this:
>>>
>>> ==25674== Invalid write of size 4
>>> ==25674==at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
>>> ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
>>> ==25674==by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST
>>> (superlu_dist.c:421)
>>> ==25674==  Address 0xa0 is not stack'd, malloc'd or (recently) free'd
>>> ==25674==
>>> [1]PETSC ERROR:
>>> 
>>> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>>> probably
>>> memory access out of range
>>>
>>>
>>> On 10/11/2016 03:26 PM, Anton Popov wrote:
>>>
 On 10/10/2016 07:11 PM, Satish Balay wrote:

> Thats from petsc-3.5
>
> Anton - please post the stack trace you get with
> --download-superlu_dist-commit=origin/maint
>
 I guess this is it:

 [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
 /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
 [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282
 /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
 [0]PETSC ERROR: [0] MatLUFactorNumeric line 2985
 /home/anton/LIB/petsc/src/mat/interface/matrix.c
 [0]PETSC ERROR: [0] PCSetUp_LU line 101
 /home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
 [0]PETSC ERROR: [0] PCSetUp line 930
 /home/anton/LIB/petsc/src/ksp/pc/interface/precon.c

 According to the line numbers it crashes within
 MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.

 Surprisingly this only happens on the second SNES iteration, but not on
 the
 first.

 I'm trying to reproduce this behavior with PETSc KSP and SNES examples.

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-11 Thread Anton



On 10/11/16 7:19 PM, Satish Balay wrote:

This log looks truncated. Are there any valgrind mesages before this?
[like from your application code - or from MPI]

Yes it is indeed truncated. I only included relevant messages.


Perhaps you can send the complete log - with:
valgrind -q --tool=memcheck --leak-check=yes --num-callers=20 
--track-origins=yes

[and if there were more valgrind messages from MPI - rebuild petsc
There are no messages originating from our code, just a few MPI related 
ones (probably false positives) and from SuperLU_DIST (most of them).


Thanks,
Anton

with --download-mpich - for a valgrind clean mpi]

Sherry,
Perhaps this log points to some issue in superlu_dist?

thanks,
Satish

On Tue, 11 Oct 2016, Anton Popov wrote:


Valgrind immediately detects interesting stuff:

==25673== Use of uninitialised value of size 8
==25673==at 0x178272C: static_schedule (static_schedule.c:960)
==25674== Use of uninitialised value of size 8
==25674==at 0x178272C: static_schedule (static_schedule.c:960)
==25674==by 0x174E74E: pdgstrf (pdgstrf.c:572)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)


==25673== Conditional jump or move depends on uninitialised value(s)
==25673==at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)


==25673== Conditional jump or move depends on uninitialised value(s)
==25673==at 0x5C83F43: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
==25673==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
==25673==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)

==25674== Use of uninitialised value of size 8
==25674==at 0x62BF72B: _itoa_word (_itoa.c:179)
==25674==by 0x62C1289: printf_positional (vfprintf.c:2022)
==25674==by 0x62C2465: vfprintf (vfprintf.c:1677)
==25674==by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63)
==25674==by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
==25674==by 0x5CC6C08: MPIR_Err_create_code_valist (in
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674==by 0x5CC7A9A: MPIR_Err_create_code (in
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674==by 0x5C83FB1: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
==25674==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
==25674==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)

==25674== Use of uninitialised value of size 8
==25674==at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)

And it crashes after this:

==25674== Invalid write of size 4
==25674==at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
==25674==by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:421)
==25674==  Address 0xa0 is not stack'd, malloc'd or (recently) free'd
==25674==
[1]PETSC ERROR:

[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably
memory access out of range


On 10/11/2016 03:26 PM, Anton Popov wrote:

On 10/10/2016 07:11 PM, Satish Balay wrote:

Thats from petsc-3.5

Anton - please post the stack trace you get with
--download-superlu_dist-commit=origin/maint

I guess this is it:

[0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
/home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282
/home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[0]PETSC ERROR: [0] MatLUFactorNumeric line 2985
/home/anton/LIB/petsc/src/mat/interface/matrix.c
[0]PETSC ERROR: [0] PCSetUp_LU line 101
/home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
[0]PETSC ERROR: [0] PCSetUp line 930
/home/anton/LIB/petsc/src/ksp/pc/interface/precon.c

According to the line numbers it crashes within
MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.

Surprisingly this only happens on the second SNES iteration, but not on the
first.

I'm trying to reproduce this behavior with PETSc KSP and SNES examples.
However, everything I've tried up to now with SuperLU_DIST does just fine.

I'm also checking our code in Valgrind to make sure it's clean.

Anton

Satish


On Mon, 10 Oct 2016, Xiaoye S. Li wrote:


Which version of superlu_dist does this capture?   I looked at the
original
error  log, it pointed to pdgssvx: line 161.  But that line is in
comment
block, not the program.

Sherry


On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov  wrote:


On 10/07/2016 05:23 PM, Satish Balay wrote:


On Fri, 7 Oct 2016, Kong, Fande wrote:

On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay 
wrote:

On Fri, 7 Oct 2016, Anton Popov wrote:

Hi guys,

are there any news about fixing buggy behavior of
SuperLU_DIST, exactly


what


is described here:

https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.



Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-11 Thread Satish Balay
On Tue, 11 Oct 2016, Anton wrote:

> 
> 
> On 10/11/16 7:44 PM, Barry Smith wrote:
> > You can run your code with -ksp_view_mat binary -ksp_view_rhs binary
> > this will cause it to save the matrices and right hand sides to the
> > linear systems in a file called binaryoutput, then email the file to
> > petsc-ma...@mcs.anl.gov (don't worry this email address accepts large
> > attachments). And tell us how many processes you ran on that produced
> > the problems.
> >
> > Barry
> >
> 
> I'll do that, but I just wonder which version of SuperLU_DIST is used in
> 3.7.4?
> 
> The latest version available on http://crd-legacy.lbl.gov/~xiaoye/SuperLU/ is
> 5.1.1 which is a week old and includes bug fixes.

This is the version you essentially got - when you configured with 
--download-superlu_dist-commit=origin/maint

Satish

> 
> Maybe we're facing a problem that is already solved.
> 
> Thanks,
> Anton
> >
> > > On Oct 11, 2016, at 12:19 PM, Satish Balay  wrote:
> > >
> > > This log looks truncated. Are there any valgrind mesages before this?
> > > [like from your application code - or from MPI]
> > >
> > > Perhaps you can send the complete log - with:
> > > valgrind -q --tool=memcheck --leak-check=yes --num-callers=20
> > > --track-origins=yes
> > >
> > > [and if there were more valgrind messages from MPI - rebuild petsc
> > > with --download-mpich - for a valgrind clean mpi]
> > >
> > > Sherry,
> > > Perhaps this log points to some issue in superlu_dist?
> > >
> > > thanks,
> > > Satish
> > >
> > > On Tue, 11 Oct 2016, Anton Popov wrote:
> > >
> > > > Valgrind immediately detects interesting stuff:
> > > >
> > > > ==25673== Use of uninitialised value of size 8
> > > > ==25673==at 0x178272C: static_schedule (static_schedule.c:960)
> > > > ==25674== Use of uninitialised value of size 8
> > > > ==25674==at 0x178272C: static_schedule (static_schedule.c:960)
> > > > ==25674==by 0x174E74E: pdgstrf (pdgstrf.c:572)
> > > > ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
> > > >
> > > >
> > > > ==25673== Conditional jump or move depends on uninitialised value(s)
> > > > ==25673==at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
> > > > ==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)
> > > >
> > > >
> > > > ==25673== Conditional jump or move depends on uninitialised value(s)
> > > > ==25673==at 0x5C83F43: PMPI_Recv (in
> > > > /opt/mpich3/lib/libmpi.so.12.1.0)
> > > > ==25673==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
> > > > ==25673==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
> > > > ==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)
> > > >
> > > > ==25674== Use of uninitialised value of size 8
> > > > ==25674==at 0x62BF72B: _itoa_word (_itoa.c:179)
> > > > ==25674==by 0x62C1289: printf_positional (vfprintf.c:2022)
> > > > ==25674==by 0x62C2465: vfprintf (vfprintf.c:1677)
> > > > ==25674==by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63)
> > > > ==25674==by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
> > > > ==25674==by 0x5CC6C08: MPIR_Err_create_code_valist (in
> > > > /opt/mpich3/lib/libmpi.so.12.1.0)
> > > > ==25674==by 0x5CC7A9A: MPIR_Err_create_code (in
> > > > /opt/mpich3/lib/libmpi.so.12.1.0)
> > > > ==25674==by 0x5C83FB1: PMPI_Recv (in
> > > > /opt/mpich3/lib/libmpi.so.12.1.0)
> > > > ==25674==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
> > > > ==25674==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
> > > > ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
> > > >
> > > > ==25674== Use of uninitialised value of size 8
> > > > ==25674==at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
> > > > ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
> > > >
> > > > And it crashes after this:
> > > >
> > > > ==25674== Invalid write of size 4
> > > > ==25674==at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
> > > > ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
> > > > ==25674==by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST
> > > > (superlu_dist.c:421)
> > > > ==25674==  Address 0xa0 is not stack'd, malloc'd or (recently) free'd
> > > > ==25674==
> > > > [1]PETSC ERROR:
> > > > 
> > > > [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> > > > probably
> > > > memory access out of range
> > > >
> > > >
> > > > On 10/11/2016 03:26 PM, Anton Popov wrote:
> > > > > On 10/10/2016 07:11 PM, Satish Balay wrote:
> > > > > > Thats from petsc-3.5
> > > > > >
> > > > > > Anton - please post the stack trace you get with
> > > > > > --download-superlu_dist-commit=origin/maint
> > > > > I guess this is it:
> > > > >
> > > > > [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
> > > > > /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> > > > > [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282
> > > > > 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-11 Thread Anton



On 10/11/16 7:44 PM, Barry Smith wrote:

You can run your code with -ksp_view_mat binary -ksp_view_rhs binary this 
will cause it to save the matrices and right hand sides to the linear systems 
in a file called binaryoutput, then email the file to petsc-ma...@mcs.anl.gov 
(don't worry this email address accepts large attachments). And tell us how 
many processes you ran on that produced the problems.

Barry



I'll do that, but I just wonder which version of SuperLU_DIST is used in 
3.7.4?


The latest version available on 
http://crd-legacy.lbl.gov/~xiaoye/SuperLU/ is 5.1.1 which is a week old 
and includes bug fixes.


Maybe we're facing a problem that is already solved.

Thanks,
Anton



On Oct 11, 2016, at 12:19 PM, Satish Balay  wrote:

This log looks truncated. Are there any valgrind mesages before this?
[like from your application code - or from MPI]

Perhaps you can send the complete log - with:
valgrind -q --tool=memcheck --leak-check=yes --num-callers=20 
--track-origins=yes

[and if there were more valgrind messages from MPI - rebuild petsc
with --download-mpich - for a valgrind clean mpi]

Sherry,
Perhaps this log points to some issue in superlu_dist?

thanks,
Satish

On Tue, 11 Oct 2016, Anton Popov wrote:


Valgrind immediately detects interesting stuff:

==25673== Use of uninitialised value of size 8
==25673==at 0x178272C: static_schedule (static_schedule.c:960)
==25674== Use of uninitialised value of size 8
==25674==at 0x178272C: static_schedule (static_schedule.c:960)
==25674==by 0x174E74E: pdgstrf (pdgstrf.c:572)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)


==25673== Conditional jump or move depends on uninitialised value(s)
==25673==at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)


==25673== Conditional jump or move depends on uninitialised value(s)
==25673==at 0x5C83F43: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
==25673==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
==25673==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)

==25674== Use of uninitialised value of size 8
==25674==at 0x62BF72B: _itoa_word (_itoa.c:179)
==25674==by 0x62C1289: printf_positional (vfprintf.c:2022)
==25674==by 0x62C2465: vfprintf (vfprintf.c:1677)
==25674==by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63)
==25674==by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
==25674==by 0x5CC6C08: MPIR_Err_create_code_valist (in
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674==by 0x5CC7A9A: MPIR_Err_create_code (in
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674==by 0x5C83FB1: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
==25674==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
==25674==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)

==25674== Use of uninitialised value of size 8
==25674==at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)

And it crashes after this:

==25674== Invalid write of size 4
==25674==at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
==25674==by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:421)
==25674==  Address 0xa0 is not stack'd, malloc'd or (recently) free'd
==25674==
[1]PETSC ERROR:

[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably
memory access out of range


On 10/11/2016 03:26 PM, Anton Popov wrote:

On 10/10/2016 07:11 PM, Satish Balay wrote:

Thats from petsc-3.5

Anton - please post the stack trace you get with
--download-superlu_dist-commit=origin/maint

I guess this is it:

[0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
/home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282
/home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[0]PETSC ERROR: [0] MatLUFactorNumeric line 2985
/home/anton/LIB/petsc/src/mat/interface/matrix.c
[0]PETSC ERROR: [0] PCSetUp_LU line 101
/home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
[0]PETSC ERROR: [0] PCSetUp line 930
/home/anton/LIB/petsc/src/ksp/pc/interface/precon.c

According to the line numbers it crashes within
MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.

Surprisingly this only happens on the second SNES iteration, but not on the
first.

I'm trying to reproduce this behavior with PETSc KSP and SNES examples.
However, everything I've tried up to now with SuperLU_DIST does just fine.

I'm also checking our code in Valgrind to make sure it's clean.

Anton

Satish


On Mon, 10 Oct 2016, Xiaoye S. Li wrote:


Which version of superlu_dist does this capture?   I looked at the
original
error  log, it pointed to pdgssvx: line 161.  But that line is in
comment
block, 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-11 Thread Barry Smith

   You can run your code with -ksp_view_mat binary -ksp_view_rhs binary this 
will cause it to save the matrices and right hand sides to the linear systems 
in a file called binaryoutput, then email the file to petsc-ma...@mcs.anl.gov 
(don't worry this email address accepts large attachments). And tell us how 
many processes you ran on that produced the problems.

   Barry



> On Oct 11, 2016, at 12:19 PM, Satish Balay  wrote:
> 
> This log looks truncated. Are there any valgrind mesages before this?
> [like from your application code - or from MPI]
> 
> Perhaps you can send the complete log - with:
> valgrind -q --tool=memcheck --leak-check=yes --num-callers=20 
> --track-origins=yes
> 
> [and if there were more valgrind messages from MPI - rebuild petsc
> with --download-mpich - for a valgrind clean mpi]
> 
> Sherry,
> Perhaps this log points to some issue in superlu_dist?
> 
> thanks,
> Satish
> 
> On Tue, 11 Oct 2016, Anton Popov wrote:
> 
>> Valgrind immediately detects interesting stuff:
>> 
>> ==25673== Use of uninitialised value of size 8
>> ==25673==at 0x178272C: static_schedule (static_schedule.c:960)
>> ==25674== Use of uninitialised value of size 8
>> ==25674==at 0x178272C: static_schedule (static_schedule.c:960)
>> ==25674==by 0x174E74E: pdgstrf (pdgstrf.c:572)
>> ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
>> 
>> 
>> ==25673== Conditional jump or move depends on uninitialised value(s)
>> ==25673==at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
>> ==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)
>> 
>> 
>> ==25673== Conditional jump or move depends on uninitialised value(s)
>> ==25673==at 0x5C83F43: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
>> ==25673==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
>> ==25673==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
>> ==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)
>> 
>> ==25674== Use of uninitialised value of size 8
>> ==25674==at 0x62BF72B: _itoa_word (_itoa.c:179)
>> ==25674==by 0x62C1289: printf_positional (vfprintf.c:2022)
>> ==25674==by 0x62C2465: vfprintf (vfprintf.c:1677)
>> ==25674==by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63)
>> ==25674==by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
>> ==25674==by 0x5CC6C08: MPIR_Err_create_code_valist (in
>> /opt/mpich3/lib/libmpi.so.12.1.0)
>> ==25674==by 0x5CC7A9A: MPIR_Err_create_code (in
>> /opt/mpich3/lib/libmpi.so.12.1.0)
>> ==25674==by 0x5C83FB1: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
>> ==25674==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
>> ==25674==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
>> ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
>> 
>> ==25674== Use of uninitialised value of size 8
>> ==25674==at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
>> ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
>> 
>> And it crashes after this:
>> 
>> ==25674== Invalid write of size 4
>> ==25674==at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
>> ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
>> ==25674==by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST 
>> (superlu_dist.c:421)
>> ==25674==  Address 0xa0 is not stack'd, malloc'd or (recently) free'd
>> ==25674==
>> [1]PETSC ERROR:
>> 
>> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
>> probably
>> memory access out of range
>> 
>> 
>> On 10/11/2016 03:26 PM, Anton Popov wrote:
>>> 
>>> On 10/10/2016 07:11 PM, Satish Balay wrote:
 Thats from petsc-3.5
 
 Anton - please post the stack trace you get with
 --download-superlu_dist-commit=origin/maint
>>> 
>>> I guess this is it:
>>> 
>>> [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
>>> /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>>> [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282
>>> /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
>>> [0]PETSC ERROR: [0] MatLUFactorNumeric line 2985
>>> /home/anton/LIB/petsc/src/mat/interface/matrix.c
>>> [0]PETSC ERROR: [0] PCSetUp_LU line 101
>>> /home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
>>> [0]PETSC ERROR: [0] PCSetUp line 930
>>> /home/anton/LIB/petsc/src/ksp/pc/interface/precon.c
>>> 
>>> According to the line numbers it crashes within
>>> MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.
>>> 
>>> Surprisingly this only happens on the second SNES iteration, but not on the
>>> first.
>>> 
>>> I'm trying to reproduce this behavior with PETSc KSP and SNES examples.
>>> However, everything I've tried up to now with SuperLU_DIST does just fine.
>>> 
>>> I'm also checking our code in Valgrind to make sure it's clean.
>>> 
>>> Anton
 
 Satish
 
 
 On Mon, 10 Oct 2016, Xiaoye S. Li wrote:
 
> Which version of superlu_dist does this capture?   I looked at the
> original
> error  log, it pointed 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-11 Thread Satish Balay
This log looks truncated. Are there any valgrind mesages before this?
[like from your application code - or from MPI]

Perhaps you can send the complete log - with:
valgrind -q --tool=memcheck --leak-check=yes --num-callers=20 
--track-origins=yes

[and if there were more valgrind messages from MPI - rebuild petsc
with --download-mpich - for a valgrind clean mpi]

Sherry,
Perhaps this log points to some issue in superlu_dist?

thanks,
Satish

On Tue, 11 Oct 2016, Anton Popov wrote:

> Valgrind immediately detects interesting stuff:
> 
> ==25673== Use of uninitialised value of size 8
> ==25673==at 0x178272C: static_schedule (static_schedule.c:960)
> ==25674== Use of uninitialised value of size 8
> ==25674==at 0x178272C: static_schedule (static_schedule.c:960)
> ==25674==by 0x174E74E: pdgstrf (pdgstrf.c:572)
> ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
> 
> 
> ==25673== Conditional jump or move depends on uninitialised value(s)
> ==25673==at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
> ==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)
> 
> 
> ==25673== Conditional jump or move depends on uninitialised value(s)
> ==25673==at 0x5C83F43: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
> ==25673==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
> ==25673==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
> ==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)
> 
> ==25674== Use of uninitialised value of size 8
> ==25674==at 0x62BF72B: _itoa_word (_itoa.c:179)
> ==25674==by 0x62C1289: printf_positional (vfprintf.c:2022)
> ==25674==by 0x62C2465: vfprintf (vfprintf.c:1677)
> ==25674==by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63)
> ==25674==by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
> ==25674==by 0x5CC6C08: MPIR_Err_create_code_valist (in
> /opt/mpich3/lib/libmpi.so.12.1.0)
> ==25674==by 0x5CC7A9A: MPIR_Err_create_code (in
> /opt/mpich3/lib/libmpi.so.12.1.0)
> ==25674==by 0x5C83FB1: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
> ==25674==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
> ==25674==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
> ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
> 
> ==25674== Use of uninitialised value of size 8
> ==25674==at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
> ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
> 
> And it crashes after this:
> 
> ==25674== Invalid write of size 4
> ==25674==at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
> ==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
> ==25674==by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:421)
> ==25674==  Address 0xa0 is not stack'd, malloc'd or (recently) free'd
> ==25674==
> [1]PETSC ERROR:
> 
> [1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably
> memory access out of range
> 
> 
> On 10/11/2016 03:26 PM, Anton Popov wrote:
> >
> > On 10/10/2016 07:11 PM, Satish Balay wrote:
> > > Thats from petsc-3.5
> > >
> > > Anton - please post the stack trace you get with
> > > --download-superlu_dist-commit=origin/maint
> >
> > I guess this is it:
> >
> > [0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421
> > /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> > [0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282
> > /home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
> > [0]PETSC ERROR: [0] MatLUFactorNumeric line 2985
> > /home/anton/LIB/petsc/src/mat/interface/matrix.c
> > [0]PETSC ERROR: [0] PCSetUp_LU line 101
> > /home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
> > [0]PETSC ERROR: [0] PCSetUp line 930
> > /home/anton/LIB/petsc/src/ksp/pc/interface/precon.c
> >
> > According to the line numbers it crashes within
> > MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.
> >
> > Surprisingly this only happens on the second SNES iteration, but not on the
> > first.
> >
> > I'm trying to reproduce this behavior with PETSc KSP and SNES examples.
> > However, everything I've tried up to now with SuperLU_DIST does just fine.
> >
> > I'm also checking our code in Valgrind to make sure it's clean.
> >
> > Anton
> > >
> > > Satish
> > >
> > >
> > > On Mon, 10 Oct 2016, Xiaoye S. Li wrote:
> > >
> > > > Which version of superlu_dist does this capture?   I looked at the
> > > > original
> > > > error  log, it pointed to pdgssvx: line 161.  But that line is in
> > > > comment
> > > > block, not the program.
> > > >
> > > > Sherry
> > > >
> > > >
> > > > On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov  wrote:
> > > >
> > > > >
> > > > > On 10/07/2016 05:23 PM, Satish Balay wrote:
> > > > >
> > > > > > On Fri, 7 Oct 2016, Kong, Fande wrote:
> > > > > >
> > > > > > On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay 
> > > > > > wrote:
> > > > > > > On Fri, 7 Oct 2016, Anton Popov wrote:
> > > > > > > > Hi guys,
> > > > > > > > > are 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-11 Thread Anton Popov

Valgrind immediately detects interesting stuff:

==25673== Use of uninitialised value of size 8
==25673==at 0x178272C: static_schedule (static_schedule.c:960)
==25674== Use of uninitialised value of size 8
==25674==at 0x178272C: static_schedule (static_schedule.c:960)
==25674==by 0x174E74E: pdgstrf (pdgstrf.c:572)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)


==25673== Conditional jump or move depends on uninitialised value(s)
==25673==at 0x1752143: pdgstrf (dlook_ahead_update.c:24)
==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)


==25673== Conditional jump or move depends on uninitialised value(s)
==25673==at 0x5C83F43: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
==25673==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
==25673==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
==25673==by 0x1733954: pdgssvx (pdgssvx.c:1124)

==25674== Use of uninitialised value of size 8
==25674==at 0x62BF72B: _itoa_word (_itoa.c:179)
==25674==by 0x62C1289: printf_positional (vfprintf.c:2022)
==25674==by 0x62C2465: vfprintf (vfprintf.c:1677)
==25674==by 0x638AFD5: __vsnprintf_chk (vsnprintf_chk.c:63)
==25674==by 0x638AF37: __snprintf_chk (snprintf_chk.c:34)
==25674==by 0x5CC6C08: MPIR_Err_create_code_valist (in 
/opt/mpich3/lib/libmpi.so.12.1.0)
==25674==by 0x5CC7A9A: MPIR_Err_create_code (in 
/opt/mpich3/lib/libmpi.so.12.1.0)

==25674==by 0x5C83FB1: PMPI_Recv (in /opt/mpich3/lib/libmpi.so.12.1.0)
==25674==by 0x1755385: pdgstrf2_trsm (pdgstrf2.c:253)
==25674==by 0x1751E4F: pdgstrf (dlook_ahead_update.c:195)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)

==25674== Use of uninitialised value of size 8
==25674==at 0x1751E92: pdgstrf (dlook_ahead_update.c:205)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)

And it crashes after this:

==25674== Invalid write of size 4
==25674==at 0x1751F2F: pdgstrf (dlook_ahead_update.c:211)
==25674==by 0x1733954: pdgssvx (pdgssvx.c:1124)
==25674==by 0xAAEFAE: MatLUFactorNumeric_SuperLU_DIST 
(superlu_dist.c:421)

==25674==  Address 0xa0 is not stack'd, malloc'd or (recently) free'd
==25674==
[1]PETSC ERROR: 

[1]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
probably memory access out of range



On 10/11/2016 03:26 PM, Anton Popov wrote:


On 10/10/2016 07:11 PM, Satish Balay wrote:

Thats from petsc-3.5

Anton - please post the stack trace you get with 
--download-superlu_dist-commit=origin/maint


I guess this is it:

[0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421 
/home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282 
/home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[0]PETSC ERROR: [0] MatLUFactorNumeric line 2985 
/home/anton/LIB/petsc/src/mat/interface/matrix.c
[0]PETSC ERROR: [0] PCSetUp_LU line 101 
/home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
[0]PETSC ERROR: [0] PCSetUp line 930 
/home/anton/LIB/petsc/src/ksp/pc/interface/precon.c


According to the line numbers it crashes within 
MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.


Surprisingly this only happens on the second SNES iteration, but not 
on the first.


I'm trying to reproduce this behavior with PETSc KSP and SNES 
examples. However, everything I've tried up to now with SuperLU_DIST 
does just fine.


I'm also checking our code in Valgrind to make sure it's clean.

Anton


Satish


On Mon, 10 Oct 2016, Xiaoye S. Li wrote:

Which version of superlu_dist does this capture?   I looked at the 
original
error  log, it pointed to pdgssvx: line 161.  But that line is in 
comment

block, not the program.

Sherry


On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov  
wrote:




On 10/07/2016 05:23 PM, Satish Balay wrote:


On Fri, 7 Oct 2016, Kong, Fande wrote:

On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  
wrote:

On Fri, 7 Oct 2016, Anton Popov wrote:

Hi guys,
are there any news about fixing buggy behavior of SuperLU_DIST, 
exactly



what


is described here:

https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.


mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm
l=CwIBAg=
54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=DUUt3SRGI0_
JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos= ?

I'm using 3.7.4 and still get SEGV in pdgssvx routine. 
Everything works



fine


with 3.5.4.

Do I still have to stick to maint branch, and what are the 
chances for



these


fixes to be included in 3.7.5?


3.7.4. is off maint branch [as of a week ago]. So if you are seeing
issues with it - its best to debug and figure out the cause.

This bug is indeed inside of superlu_dist, and we started having 
this

issue
from PETSc-3.6.x. I think superlu_dist developers should have 
fixed this
bug. We forgot to 

Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-11 Thread Anton Popov


On 10/10/2016 07:11 PM, Satish Balay wrote:

Thats from petsc-3.5

Anton - please post the stack trace you get with  
--download-superlu_dist-commit=origin/maint


I guess this is it:

[0]PETSC ERROR: [0] SuperLU_DIST:pdgssvx line 421 
/home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[0]PETSC ERROR: [0] MatLUFactorNumeric_SuperLU_DIST line 282 
/home/anton/LIB/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c
[0]PETSC ERROR: [0] MatLUFactorNumeric line 2985 
/home/anton/LIB/petsc/src/mat/interface/matrix.c
[0]PETSC ERROR: [0] PCSetUp_LU line 101 
/home/anton/LIB/petsc/src/ksp/pc/impls/factor/lu/lu.c
[0]PETSC ERROR: [0] PCSetUp line 930 
/home/anton/LIB/petsc/src/ksp/pc/interface/precon.c


According to the line numbers it crashes within 
MatLUFactorNumeric_SuperLU_DIST while calling pdgssvx.


Surprisingly this only happens on the second SNES iteration, but not on 
the first.


I'm trying to reproduce this behavior with PETSc KSP and SNES examples. 
However, everything I've tried up to now with SuperLU_DIST does just fine.


I'm also checking our code in Valgrind to make sure it's clean.

Anton


Satish


On Mon, 10 Oct 2016, Xiaoye S. Li wrote:


Which version of superlu_dist does this capture?   I looked at the original
error  log, it pointed to pdgssvx: line 161.  But that line is in comment
block, not the program.

Sherry


On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov  wrote:



On 10/07/2016 05:23 PM, Satish Balay wrote:


On Fri, 7 Oct 2016, Kong, Fande wrote:

On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  wrote:

On Fri, 7 Oct 2016, Anton Popov wrote:

Hi guys,

are there any news about fixing buggy behavior of SuperLU_DIST, exactly


what


is described here:

https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.


mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm
l=CwIBAg=
54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=DUUt3SRGI0_
JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos=  ?


I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works


fine


with 3.5.4.

Do I still have to stick to maint branch, and what are the chances for


these


fixes to be included in 3.7.5?


3.7.4. is off maint branch [as of a week ago]. So if you are seeing
issues with it - its best to debug and figure out the cause.

This bug is indeed inside of superlu_dist, and we started having this

issue
from PETSc-3.6.x. I think superlu_dist developers should have fixed this
bug. We forgot to update superlu_dist??  This is not a thing users could
debug and fix.

I have many people in INL suffering from this issue, and they have to
stay
with PETSc-3.5.4 to use superlu_dist.


To verify if the bug is fixed in latest superlu_dist - you can try
[assuming you have git - either from petsc-3.7/maint/master]:

--download-superlu_dist --download-superlu_dist-commit=origin/maint


Satish

Hi Satish,

I did this:

git clone -b maint https://bitbucket.org/petsc/petsc.git petsc

--download-superlu_dist
--download-superlu_dist-commit=origin/maint (not sure this is needed,
since I'm already in maint)

The problem is still there.

Cheers,
Anton





Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-10 Thread Satish Balay
Thats from petsc-3.5

Anton - please post the stack trace you get with  
--download-superlu_dist-commit=origin/maint

Satish


On Mon, 10 Oct 2016, Xiaoye S. Li wrote:

> Which version of superlu_dist does this capture?   I looked at the original
> error  log, it pointed to pdgssvx: line 161.  But that line is in comment
> block, not the program.
> 
> Sherry
> 
> 
> On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov  wrote:
> 
> >
> >
> > On 10/07/2016 05:23 PM, Satish Balay wrote:
> >
> >> On Fri, 7 Oct 2016, Kong, Fande wrote:
> >>
> >> On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  wrote:
> >>>
> >>> On Fri, 7 Oct 2016, Anton Popov wrote:
> 
>  Hi guys,
> >
> > are there any news about fixing buggy behavior of SuperLU_DIST, exactly
> >
>  what
> 
> > is described here:
> >
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> >
>  mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm
>  l=CwIBAg=
>  54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=DUUt3SRGI0_
>  JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
>  1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos=  ?
> 
> > I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works
> >
>  fine
> 
> > with 3.5.4.
> >
> > Do I still have to stick to maint branch, and what are the chances for
> >
>  these
> 
> > fixes to be included in 3.7.5?
> >
>  3.7.4. is off maint branch [as of a week ago]. So if you are seeing
>  issues with it - its best to debug and figure out the cause.
> 
>  This bug is indeed inside of superlu_dist, and we started having this
> >>> issue
> >>> from PETSc-3.6.x. I think superlu_dist developers should have fixed this
> >>> bug. We forgot to update superlu_dist??  This is not a thing users could
> >>> debug and fix.
> >>>
> >>> I have many people in INL suffering from this issue, and they have to
> >>> stay
> >>> with PETSc-3.5.4 to use superlu_dist.
> >>>
> >> To verify if the bug is fixed in latest superlu_dist - you can try
> >> [assuming you have git - either from petsc-3.7/maint/master]:
> >>
> >> --download-superlu_dist --download-superlu_dist-commit=origin/maint
> >>
> >>
> >> Satish
> >>
> >> Hi Satish,
> > I did this:
> >
> > git clone -b maint https://bitbucket.org/petsc/petsc.git petsc
> >
> > --download-superlu_dist
> > --download-superlu_dist-commit=origin/maint (not sure this is needed,
> > since I'm already in maint)
> >
> > The problem is still there.
> >
> > Cheers,
> > Anton
> >
> 



Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-10 Thread Xiaoye S. Li
Which version of superlu_dist does this capture?   I looked at the original
error  log, it pointed to pdgssvx: line 161.  But that line is in comment
block, not the program.

Sherry


On Mon, Oct 10, 2016 at 7:27 AM, Anton Popov  wrote:

>
>
> On 10/07/2016 05:23 PM, Satish Balay wrote:
>
>> On Fri, 7 Oct 2016, Kong, Fande wrote:
>>
>> On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  wrote:
>>>
>>> On Fri, 7 Oct 2016, Anton Popov wrote:

 Hi guys,
>
> are there any news about fixing buggy behavior of SuperLU_DIST, exactly
>
 what

> is described here:
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
>
 mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.htm
 l=CwIBAg=
 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=DUUt3SRGI0_
 JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
 1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos=  ?

> I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works
>
 fine

> with 3.5.4.
>
> Do I still have to stick to maint branch, and what are the chances for
>
 these

> fixes to be included in 3.7.5?
>
 3.7.4. is off maint branch [as of a week ago]. So if you are seeing
 issues with it - its best to debug and figure out the cause.

 This bug is indeed inside of superlu_dist, and we started having this
>>> issue
>>> from PETSc-3.6.x. I think superlu_dist developers should have fixed this
>>> bug. We forgot to update superlu_dist??  This is not a thing users could
>>> debug and fix.
>>>
>>> I have many people in INL suffering from this issue, and they have to
>>> stay
>>> with PETSc-3.5.4 to use superlu_dist.
>>>
>> To verify if the bug is fixed in latest superlu_dist - you can try
>> [assuming you have git - either from petsc-3.7/maint/master]:
>>
>> --download-superlu_dist --download-superlu_dist-commit=origin/maint
>>
>>
>> Satish
>>
>> Hi Satish,
> I did this:
>
> git clone -b maint https://bitbucket.org/petsc/petsc.git petsc
>
> --download-superlu_dist
> --download-superlu_dist-commit=origin/maint (not sure this is needed,
> since I'm already in maint)
>
> The problem is still there.
>
> Cheers,
> Anton
>


Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-10 Thread Anton Popov



On 10/07/2016 05:23 PM, Satish Balay wrote:

On Fri, 7 Oct 2016, Kong, Fande wrote:


On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  wrote:


On Fri, 7 Oct 2016, Anton Popov wrote:


Hi guys,

are there any news about fixing buggy behavior of SuperLU_DIST, exactly

what

is described here:

https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.

mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.html=CwIBAg=
54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=DUUt3SRGI0_
JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos=  ?

I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works

fine

with 3.5.4.

Do I still have to stick to maint branch, and what are the chances for

these

fixes to be included in 3.7.5?

3.7.4. is off maint branch [as of a week ago]. So if you are seeing
issues with it - its best to debug and figure out the cause.


This bug is indeed inside of superlu_dist, and we started having this issue
from PETSc-3.6.x. I think superlu_dist developers should have fixed this
bug. We forgot to update superlu_dist??  This is not a thing users could
debug and fix.

I have many people in INL suffering from this issue, and they have to stay
with PETSc-3.5.4 to use superlu_dist.

To verify if the bug is fixed in latest superlu_dist - you can try
[assuming you have git - either from petsc-3.7/maint/master]:

--download-superlu_dist --download-superlu_dist-commit=origin/maint


Satish


Hi Satish,
I did this:

git clone -b maint https://bitbucket.org/petsc/petsc.git petsc

--download-superlu_dist
--download-superlu_dist-commit=origin/maint (not sure this is needed, 
since I'm already in maint)


The problem is still there.

Cheers,
Anton


Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-07 Thread Barry Smith

   Fande,

If you can reproduce the problem with PETSc 3.7.4 please send us sample 
code that produces it so we can work with Sherry to get it fixed ASAP.

   Barry

> On Oct 7, 2016, at 10:23 AM, Satish Balay  wrote:
> 
> On Fri, 7 Oct 2016, Kong, Fande wrote:
> 
>> On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  wrote:
>> 
>>> On Fri, 7 Oct 2016, Anton Popov wrote:
>>> 
 Hi guys,
 
 are there any news about fixing buggy behavior of SuperLU_DIST, exactly
>>> what
 is described here:
 
 https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
>>> mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.html=CwIBAg=
>>> 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=DUUt3SRGI0_
>>> JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
>>> 1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos=  ?
 
 I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works
>>> fine
 with 3.5.4.
 
 Do I still have to stick to maint branch, and what are the chances for
>>> these
 fixes to be included in 3.7.5?
>>> 
>>> 3.7.4. is off maint branch [as of a week ago]. So if you are seeing
>>> issues with it - its best to debug and figure out the cause.
>>> 
>> 
>> This bug is indeed inside of superlu_dist, and we started having this issue
>> from PETSc-3.6.x. I think superlu_dist developers should have fixed this
>> bug. We forgot to update superlu_dist??  This is not a thing users could
>> debug and fix.
>> 
>> I have many people in INL suffering from this issue, and they have to stay
>> with PETSc-3.5.4 to use superlu_dist.
> 
> To verify if the bug is fixed in latest superlu_dist - you can try
> [assuming you have git - either from petsc-3.7/maint/master]:
> 
> --download-superlu_dist --download-superlu_dist-commit=origin/maint
> 
> 
> Satish
> 



Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-07 Thread Satish Balay
On Fri, 7 Oct 2016, Kong, Fande wrote:

> On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  wrote:
> 
> > On Fri, 7 Oct 2016, Anton Popov wrote:
> >
> > > Hi guys,
> > >
> > > are there any news about fixing buggy behavior of SuperLU_DIST, exactly
> > what
> > > is described here:
> > >
> > > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> > mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.html=CwIBAg=
> > 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=DUUt3SRGI0_
> > JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
> > 1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos=  ?
> > >
> > > I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works
> > fine
> > > with 3.5.4.
> > >
> > > Do I still have to stick to maint branch, and what are the chances for
> > these
> > > fixes to be included in 3.7.5?
> >
> > 3.7.4. is off maint branch [as of a week ago]. So if you are seeing
> > issues with it - its best to debug and figure out the cause.
> >
> 
> This bug is indeed inside of superlu_dist, and we started having this issue
> from PETSc-3.6.x. I think superlu_dist developers should have fixed this
> bug. We forgot to update superlu_dist??  This is not a thing users could
> debug and fix.
> 
> I have many people in INL suffering from this issue, and they have to stay
> with PETSc-3.5.4 to use superlu_dist.

To verify if the bug is fixed in latest superlu_dist - you can try
[assuming you have git - either from petsc-3.7/maint/master]:

--download-superlu_dist --download-superlu_dist-commit=origin/maint


Satish



Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-07 Thread Matthew Knepley
On Fri, Oct 7, 2016 at 10:16 AM, Kong, Fande  wrote:

> On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  wrote:
>
>> On Fri, 7 Oct 2016, Anton Popov wrote:
>>
>> > Hi guys,
>> >
>> > are there any news about fixing buggy behavior of SuperLU_DIST, exactly
>> what
>> > is described here:
>> >
>> > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.mc
>> s.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.html&
>> d=CwIBAg=54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=
>> DUUt3SRGI0_JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H8
>> 9Z6LXKBfJBOAM2vG1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0e
>> UwibAKsRRWKafos=  ?
>> >
>> > I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works
>> fine
>> > with 3.5.4.
>> >
>> > Do I still have to stick to maint branch, and what are the chances for
>> these
>> > fixes to be included in 3.7.5?
>>
>> 3.7.4. is off maint branch [as of a week ago]. So if you are seeing
>> issues with it - its best to debug and figure out the cause.
>>
>
> This bug is indeed inside of superlu_dist, and we started having this
> issue from PETSc-3.6.x. I think superlu_dist developers should have fixed
> this bug. We forgot to update superlu_dist??  This is not a thing users
> could debug and fix.
>
> I have many people in INL suffering from this issue, and they have to stay
> with PETSc-3.5.4 to use superlu_dist.
>

Do you have this bug with the latest maint?

  Matt


> Fande
>
>
>
>>
>> Satish
>>
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener


Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-07 Thread Kong, Fande
On Fri, Oct 7, 2016 at 9:04 AM, Satish Balay  wrote:

> On Fri, 7 Oct 2016, Anton Popov wrote:
>
> > Hi guys,
> >
> > are there any news about fixing buggy behavior of SuperLU_DIST, exactly
> what
> > is described here:
> >
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> mcs.anl.gov_pipermail_petsc-2Dusers_2015-2DAugust_026802.html=CwIBAg=
> 54IZrppPQZKX9mLzcGdPfFD1hxrcB__aEkJFOKJFd00=DUUt3SRGI0_
> JgtNaS3udV68GRkgV4ts7XKfj2opmiCY=RwruX6ckX0t9H89Z6LXKBfJBOAM2vG
> 1sQHw2tIsSQtA=bbB62oGLm582JebVs8xsUej_OX0eUwibAKsRRWKafos=  ?
> >
> > I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works
> fine
> > with 3.5.4.
> >
> > Do I still have to stick to maint branch, and what are the chances for
> these
> > fixes to be included in 3.7.5?
>
> 3.7.4. is off maint branch [as of a week ago]. So if you are seeing
> issues with it - its best to debug and figure out the cause.
>

This bug is indeed inside of superlu_dist, and we started having this issue
from PETSc-3.6.x. I think superlu_dist developers should have fixed this
bug. We forgot to update superlu_dist??  This is not a thing users could
debug and fix.

I have many people in INL suffering from this issue, and they have to stay
with PETSc-3.5.4 to use superlu_dist.

Fande



>
> Satish
>


Re: [petsc-users] SuperLU_dist issue in 3.7.4

2016-10-07 Thread Satish Balay
On Fri, 7 Oct 2016, Anton Popov wrote:

> Hi guys,
> 
> are there any news about fixing buggy behavior of SuperLU_DIST, exactly what
> is described here:
> 
> http://lists.mcs.anl.gov/pipermail/petsc-users/2015-August/026802.html ?
> 
> I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works fine
> with 3.5.4.
> 
> Do I still have to stick to maint branch, and what are the chances for these
> fixes to be included in 3.7.5?

3.7.4. is off maint branch [as of a week ago]. So if you are seeing
issues with it - its best to debug and figure out the cause.

Satish


[petsc-users] SuperLU_dist issue in 3.7.4

2016-10-07 Thread Anton Popov

Hi guys,

are there any news about fixing buggy behavior of SuperLU_DIST, exactly 
what is described here:


http://lists.mcs.anl.gov/pipermail/petsc-users/2015-August/026802.html ?

I'm using 3.7.4 and still get SEGV in pdgssvx routine. Everything works 
fine with 3.5.4.


Do I still have to stick to maint branch, and what are the chances for 
these fixes to be included in 3.7.5?


Thanks,

Anton