Re: [petsc-users] DMPlex memory problem in scaling test

2019-10-10 Thread Danyang Su via petsc-users

Hi Matt,

My previous test is terminated after calling subroutine A as shown below.

>> In Subroutine A

  call DMPlexDistribute(dmda_flow%da,stencil_width,    &
    PETSC_NULL_SF,distributedMesh,ierr)
  CHKERRQ(ierr)

  if (distributedMesh /= PETSC_NULL_DM) then

    call DMDestroy(dmda_flow%da,ierr)
    CHKERRQ(ierr)
    !c set the global mesh as distributed mesh
    dmda_flow%da = distributedMesh

    call DMDestroy(distributedMesh,ierr)

   !If DMDestroy(distributedMesh,ierr) called, then everything is 
destroyed and there is nothing output with -malloc_test. However, I got 
error in the next subroutine [0]PETSC ERROR: DMGetCoordinatesLocal() 
line 5545 in /home/dsu/Soft/PETSc/petsc-dev/src/dm/interface/dm.c Object 
already free: Parameter # 1


    CHKERRQ(ierr)

 end if

>> In Subroutine B

  !c get local mesh DM and set coordinates

  call DMGetCoordinatesLocal(dmda_flow%da,gc,ierr)
  CHKERRQ(ierr)

  call DMGetCoordinateDM(dmda_flow%da,cda,ierr)
  CHKERRQ(ierr)

Thanks,

Danyang


On 2019-10-10 6:15 p.m., Matthew Knepley wrote:


On Thu, Oct 10, 2019 at 9:00 PM Danyang Su > wrote:



Labels should be destroyed with the DM. Just make a small code
that does nothing but distribute the mesh and end. If you
run with -malloc_test you should see if everythign is destroyed
properly.

  Thanks,

    Matt


Attached is the output run with -malloc_test using 2 processor.
It's a big file. How can I quick check if something is not
properly destroyed?

Everything output has not been destroyed. It looks like you did not 
destroy the distributed DM.


  Thanks,

    Matt

Thanks,

Danyang

--
What most experimenters take for granted before they begin their 
experiments is infinitely more interesting than any results to which 
their experiments lead.

-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 



Re: [petsc-users] DMPlex memory problem in scaling test

2019-10-10 Thread Matthew Knepley via petsc-users
On Thu, Oct 10, 2019 at 9:00 PM Danyang Su  wrote:

> Labels should be destroyed with the DM. Just make a small code that does
> nothing but distribute the mesh and end. If you
> run with -malloc_test you should see if everythign is destroyed properly.
>
>   Thanks,
>
> Matt
>
> Attached is the output run with -malloc_test using 2 processor. It's a big
> file. How can I quick check if something is not properly destroyed?
>
Everything output has not been destroyed. It looks like you did not destroy
the distributed DM.

  Thanks,

Matt

> Thanks,
>
> Danyang
>
-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


Re: [petsc-users] DMPlex memory problem in scaling test

2019-10-10 Thread Matthew Knepley via petsc-users
On Thu, Oct 10, 2019 at 7:53 PM Danyang Su  wrote:

> On 2019-10-10 4:28 p.m., Matthew Knepley wrote:
>
> On Thu, Oct 10, 2019 at 4:26 PM Danyang Su  wrote:
>
>> Hi All,
>>
>> Your guess is right. The memory problem occurs after
>> DMPlexCreateFromCellList and DMPlexDistribute. The mesh related memory in
>> the master processor is not released after that.
>>
>> The pseudo code I use is
>>
>> if (rank == 0) then !only the master processor read the mesh file
>> and create cell list
>>
>> call DMPlexCreateFromCellList(Petsc_Comm_World,ndim,num_cells, &
>>   num_nodes,num_nodes_per_cell,&
>>   Petsc_False,dmplex_cells,ndim,   &
>> !use Petsc_True to create intermediate mesh entities (faces, edges),
>>   dmplex_verts,dmda_flow%da,ierr)
>> !not work for prism for the current 3.8 version.
>> CHKERRQ(ierr)
>>
>> else !slave processors pass zero cells
>>
>> call DMPlexCreateFromCellList(Petsc_Comm_World,ndim,0,0,   &
>>   num_nodes_per_cell,  &
>>   Petsc_False,dmplex_cells,ndim,   &
>> !use Petsc_True to create intermediate mesh entities (faces, edges),
>>   dmplex_verts,dmda_flow%da,ierr)
>> !not work for prism for the current 3.8 version.
>> CHKERRQ(ierr)
>>
>> end if
>>
>> call DMPlexDistribute
>>
>> call DMDestroy(dmda_flow%da,ierr)
>> CHKERRQ(ierr)
>>
>> !c set the global mesh as distributed mesh
>> dmda_flow%da = distributedMesh
>>
>>
>> After calling the above functions, the memory usage for the test case
>> (no. points 953,433, nprocs 160) is shown below:
>> rank 0 PETSc memory current MB1610.39 PETSc memory maximum MB
>> 1690.42
>> rank 151 PETSc memory current MB 105.00 PETSc memory maximum MB
>> 104.94
>> rank 98 PETSc memory current MB 106.02 PETSc memory maximum MB
>> 105.95
>> rank 18 PETSc memory current MB 106.17 PETSc memory maximum MB
>> 106.17
>>
>> Is there any function available in the master version that can release
>> this memory?
>>
> DMDestroy() releases this memory, UNLESS you are holding other objects
> that refer to it, like a vector from that DM.
>
> Well, I have some labels set before distribution. After distribution, the
> labels values are collected but not destroyed. I will try this to see if it
> makes big difference.
>
> Labels should be destroyed with the DM. Just make a small code that does
nothing but distribute the mesh and end. If you
run with -malloc_test you should see if everythign is destroyed properly.

  Thanks,

Matt

> Thanks,
>
> danyang
>
>
>   Thanks,
>
>  Matt
>
>> Thanks,
>>
>> Danyang
>> On 2019-10-10 11:09 a.m., Mark Adams via petsc-users wrote:
>>
>> Now that I think about it, the partitioning and distribution can be done
>> with existing API, I would assume, like is done with matrices.
>>
>> I'm still wondering what the H5 format is. I assume that it is not built
>> for a hardwired number of processes to read in parallel and that the
>> parallel read is somewhat scalable.
>>
>> On Thu, Oct 10, 2019 at 12:13 PM Mark Adams  wrote:
>>
>>> A related question, what is the state of having something like a
>>> distributed  DMPlexCreateFromCellList method, but maybe your H5 efforts
>>> would work. My bone modeling code is old and a pain, but the apps
>>> specialized serial mesh generator could write an H5 file instead of the
>>> current FEAP file. Then you reader, SNES and a large deformation plasticity
>>> element in PetscFE could replace my code, in the future.
>>>
>>> How does your H5 thing work? Is it basically a flat file (not
>>> partitioned) that is read in in parallel by slicing the cell lists, etc,
>>> using file seek or something equivalent, then reconstructing a local graph
>>> on each processor to give to say Parmetis, then completes the distribution
>>> with this reasonable partitioning? (this is what our current code does)
>>>
>>> Thanks,
>>> Mark
>>>
>>> On Thu, Oct 10, 2019 at 9:30 AM Dave May via petsc-users <
>>> petsc-users@mcs.anl.gov> wrote:
>>>


 On Thu 10. Oct 2019 at 15:15, Matthew Knepley 
 wrote:

> On Thu, Oct 10, 2019 at 9:10 AM Dave May 
> wrote:
>
>> On Thu 10. Oct 2019 at 15:04, Matthew Knepley 
>> wrote:
>>
>>> On Thu, Oct 10, 2019 at 8:41 AM Dave May 
>>> wrote:
>>>
 On Thu 10. Oct 2019 at 14:34, Matthew Knepley 
 wrote:

> On Thu, Oct 10, 2019 at 8:31 AM Dave May 
> wrote:
>
>> On Thu, 10 Oct 2019 at 13:21, Matthew Knepley via petsc-users <
>> petsc-users@mcs.anl.gov> wrote:
>>
>>> On Wed, Oct 9, 2019 at 5:10 PM Danyang Su via petsc-users <
>>> petsc-users@mcs.anl.gov> wrote:
>>>
 Dear All,

 I have a question regarding 

Re: [petsc-users] DMPlex memory problem in scaling test

2019-10-10 Thread Mark Adams via petsc-users
Now that I think about it, the partitioning and distribution can be done
with existing API, I would assume, like is done with matrices.

I'm still wondering what the H5 format is. I assume that it is not built
for a hardwired number of processes to read in parallel and that the
parallel read is somewhat scalable.

On Thu, Oct 10, 2019 at 12:13 PM Mark Adams  wrote:

> A related question, what is the state of having something like a
> distributed  DMPlexCreateFromCellList method, but maybe your H5 efforts
> would work. My bone modeling code is old and a pain, but the apps
> specialized serial mesh generator could write an H5 file instead of the
> current FEAP file. Then you reader, SNES and a large deformation plasticity
> element in PetscFE could replace my code, in the future.
>
> How does your H5 thing work? Is it basically a flat file (not partitioned)
> that is read in in parallel by slicing the cell lists, etc, using file seek
> or something equivalent, then reconstructing a local graph on each
> processor to give to say Parmetis, then completes the distribution with
> this reasonable partitioning? (this is what our current code does)
>
> Thanks,
> Mark
>
> On Thu, Oct 10, 2019 at 9:30 AM Dave May via petsc-users <
> petsc-users@mcs.anl.gov> wrote:
>
>>
>>
>> On Thu 10. Oct 2019 at 15:15, Matthew Knepley  wrote:
>>
>>> On Thu, Oct 10, 2019 at 9:10 AM Dave May 
>>> wrote:
>>>
 On Thu 10. Oct 2019 at 15:04, Matthew Knepley 
 wrote:

> On Thu, Oct 10, 2019 at 8:41 AM Dave May 
> wrote:
>
>> On Thu 10. Oct 2019 at 14:34, Matthew Knepley 
>> wrote:
>>
>>> On Thu, Oct 10, 2019 at 8:31 AM Dave May 
>>> wrote:
>>>
 On Thu, 10 Oct 2019 at 13:21, Matthew Knepley via petsc-users <
 petsc-users@mcs.anl.gov> wrote:

> On Wed, Oct 9, 2019 at 5:10 PM Danyang Su via petsc-users <
> petsc-users@mcs.anl.gov> wrote:
>
>> Dear All,
>>
>> I have a question regarding the maximum memory usage for the
>> scaling test. My code is written in Fortran with support for both
>> structured grid (DM) and unstructured grid (DMPlex). It looks like 
>> memory
>> consumption is much larger when DMPlex is used and finally causew
>> out_of_memory problem.
>>
>> Below are some test using both structured grid and unstructured
>> grid. The memory consumption by the code is estimated based on all
>> allocated arrays and PETSc memory consumption is estimated based on
>> PetscMemoryGetMaximumUsage.
>>
>> I just wonder why the PETSc memory consumption does not decrease
>> when number of processors increases. For structured grid (scenario 
>> 7-9),
>> the memory consumption decreases as number of processors increases.
>> However, for unstructured grid case (scenario 14-16), the memory for 
>> PETSc
>> part remains unchanged. When I run a larger case, the code crashes 
>> because
>> memory is ran out. The same case works on another cluster with 480GB 
>> memory
>> per node. Does this make sense?
>>
> We would need a finer breakdown of where memory is being used. I
> did this for a paper:
>
>
> https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/jgrb.50217
>
> If the subdomains, the halo sizes can overwhelm the basic storage.
> It looks like the subdomains are big here,
> but things are not totally clear to me. It would be helpful to
> send the output of -log_view for each case since
> PETSc tries to keep track of allocated memory.
>

 Matt - I'd guess that there is a sequential (non-partitioned) mesh
 hanging around in memory.
 Is it possible that he's created the PLEX object which is loaded
 sequentially (stored and retained in memory and never released), and 
 then
 afterwards distributed?
 This can never happen with the DMDA and the table verifies this.
 If his code using the DMDA and DMPLEX are as identical as possible
 (albeit the DM used), then a sequential mesh held in memory seems the
 likely cause.

>>>
>>> Dang it, Dave is always right.
>>>
>>> How to prevent this?
>>>
>>
>> I thought you/Lawrence/Vaclav/others... had developed and provided
>> support  for a parallel DMPLEX load via a suitably defined plex specific 
>> H5
>> mesh file.
>>
>
> We have, but these tests looked like generated meshes.
>

 Great.

 So would a solution to the problem be to have the user modify their
 code in the follow way:
 * they move the mesh gen stage into a seperate exec which they call
 offline (on a fat node with lots of memory), and dump the appropriate file
 * they change 

Re: [petsc-users] DMPlex memory problem in scaling test

2019-10-10 Thread Mark Adams via petsc-users
A related question, what is the state of having something like a
distributed  DMPlexCreateFromCellList method, but maybe your H5 efforts
would work. My bone modeling code is old and a pain, but the apps
specialized serial mesh generator could write an H5 file instead of the
current FEAP file. Then you reader, SNES and a large deformation plasticity
element in PetscFE could replace my code, in the future.

How does your H5 thing work? Is it basically a flat file (not partitioned)
that is read in in parallel by slicing the cell lists, etc, using file seek
or something equivalent, then reconstructing a local graph on each
processor to give to say Parmetis, then completes the distribution with
this reasonable partitioning? (this is what our current code does)

Thanks,
Mark

On Thu, Oct 10, 2019 at 9:30 AM Dave May via petsc-users <
petsc-users@mcs.anl.gov> wrote:

>
>
> On Thu 10. Oct 2019 at 15:15, Matthew Knepley  wrote:
>
>> On Thu, Oct 10, 2019 at 9:10 AM Dave May  wrote:
>>
>>> On Thu 10. Oct 2019 at 15:04, Matthew Knepley  wrote:
>>>
 On Thu, Oct 10, 2019 at 8:41 AM Dave May 
 wrote:

> On Thu 10. Oct 2019 at 14:34, Matthew Knepley 
> wrote:
>
>> On Thu, Oct 10, 2019 at 8:31 AM Dave May 
>> wrote:
>>
>>> On Thu, 10 Oct 2019 at 13:21, Matthew Knepley via petsc-users <
>>> petsc-users@mcs.anl.gov> wrote:
>>>
 On Wed, Oct 9, 2019 at 5:10 PM Danyang Su via petsc-users <
 petsc-users@mcs.anl.gov> wrote:

> Dear All,
>
> I have a question regarding the maximum memory usage for the
> scaling test. My code is written in Fortran with support for both
> structured grid (DM) and unstructured grid (DMPlex). It looks like 
> memory
> consumption is much larger when DMPlex is used and finally causew
> out_of_memory problem.
>
> Below are some test using both structured grid and unstructured
> grid. The memory consumption by the code is estimated based on all
> allocated arrays and PETSc memory consumption is estimated based on
> PetscMemoryGetMaximumUsage.
>
> I just wonder why the PETSc memory consumption does not decrease
> when number of processors increases. For structured grid (scenario 
> 7-9),
> the memory consumption decreases as number of processors increases.
> However, for unstructured grid case (scenario 14-16), the memory for 
> PETSc
> part remains unchanged. When I run a larger case, the code crashes 
> because
> memory is ran out. The same case works on another cluster with 480GB 
> memory
> per node. Does this make sense?
>
 We would need a finer breakdown of where memory is being used. I
 did this for a paper:


 https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/jgrb.50217

 If the subdomains, the halo sizes can overwhelm the basic storage.
 It looks like the subdomains are big here,
 but things are not totally clear to me. It would be helpful to send
 the output of -log_view for each case since
 PETSc tries to keep track of allocated memory.

>>>
>>> Matt - I'd guess that there is a sequential (non-partitioned) mesh
>>> hanging around in memory.
>>> Is it possible that he's created the PLEX object which is loaded
>>> sequentially (stored and retained in memory and never released), and 
>>> then
>>> afterwards distributed?
>>> This can never happen with the DMDA and the table verifies this.
>>> If his code using the DMDA and DMPLEX are as identical as possible
>>> (albeit the DM used), then a sequential mesh held in memory seems the
>>> likely cause.
>>>
>>
>> Dang it, Dave is always right.
>>
>> How to prevent this?
>>
>
> I thought you/Lawrence/Vaclav/others... had developed and provided
> support  for a parallel DMPLEX load via a suitably defined plex specific 
> H5
> mesh file.
>

 We have, but these tests looked like generated meshes.

>>>
>>> Great.
>>>
>>> So would a solution to the problem be to have the user modify their code
>>> in the follow way:
>>> * they move the mesh gen stage into a seperate exec which they call
>>> offline (on a fat node with lots of memory), and dump the appropriate file
>>> * they change their existing application to simply load that file in
>>> parallel.
>>>
>>
>> Yes.
>>
>>
>>> If there were examples illustrating how to create the file which can be
>>> loaded in parallel I think it would be very helpful for the user (and many
>>> others)
>>>
>>
>> I think Vaclav is going to add his examples as soon as we fix this
>> parallel interpolation bug. I am praying for time in the latter
>> part of October to do this.
>>
>
>
> Excellent news - thanks for the update and info.
>
> Cheers
> Dave

Re: [petsc-users] DMPlex memory problem in scaling test

2019-10-10 Thread Matthew Knepley via petsc-users
On Thu, Oct 10, 2019 at 8:31 AM Dave May  wrote:

> On Thu, 10 Oct 2019 at 13:21, Matthew Knepley via petsc-users <
> petsc-users@mcs.anl.gov> wrote:
>
>> On Wed, Oct 9, 2019 at 5:10 PM Danyang Su via petsc-users <
>> petsc-users@mcs.anl.gov> wrote:
>>
>>> Dear All,
>>>
>>> I have a question regarding the maximum memory usage for the scaling
>>> test. My code is written in Fortran with support for both structured grid
>>> (DM) and unstructured grid (DMPlex). It looks like memory consumption is
>>> much larger when DMPlex is used and finally causew out_of_memory problem.
>>>
>>> Below are some test using both structured grid and unstructured grid.
>>> The memory consumption by the code is estimated based on all allocated
>>> arrays and PETSc memory consumption is estimated based on
>>> PetscMemoryGetMaximumUsage.
>>>
>>> I just wonder why the PETSc memory consumption does not decrease when
>>> number of processors increases. For structured grid (scenario 7-9), the
>>> memory consumption decreases as number of processors increases. However,
>>> for unstructured grid case (scenario 14-16), the memory for PETSc part
>>> remains unchanged. When I run a larger case, the code crashes because
>>> memory is ran out. The same case works on another cluster with 480GB memory
>>> per node. Does this make sense?
>>>
>> We would need a finer breakdown of where memory is being used. I did this
>> for a paper:
>>
>>   https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/jgrb.50217
>>
>> If the subdomains, the halo sizes can overwhelm the basic storage. It
>> looks like the subdomains are big here,
>> but things are not totally clear to me. It would be helpful to send the
>> output of -log_view for each case since
>> PETSc tries to keep track of allocated memory.
>>
>
> Matt - I'd guess that there is a sequential (non-partitioned) mesh hanging
> around in memory.
> Is it possible that he's created the PLEX object which is loaded
> sequentially (stored and retained in memory and never released), and then
> afterwards distributed?
> This can never happen with the DMDA and the table verifies this.
> If his code using the DMDA and DMPLEX are as identical as possible (albeit
> the DM used), then a sequential mesh held in memory seems the likely cause.
>

Dang it, Dave is always right.

How to prevent this? Since it looks like you are okay with fairly regular
meshes, I would construct the
coarsest mesh you can, and then use

  -dm_refine 

which is activated by DMSetFromOptions(). Make sure to call it after
DMPlexDistribute(). It will regularly
refine in parallel and should show good memory scaling as Dave says.

  Thanks,

 Matt


>
>>   Thanks,
>>
>>  Matt
>>
>>> scenario no. points cell type DMPLex nprocs no. nodes mem per node GB
>>> solver Rank 0 memory MB Rank 0 petsc memory MB Runtime (sec)
>>> 1 2121 rectangle no 40 1 200 GMRES,Hypre preconditioner 0.21 41.6
>>> 2 8241 rectangle no 40 1 200 GMRES,Hypre preconditioner 0.59 51.84
>>> 3 32481 rectangle no 40 1 200 GMRES,Hypre preconditioner 1.95 59.1
>>> 4 128961 rectangle no 40 1 200 GMRES,Hypre preconditioner 7.05 89.71
>>> 5 513921 rectangle no 40 1 200 GMRES,Hypre preconditioner 26.76 110.58
>>> 6 2051841 rectangle no 40 1 200 GMRES,Hypre preconditioner 104.21 232.05
>>> *7* *8199681* *rectangle* *no* *40* *1* *200* *GMRES,Hypre
>>> preconditioner* *411.26* *703.27* *140.29*
>>> *8* *8199681* *rectangle* *no* *80* *2* *200* *GMRES,Hypre
>>> preconditioner* *206.6* *387.25* *62.04*
>>> *9* *8199681* *rectangle* *no* *160* *4* *200* *GMRES,Hypre
>>> preconditioner* *104.28* *245.3* *32.76*
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 10 2121 triangle yes 40 1 200 GMRES,Hypre preconditioner 0.49 61.78
>>> 11 15090 triangle yes 40 1 200 GMRES,Hypre preconditioner 2.32 96.61
>>> 12 59847 triangle yes 40 1 200 GMRES,Hypre preconditioner 8.28 176.14
>>> 13 238568 triangle yes 40 1 200 GMRES,Hypre preconditioner 31.89 573.73
>>> *14* *953433* *triangle* *yes* *40* *1* *200* *GMRES,Hypre
>>> preconditioner* *119.23* *2102.54* *44.11*
>>> *15* *953433* *triangle* *yes* *80* *2* *200* *GMRES,Hypre
>>> preconditioner* *72.99* *2123.8* *24.36*
>>> *16* *953433* *triangle* *yes* *160* *4* *200* *GMRES,Hypre
>>> preconditioner* *48.65* *2076.25* *14.87*
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 17 55770 prism yes 40 1 200 GMRES,Hypre preconditioner 18.46 219.39
>>> 18 749814 prism yes 40 1 200 GMRES,Hypre preconditioner 149.86 2412.39
>>> 19 750 prism yes 40 to 640 1 to 16 200 GMRES,Hypre preconditioner
>>> out_of_memory
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *20* *750* *prism* *yes* *64* *2* *480* *GMRES,Hypre preconditioner*
>>> *890.92* *17214.41*
>>>
>>> The error information of scenario 19 is shown below:
>>>
>>> kernel messages produced during job executions:
>>> [Oct 9 10:41] mpiexec.hydra invoked oom-killer: gfp_mask=0x200da,
>>> order=0, oom_score_adj=0
>>> [  +0.010274] mpiexec.hydra cpuset=/ mems_allowed=0-1
>>> [  +0.006680] CPU: 2