from:"Zhang, Junchao"

Re: [petsc-users] Performance problem using COO interface

2023-01-17 Thread Zhang, Junchao via petsc-users

Hi, Philip,
  Could you add -log_view and see what functions are used in the solve? Since 
it is CPU-only, perhaps with -log_view of different runs, we can easily see 
which functions slowed down.

--Junchao Zhang

From: Fackler, Philip 
Sent: Tuesday, January 17, 2023 4:13 PM
To: xolotl-psi-developm...@lists.sourceforge.net 
; petsc-users@mcs.anl.gov 

Cc: Mills, Richard Tran ; Zhang, Junchao 
; Blondel, Sophie ; Roth, Philip 

Subject: Performance problem using COO interface

In Xolotl's feature-petsc-kokkos branch I have ported the code to use petsc's 
COO interface for creating the Jacobian matrix (and the Kokkos interface for 
interacting with Vec entries). As the attached plots show for one case, while 
the code for computing the RHSFunction and RHSJacobian perform similarly (or 
slightly better) after the port, the performance for the solve as a whole is 
significantly worse.

Note:
This is all CPU-only (so kokkos and kokkos-kernels are built with only the 
serial backend).
The dev version is using MatSetValuesStencil with the default implementations 
for Mat and Vec.
The port version is using MatSetValuesCOO and is run with -dm_mat_type 
aijkokkos -dm_vec_type kokkos.
The port/def version is using MatSetValuesCOO and is run with -dm_vec_type 
kokkos (using the default Mat implementation).

So, this seems to be due be a performance difference in the petsc 
implementations. Please advise. Is this a known issue? Or am I missing 
something?

Thank you for the help,

Philip Fackler
Research Software Engineer, Application Engineering Group
Advanced Computing Systems Research Section
Computer Science and Mathematics Division
Oak Ridge National Laboratory

Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Zhang, Junchao via petsc-users

MPI rank distribution (e.g., 8 ranks per node or 16 ranks per node) is usually 
managed by workload managers like Slurm, PBS through your job scripts, which is 
out of petsc’s control.

From: Amin Sadeghi 
Date: Wednesday, March 25, 2020 at 4:40 PM
To: Junchao Zhang 
Cc: Mark Adams , PETSc users list 
Subject: Re: [petsc-users] Poor speed up for KSP example 45

Junchao, thank you for doing the experiment, I guess TACC Frontera nodes have 
higher memory bandwidth (maybe more modern CPU architecture, although I'm not 
familiar as to which hardware affect memory bandwidth) than Compute Canada's 
Graham.

Mark, I did as you suggested. As you suspected, running make streams yielded 
the same results, indicating that the memory bandwidth saturated at around 8 
MPI processes. I ran the experiment on multiple nodes but only requested 8 
cores per node, and here is the result:

1 node (8 cores total): 17.5s, 6X speedup
2 nodes (16 cores total): 13.5s, 7X speedup
3 nodes (24 cores total): 9.4s, 10X speedup
4 nodes (32 cores total): 8.3s, 12X speedup
5 nodes (40 cores total): 7.0s, 14X speedup
6 nodes (48 cores total): 61.4s, 2X speedup [!!!]
7 nodes (56 cores total): 4.3s, 23X speedup
8 nodes (64 cores total): 3.7s, 27X speedup

Note: as you can see, the experiment with 6 nodes showed extremely poor 
scaling, which I guess was an outlier, maybe due to some connection problem?

I also ran another experiment, requesting 2 full nodes, i.e. 64 cores, and 
here's the result:

2 nodes (64 cores total): 6.0s, 16X speedup [32 cores each node]

So, it turns out that given a fixed number of cores, i.e. 64 in our case, much 
better speedups (27X vs. 16X in our case) can be achieved if they are 
distributed among separate nodes.

Anyways, I really appreciate all your inputs.

One final question: From what I understand from Mark's comment, PETSc at the 
moment is blind to memory hierarchy, is it feasible to make PETSc aware of the 
inter and intra node communication so that partitioning is done to maximize 
performance? Or, to put it differently, is this something that PETSc devs have 
their eyes on for the future?

Sincerely,
Amin

On Wed, Mar 25, 2020 at 3:51 PM Junchao Zhang 
mailto:junchao.zh...@gmail.com>> wrote:
I repeated your experiment on one node of TACC Frontera,
1 rank: 85.0s
16 ranks: 8.2s, 10x speedup
32 ranks: 5.7s, 15x speedup

--Junchao Zhang

On Wed, Mar 25, 2020 at 1:18 PM Mark Adams 
mailto:mfad...@lbl.gov>> wrote:
Also, a better test is see where streams pretty much saturates, then run that 
many processors per node and do the same test by increasing the nodes. This 
will tell you how well your network communication is doing.

But this result has a lot of stuff in "network communication" that can be 
further evaluated. The worst thing about this, I would think, is that the 
partitioning is blind to the memory hierarchy of inter and intra node 
communication. The next thing to do is run with an initial grid that puts one 
cell per node and the do uniform refinement, until you have one cell per 
process (eg, one refinement step using 8 processes per node), partition to get 
one cell per process, then do uniform refinement to get a reasonable sized 
local problem. Alas, this is not easy to do, but it is doable.

On Wed, Mar 25, 2020 at 2:04 PM Mark Adams 
mailto:mfad...@lbl.gov>> wrote:
I would guess that you are saturating the memory bandwidth. After you make 
PETSc (make all) it will suggest that you test it (make test) and suggest that 
you run streams (make streams).

I see Matt answered but let me add that when you make streams you will seed the 
memory rate for 1,2,3, ... NP processes. If your machine is decent you should 
see very good speed up at the beginning and then it will start to saturate. You 
are seeing about 50% of perfect speedup at 16 process. I would expect that you 
will see something similar with streams. Without knowing your machine, your 
results look typical.

On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi 
mailto:aminthefr...@gmail.com>> wrote:
Hi,

I ran KSP example 45 on a single node with 32 cores and 125GB memory using 1, 
16 and 32 MPI processes. Here's a comparison of the time spent during KSP.solve:

- 1 MPI process: ~98 sec, speedup: 1X
- 16 MPI processes: ~12 sec, speedup: ~8X
- 32 MPI processes: ~11 sec, speedup: ~9X

Since the problem size is large enough (8M unknowns), I expected a speedup much 
closer to 32X, rather than 9X. Is this expected? If yes, how can it be improved?

I've attached three log files for more details.

Sincerely,
Amin

Re: [petsc-users] Choosing VecScatter Method in Matrix-Vector Product

2020-01-27 Thread Zhang, Junchao via petsc-users

--Junchao Zhang

On Mon, Jan 27, 2020 at 10:09 AM Felix Huber 
mailto:st107...@stud.uni-stuttgart.de>> wrote:
Thank you all for you reply!

> Are you using a KSP/PC configuration which should weak scale?
Yes the system is solved with KSPSolve. There is no preconditioner yet,
but I fixed the number of CG iterations to 3 to ensure an apples to
apples comparison during the scaling measurements.

>> VecScatter has been greatly refactored (and the default implementation
>> is entirely new) since 3.7.

I now tried to use PETSc 3.11 and the code runs fine. The communication
seems to show a better weak scaling behavior now.

I'll see if we can just upgrade to 3.11.

> Anyway, I'm curious about your
> configuration and how you determine that MPI_Alltoallv/MPI_Alltoallw is
> being used.
I used the Extrae profiler which intercepts all MPI calls and logs them
into a file. This showed that Alltoall is being used for the
communication, which I found surprising. With PETSc 3.11 the Alltoall
calls are replaced by MPI_Start(all) and MPI_Wait(all), which sounds
more reasonable to me.
> This has never been a default code path, so I suspect
> something in your environment or code making this happen.

I attached some log files for some PETSc 3.7 runs on 1,19 and 115 nodes
(24 cores each) which suggest polynomial scaling (vs logarithmic
scaling). Could it be some installation setting of the PETSc version? (I
use a preinstalled PETSc)
I checked petsc 3.7.6 and did not think the vecscatter type could be set at 
configure time.  Anyway, upgrading petsc is preferred. If that is not possible, 
we can work together to see what happened.

> Can you please send representative log files which characterize the
> lack of scaling (include the full log_view)?

"Stage 1: activation" is the stage of interest, as it wraps the
KSPSolve. The number of unkowns per rank is very small in the
measurement, so most of the time should be communication. However, I
just noticed, that the stage also contains an additional setup step
which might be the reason why the MatMul takes longer than the KSPSolve.
I can repeat the measurements if necessary.
I should add, that I put a MPI_Barrier before the KSPSolve, to avoid any
previous work imbalance to effect the KSPSolve call.

You can use -log_sync, which adds an MPI_Barrier at the beginning of each 
event. Compare log_view files with and without -log_sync. If an event has much 
higher %T without -log_sync than with -log_sync, it means the code is not 
balanced. Alternatively, you can look at the Ratio column in log file without 
-log_sync.

Best regards,
Felix

Re: [petsc-users] DMDA Error

2020-01-24 Thread Zhang, Junchao via petsc-users

fs/home/software/spack-0.10.1/opt/spack/linux-centos7-x86_64/gcc-4.8.5/intel-parallel-studio-cluster.2019.5-zqvneipqa4u52iwlyy5kx4hbsfnspz6g/compilers_and_libraries_2019.5.281/linux/mpi/intel64/libfabric/lib/libfabric.so.1
(0x2afd30344000)
libXau.so.6 => /lib64/libXau.so.6 (0x2afd3057c000)

--Junchao Zhang

On Tue, Jan 21, 2020 at 2:25 AM Anthony Jourdon
mailto:jourdon_anth...@hotmail.fr>> wrote:
Hello,

I made a test to try to reproduce the error.
To do so I modified the file $PETSC_DIR/src/dm/examples/tests/ex35.c
I attach the file in case of need.

The same error is reproduced for 1024 mpi ranks. I tested two problem sizes
(2*512+1x2*64+1x2*256+1 and 2*1024+1x2*128+1x2*512+1) and the error occured for
both cases, the first case is also the one I used to run before the OS and mpi
updates.
I also run the code with -malloc_debug and nothing more appeared.

I attached the configure command I used to build a debug version of petsc.

Thank you for your time,
Sincerly.
Anthony Jourdon

De : Zhang, Junchao mailto:jczh...@mcs.anl.gov>>
Envoyé : jeudi 16 janvier 2020 16:49
À : Anthony Jourdon
mailto:jourdon_anth...@hotmail.fr>>
Cc : petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>
mailto:petsc-users@mcs.anl.gov>>
Objet : Re: [petsc-users] DMDA Error

It seems the problem is triggered by DMSetUp. You can write a small test
creating the DMDA with the same size as your code, to see if you can reproduce
the problem. If yes, it would be much easier for us to debug it.
--Junchao Zhang

On Thu, Jan 16, 2020 at 7:38 AM Anthony Jourdon
mailto:jourdon_anth...@hotmail.fr>> wrote:

Dear Petsc developer,

I need assistance with an error.

I run a code that uses the DMDA related functions. I'm using petsc-3.8.4.

This code used to run very well on a super computer with the OS SLES11.

Petsc was built using an intel mpi 5.1.3.223 module and intel mkl version
2016.0.2.181

The code was running with no problem on 1024 and more mpi ranks.

Recently, the OS of the computer has been updated to RHEL7

I rebuilt Petsc using new available versions of intel mpi (2019U5) and mkl
(2019.0.5.281) which are the same versions for compilers and mkl.

Since then I tested to run the exact same code on 8, 16, 24, 48, 512 and 1024
mpi ranks.

Until 1024 mpi ranks no problem, but for 1024 an error related to DMDA
appeared. I snip the first lines of the error stack here and the full error
stack is attached.

[534]PETSC ERROR: #1 PetscGatherMessageLengths() line 120 in
/scratch2/dlp/appli_local/SCR/OROGEN/petsc3.8.4_MPI/petsc-3.8.4/src/sys/utils/mpimesg.c

[534]PETSC ERROR: #2 VecScatterCreate_PtoS() line 2288 in
/scratch2/dlp/appli_local/SCR/OROGEN/petsc3.8.4_MPI/petsc-3.8.4/src/vec/vec/utils/vpscat.c

[534]PETSC ERROR: #3 VecScatterCreate() line 1462 in
/scratch2/dlp/appli_local/SCR/OROGEN/petsc3.8.4_MPI/petsc-3.8.4/src/vec/vec/utils/vscat.c

[534]PETSC ERROR: #4 DMSetUp_DA_3D() line 1042 in
/scratch2/dlp/appli_local/SCR/OROGEN/petsc3.8.4_MPI/petsc-3.8.4/src/dm/impls/da/da3.c

[534]PETSC ERROR: #5 DMSetUp_DA() line 25 in
/scratch2/dlp/appli_local/SCR/OROGEN/petsc3.8.4_MPI/petsc-3.8.4/src/dm/impls/da/dareg.c

[534]PETSC ERROR: #6 DMSetUp() line 720 in
/scratch2/dlp/appli_local/SCR/OROGEN/petsc3.8.4_MPI/petsc-3.8.4/src/dm/interface/dm.c

Thank you for your time,

Sincerly,

Anthony Jourdon

Re: [petsc-users] DMDA Error

2020-01-21 Thread Zhang, Junchao via petsc-users

I submitted a job and I am waiting for the result.
--Junchao Zhang


On Tue, Jan 21, 2020 at 3:03 AM Dave May 
mailto:dave.mayhe...@gmail.com>> wrote:
Hi Anthony,

On Tue, 21 Jan 2020 at 08:25, Anthony Jourdon 
mailto:jourdon_anth...@hotmail.fr>> wrote:
Hello,

I made a test to try to reproduce the error.
To do so I modified the file $PETSC_DIR/src/dm/examples/tests/ex35.c
I attach the file in case of need.

The same error is reproduced for 1024 mpi ranks. I tested two problem sizes 
(2*512+1x2*64+1x2*256+1 and 2*1024+1x2*128+1x2*512+1) and the error occured for 
both cases, the first case is also the one I used to run before the OS and mpi 
updates.
I also run the code with -malloc_debug and nothing more appeared.

I attached the configure command I used to build a debug version of petsc.

The error indicates the problem occurs on the bold line below (e.g. within 
MPI_Isend())


  /* Post the Isends with the message length-info */

  for (i=0,j=0; imailto:jczh...@mcs.anl.gov>>
Envoyé : jeudi 16 janvier 2020 16:49
À : Anthony Jourdon 
mailto:jourdon_anth...@hotmail.fr>>
Cc : petsc-users@mcs.anl.gov 
mailto:petsc-users@mcs.anl.gov>>
Objet : Re: [petsc-users] DMDA Error

It seems the problem is triggered by DMSetUp. You can write a small test 
creating the DMDA with the same size as your code, to see if you can reproduce 
the problem. If yes, it would be much easier for us to debug it.
--Junchao Zhang


On Thu, Jan 16, 2020 at 7:38 AM Anthony Jourdon 
mailto:jourdon_anth...@hotmail.fr>> wrote:

Dear Petsc developer,


I need assistance with an error.


I run a code that uses the DMDA related functions. I'm using petsc-3.8.4.


This code used to run very well on a super computer with the OS SLES11.

Petsc was built using an intel mpi 5.1.3.223 module and intel mkl version 
2016.0.2.181

The code was running with no problem on 1024 and more mpi ranks.


Recently, the OS of the computer has been updated to RHEL7

I rebuilt Petsc using new available versions of intel mpi (2019U5) and mkl 
(2019.0.5.281) which are the same versions for compilers and mkl.

Since then I tested to run the exact same code on 8, 16, 24, 48, 512 and 1024 
mpi ranks.

Until 1024 mpi ranks no problem, but for 1024 an error related to DMDA 
appeared. I snip the first lines of the error stack here and the full error 
stack is attached.


[534]PETSC ERROR: #1 PetscGatherMessageLengths() line 120 in 
/scratch2/dlp/appli_local/SCR/OROGEN/petsc3.8.4_MPI/petsc-3.8.4/src/sys/utils/mpimesg.c

[534]PETSC ERROR: #2 VecScatterCreate_PtoS() line 2288 in 
/scratch2/dlp/appli_local/SCR/OROGEN/petsc3.8.4_MPI/petsc-3.8.4/src/vec/vec/utils/vpscat.c

[534]PETSC ERROR: #3 VecScatterCreate() line 1462 in 
/scratch2/dlp/appli_local/SCR/OROGEN/petsc3.8.4_MPI/petsc-3.8.4/src/vec/vec/utils/vscat.c

[534]PETSC ERROR: #4 DMSetUp_DA_3D() line 1042 in 
/scratch2/dlp/appli_local/SCR/OROGEN/petsc3.8.4_MPI/petsc-3.8.4/src/dm/impls/da/da3.c

[534]PETSC ERROR: #5 DMSetUp_DA() line 25 in 
/scratch2/dlp/appli_local/SCR/OROGEN/petsc3.8.4_MPI/petsc-3.8.4/src/dm/impls/da/dareg.c

[534]PETSC ERROR: #6 DMSetUp() line 720 in 
/scratch2/dlp/appli_local/SCR/OROGEN/petsc3.8.4_MPI/petsc-3.8.4/src/dm/interface/dm.c



Thank you for your time,

Sincerly,


Anthony Jourdon

Re: [petsc-users] DMDA Error

2020-01-16 Thread Zhang, Junchao via petsc-users