Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Fande Kong
In case someone wants to learn more about the hierarchical partitioning 
algorithm. Here is a reference 

https://arxiv.org/pdf/1809.02666.pdf

Thanks 

Fande 


> On Mar 25, 2020, at 5:18 PM, Mark Adams  wrote:
> 
> 
> 
> 
>> On Wed, Mar 25, 2020 at 6:40 PM Fande Kong  wrote:
>>> 
>>> 
 On Wed, Mar 25, 2020 at 12:18 PM Mark Adams  wrote:
 Also, a better test is see where streams pretty much saturates, then run 
 that many processors per node and do the same test by increasing the 
 nodes. This will tell you how well your network communication is doing.
 
 But this result has a lot of stuff in "network communication" that can be 
 further evaluated. The worst thing about this, I would think, is that the 
 partitioning is blind to the memory hierarchy of inter and intra node 
 communication.
>>> 
>>> Hierarchical partitioning was designed for this purpose. 
>>> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/MatOrderings/MATPARTITIONINGHIERARCH.html#MATPARTITIONINGHIERARCH
>>> 
>> 
>> That's fantastic!
>>  
>> Fande,
>>  
>>> The next thing to do is run with an initial grid that puts one cell per 
>>> node and the do uniform refinement, until you have one cell per process 
>>> (eg, one refinement step using 8 processes per node), partition to get one 
>>> cell per process, then do uniform refinement to get a reasonable sized 
>>> local problem. Alas, this is not easy to do, but it is doable.
>>> 
 On Wed, Mar 25, 2020 at 2:04 PM Mark Adams  wrote:
 I would guess that you are saturating the memory bandwidth. After you make 
 PETSc (make all) it will suggest that you test it (make test) and suggest 
 that you run streams (make streams).
 
 I see Matt answered but let me add that when you make streams you will 
 seed the memory rate for 1,2,3, ... NP processes. If your machine is 
 decent you should see very good speed up at the beginning and then it will 
 start to saturate. You are seeing about 50% of perfect speedup at 16 
 process. I would expect that you will see something similar with streams. 
 Without knowing your machine, your results look typical.
 
> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi  
> wrote:
> Hi,
> 
> I ran KSP example 45 on a single node with 32 cores and 125GB memory 
> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent 
> during KSP.solve:
> 
> - 1 MPI process: ~98 sec, speedup: 1X
> - 16 MPI processes: ~12 sec, speedup: ~8X
> - 32 MPI processes: ~11 sec, speedup: ~9X
> 
> Since the problem size is large enough (8M unknowns), I expected a 
> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how 
> can it be improved?
> 
> I've attached three log files for more details. 
> 
> Sincerely,
> Amin


Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Mark Adams
On Wed, Mar 25, 2020 at 6:40 PM Fande Kong  wrote:

>
>
> On Wed, Mar 25, 2020 at 12:18 PM Mark Adams  wrote:
>
>> Also, a better test is see where streams pretty much saturates, then run
>> that many processors per node and do the same test by increasing the nodes.
>> This will tell you how well your network communication is doing.
>>
>> But this result has a lot of stuff in "network communication" that can be
>> further evaluated. The worst thing about this, I would think, is that the
>> partitioning is blind to the memory hierarchy of inter and intra node
>> communication.
>>
>
> Hierarchical partitioning was designed for this purpose.
> https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/MatOrderings/MATPARTITIONINGHIERARCH.html#MATPARTITIONINGHIERARCH
>
>
That's fantastic!


> Fande,
>
>
>> The next thing to do is run with an initial grid that puts one cell per
>> node and the do uniform refinement, until you have one cell per process
>> (eg, one refinement step using 8 processes per node), partition to get one
>> cell per process, then do uniform refinement to get a reasonable sized
>> local problem. Alas, this is not easy to do, but it is doable.
>>
>> On Wed, Mar 25, 2020 at 2:04 PM Mark Adams  wrote:
>>
>>> I would guess that you are saturating the memory bandwidth. After
>>> you make PETSc (make all) it will suggest that you test it (make test) and
>>> suggest that you run streams (make streams).
>>>
>>> I see Matt answered but let me add that when you make streams you will
>>> seed the memory rate for 1,2,3, ... NP processes. If your machine is decent
>>> you should see very good speed up at the beginning and then it will start
>>> to saturate. You are seeing about 50% of perfect speedup at 16 process. I
>>> would expect that you will see something similar with streams. Without
>>> knowing your machine, your results look typical.
>>>
>>> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi 
>>> wrote:
>>>
 Hi,

 I ran KSP example 45 on a single node with 32 cores and 125GB memory
 using 1, 16 and 32 MPI processes. Here's a comparison of the time spent
 during KSP.solve:

 - 1 MPI process: ~98 sec, speedup: 1X
 - 16 MPI processes: ~12 sec, speedup: ~8X
 - 32 MPI processes: ~11 sec, speedup: ~9X

 Since the problem size is large enough (8M unknowns), I expected a
 speedup much closer to 32X, rather than 9X. Is this expected? If yes, how
 can it be improved?

 I've attached three log files for more details.

 Sincerely,
 Amin

>>>


Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Zhang, Junchao via petsc-users

MPI rank distribution (e.g., 8 ranks per node or 16 ranks per node) is usually 
managed by workload managers like Slurm, PBS through your job scripts, which is 
out of petsc’s control.

From: Amin Sadeghi 
Date: Wednesday, March 25, 2020 at 4:40 PM
To: Junchao Zhang 
Cc: Mark Adams , PETSc users list 
Subject: Re: [petsc-users] Poor speed up for KSP example 45

Junchao, thank you for doing the experiment, I guess TACC Frontera nodes have 
higher memory bandwidth (maybe more modern CPU architecture, although I'm not 
familiar as to which hardware affect memory bandwidth) than Compute Canada's 
Graham.

Mark, I did as you suggested. As you suspected, running make streams yielded 
the same results, indicating that the memory bandwidth saturated at around 8 
MPI processes. I ran the experiment on multiple nodes but only requested 8 
cores per node, and here is the result:

1 node (8 cores total): 17.5s, 6X speedup
2 nodes (16 cores total): 13.5s, 7X speedup
3 nodes (24 cores total): 9.4s, 10X speedup
4 nodes (32 cores total): 8.3s, 12X speedup
5 nodes (40 cores total): 7.0s, 14X speedup
6 nodes (48 cores total): 61.4s, 2X speedup [!!!]
7 nodes (56 cores total): 4.3s, 23X speedup
8 nodes (64 cores total): 3.7s, 27X speedup

Note: as you can see, the experiment with 6 nodes showed extremely poor 
scaling, which I guess was an outlier, maybe due to some connection problem?

I also ran another experiment, requesting 2 full nodes, i.e. 64 cores, and 
here's the result:

2 nodes (64 cores total): 6.0s, 16X speedup [32 cores each node]

So, it turns out that given a fixed number of cores, i.e. 64 in our case, much 
better speedups (27X vs. 16X in our case) can be achieved if they are 
distributed among separate nodes.

Anyways, I really appreciate all your inputs.

One final question: From what I understand from Mark's comment, PETSc at the 
moment is blind to memory hierarchy, is it feasible to make PETSc aware of the 
inter and intra node communication so that partitioning is done to maximize 
performance? Or, to put it differently, is this something that PETSc devs have 
their eyes on for the future?


Sincerely,
Amin


On Wed, Mar 25, 2020 at 3:51 PM Junchao Zhang 
mailto:junchao.zh...@gmail.com>> wrote:
I repeated your experiment on one node of TACC Frontera,
1 rank: 85.0s
16 ranks: 8.2s, 10x speedup
32 ranks: 5.7s, 15x speedup

--Junchao Zhang


On Wed, Mar 25, 2020 at 1:18 PM Mark Adams 
mailto:mfad...@lbl.gov>> wrote:
Also, a better test is see where streams pretty much saturates, then run that 
many processors per node and do the same test by increasing the nodes. This 
will tell you how well your network communication is doing.

But this result has a lot of stuff in "network communication" that can be 
further evaluated. The worst thing about this, I would think, is that the 
partitioning is blind to the memory hierarchy of inter and intra node 
communication. The next thing to do is run with an initial grid that puts one 
cell per node and the do uniform refinement, until you have one cell per 
process (eg, one refinement step using 8 processes per node), partition to get 
one cell per process, then do uniform refinement to get a reasonable sized 
local problem. Alas, this is not easy to do, but it is doable.

On Wed, Mar 25, 2020 at 2:04 PM Mark Adams 
mailto:mfad...@lbl.gov>> wrote:
I would guess that you are saturating the memory bandwidth. After you make 
PETSc (make all) it will suggest that you test it (make test) and suggest that 
you run streams (make streams).

I see Matt answered but let me add that when you make streams you will seed the 
memory rate for 1,2,3, ... NP processes. If your machine is decent you should 
see very good speed up at the beginning and then it will start to saturate. You 
are seeing about 50% of perfect speedup at 16 process. I would expect that you 
will see something similar with streams. Without knowing your machine, your 
results look typical.

On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi 
mailto:aminthefr...@gmail.com>> wrote:
Hi,

I ran KSP example 45 on a single node with 32 cores and 125GB memory using 1, 
16 and 32 MPI processes. Here's a comparison of the time spent during KSP.solve:

- 1 MPI process: ~98 sec, speedup: 1X
- 16 MPI processes: ~12 sec, speedup: ~8X
- 32 MPI processes: ~11 sec, speedup: ~9X

Since the problem size is large enough (8M unknowns), I expected a speedup much 
closer to 32X, rather than 9X. Is this expected? If yes, how can it be improved?

I've attached three log files for more details.

Sincerely,
Amin


Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Fande Kong
On Wed, Mar 25, 2020 at 12:18 PM Mark Adams  wrote:

> Also, a better test is see where streams pretty much saturates, then run
> that many processors per node and do the same test by increasing the nodes.
> This will tell you how well your network communication is doing.
>
> But this result has a lot of stuff in "network communication" that can be
> further evaluated. The worst thing about this, I would think, is that the
> partitioning is blind to the memory hierarchy of inter and intra node
> communication.
>

Hierarchical partitioning was designed for this purpose.
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/MatOrderings/MATPARTITIONINGHIERARCH.html#MATPARTITIONINGHIERARCH

Fande,


> The next thing to do is run with an initial grid that puts one cell per
> node and the do uniform refinement, until you have one cell per process
> (eg, one refinement step using 8 processes per node), partition to get one
> cell per process, then do uniform refinement to get a reasonable sized
> local problem. Alas, this is not easy to do, but it is doable.
>
> On Wed, Mar 25, 2020 at 2:04 PM Mark Adams  wrote:
>
>> I would guess that you are saturating the memory bandwidth. After
>> you make PETSc (make all) it will suggest that you test it (make test) and
>> suggest that you run streams (make streams).
>>
>> I see Matt answered but let me add that when you make streams you will
>> seed the memory rate for 1,2,3, ... NP processes. If your machine is decent
>> you should see very good speed up at the beginning and then it will start
>> to saturate. You are seeing about 50% of perfect speedup at 16 process. I
>> would expect that you will see something similar with streams. Without
>> knowing your machine, your results look typical.
>>
>> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi 
>> wrote:
>>
>>> Hi,
>>>
>>> I ran KSP example 45 on a single node with 32 cores and 125GB memory
>>> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent
>>> during KSP.solve:
>>>
>>> - 1 MPI process: ~98 sec, speedup: 1X
>>> - 16 MPI processes: ~12 sec, speedup: ~8X
>>> - 32 MPI processes: ~11 sec, speedup: ~9X
>>>
>>> Since the problem size is large enough (8M unknowns), I expected a
>>> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how
>>> can it be improved?
>>>
>>> I've attached three log files for more details.
>>>
>>> Sincerely,
>>> Amin
>>>
>>


Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Amin Sadeghi
That's great. Thanks for creating this great piece of software!

Amin

On Wed, Mar 25, 2020 at 5:56 PM Matthew Knepley  wrote:

> On Wed, Mar 25, 2020 at 5:41 PM Amin Sadeghi 
> wrote:
>
>> Junchao, thank you for doing the experiment, I guess TACC Frontera nodes
>> have higher memory bandwidth (maybe more modern CPU architecture, although
>> I'm not familiar as to which hardware affect memory bandwidth) than Compute
>> Canada's Graham.
>>
>> Mark, I did as you suggested. As you suspected, running make streams
>> yielded the same results, indicating that the memory bandwidth saturated at
>> around 8 MPI processes. I ran the experiment on multiple nodes but only
>> requested 8 cores per node, and here is the result:
>>
>> 1 node (8 cores total): 17.5s, 6X speedup
>> 2 nodes (16 cores total): 13.5s, 7X speedup
>> 3 nodes (24 cores total): 9.4s, 10X speedup
>> 4 nodes (32 cores total): 8.3s, 12X speedup
>> 5 nodes (40 cores total): 7.0s, 14X speedup
>> 6 nodes (48 cores total): 61.4s, 2X speedup [!!!]
>> 7 nodes (56 cores total): 4.3s, 23X speedup
>> 8 nodes (64 cores total): 3.7s, 27X speedup
>>
>> *Note:* as you can see, the experiment with 6 nodes showed extremely
>> poor scaling, which I guess was an outlier, maybe due to some connection
>> problem?
>>
>> I also ran another experiment, requesting 2 full nodes, i.e. 64 cores,
>> and here's the result:
>>
>> 2 nodes (64 cores total): 6.0s, 16X speedup [32 cores each node]
>>
>> So, it turns out that given a fixed number of cores, i.e. 64 in our case,
>> much better speedups (27X vs. 16X in our case) can be achieved if they are
>> distributed among separate nodes.
>>
>> Anyways, I really appreciate all your inputs.
>>
>> *One final question:* From what I understand from Mark's comment, PETSc
>> at the moment is blind to memory hierarchy, is it feasible to make PETSc
>> aware of the inter and intra node communication so that partitioning is
>> done to maximize performance? Or, to put it differently, is this something
>> that PETSc devs have their eyes on for the future?
>>
>
> There is already stuff in VecScatter that knows about the memory
> hierarchy, which Junchao put in. We are actively working on some other
> node-aware algorithms.
>
>   Thanks,
>
>  Matt
>
>
>> Sincerely,
>> Amin
>>
>>
>> On Wed, Mar 25, 2020 at 3:51 PM Junchao Zhang 
>> wrote:
>>
>>> I repeated your experiment on one node of TACC Frontera,
>>> 1 rank: 85.0s
>>> 16 ranks: 8.2s, 10x speedup
>>> 32 ranks: 5.7s, 15x speedup
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Wed, Mar 25, 2020 at 1:18 PM Mark Adams  wrote:
>>>
 Also, a better test is see where streams pretty much saturates, then
 run that many processors per node and do the same test by increasing the
 nodes. This will tell you how well your network communication is doing.

 But this result has a lot of stuff in "network communication" that can
 be further evaluated. The worst thing about this, I would think, is that
 the partitioning is blind to the memory hierarchy of inter and intra node
 communication. The next thing to do is run with an initial grid that puts
 one cell per node and the do uniform refinement, until you have one cell
 per process (eg, one refinement step using 8 processes per node), partition
 to get one cell per process, then do uniform refinement to get a
 reasonable sized local problem. Alas, this is not easy to do, but it is
 doable.

 On Wed, Mar 25, 2020 at 2:04 PM Mark Adams  wrote:

> I would guess that you are saturating the memory bandwidth. After
> you make PETSc (make all) it will suggest that you test it (make test) and
> suggest that you run streams (make streams).
>
> I see Matt answered but let me add that when you make streams you will
> seed the memory rate for 1,2,3, ... NP processes. If your machine is 
> decent
> you should see very good speed up at the beginning and then it will start
> to saturate. You are seeing about 50% of perfect speedup at 16 process. I
> would expect that you will see something similar with streams. Without
> knowing your machine, your results look typical.
>
> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi 
> wrote:
>
>> Hi,
>>
>> I ran KSP example 45 on a single node with 32 cores and 125GB memory
>> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent
>> during KSP.solve:
>>
>> - 1 MPI process: ~98 sec, speedup: 1X
>> - 16 MPI processes: ~12 sec, speedup: ~8X
>> - 32 MPI processes: ~11 sec, speedup: ~9X
>>
>> Since the problem size is large enough (8M unknowns), I expected a
>> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how
>> can it be improved?
>>
>> I've attached three log files for more details.
>>
>> Sincerely,
>> Amin
>>
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is 

Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Matthew Knepley
On Wed, Mar 25, 2020 at 5:41 PM Amin Sadeghi  wrote:

> Junchao, thank you for doing the experiment, I guess TACC Frontera nodes
> have higher memory bandwidth (maybe more modern CPU architecture, although
> I'm not familiar as to which hardware affect memory bandwidth) than Compute
> Canada's Graham.
>
> Mark, I did as you suggested. As you suspected, running make streams
> yielded the same results, indicating that the memory bandwidth saturated at
> around 8 MPI processes. I ran the experiment on multiple nodes but only
> requested 8 cores per node, and here is the result:
>
> 1 node (8 cores total): 17.5s, 6X speedup
> 2 nodes (16 cores total): 13.5s, 7X speedup
> 3 nodes (24 cores total): 9.4s, 10X speedup
> 4 nodes (32 cores total): 8.3s, 12X speedup
> 5 nodes (40 cores total): 7.0s, 14X speedup
> 6 nodes (48 cores total): 61.4s, 2X speedup [!!!]
> 7 nodes (56 cores total): 4.3s, 23X speedup
> 8 nodes (64 cores total): 3.7s, 27X speedup
>
> *Note:* as you can see, the experiment with 6 nodes showed extremely poor
> scaling, which I guess was an outlier, maybe due to some connection problem?
>
> I also ran another experiment, requesting 2 full nodes, i.e. 64 cores, and
> here's the result:
>
> 2 nodes (64 cores total): 6.0s, 16X speedup [32 cores each node]
>
> So, it turns out that given a fixed number of cores, i.e. 64 in our case,
> much better speedups (27X vs. 16X in our case) can be achieved if they are
> distributed among separate nodes.
>
> Anyways, I really appreciate all your inputs.
>
> *One final question:* From what I understand from Mark's comment, PETSc
> at the moment is blind to memory hierarchy, is it feasible to make PETSc
> aware of the inter and intra node communication so that partitioning is
> done to maximize performance? Or, to put it differently, is this something
> that PETSc devs have their eyes on for the future?
>

There is already stuff in VecScatter that knows about the memory hierarchy,
which Junchao put in. We are actively working on some other node-aware
algorithms.

  Thanks,

 Matt


> Sincerely,
> Amin
>
>
> On Wed, Mar 25, 2020 at 3:51 PM Junchao Zhang 
> wrote:
>
>> I repeated your experiment on one node of TACC Frontera,
>> 1 rank: 85.0s
>> 16 ranks: 8.2s, 10x speedup
>> 32 ranks: 5.7s, 15x speedup
>>
>> --Junchao Zhang
>>
>>
>> On Wed, Mar 25, 2020 at 1:18 PM Mark Adams  wrote:
>>
>>> Also, a better test is see where streams pretty much saturates, then run
>>> that many processors per node and do the same test by increasing the nodes.
>>> This will tell you how well your network communication is doing.
>>>
>>> But this result has a lot of stuff in "network communication" that can
>>> be further evaluated. The worst thing about this, I would think, is that
>>> the partitioning is blind to the memory hierarchy of inter and intra node
>>> communication. The next thing to do is run with an initial grid that puts
>>> one cell per node and the do uniform refinement, until you have one cell
>>> per process (eg, one refinement step using 8 processes per node), partition
>>> to get one cell per process, then do uniform refinement to get a
>>> reasonable sized local problem. Alas, this is not easy to do, but it is
>>> doable.
>>>
>>> On Wed, Mar 25, 2020 at 2:04 PM Mark Adams  wrote:
>>>
 I would guess that you are saturating the memory bandwidth. After
 you make PETSc (make all) it will suggest that you test it (make test) and
 suggest that you run streams (make streams).

 I see Matt answered but let me add that when you make streams you will
 seed the memory rate for 1,2,3, ... NP processes. If your machine is decent
 you should see very good speed up at the beginning and then it will start
 to saturate. You are seeing about 50% of perfect speedup at 16 process. I
 would expect that you will see something similar with streams. Without
 knowing your machine, your results look typical.

 On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi 
 wrote:

> Hi,
>
> I ran KSP example 45 on a single node with 32 cores and 125GB memory
> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent
> during KSP.solve:
>
> - 1 MPI process: ~98 sec, speedup: 1X
> - 16 MPI processes: ~12 sec, speedup: ~8X
> - 32 MPI processes: ~11 sec, speedup: ~9X
>
> Since the problem size is large enough (8M unknowns), I expected a
> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how
> can it be improved?
>
> I've attached three log files for more details.
>
> Sincerely,
> Amin
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Amin Sadeghi
Junchao, thank you for doing the experiment, I guess TACC Frontera nodes
have higher memory bandwidth (maybe more modern CPU architecture, although
I'm not familiar as to which hardware affect memory bandwidth) than Compute
Canada's Graham.

Mark, I did as you suggested. As you suspected, running make streams
yielded the same results, indicating that the memory bandwidth saturated at
around 8 MPI processes. I ran the experiment on multiple nodes but only
requested 8 cores per node, and here is the result:

1 node (8 cores total): 17.5s, 6X speedup
2 nodes (16 cores total): 13.5s, 7X speedup
3 nodes (24 cores total): 9.4s, 10X speedup
4 nodes (32 cores total): 8.3s, 12X speedup
5 nodes (40 cores total): 7.0s, 14X speedup
6 nodes (48 cores total): 61.4s, 2X speedup [!!!]
7 nodes (56 cores total): 4.3s, 23X speedup
8 nodes (64 cores total): 3.7s, 27X speedup

*Note:* as you can see, the experiment with 6 nodes showed extremely poor
scaling, which I guess was an outlier, maybe due to some connection problem?

I also ran another experiment, requesting 2 full nodes, i.e. 64 cores, and
here's the result:

2 nodes (64 cores total): 6.0s, 16X speedup [32 cores each node]

So, it turns out that given a fixed number of cores, i.e. 64 in our case,
much better speedups (27X vs. 16X in our case) can be achieved if they are
distributed among separate nodes.

Anyways, I really appreciate all your inputs.

*One final question:* From what I understand from Mark's comment, PETSc at
the moment is blind to memory hierarchy, is it feasible to make PETSc aware
of the inter and intra node communication so that partitioning is done to
maximize performance? Or, to put it differently, is this something that
PETSc devs have their eyes on for the future?

Sincerely,
Amin


On Wed, Mar 25, 2020 at 3:51 PM Junchao Zhang 
wrote:

> I repeated your experiment on one node of TACC Frontera,
> 1 rank: 85.0s
> 16 ranks: 8.2s, 10x speedup
> 32 ranks: 5.7s, 15x speedup
>
> --Junchao Zhang
>
>
> On Wed, Mar 25, 2020 at 1:18 PM Mark Adams  wrote:
>
>> Also, a better test is see where streams pretty much saturates, then run
>> that many processors per node and do the same test by increasing the nodes.
>> This will tell you how well your network communication is doing.
>>
>> But this result has a lot of stuff in "network communication" that can be
>> further evaluated. The worst thing about this, I would think, is that the
>> partitioning is blind to the memory hierarchy of inter and intra node
>> communication. The next thing to do is run with an initial grid that puts
>> one cell per node and the do uniform refinement, until you have one cell
>> per process (eg, one refinement step using 8 processes per node), partition
>> to get one cell per process, then do uniform refinement to get a
>> reasonable sized local problem. Alas, this is not easy to do, but it is
>> doable.
>>
>> On Wed, Mar 25, 2020 at 2:04 PM Mark Adams  wrote:
>>
>>> I would guess that you are saturating the memory bandwidth. After
>>> you make PETSc (make all) it will suggest that you test it (make test) and
>>> suggest that you run streams (make streams).
>>>
>>> I see Matt answered but let me add that when you make streams you will
>>> seed the memory rate for 1,2,3, ... NP processes. If your machine is decent
>>> you should see very good speed up at the beginning and then it will start
>>> to saturate. You are seeing about 50% of perfect speedup at 16 process. I
>>> would expect that you will see something similar with streams. Without
>>> knowing your machine, your results look typical.
>>>
>>> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi 
>>> wrote:
>>>
 Hi,

 I ran KSP example 45 on a single node with 32 cores and 125GB memory
 using 1, 16 and 32 MPI processes. Here's a comparison of the time spent
 during KSP.solve:

 - 1 MPI process: ~98 sec, speedup: 1X
 - 16 MPI processes: ~12 sec, speedup: ~8X
 - 32 MPI processes: ~11 sec, speedup: ~9X

 Since the problem size is large enough (8M unknowns), I expected a
 speedup much closer to 32X, rather than 9X. Is this expected? If yes, how
 can it be improved?

 I've attached three log files for more details.

 Sincerely,
 Amin

>>>


Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Junchao Zhang
I repeated your experiment on one node of TACC Frontera,
1 rank: 85.0s
16 ranks: 8.2s, 10x speedup
32 ranks: 5.7s, 15x speedup

--Junchao Zhang


On Wed, Mar 25, 2020 at 1:18 PM Mark Adams  wrote:

> Also, a better test is see where streams pretty much saturates, then run
> that many processors per node and do the same test by increasing the nodes.
> This will tell you how well your network communication is doing.
>
> But this result has a lot of stuff in "network communication" that can be
> further evaluated. The worst thing about this, I would think, is that the
> partitioning is blind to the memory hierarchy of inter and intra node
> communication. The next thing to do is run with an initial grid that puts
> one cell per node and the do uniform refinement, until you have one cell
> per process (eg, one refinement step using 8 processes per node), partition
> to get one cell per process, then do uniform refinement to get a
> reasonable sized local problem. Alas, this is not easy to do, but it is
> doable.
>
> On Wed, Mar 25, 2020 at 2:04 PM Mark Adams  wrote:
>
>> I would guess that you are saturating the memory bandwidth. After
>> you make PETSc (make all) it will suggest that you test it (make test) and
>> suggest that you run streams (make streams).
>>
>> I see Matt answered but let me add that when you make streams you will
>> seed the memory rate for 1,2,3, ... NP processes. If your machine is decent
>> you should see very good speed up at the beginning and then it will start
>> to saturate. You are seeing about 50% of perfect speedup at 16 process. I
>> would expect that you will see something similar with streams. Without
>> knowing your machine, your results look typical.
>>
>> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi 
>> wrote:
>>
>>> Hi,
>>>
>>> I ran KSP example 45 on a single node with 32 cores and 125GB memory
>>> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent
>>> during KSP.solve:
>>>
>>> - 1 MPI process: ~98 sec, speedup: 1X
>>> - 16 MPI processes: ~12 sec, speedup: ~8X
>>> - 32 MPI processes: ~11 sec, speedup: ~9X
>>>
>>> Since the problem size is large enough (8M unknowns), I expected a
>>> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how
>>> can it be improved?
>>>
>>> I've attached three log files for more details.
>>>
>>> Sincerely,
>>> Amin
>>>
>>


Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Mark Adams
Also, a better test is see where streams pretty much saturates, then run
that many processors per node and do the same test by increasing the nodes.
This will tell you how well your network communication is doing.

But this result has a lot of stuff in "network communication" that can be
further evaluated. The worst thing about this, I would think, is that the
partitioning is blind to the memory hierarchy of inter and intra node
communication. The next thing to do is run with an initial grid that puts
one cell per node and the do uniform refinement, until you have one cell
per process (eg, one refinement step using 8 processes per node), partition
to get one cell per process, then do uniform refinement to get a
reasonable sized local problem. Alas, this is not easy to do, but it is
doable.

On Wed, Mar 25, 2020 at 2:04 PM Mark Adams  wrote:

> I would guess that you are saturating the memory bandwidth. After you make
> PETSc (make all) it will suggest that you test it (make test) and suggest
> that you run streams (make streams).
>
> I see Matt answered but let me add that when you make streams you will
> seed the memory rate for 1,2,3, ... NP processes. If your machine is decent
> you should see very good speed up at the beginning and then it will start
> to saturate. You are seeing about 50% of perfect speedup at 16 process. I
> would expect that you will see something similar with streams. Without
> knowing your machine, your results look typical.
>
> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi 
> wrote:
>
>> Hi,
>>
>> I ran KSP example 45 on a single node with 32 cores and 125GB memory
>> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent
>> during KSP.solve:
>>
>> - 1 MPI process: ~98 sec, speedup: 1X
>> - 16 MPI processes: ~12 sec, speedup: ~8X
>> - 32 MPI processes: ~11 sec, speedup: ~9X
>>
>> Since the problem size is large enough (8M unknowns), I expected a
>> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how
>> can it be improved?
>>
>> I've attached three log files for more details.
>>
>> Sincerely,
>> Amin
>>
>


Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Matthew Knepley
On Wed, Mar 25, 2020 at 2:11 PM Amin Sadeghi  wrote:

> Thank you Matt and Mark for the explanation. That makes sense. Please
> correct me if I'm wrong, I think instead of asking for the whole node with
> 32 cores, if I ask for more nodes, say 4 or 8, but each with 8 cores, then
> I should see much better speedups. Is that correct?
>

Yes, exactly

  Matt


> On Wed, Mar 25, 2020 at 2:04 PM Mark Adams  wrote:
>
>> I would guess that you are saturating the memory bandwidth. After
>> you make PETSc (make all) it will suggest that you test it (make test) and
>> suggest that you run streams (make streams).
>>
>> I see Matt answered but let me add that when you make streams you will
>> seed the memory rate for 1,2,3, ... NP processes. If your machine is decent
>> you should see very good speed up at the beginning and then it will start
>> to saturate. You are seeing about 50% of perfect speedup at 16 process. I
>> would expect that you will see something similar with streams. Without
>> knowing your machine, your results look typical.
>>
>> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi 
>> wrote:
>>
>>> Hi,
>>>
>>> I ran KSP example 45 on a single node with 32 cores and 125GB memory
>>> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent
>>> during KSP.solve:
>>>
>>> - 1 MPI process: ~98 sec, speedup: 1X
>>> - 16 MPI processes: ~12 sec, speedup: ~8X
>>> - 32 MPI processes: ~11 sec, speedup: ~9X
>>>
>>> Since the problem size is large enough (8M unknowns), I expected a
>>> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how
>>> can it be improved?
>>>
>>> I've attached three log files for more details.
>>>
>>> Sincerely,
>>> Amin
>>>
>>

-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Amin Sadeghi
Thank you Matt and Mark for the explanation. That makes sense. Please
correct me if I'm wrong, I think instead of asking for the whole node with
32 cores, if I ask for more nodes, say 4 or 8, but each with 8 cores, then
I should see much better speedups. Is that correct?

On Wed, Mar 25, 2020 at 2:04 PM Mark Adams  wrote:

> I would guess that you are saturating the memory bandwidth. After you make
> PETSc (make all) it will suggest that you test it (make test) and suggest
> that you run streams (make streams).
>
> I see Matt answered but let me add that when you make streams you will
> seed the memory rate for 1,2,3, ... NP processes. If your machine is decent
> you should see very good speed up at the beginning and then it will start
> to saturate. You are seeing about 50% of perfect speedup at 16 process. I
> would expect that you will see something similar with streams. Without
> knowing your machine, your results look typical.
>
> On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi 
> wrote:
>
>> Hi,
>>
>> I ran KSP example 45 on a single node with 32 cores and 125GB memory
>> using 1, 16 and 32 MPI processes. Here's a comparison of the time spent
>> during KSP.solve:
>>
>> - 1 MPI process: ~98 sec, speedup: 1X
>> - 16 MPI processes: ~12 sec, speedup: ~8X
>> - 32 MPI processes: ~11 sec, speedup: ~9X
>>
>> Since the problem size is large enough (8M unknowns), I expected a
>> speedup much closer to 32X, rather than 9X. Is this expected? If yes, how
>> can it be improved?
>>
>> I've attached three log files for more details.
>>
>> Sincerely,
>> Amin
>>
>


Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Mark Adams
I would guess that you are saturating the memory bandwidth. After you make
PETSc (make all) it will suggest that you test it (make test) and suggest
that you run streams (make streams).

I see Matt answered but let me add that when you make streams you will seed
the memory rate for 1,2,3, ... NP processes. If your machine is decent you
should see very good speed up at the beginning and then it will start to
saturate. You are seeing about 50% of perfect speedup at 16 process. I
would expect that you will see something similar with streams. Without
knowing your machine, your results look typical.

On Wed, Mar 25, 2020 at 1:05 PM Amin Sadeghi  wrote:

> Hi,
>
> I ran KSP example 45 on a single node with 32 cores and 125GB memory using
> 1, 16 and 32 MPI processes. Here's a comparison of the time spent during
> KSP.solve:
>
> - 1 MPI process: ~98 sec, speedup: 1X
> - 16 MPI processes: ~12 sec, speedup: ~8X
> - 32 MPI processes: ~11 sec, speedup: ~9X
>
> Since the problem size is large enough (8M unknowns), I expected a speedup
> much closer to 32X, rather than 9X. Is this expected? If yes, how can it be
> improved?
>
> I've attached three log files for more details.
>
> Sincerely,
> Amin
>


Re: [petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Matthew Knepley
On Wed, Mar 25, 2020 at 1:01 PM Amin Sadeghi  wrote:

> Hi,
>
> I ran KSP example 45 on a single node with 32 cores and 125GB memory using
> 1, 16 and 32 MPI processes. Here's a comparison of the time spent during
> KSP.solve:
>
> - 1 MPI process: ~98 sec, speedup: 1X
> - 16 MPI processes: ~12 sec, speedup: ~8X
> - 32 MPI processes: ~11 sec, speedup: ~9X
>
> Since the problem size is large enough (8M unknowns), I expected a speedup
> much closer to 32X, rather than 9X. Is this expected? If yes, how can it be
> improved?
>
> I've attached three log files for more details.
>

We have answered this here:
https://www.mcs.anl.gov/petsc/documentation/faq.html#computers

However, I can briefly summarize it. The bottleneck here is not computing
power, it is memory bandwidth. The node
you are running on has enough bandwidth for about 8 processes, not 32. I
probably takes 12-16 processes to saturate
the memory bandwidth, but not 32. That is why you see no speedup after 16.
There is no way to improve this by optimization.
The only thing to do is change the algorithm you are using. This behavior
has been extensively documented and talked about
for two decades. See, for example, the Roofline Performance Model.

  Thanks,

Matt


> Sincerely,
> Amin
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ 


[petsc-users] Poor speed up for KSP example 45

2020-03-25 Thread Amin Sadeghi
Hi,

I ran KSP example 45 on a single node with 32 cores and 125GB memory using
1, 16 and 32 MPI processes. Here's a comparison of the time spent during
KSP.solve:

- 1 MPI process: ~98 sec, speedup: 1X
- 16 MPI processes: ~12 sec, speedup: ~8X
- 32 MPI processes: ~11 sec, speedup: ~9X

Since the problem size is large enough (8M unknowns), I expected a speedup
much closer to 32X, rather than 9X. Is this expected? If yes, how can it be
improved?

I've attached three log files for more details.

Sincerely,
Amin
[aminsad@gra798 tutorials]$ export OMP_NUM_THREADS=1; time mpirun -n 1 ./ex45 -da_grid_x 200 -da_grid_y 200 -da_grid_z 200 -ksp_monitor -log_view
  0 KSP Residual norm 5.258064405273e+02
  1 KSP Residual norm 1.278253394418e+02
  2 KSP Residual norm 6.511271570096e+01
  3 KSP Residual norm 4.135214140632e+01
  4 KSP Residual norm 2.922513996425e+01
  5 KSP Residual norm 2.204747487144e+01
  6 KSP Residual norm 1.739713045931e+01
  7 KSP Residual norm 1.417931744122e+01
  8 KSP Residual norm 1.184374519260e+01
  9 KSP Residual norm 1.008506076330e+01
 10 KSP Residual norm 8.721445668356e+00
 11 KSP Residual norm 7.638849276263e+00
 12 KSP Residual norm 6.762286552928e+00
 13 KSP Residual norm 6.040701673926e+00
 14 KSP Residual norm 5.438219991460e+00
 15 KSP Residual norm 4.928968227009e+00
 16 KSP Residual norm 4.493866926479e+00
 17 KSP Residual norm 4.118572030763e+00
 18 KSP Residual norm 3.792124134425e+00
 19 KSP Residual norm 3.506027000640e+00
 20 KSP Residual norm 3.253607008689e+00
 21 KSP Residual norm 3.029552816399e+00
 22 KSP Residual norm 2.829585518871e+00
 23 KSP Residual norm 2.650219011338e+00
 24 KSP Residual norm 2.488587406800e+00
 25 KSP Residual norm 2.342315612358e+00
 26 KSP Residual norm 2.209421401090e+00
 27 KSP Residual norm 2.088238753752e+00
 28 KSP Residual norm 1.977358853814e+00
 29 KSP Residual norm 1.875582791372e+00
 30 KSP Residual norm 1.781883004600e+00
 31 KSP Residual norm 1.736355592493e+00
 32 KSP Residual norm 1.689802418566e+00
 33 KSP Residual norm 1.642330945653e+00
 34 KSP Residual norm 1.594067680872e+00
 35 KSP Residual norm 1.545141620575e+00
 36 KSP Residual norm 1.495696012728e+00
 37 KSP Residual norm 1.445872233985e+00
 38 KSP Residual norm 1.395827586448e+00
 39 KSP Residual norm 1.345725260919e+00
 40 KSP Residual norm 1.295731272357e+00
 41 KSP Residual norm 1.246028548392e+00
 42 KSP Residual norm 1.196795241049e+00
 43 KSP Residual norm 1.148224207298e+00
 44 KSP Residual norm 1.100512793196e+00
 45 KSP Residual norm 1.053857218299e+00
 46 KSP Residual norm 1.008393212219e+00
 47 KSP Residual norm 9.643431709584e-01
 48 KSP Residual norm 9.220075828693e-01
 49 KSP Residual norm 8.815371326016e-01
 50 KSP Residual norm 8.430754941240e-01
 51 KSP Residual norm 8.069283181947e-01
 52 KSP Residual norm 7.736058784267e-01
 53 KSP Residual norm 7.438791922022e-01
 54 KSP Residual norm 7.187496343169e-01
 55 KSP Residual norm 6.989924349694e-01
 56 KSP Residual norm 6.842383310219e-01
 57 KSP Residual norm 6.725928197691e-01
 58 KSP Residual norm 6.614246390768e-01
 59 KSP Residual norm 6.488228189072e-01
 60 KSP Residual norm 6.362497219776e-01
 61 KSP Residual norm 6.310576622808e-01
 62 KSP Residual norm 6.222805856447e-01
 63 KSP Residual norm 6.135102320894e-01
 64 KSP Residual norm 6.045478751124e-01
 65 KSP Residual norm 5.953900252814e-01
 66 KSP Residual norm 5.831923828721e-01
 67 KSP Residual norm 5.674340239745e-01
 68 KSP Residual norm 5.514386482427e-01
 69 KSP Residual norm 5.331703769160e-01
 70 KSP Residual norm 5.130105727821e-01
 71 KSP Residual norm 4.939642780919e-01
 72 KSP Residual norm 4.736283190935e-01
 73 KSP Residual norm 4.543755273874e-01
 74 KSP Residual norm 4.352475628588e-01
 75 KSP Residual norm 4.163157126078e-01
 76 KSP Residual norm 3.990162130316e-01
 77 KSP Residual norm 3.809381189489e-01
 78 KSP Residual norm 3.652493983140e-01
 79 KSP Residual norm 3.492146621424e-01
 80 KSP Residual norm 3.349556893166e-01
 81 KSP Residual norm 3.211628960531e-01
 82 KSP Residual norm 3.078715656898e-01
 83 KSP Residual norm 2.952352033285e-01
 84 KSP Residual norm 2.829258065907e-01
 85 KSP Residual norm 2.722494228035e-01
 86 KSP Residual norm 2.632988026599e-01
 87 KSP Residual norm 2.559836740725e-01
 88 KSP Residual norm 2.494846515621e-01
 89 KSP Residual norm 2.424910256174e-01
 90 KSP Residual norm 2.350783047703e-01
 91 KSP Residual norm 2.297176627239e-01
 92 KSP Residual norm 2.239807796082e-01
 93 KSP Residual norm 2.189487728757e-01
 94 KSP Residual norm 2.129028764448e-01
 95 KSP Residual norm 2.064486156208e-01
 96 KSP Residual norm 1.996150634880e-01
 97 KSP Residual norm 1.922502635756e-01
 98 KSP Residual norm 1.850232436496e-01
 99 KSP Residual norm 1.780506972401e-01
100 KSP Residual norm 1.712736120490e-01
101 KSP Residual norm 1.646455202490e-01
102 KSP Residual norm 1.579011664896e-01
103 KSP Residual norm 1.516406195159e-01
104 KSP Residual norm 1.450996365325e-01
105 KSP Residual norm