Slow speed after changing from serial to parallel (with ex2f.F)

2008-04-20 Thread Ben Tay
Hi Satish,

1st of all, I forgot to inform u that I've changed the m and n to 800. I 
would like to see if the larger value can make the scaling better. If 
req, I can redo the test with m,n=600.

I can install MPICH but I don't think I can choose to run on a single 
machine using from 1 to 8 procs. In order to run the code, I usually 
have to use the command

bsub -o log -q linux64 ./a.out   for single procs

bsub -o log -q mcore_parallel -n $ -a mvapich mpirun.lsf ./a.out where 
$=no. of procs.   for multiple procs

After that, when the job is running, I'll be given the server which my 
job runs on e.g. atlas3-c10 (1 procs) or 2*atlas3-c10 + 2*atlas3-c12 (4 
procs) or 2*atlas3-c10 + 2*atlas3-c12 +2*atlas3-c11 + 2*atlas3-c13 (8 
procs). I was told that 2*atlas3-c10 doesn't mean that it is running on 
a dual core single cpu.

Btw, are you saying that I should 1st install the latest MPICH2 build 
with the option :

./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker And then 
install PETSc with the MPICH2?

So after that do you know how to do what you've suggest for my servers? 
I don't really understand what you mean. May I supposed to run 4 jobs on 
1 quadcore? Or 1 job using 4 cores on 1 quadcore? Well, I do know that 
atlas3-c00 to c03 are the location of the quad cores. I can force to use 
them by

bsub -o log -q mcore_parallel -n $ -m quadcore -a mvapich mpirun.lsf ./a.out

Lastly, I make a mistake in the different times reported by the same 
compiler. Sorry abt that.

Thank you very much.



Satish Balay wrote:
 On Sat, 19 Apr 2008, Ben Tay wrote:

   
 Btw, I'm not able to try the latest mpich2 because I do not have the
 administrator rights. I was told that some special configuration is
 required.
 

 You don't need admin rights to install/use MPICH with the options I
 mentioned. I was sugesting just running in SMP mode on a single
 machine [from 1-8 procs on Quad-Core Intel Xeon X5355, to compare with
 my SMP runs] with:

 ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker

   
 Btw, should there be any different in speed whether I use mpiuni and
 ifort or mpi and mpif90? I tried on ex2f (below) and there's only a
 small difference. If there is a large difference (mpi being slower),
 then it mean there's something wrong in the code?
 

 For one - you are not using MPIUNI. You are using
 --with-mpi-dir=/lsftmp/g0306332/mpich2. However - if compilers are the
 same  compiler options are the same, I would expect the same
 performance in both the cases. Do you get such different times for
 different runs of the same binary?

 MatMult 384 vs 423

 What if you run both of the binaries on the same machine? [as a single
 job?].

 If you are using pbs scheduler - sugest doing:
 - squb -I [to get interactive access to thenodes]
 - login to each node - to check no one else is using the scheduled nodes.
 - run multiple jobs during this single allocation for comparision.

 These are general tips to help you debug performance on your cluster.

 BTW: I get:
 ex2f-600-1p.log:MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 
 0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   397

 You get:
 log.1:MatMult 1879 1.0 2.8137e+01 1.0 3.84e+08 1.0 0.0e+00 
 0.0e+00 0.0e+00 12 11  0  0  0  12 11  0  0  0   384


 There is a difference in number of iterations. Are you sure you are
 using the same ex2f with -m 600 -n 600 options?

 Satish




Slow speed after changing from serial to parallel (with ex2f.F)

2008-04-19 Thread Satish Balay
On Sat, 19 Apr 2008, Ben Tay wrote:

 Btw, I'm not able to try the latest mpich2 because I do not have the
 administrator rights. I was told that some special configuration is
 required.

You don't need admin rights to install/use MPICH with the options I
mentioned. I was sugesting just running in SMP mode on a single
machine [from 1-8 procs on Quad-Core Intel Xeon X5355, to compare with
my SMP runs] with:

./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker

 Btw, should there be any different in speed whether I use mpiuni and
 ifort or mpi and mpif90? I tried on ex2f (below) and there's only a
 small difference. If there is a large difference (mpi being slower),
 then it mean there's something wrong in the code?

For one - you are not using MPIUNI. You are using
--with-mpi-dir=/lsftmp/g0306332/mpich2. However - if compilers are the
same  compiler options are the same, I would expect the same
performance in both the cases. Do you get such different times for
different runs of the same binary?

MatMult 384 vs 423

What if you run both of the binaries on the same machine? [as a single
job?].

If you are using pbs scheduler - sugest doing:
- squb -I [to get interactive access to thenodes]
- login to each node - to check no one else is using the scheduled nodes.
- run multiple jobs during this single allocation for comparision.

These are general tips to help you debug performance on your cluster.

BTW: I get:
ex2f-600-1p.log:MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 
0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   397

You get:
log.1:MatMult 1879 1.0 2.8137e+01 1.0 3.84e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 11? 0? 0? 0? 12 11? 0? 0? 0?? 384


There is a difference in number of iterations. Are you sure you are
using the same ex2f with -m 600 -n 600 options?

Satish


Slow speed after changing from serial to parallel (with ex2f.F)

2008-04-19 Thread Satish Balay
Ben,

This conversation is getting long and winding. And we are are getting
into your cluster adminstration - which is not PETSc related.

I'll sugest you figureout about using the cluster from your system
admin and how to use bsub.

http://www.vub.ac.be/BFUCC/LSF/man/bsub.1.html

However I'll point out the following things.

- I'll sugest learning about scheduling an interactive job on your
  cluster. This will help you with running multiple jobs on the same
  machine.

- When making comparisions, have minimum changes between thing you
compare runs.

 * For eg: you are comparing runs between different queues '-q
 linux64' '-q mcore_parallel'. There might be differences here that
 can result in different performance.

 * If you are getting part of the machine [for -n 1 jobs] - verify if
 you are sharing the other part with some other job. Without this
 verification - your numbers are not meaningful. [depending upon how
 the queue is configured - it can either allocate part of the node or
 full node]

 * you should be able to request 4procs [i.e 1 complete machine] but
 be able to run either -np 1, 2 or 4 on the allocation. [This is
 easier to do in interactive mode]. This ensures nobody else is using
 the machine.  And you can run your code multiple times - to see if
 you are getting consistant results.

Regarding the primary issue you've had - with performance debugging
your PETSc appliation in *SMP-mode*, we've observed performance
anamolies in your log_summary for both your code, and ex2.f.F This
could be due one or more of the following:

- issues in your code
- issues with MPI you are using
- isues with the cluster you are using.

To narrow down - the comparisions I sugest:

- compare my ex2f.F with the *exact* same runs on your machine [You've
claimed that you also hav access to a 2-quad-core Intel Xeon X5355
machine]. So you should be able to reproduce the exact same experiment
as me - and compare the results. This should keep both software same -
and show differences in system software etc..


? No of Nodes Processors Qty per node Total cores per node Memory per node ?
? 4 Quad-Core Intel Xeon X5355 2 8 16 GB ?
   ^^^ 
? 60 Dual-Core Intel Xeon 5160 2 4 8 GB


i.e configure latest mpich2 with  [default compilers gcc/gfortran]:
./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker

Build PETSc with this MPI [and same compilers]
./config/configure.py --with-mpi-dir= --with-debugging=0

And run ex2f.F 600x600 on 1, 2, 4, 8 procs on a *single* X5355
machine. [it might have a different queue name]

- Now compare ex2f.F performance wtih MPICH [as built above] and the
current MPI you are using. This should identify the performance
differences between MPI implemenations within the box [within the SMP
box]

- Now compare runs between ex2f.F and your application.

At each of the above steps of comparision - we are hoping to identify
the reason for differences and rectify. Perhaps this is not possible
on your cluster and you can't improve on what you already have..

If you can't debug the SMP performance issues, you can avoid SMP
completely, and use 1 MPI task per machine [or 1 MPI task per memory
bank = 2 per machine]. But you'll still have to do similar analysis
to make sure there are no performance anamolies in the tool chain.

[i.e hardware, system software, MPI, application]

If you are willing to do the above steps, we can help with the
comparisions. As mentioned - this is getting long and windy. If you
have futher questions in this regard - we should contiune it at
petsc-maint at mcs.anl.gov

Satish

On Sat, 19 Apr 2008, Ben Tay wrote:

 Hi Satish,
 
 1st of all, I forgot to inform u that I've changed the m and n to 800. I would
 like to see if the larger value can make the scaling better. If req, I can
 redo the test with m,n=600.
 
 I can install MPICH but I don't think I can choose to run on a single machine
 using from 1 to 8 procs. In order to run the code, I usually have to use the
 command
 
 bsub -o log -q linux64 ./a.out   for single procs
 
 bsub -o log -q mcore_parallel -n $ -a mvapich mpirun.lsf ./a.out where $=no.
 of procs.   for multiple procs
 
 After that, when the job is running, I'll be given the server which my job
 runs on e.g. atlas3-c10 (1 procs) or 2*atlas3-c10 + 2*atlas3-c12 (4 procs) or
 2*atlas3-c10 + 2*atlas3-c12 +2*atlas3-c11 + 2*atlas3-c13 (8 procs). I was told
 that 2*atlas3-c10 doesn't mean that it is running on a dual core single cpu.
 
 Btw, are you saying that I should 1st install the latest MPICH2 build with the
 option :
 
 ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker And then install
 PETSc with the MPICH2?
 
 So after that do you know how to do what you've suggest for my servers? I
 don't really understand what you mean. May I supposed to run 4 jobs on 1
 quadcore? Or 1 job using 4 cores on 1 quadcore? Well, I do know that
 atlas3-c00 to c03 are the location of the quad cores. I can force to use them
 

Slow speed after changing from serial to parallel

2008-04-18 Thread Ben Tay
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080418/ca9caf8e/attachment.htm


Slow speed after changing from serial to parallel

2008-04-18 Thread Satish Balay
On Fri, 18 Apr 2008, Ben Tay wrote:

 Hi,
 
 I've email my school super computing staff and they told me that the queue 
 which I'm using is one meant for testing, hence, it's
 handling of work load is not good. I've sent my job to another queue and it's 
 run on 4 processors. It's my own code because there seems
 to be something wrong with the server displaying the summary when using 
 -log_summary with ex2f.F. I'm trying it again.

Thats wierd. We should first make sure ex2f [or ex2] are running
properly before looking at your code.

 
 Anyway comparing just kspsolve between the two, the speedup is about 2.7. 
 However, I noticed that for the 4 processors one, its
 MatAssemblyBegin is? 1.5158e+02, which is more than KSPSolve's 4.7041e+00. So 
 is MatAssemblyBegin's time included in KSPSolve? If not,
 does it mean that there's something wrong about my MatAssemblyBegin?

MatAssemblyBegin is not included in KSPSolve(). Something wierd is
going here. There are 2 possibilities.

- whatever code you have before matrix assembly is unbalanced, so
  MatAssemblyBegin() acts as a barrier .

- MPI communication is not optimal within the node.

Its best to first make sure ex2 or ex2f runs fine. As recommended
earlier - you should try latest mpich2 with --with-device=ch3:nemesis:newtcp
and compare ex2/ex2f performance with your current MPI.

Satish


Slow speed after changing from serial to parallel

2008-04-16 Thread Ben Tay
Oh sorry here's the whole information. I'm using 2 processors currently:


*** WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document***


-- PETSc Performance 
Summary: --

./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by g0306332 
Tue Apr 15 23:03:09 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 
HG revision: 414581156e67e55c761739b0deb119f7590d0f4b

 Max   Max/MinAvg  Total
Time (sec):   1.114e+03  1.00054   1.114e+03
Objects:  5.400e+01  1.0   5.400e+01
Flops:1.574e+11  1.0   1.574e+11  3.147e+11
Flops/sec:1.414e+08  1.00054   1.413e+08  2.826e+08
MPI Messages: 8.777e+03  1.0   8.777e+03  1.755e+04
MPI Message Lengths:  4.213e+07  1.0   4.800e+03  8.425e+07
MPI Reductions:   8.644e+03  1.0

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N 
-- 2N flops
and VecAXPY() for complex vectors of length 
N -- 8N flops

Summary of Stages:   - Time --  - Flops -  --- Messages 
---  -- Message Lengths --  -- Reductions --
Avg %Total Avg %Total   counts   
%Total Avg %Total   counts   %Total
 0:  Main Stage: 1.1136e+03 100.0%  3.1475e+11 100.0%  1.755e+04 
100.0%  4.800e+03  100.0%  1.729e+04 100.0%


See the 'Profiling' chapter of the users' manual for details on 
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops/sec: Max - maximum over all processors
   Ratio - ratio of maximum to minimum over all 
processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() 
and PetscLogStagePop().
  %T - percent time in this phase %F - percent flops in this 
phase
  %M - percent messages in this phase %L - percent message 
lengths in this phase
  %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time 
over all processors)



  ##
  ##
  #  WARNING!!!#
  ##
  #   This code was run without the PreLoadBegin() #
  #   macros. To get timing results we always recommend#
  #   preloading. otherwise timing numbers may be  #
  #   meaningless. #
  ##


EventCount  Time (sec) 
Flops/sec --- Global ---  --- Stage ---   Total
   Max Ratio  Max Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s


--- Event Stage 0: Main Stage

MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 
0.0e+00 10 11100100  0  10 11100100  0   217
MatSolve8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 
0.0e+00 17 11  0  0  0  17 11  0  0  0   120
MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   140
MatILUFactorSym1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyBegin   1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 
0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0 0
MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 
7.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatGetRowIJ1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 

Slow speed after changing from serial to parallel

2008-04-16 Thread Ben Tay
Hi,

Here's the summary for 1 processor. Seems like it's also using a long 
time... Can someone tell me when my mistakes possibly lie? Thank you 
very much!


*** WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document***


-- PETSc Performance 
Summary: --

./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 
Wed Apr 16 00:39:22 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 
HG revision: 414581156e67e55c761739b0deb119f7590d0f4b

 Max   Max/MinAvg  Total
Time (sec):   1.088e+03  1.0   1.088e+03
Objects:  4.300e+01  1.0   4.300e+01
Flops:2.658e+11  1.0   2.658e+11  2.658e+11
Flops/sec:2.444e+08  1.0   2.444e+08  2.444e+08
MPI Messages: 0.000e+00  0.0   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00  0.0   0.000e+00  0.000e+00
MPI Reductions:   1.460e+04  1.0

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N 
-- 2N flops
and VecAXPY() for complex vectors of length 
N -- 8N flops

Summary of Stages:   - Time --  - Flops -  --- Messages 
---  -- Message Lengths --  -- Reductions --
Avg %Total Avg %Total   counts   
%Total Avg %Total   counts   %Total
 0:  Main Stage: 1.0877e+03 100.0%  2.6584e+11 100.0%  0.000e+00   
0.0%  0.000e+000.0%  1.460e+04 100.0%


See the 'Profiling' chapter of the users' manual for details on 
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops/sec: Max - maximum over all processors
   Ratio - ratio of maximum to minimum over all 
processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() 
and PetscLogStagePop().
  %T - percent time in this phase %F - percent flops in this 
phase
  %M - percent messages in this phase %L - percent message 
lengths in this phase
  %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time 
over all processors)



  ##
  ##
  #  WARNING!!!#
  ##
  #   This code was run without the PreLoadBegin() #
  #   macros. To get timing results we always recommend#
  #   preloading. otherwise timing numbers may be  #
  #   meaningless. #
  #   preloading. otherwise timing numbers may be  #
  #   meaningless. #
  ##


EventCount  Time (sec) 
Flops/sec --- Global ---  --- Stage ---   Total
   Max Ratio  Max Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s


--- Event Stage 0: Main Stage

MatMult 7412 1.0 1.3344e+02 1.0 2.16e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 12 11  0  0  0  12 11  0  0  0   216
MatSolve7413 1.0 2.6851e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 25 11  0  0  0  25 11  0  0  0   107
MatLUFactorNum 1 1.0 4.3947e-02 1.0 8.83e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  088
MatILUFactorSym1 1.0 3.7798e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyBegin   1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyEnd 1 1.0 2.5835e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatGetRowIJ1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  

Slow speed after changing from serial to parallel

2008-04-16 Thread Ben Tay
Hi,

I was initially using LU and Hypre to solve my serial code. I switched 
to the default GMRES when I converted the parallel code. I've now redo 
the test using KSPBCGS and also Hypre  BommerAMG. Seems like 
MatAssemblyBegin, VecAYPX, VecScatterEnd (in bold) are the problems. 
What should I be checking? Here's the results for 1 and 2 processor  for 
each solver. Thank you so much!

*1 processor KSPBCGS *


*** WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document***


-- PETSc Performance 
Summary: --

./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 
Wed Apr 16 08:32:21 2008
Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 
HG revision: 414581156e67e55c761739b0deb119f7590d0f4b

 Max   Max/MinAvg  Total
Time (sec):   8.176e+01  1.0   8.176e+01
Objects:  2.700e+01  1.0   2.700e+01
Flops:1.893e+10  1.0   1.893e+10  1.893e+10
Flops/sec:2.315e+08  1.0   2.315e+08  2.315e+08
MPI Messages: 0.000e+00  0.0   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00  0.0   0.000e+00  0.000e+00
MPI Reductions:   3.743e+03  1.0

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N 
-- 2N flops
and VecAXPY() for complex vectors of length 
N -- 8N flops

Summary of Stages:   - Time --  - Flops -  --- Messages 
---  -- Message Lengths --  -- Reductions --
Avg %Total Avg %Total   counts   
%Total Avg %Total   counts   %Total
 0:  Main Stage: 8.1756e+01 100.0%  1.8925e+10 100.0%  0.000e+00   
0.0%  0.000e+000.0%  3.743e+03 100.0%


See the 'Profiling' chapter of the users' manual for details on 
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops/sec: Max - maximum over all processors
   Ratio - ratio of maximum to minimum over all 
processors
   Mess: number of messages sent
   Avg. len: average message length
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() 
and PetscLogStagePop().
  %T - percent time in this phase %F - percent flops in this 
phase
  %M - percent messages in this phase %L - percent message 
lengths in this phase
  %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time 
over all processors)


 
  ##
  ##
  #  WARNING!!!#
  ##
  #   This code was run without the PreLoadBegin() #
  #   macros. To get timing results we always recommend#
  #   preloading. otherwise timing numbers may be  #
  #   meaningless. #
  ##




EventCount  Time (sec) 
Flops/sec --- Global ---  --- Stage ---   Total
   Max Ratio  Max Ratio   Max  Ratio  Mess   Avg len 
Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s


--- Event Stage 0: Main Stage

MatMult 1498 1.0 1.6548e+01 1.0 3.55e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 20 31  0  0  0  20 31  0  0  0   355
MatSolve1500 1.0 3.2228e+01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 39 31  0  0  0  39 31  0  0  0   183
MatLUFactorNum 2 1.0 2.0642e-01 1.0 1.02e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0   102
MatILUFactorSym2 1.0 2.0250e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyBegin   2 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0 0
MatAssemblyEnd 2 1.0 1.7963e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  

Slow speed after changing from serial to parallel (with ex2f.F)

2008-04-16 Thread Ben Tay
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080416/0f3ee54b/attachment.htm


Slow speed after changing from serial to parallel (with ex2f.F)

2008-04-16 Thread Satish Balay
On Wed, 16 Apr 2008, Ben Tay wrote:

 Hi Satish, thank you very much for helping me run the ex2f.F code.
 
 I think I've a clearer picture now. I believe I'm running on Dual-Core Intel
 Xeon 5160. The quad core is only on atlas3-01 to 04 and there's only 4 of
 them. I guess that the lower peak is because I'm using Xeon 5160, while you
 are using Xeon X5355.

I'm still a bit puzzled. I just ran the same binary on a 2 dualcore
xeon 5130 machine [which should be similar to your 5160 machine] and
get the following:

[balay at n001 ~]$ grep MatMult log*
log.1:MatMult 1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 
0.0e+00 14 11  0  0  0  14 11  0  0  0   364
log.2:MatMult 1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 
0.0e+00 14 11100100  0  14 11100100  0   615
log.4:MatMult  969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 
0.0e+00 14 11100100  0  14 11100100  0   656
[balay at n001 ~]$ 

 You mention about the speedups for MatMult and compare between KSPSolve. Are
 these the only things we have to look at? Because I see that some other event
 such as VecMAXPY also takes up a sizable % of the time. To get an accurate
 speedup, do I just compare the time taken by KSPSolve between different no. of
 processors or do I have to look at other events such as MatMult as well?

Sometimes we look at individual components like MatMult() VecMAXPY()
to understand whats hapenning in each stage - and at KSPSolve() to
look at the agregate performance for the whole solve [which includes
MatMult VecMAXPY etc..]. Perhaps I should have also looked at
VecMDot() aswell - at 48% of runtime - its the biggest contributor to
KSPSolve() for your run.

Its easy to get lost in the details of log_summary. Looking for
anamolies is one thing. Plotting scalability charts for the solver is
something else..

 In summary, due to load imbalance, my speedup is quite bad. So maybe I'll just
 send your results to my school's engineer and see if they could do anything.
 For my part, I guess I'll just 've to wait?

Yes - load imbalance at MatMult level is bad. On 4 proc run you have
ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6
times slower than the other task [so all speedup is lost here]

You could try the latest mpich2 [1.0.7] - just for this SMP
experiment, and see if it makes a difference. I've built mpich2 with
[default gcc/gfortran and]:

./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker

There could be something else going on on this machine thats messing
up load-balance for basic petsc example..

Satish




Slow speed after changing from serial to parallel (with ex2f.F)

2008-04-16 Thread Matthew Knepley
On Wed, Apr 16, 2008 at 8:44 AM, Ben Tay zonexo at gmail.com wrote:
 Hi,

  Am I right to say that despite all the hype about multi-core processors,
 they can't speed up solving of linear eqns? It's not possible to get a 2x
 speedup when using 2 cores. And is this true for all types of linear
 equation solver besides PETSc? What about parallel direct solvers (e.g.
 MUMPS) or those which uses openmp instead of mpich? Well, I just can't help
 feeling disappointed if that's the case...

Notice that Satish got much much better scaling than you did on our box here.
I think something is really wrong either with the installation of MPI
on that box
or something hardware-wise.

  Matt

  Also, with a smart enough LSF scheduler, I will be assured of getting
 separate processors ie 1 core from each different processor instead of 2-4
 cores from just 1 processor. In that case, if I use 1 core from processor A
 and 1 core from processor B, I should be able to get a decent speedup of
 more than 1, is that so? This option is also better than using 2 or even 4
 cores from the same processor.

  Thank you very much.

  Satish Balay wrote:

  On Wed, 16 Apr 2008, Ben Tay wrote:
 
 
 
   Hi Satish, thank you very much for helping me run the ex2f.F code.
  
   I think I've a clearer picture now. I believe I'm running on Dual-Core
 Intel
   Xeon 5160. The quad core is only on atlas3-01 to 04 and there's only 4
 of
   them. I guess that the lower peak is because I'm using Xeon 5160, while
 you
   are using Xeon X5355.
  
  
 
  I'm still a bit puzzled. I just ran the same binary on a 2 dualcore
  xeon 5130 machine [which should be similar to your 5160 machine] and
  get the following:
 
  [balay at n001 ~]$ grep MatMult log*
  log.1:MatMult 1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e+00
 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   364
  log.2:MatMult 1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e+03
 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   615
  log.4:MatMult  969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e+03
 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   656
  [balay at n001 ~]$
 
 
   You mention about the speedups for MatMult and compare between KSPSolve.
 Are
   these the only things we have to look at? Because I see that some other
 event
   such as VecMAXPY also takes up a sizable % of the time. To get an
 accurate
   speedup, do I just compare the time taken by KSPSolve between different
 no. of
   processors or do I have to look at other events such as MatMult as well?
  
  
 
  Sometimes we look at individual components like MatMult() VecMAXPY()
  to understand whats hapenning in each stage - and at KSPSolve() to
  look at the agregate performance for the whole solve [which includes
  MatMult VecMAXPY etc..]. Perhaps I should have also looked at
  VecMDot() aswell - at 48% of runtime - its the biggest contributor to
  KSPSolve() for your run.
 
  Its easy to get lost in the details of log_summary. Looking for
  anamolies is one thing. Plotting scalability charts for the solver is
  something else..
 
 
 
   In summary, due to load imbalance, my speedup is quite bad. So maybe
 I'll just
   send your results to my school's engineer and see if they could do
 anything.
   For my part, I guess I'll just 've to wait?
  
  
 
  Yes - load imbalance at MatMult level is bad. On 4 proc run you have
  ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6
  times slower than the other task [so all speedup is lost here]
 
  You could try the latest mpich2 [1.0.7] - just for this SMP
  experiment, and see if it makes a difference. I've built mpich2 with
  [default gcc/gfortran and]:
 
  ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker
 
  There could be something else going on on this machine thats messing
  up load-balance for basic petsc example..
 
  Satish
 
 
 
 





-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener




Slow speed after changing from serial to parallel

2008-04-15 Thread Matthew Knepley
1) Please never cut out parts of the summary. All the information is valuable,
and most times, necessary

2) You seem to have huge load imbalance (look at VecNorm). Do you partition
the system yourself. How many processes is this?

3) You seem to be setting a huge number of off-process values in the matrix
(see MatAssemblyBegin). Is this true? I would reorganize this part.

  Matt

On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay zonexo at gmail.com wrote:
 Hi,

  I have converted the poisson eqn part of the CFD code to parallel. The grid
 size tested is 600x720. For the momentum eqn, I used another serial linear
 solver (nspcg) to prevent mixing of results. Here's the output summary:

  --- Event Stage 0: Main Stage

  MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
 0.0e+00 10 11100100  0  10 11100100  0   217
  MatSolve8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
  MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
  MatILUFactorSym1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
 1.0e+00  0  0  0  0  0   0  0  0  0  0 0
  *MatAssemblyBegin   1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00
 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0 0*
  MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
 7.0e+00  0  0  0  0  0   0  0  0  0  0 0
  MatGetRowIJ1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
  MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
 2.0e+00  0  0  0  0  0   0  0  0  0  0 0
  MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
  KSPGMRESOrthog  8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00
 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
  KSPSetup   2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
  KSPSolve   1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03
 1.7e+04 89100100100100  89100100100100   317
  PCSetUp2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00
 3.0e+00  0  0  0  0  0   0  0  0  0  069
  PCSetUpOnBlocks1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00
 3.0e+00  0  0  0  0  0   0  0  0  0  069
  PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00
 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
  VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00
 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
  *VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00
 8.8e+03  9  2  0  0 51   9  2  0  0 5142*
  *VecScale8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00
 0.0e+00  0  1  0  0  0   0  1  0  0  0   636*
  VecCopy  284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
  VecSet  9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
 0.0e+00  1  0  0  0  0   1  0  0  0  0 0
  VecAXPY  567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
  VecMAXPY8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00
 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
  VecAssemblyBegin   2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
 6.0e+00  0  0  0  0  0   0  0  0  0  0 0
  VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
  *VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03
 0.0e+00  0  0100100  0   0  0100100  0 0*
  *VecScatterEnd   8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00
 0.0e+00  1  0  0  0  0   1  0  0  0  0 0*
  *VecNormalize8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00
 8.8e+03  9  4  0  0 51   9  4  0  0 5162*

 
   Memory usage is given in bytes:
   Object Type  Creations   Destructions   Memory  Descendants' Mem.
 --- Event Stage 0: Main Stage
  Matrix 4  4   49227380 0
   Krylov Solver 2  2  17216 0
  Preconditioner 2  2256 0
   Index Set 5  52596120 0
 Vec40 40   62243224 0
 Vec Scatter 1  1  0 0
 
  Average time to get PetscTime(): 4.05312e-07  Average time
 for MPI_Barrier(): 7.62939e-07
  Average time for zero size MPI_Send(): 2.02656e-06
  OptionTable: -log_summary


  The PETSc manual states that ratio 

Slow speed after changing from serial to parallel (with ex2f.F)

2008-04-15 Thread Matthew Knepley
On Tue, Apr 15, 2008 at 9:08 PM, Ben Tay zonexo at gmail.com wrote:

  Hi,

  I just tested the ex2f.F example, changing m and n to 600. Here's the
 result for 1, 2 and 4 processors. Interestingly, MatAssemblyBegin,
 MatGetOrdering and KSPSetup have ratios 1. The time taken seems to be
 faster as the processor increases, although speedup is not 1:1. I thought
 that this example should scale well, shouldn't it? Is there something wrong
 with my installation then?

1) Notice that the events that are unbalanced take 0.01% of the time.
Not important.

2) The speedup really stinks. Even though this is a small problem. Are
you sure that
 you are actually running on two processors with separate memory
pipes and not
 on 1 dual core?

Matt

  Thank you.

  1 processor:

  Norm of error 0.3371E+01 iterations  1153

 
  *** WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
 -fCourier9' to print this document***

 

  -- PETSc Performance Summary:
 --

  ./a.out on a atlas3-mp named atlas3-c58 with 1 processor, by g0306332 Wed
 Apr 16 10:03:12 2008
  Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
 revision: 414581156e67e55c761739b0deb119f7590d0f4b

   Max   Max/MinAvg  Total
  Time (sec):   1.222e+02  1.0   1.222e+02
  Objects:  4.400e+01  1.0   4.400e+01
  Flops:3.547e+10  1.0   3.547e+10  3.547e+10
  Flops/sec:2.903e+08  1.0   2.903e+08  2.903e+08
  MPI Messages: 0.000e+00  0.0   0.000e+00  0.000e+00
  MPI Message Lengths:  0.000e+00  0.0   0.000e+00  0.000e+00
  MPI Reductions:   2.349e+03  1.0

  Flop counting convention: 1 flop = 1 real number operation of type
 (multiply/divide/add/subtract)
  e.g., VecAXPY() for real vectors of length N
 -- 2N flops
  and VecAXPY() for complex vectors of length N
 -- 8N flops

  Summary of Stages:   - Time --  - Flops -  --- Messages ---
 -- Message Lengths --  -- Reductions --
  Avg %Total Avg %Total   counts   %Total
 Avg %Total   counts   %Total
   0:  Main Stage: 1.2216e+02 100.0%  3.5466e+10 100.0%  0.000e+00   0.0%
 0.000e+000.0%  2.349e+03 100.0%


 
  See the 'Profiling' chapter of the users' manual for details on
 interpreting output.
  Phase summary info:
 Count: number of times phase was executed
 Time and Flops/sec: Max - maximum over all processors
 Ratio - ratio of maximum to minimum over all
 processors
 Mess: number of messages sent
 Avg. len: average message length
 Reduct: number of global reductions
 Global: entire computation
 Stage: stages of a computation. Set stages with PetscLogStagePush() and
 PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this
 phase
%M - percent messages in this phase %L - percent message lengths
 in this phase
%R - percent reductions in this phase
 Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
 all processors)

 


##
##
#  WARNING!!!#
##
#   This code was run without the PreLoadBegin() #
#   macros. To get timing results we always recommend#
#   preloading. otherwise timing numbers may be  #
#   meaningless. #
##

  EventCount  Time (sec) Flops/sec
 --- Global ---  --- Stage ---   Total
 Max Ratio  Max Ratio   Max  Ratio  Mess   Avg len
 Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s

 

  --- Event Stage 0: Main Stage

  MatMult 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00
 0.0e+00 13 11  0  0  0  13 11  0  0  0   239
  MatSolve1192 1.0 3.1017e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00
 0.0e+00 25 11  

Slow speed after changing from serial to parallel (with ex2f.F)

2008-04-15 Thread Satish Balay
On Wed, 16 Apr 2008, Ben Tay wrote:

 I think you may be right. My school uses :

 ? No of Nodes Processors Qty per node Total cores per node Memory per node ?
 ? 4 Quad-Core Intel Xeon X5355 2 8 16 GB ?
 ? 60 Dual-Core Intel Xeon 5160 2 4 8 GB


I've attempted to run the same ex2f on a 2x quad-core Intel Xeon X5355
machine [with gcc/ latest mpich2 with --with-device=ch3:nemesis:newtcp] - and I 
get the following:

 Logs for my run are attached 

asterix:/home/balay/download-pinegrep MatMult *
ex2f-600-1p.log:MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 
0.0e+00 0.0e+00 0.0e+00 14 11  0  0  0  14 11  0  0  0   397
ex2f-600-2p.log:MatMult 1217 1.0 6.2256e+00 1.0 1.97e+09 1.0 
2.4e+03 4.8e+03 0.0e+00 14 11100100  0  14 11100100  0   632
ex2f-600-4p.log:MatMult  969 1.0 4.3311e+00 1.0 7.84e+08 1.0 
5.8e+03 4.8e+03 0.0e+00 15 11100100  0  15 11100100  0   724
ex2f-600-8p.log:MatMult 1318 1.0 5.6966e+00 1.0 5.33e+08 1.0 
1.8e+04 4.8e+03 0.0e+00 16 11100100  0  16 11100100  0   749
asterix:/home/balay/download-pinegrep KSPSolve *
ex2f-600-1p.log:KSPSolve   1 1.0 6.9165e+01 1.0 3.55e+10 1.0 
0.0e+00 0.0e+00 2.3e+03100100  0  0100 100100  0  0100   513
ex2f-600-2p.log:KSPSolve   1 1.0 4.4005e+01 1.0 1.81e+10 1.0 
2.4e+03 4.8e+03 2.4e+03100100100100100 100100100100100   824
ex2f-600-4p.log:KSPSolve   1 1.0 2.8139e+01 1.0 7.21e+09 1.0 
5.8e+03 4.8e+03 1.9e+03100100100100 99 100100100100 99  1024
ex2f-600-8p.log:KSPSolve   1 1.0 3.6260e+01 1.0 4.90e+09 1.0 
1.8e+04 4.8e+03 2.6e+03100100100100100 100100100100100  1081
asterix:/home/balay/download-pine


You get the following [with intel compilers?]:

asterix:/home/balay/download-pine/xgrep MatMult *
log.1:MatMult 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 13 11? 0? 0? 0? 13 11? 0? 0? 0?? 239
log.2:MatMult 1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03 
0.0e+00 11 11100100? 0? 11 11100100? 0?? 315
log.4:MatMult? 969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03 
0.0e+00? 8 11100100? 0?? 8 11100100? 0?? 321
asterix:/home/balay/download-pine/xgrep KSPSolve *
log.1:KSPSolve?? 1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00 
2.3e+03100100? 0? 0100 100100? 0? 0100?? 292
log.2:KSPSolve?? 1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03 
2.4e+03 99100100100100? 99100100100100?? 352
log.4:KSPSolve?? 1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03 
1.9e+03 98100100100 99? 98100100100 99?? 461
asterix:/home/balay/download-pine/x

What exact CPU was this run on?

A couple of comments:
- my runs for MatMult have 1.0 ratio for 2,4,8 proc runs, while yours have 1.2, 
3.6 for 2,4 proc runs [so higher
  load imbalance on your machine]
- The peaks are also lower - not sure why. 397 for 1p-MatMult for me - vs 239 
for you
- Speedups I see for MatMult are:

np   me   you

2   1.59   1.32
4   1.82   1.34
8   1.88

--

The primary issue is - expecting speedup of 4, from 4-cores and 8 from 8-cores.

As Matt indicated perhaps in Subject: general question on speed using quad 
core Xeons thread,
for sparse linear algebra - the performance is limited by memory bandwidth - 
not CPU

So one have to look at the hardware memory architecture of the machine
if you expect scalability.

The 2x quad-core has a memory architecture that gives 11GB/s if one
CPU-socket is used, but 22GB/s when both CPUs-sockets are used
[irrespective of the number of cores in each CPU socket]. One
inference is - max of 2 speedup can be obtained from such machine [due
to 2 memory bank architecture].

So if you have 2 such machines [i.e 4 memory banks] - then you can
expect a theoretical max speedup of 4.

We are generally used to evaluating performance/cpu [or core]. Here
the scalability numbers suck.

However if you do performance/number-of-memory-banks - then things look better.

Its just that we are used to always expecting scalability per node and
assume it translates to scalability per core. [however the scalability
per node - was more about scalability per memory bank - before
multicore cpus took over]


There is also another measure - performance/dollar spent. Generally
the extra cores are practically free - so here this measure also holds
up ok.

Satish
-- next part --

*** WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document***


-- PETSc Performance Summary: 
--

./ex2f on a linux-tes named intel-loaner1 with 1 processor, by balay Tue Apr 15 
22:02:38 2008
Using Petsc Development Version 

Slow speed after changing from serial to parallel

2008-04-14 Thread Ben Tay
Thank you Matthew. Sorry to trouble you again.

I tried to run it with -log_summary output and I found that there's some 
errors in the execution. Well, I was busy with other things and I just 
came back to this problem. Some of my files on the server has also been 
deleted. It has been a while and I  remember that  it worked before, 
only much slower.

Anyway, most of the serial code has been updated and maybe it's easier 
to convert the new serial code instead of debugging on the old parallel 
code now. I believe I can still reuse part of the old parallel code. 
However, I hope I can approach it better this time.

So supposed I need to start converting my new serial code to parallel. 
There's 2 eqns to be solved using PETSc, the momentum and poisson. I 
also need to parallelize other parts of my code. I wonder which route is 
the best:

1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify 
other parts of my code to parallel e.g. looping, updating of values etc. 
Once the execution is fine and speedup is reasonable, then modify the 
PETSc part - poisson eqn 1st followed by the momentum eqn.

2. Reverse the above order ie modify the PETSc part - poisson eqn 1st 
followed by the momentum eqn. Then do other parts of my code.

I'm not sure if the above 2 mtds can work or if there will be conflicts. 
Of course, an alternative will be:

3. Do the poisson, momentum eqns and other parts of the code separately. 
That is, code a standalone parallel poisson eqn and use samples values 
to test it. Same for the momentum and other parts of the code. When each 
of them is working, combine them to form the full parallel code. 
However, this will be much more troublesome.

I hope someone can give me some recommendations.

Thank you once again.

Matthew Knepley wrote:
 1) There is no way to have any idea what is going on in your code
 without -log_summary output

 2) Looking at that output, look at the percentage taken by the solver
 KSPSolve event. I suspect it is not the biggest component, because
it is very scalable.

Matt

 On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay zonexo at gmail.com wrote:
   
 Hi,

 I've a serial 2D CFD code. As my grid size requirement increases, the
 simulation takes longer. Also, memory requirement becomes a problem. Grid
 size 've reached 1200x1200. Going higher is not possible due to memory
 problem.

 I tried to convert my code to a parallel one, following the examples given.
 I also need to restructure parts of my code to enable parallel looping. I
 1st changed the PETSc solver to be parallel enabled and then I restructured
 parts of my code. I proceed on as longer as the answer for a simple test
 case is correct. I thought it's not really possible to do any speed testing
 since the code is not fully parallelized yet. When I finished during most of
 the conversion, I found that in the actual run that it is much slower,
 although the answer is correct.

 So what is the remedy now? I wonder what I should do to check what's wrong.
 Must I restart everything again? Btw, my grid size is 1200x1200. I believed
 it should be suitable for parallel run of 4 processors? Is that so?

 Thank you.
 



   




Slow speed after changing from serial to parallel

2008-04-14 Thread Matthew Knepley
I am not sure why you would ever have two codes. I never do this. PETSc
is designed to write one code to run in serial and parallel. The PETSc part
should look identical. To test, run the code yo uhave verified in serial and
output PETSc data structures (like Mat and Vec) using a binary viewer.
Then run in parallel with the same code, which will output the same
structures. Take the two files and write a small verification code that
loads both versions and calls MatEqual and VecEqual.

  Matt

On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay zonexo at gmail.com wrote:
 Thank you Matthew. Sorry to trouble you again.

  I tried to run it with -log_summary output and I found that there's some
 errors in the execution. Well, I was busy with other things and I just came
 back to this problem. Some of my files on the server has also been deleted.
 It has been a while and I  remember that  it worked before, only much
 slower.

  Anyway, most of the serial code has been updated and maybe it's easier to
 convert the new serial code instead of debugging on the old parallel code
 now. I believe I can still reuse part of the old parallel code. However, I
 hope I can approach it better this time.

  So supposed I need to start converting my new serial code to parallel.
 There's 2 eqns to be solved using PETSc, the momentum and poisson. I also
 need to parallelize other parts of my code. I wonder which route is the
 best:

  1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify
 other parts of my code to parallel e.g. looping, updating of values etc.
 Once the execution is fine and speedup is reasonable, then modify the PETSc
 part - poisson eqn 1st followed by the momentum eqn.

  2. Reverse the above order ie modify the PETSc part - poisson eqn 1st
 followed by the momentum eqn. Then do other parts of my code.

  I'm not sure if the above 2 mtds can work or if there will be conflicts. Of
 course, an alternative will be:

  3. Do the poisson, momentum eqns and other parts of the code separately.
 That is, code a standalone parallel poisson eqn and use samples values to
 test it. Same for the momentum and other parts of the code. When each of
 them is working, combine them to form the full parallel code. However, this
 will be much more troublesome.

  I hope someone can give me some recommendations.

  Thank you once again.



  Matthew Knepley wrote:

  1) There is no way to have any idea what is going on in your code
 without -log_summary output
 
  2) Looking at that output, look at the percentage taken by the solver
 KSPSolve event. I suspect it is not the biggest component, because
it is very scalable.
 
Matt
 
  On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay zonexo at gmail.com wrote:
 
 
   Hi,
  
   I've a serial 2D CFD code. As my grid size requirement increases, the
   simulation takes longer. Also, memory requirement becomes a problem.
 Grid
   size 've reached 1200x1200. Going higher is not possible due to memory
   problem.
  
   I tried to convert my code to a parallel one, following the examples
 given.
   I also need to restructure parts of my code to enable parallel looping.
 I
   1st changed the PETSc solver to be parallel enabled and then I
 restructured
   parts of my code. I proceed on as longer as the answer for a simple test
   case is correct. I thought it's not really possible to do any speed
 testing
   since the code is not fully parallelized yet. When I finished during
 most of
   the conversion, I found that in the actual run that it is much slower,
   although the answer is correct.
  
   So what is the remedy now? I wonder what I should do to check what's
 wrong.
   Must I restart everything again? Btw, my grid size is 1200x1200. I
 believed
   it should be suitable for parallel run of 4 processors? Is that so?
  
   Thank you.
  
  
 
 
 
 
 





-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener




Slow speed after changing from serial to parallel

2008-04-14 Thread Ben Tay
Hi Matthew,

I think you've misunderstood what I meant. What I'm trying to say is 
initially I've got a serial code. I tried to convert to a parallel one. 
Then I tested it and it was pretty slow. Due to some work requirement, I 
need to go back to make some changes to my code. Since the parallel is 
not working well, I updated and changed the serial one.

Well, that was a while ago and now, due to the updates and changes, the 
serial code is different from the old converted parallel code. Some 
files were also deleted and I can't seem to get it working now. So I 
thought I might as well convert the new serial code to parallel. But I'm 
not very sure what I should do 1st.

Maybe I should rephrase my question in that if I just convert my poisson 
equation subroutine from a serial PETSc to a parallel PETSc version, 
will it work? Should I expect a speedup? The rest of my code is still 
serial.

Thank you very much.

Matthew Knepley wrote:
 I am not sure why you would ever have two codes. I never do this. PETSc
 is designed to write one code to run in serial and parallel. The PETSc part
 should look identical. To test, run the code yo uhave verified in serial and
 output PETSc data structures (like Mat and Vec) using a binary viewer.
 Then run in parallel with the same code, which will output the same
 structures. Take the two files and write a small verification code that
 loads both versions and calls MatEqual and VecEqual.

   Matt

 On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay zonexo at gmail.com wrote:
   
 Thank you Matthew. Sorry to trouble you again.

  I tried to run it with -log_summary output and I found that there's some
 errors in the execution. Well, I was busy with other things and I just came
 back to this problem. Some of my files on the server has also been deleted.
 It has been a while and I  remember that  it worked before, only much
 slower.

  Anyway, most of the serial code has been updated and maybe it's easier to
 convert the new serial code instead of debugging on the old parallel code
 now. I believe I can still reuse part of the old parallel code. However, I
 hope I can approach it better this time.

  So supposed I need to start converting my new serial code to parallel.
 There's 2 eqns to be solved using PETSc, the momentum and poisson. I also
 need to parallelize other parts of my code. I wonder which route is the
 best:

  1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify
 other parts of my code to parallel e.g. looping, updating of values etc.
 Once the execution is fine and speedup is reasonable, then modify the PETSc
 part - poisson eqn 1st followed by the momentum eqn.

  2. Reverse the above order ie modify the PETSc part - poisson eqn 1st
 followed by the momentum eqn. Then do other parts of my code.

  I'm not sure if the above 2 mtds can work or if there will be conflicts. Of
 course, an alternative will be:

  3. Do the poisson, momentum eqns and other parts of the code separately.
 That is, code a standalone parallel poisson eqn and use samples values to
 test it. Same for the momentum and other parts of the code. When each of
 them is working, combine them to form the full parallel code. However, this
 will be much more troublesome.

  I hope someone can give me some recommendations.

  Thank you once again.



  Matthew Knepley wrote:

 
 1) There is no way to have any idea what is going on in your code
without -log_summary output

 2) Looking at that output, look at the percentage taken by the solver
KSPSolve event. I suspect it is not the biggest component, because
   it is very scalable.

   Matt

 On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay zonexo at gmail.com wrote:


   
 Hi,

 I've a serial 2D CFD code. As my grid size requirement increases, the
 simulation takes longer. Also, memory requirement becomes a problem.
 
 Grid
 
 size 've reached 1200x1200. Going higher is not possible due to memory
 problem.

 I tried to convert my code to a parallel one, following the examples
 
 given.
 
 I also need to restructure parts of my code to enable parallel looping.
 
 I
 
 1st changed the PETSc solver to be parallel enabled and then I
 
 restructured
 
 parts of my code. I proceed on as longer as the answer for a simple test
 case is correct. I thought it's not really possible to do any speed
 
 testing
 
 since the code is not fully parallelized yet. When I finished during
 
 most of
 
 the conversion, I found that in the actual run that it is much slower,
 although the answer is correct.

 So what is the remedy now? I wonder what I should do to check what's
 
 wrong.
 
 Must I restart everything again? Btw, my grid size is 1200x1200. I
 
 believed
 
 it should be suitable for parallel run of 4 processors? Is that so?

 Thank you.


 



   
 



   




Slow speed after changing from serial to parallel

2008-04-14 Thread Matthew Knepley
On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay zonexo at gmail.com wrote:
 Hi Matthew,

  I think you've misunderstood what I meant. What I'm trying to say is
 initially I've got a serial code. I tried to convert to a parallel one. Then
 I tested it and it was pretty slow. Due to some work requirement, I need to
 go back to make some changes to my code. Since the parallel is not working
 well, I updated and changed the serial one.

  Well, that was a while ago and now, due to the updates and changes, the
 serial code is different from the old converted parallel code. Some files
 were also deleted and I can't seem to get it working now. So I thought I
 might as well convert the new serial code to parallel. But I'm not very sure
 what I should do 1st.

  Maybe I should rephrase my question in that if I just convert my poisson
 equation subroutine from a serial PETSc to a parallel PETSc version, will it
 work? Should I expect a speedup? The rest of my code is still serial.

You should, of course, only expect speedup in the parallel parts

  Matt

  Thank you very much.



  Matthew Knepley wrote:

  I am not sure why you would ever have two codes. I never do this. PETSc
  is designed to write one code to run in serial and parallel. The PETSc
 part
  should look identical. To test, run the code yo uhave verified in serial
 and
  output PETSc data structures (like Mat and Vec) using a binary viewer.
  Then run in parallel with the same code, which will output the same
  structures. Take the two files and write a small verification code that
  loads both versions and calls MatEqual and VecEqual.
 
   Matt
 
  On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay zonexo at gmail.com wrote:
 
 
   Thank you Matthew. Sorry to trouble you again.
  
I tried to run it with -log_summary output and I found that there's
 some
   errors in the execution. Well, I was busy with other things and I just
 came
   back to this problem. Some of my files on the server has also been
 deleted.
   It has been a while and I  remember that  it worked before, only much
   slower.
  
Anyway, most of the serial code has been updated and maybe it's easier
 to
   convert the new serial code instead of debugging on the old parallel
 code
   now. I believe I can still reuse part of the old parallel code. However,
 I
   hope I can approach it better this time.
  
So supposed I need to start converting my new serial code to parallel.
   There's 2 eqns to be solved using PETSc, the momentum and poisson. I
 also
   need to parallelize other parts of my code. I wonder which route is the
   best:
  
1. Don't change the PETSc part ie continue using PETSC_COMM_SELF,
 modify
   other parts of my code to parallel e.g. looping, updating of values etc.
   Once the execution is fine and speedup is reasonable, then modify the
 PETSc
   part - poisson eqn 1st followed by the momentum eqn.
  
2. Reverse the above order ie modify the PETSc part - poisson eqn 1st
   followed by the momentum eqn. Then do other parts of my code.
  
I'm not sure if the above 2 mtds can work or if there will be
 conflicts. Of
   course, an alternative will be:
  
3. Do the poisson, momentum eqns and other parts of the code
 separately.
   That is, code a standalone parallel poisson eqn and use samples values
 to
   test it. Same for the momentum and other parts of the code. When each of
   them is working, combine them to form the full parallel code. However,
 this
   will be much more troublesome.
  
I hope someone can give me some recommendations.
  
Thank you once again.
  
  
  
Matthew Knepley wrote:
  
  
  
1) There is no way to have any idea what is going on in your code
  without -log_summary output
   
2) Looking at that output, look at the percentage taken by the solver
  KSPSolve event. I suspect it is not the biggest component, because
 it is very scalable.
   
 Matt
   
On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay zonexo at gmail.com wrote:
   
   
   
   
 Hi,

 I've a serial 2D CFD code. As my grid size requirement increases,
 the
 simulation takes longer. Also, memory requirement becomes a problem.


   
   Grid
  
  
   
 size 've reached 1200x1200. Going higher is not possible due to
 memory
 problem.

 I tried to convert my code to a parallel one, following the examples


   
   given.
  
  
   
 I also need to restructure parts of my code to enable parallel
 looping.


   
   I
  
  
   
 1st changed the PETSc solver to be parallel enabled and then I


   
   restructured
  
  
   
 parts of my code. I proceed on as longer as the answer for a simple
 test
 case is correct. I thought it's not really possible to do any speed


   
   testing
  
  
   
 since the code is not fully parallelized yet. When I finished during


   
   most of
  
  
   
 the conversion, I found that in the actual run that it is much
 

Slow speed after changing from serial to parallel

2008-04-13 Thread Ben Tay
Hi,

I've a serial 2D CFD code. As my grid size requirement increases, the
simulation takes longer. Also, memory requirement becomes a problem. Grid
size 've reached 1200x1200. Going higher is not possible due to memory
problem.

I tried to convert my code to a parallel one, following the examples given.
I also need to restructure parts of my code to enable parallel looping. I
1st changed the PETSc solver to be parallel enabled and then I restructured
parts of my code. I proceed on as longer as the answer for a simple test
case is correct. I thought it's not really possible to do any speed testing
since the code is not fully parallelized yet. When I finished during most of
the conversion, I found that in the actual run that it is much slower,
although the answer is correct.

So what is the remedy now? I wonder what I should do to check what's wrong.
Must I restart everything again? Btw, my grid size is 1200x1200. I believed
it should be suitable for parallel run of 4 processors? Is that so?

Thank you.
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080413/9c4ce213/attachment.htm


Slow speed after changing from serial to parallel

2008-04-13 Thread Matthew Knepley
1) There is no way to have any idea what is going on in your code
without -log_summary output

2) Looking at that output, look at the percentage taken by the solver
KSPSolve event. I suspect it is not the biggest component, because
   it is very scalable.

   Matt

On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay zonexo at gmail.com wrote:


 Hi,

 I've a serial 2D CFD code. As my grid size requirement increases, the
 simulation takes longer. Also, memory requirement becomes a problem. Grid
 size 've reached 1200x1200. Going higher is not possible due to memory
 problem.

 I tried to convert my code to a parallel one, following the examples given.
 I also need to restructure parts of my code to enable parallel looping. I
 1st changed the PETSc solver to be parallel enabled and then I restructured
 parts of my code. I proceed on as longer as the answer for a simple test
 case is correct. I thought it's not really possible to do any speed testing
 since the code is not fully parallelized yet. When I finished during most of
 the conversion, I found that in the actual run that it is much slower,
 although the answer is correct.

 So what is the remedy now? I wonder what I should do to check what's wrong.
 Must I restart everything again? Btw, my grid size is 1200x1200. I believed
 it should be suitable for parallel run of 4 processors? Is that so?

 Thank you.



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener