Slow speed after changing from serial to parallel (with ex2f.F)
Hi Satish, 1st of all, I forgot to inform u that I've changed the m and n to 800. I would like to see if the larger value can make the scaling better. If req, I can redo the test with m,n=600. I can install MPICH but I don't think I can choose to run on a single machine using from 1 to 8 procs. In order to run the code, I usually have to use the command bsub -o log -q linux64 ./a.out for single procs bsub -o log -q mcore_parallel -n $ -a mvapich mpirun.lsf ./a.out where $=no. of procs. for multiple procs After that, when the job is running, I'll be given the server which my job runs on e.g. atlas3-c10 (1 procs) or 2*atlas3-c10 + 2*atlas3-c12 (4 procs) or 2*atlas3-c10 + 2*atlas3-c12 +2*atlas3-c11 + 2*atlas3-c13 (8 procs). I was told that 2*atlas3-c10 doesn't mean that it is running on a dual core single cpu. Btw, are you saying that I should 1st install the latest MPICH2 build with the option : ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker And then install PETSc with the MPICH2? So after that do you know how to do what you've suggest for my servers? I don't really understand what you mean. May I supposed to run 4 jobs on 1 quadcore? Or 1 job using 4 cores on 1 quadcore? Well, I do know that atlas3-c00 to c03 are the location of the quad cores. I can force to use them by bsub -o log -q mcore_parallel -n $ -m quadcore -a mvapich mpirun.lsf ./a.out Lastly, I make a mistake in the different times reported by the same compiler. Sorry abt that. Thank you very much. Satish Balay wrote: On Sat, 19 Apr 2008, Ben Tay wrote: Btw, I'm not able to try the latest mpich2 because I do not have the administrator rights. I was told that some special configuration is required. You don't need admin rights to install/use MPICH with the options I mentioned. I was sugesting just running in SMP mode on a single machine [from 1-8 procs on Quad-Core Intel Xeon X5355, to compare with my SMP runs] with: ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker Btw, should there be any different in speed whether I use mpiuni and ifort or mpi and mpif90? I tried on ex2f (below) and there's only a small difference. If there is a large difference (mpi being slower), then it mean there's something wrong in the code? For one - you are not using MPIUNI. You are using --with-mpi-dir=/lsftmp/g0306332/mpich2. However - if compilers are the same compiler options are the same, I would expect the same performance in both the cases. Do you get such different times for different runs of the same binary? MatMult 384 vs 423 What if you run both of the binaries on the same machine? [as a single job?]. If you are using pbs scheduler - sugest doing: - squb -I [to get interactive access to thenodes] - login to each node - to check no one else is using the scheduled nodes. - run multiple jobs during this single allocation for comparision. These are general tips to help you debug performance on your cluster. BTW: I get: ex2f-600-1p.log:MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 397 You get: log.1:MatMult 1879 1.0 2.8137e+01 1.0 3.84e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 11 0 0 0 12 11 0 0 0 384 There is a difference in number of iterations. Are you sure you are using the same ex2f with -m 600 -n 600 options? Satish
Slow speed after changing from serial to parallel (with ex2f.F)
On Sat, 19 Apr 2008, Ben Tay wrote: Btw, I'm not able to try the latest mpich2 because I do not have the administrator rights. I was told that some special configuration is required. You don't need admin rights to install/use MPICH with the options I mentioned. I was sugesting just running in SMP mode on a single machine [from 1-8 procs on Quad-Core Intel Xeon X5355, to compare with my SMP runs] with: ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker Btw, should there be any different in speed whether I use mpiuni and ifort or mpi and mpif90? I tried on ex2f (below) and there's only a small difference. If there is a large difference (mpi being slower), then it mean there's something wrong in the code? For one - you are not using MPIUNI. You are using --with-mpi-dir=/lsftmp/g0306332/mpich2. However - if compilers are the same compiler options are the same, I would expect the same performance in both the cases. Do you get such different times for different runs of the same binary? MatMult 384 vs 423 What if you run both of the binaries on the same machine? [as a single job?]. If you are using pbs scheduler - sugest doing: - squb -I [to get interactive access to thenodes] - login to each node - to check no one else is using the scheduled nodes. - run multiple jobs during this single allocation for comparision. These are general tips to help you debug performance on your cluster. BTW: I get: ex2f-600-1p.log:MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 397 You get: log.1:MatMult 1879 1.0 2.8137e+01 1.0 3.84e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 11? 0? 0? 0? 12 11? 0? 0? 0?? 384 There is a difference in number of iterations. Are you sure you are using the same ex2f with -m 600 -n 600 options? Satish
Slow speed after changing from serial to parallel (with ex2f.F)
Ben, This conversation is getting long and winding. And we are are getting into your cluster adminstration - which is not PETSc related. I'll sugest you figureout about using the cluster from your system admin and how to use bsub. http://www.vub.ac.be/BFUCC/LSF/man/bsub.1.html However I'll point out the following things. - I'll sugest learning about scheduling an interactive job on your cluster. This will help you with running multiple jobs on the same machine. - When making comparisions, have minimum changes between thing you compare runs. * For eg: you are comparing runs between different queues '-q linux64' '-q mcore_parallel'. There might be differences here that can result in different performance. * If you are getting part of the machine [for -n 1 jobs] - verify if you are sharing the other part with some other job. Without this verification - your numbers are not meaningful. [depending upon how the queue is configured - it can either allocate part of the node or full node] * you should be able to request 4procs [i.e 1 complete machine] but be able to run either -np 1, 2 or 4 on the allocation. [This is easier to do in interactive mode]. This ensures nobody else is using the machine. And you can run your code multiple times - to see if you are getting consistant results. Regarding the primary issue you've had - with performance debugging your PETSc appliation in *SMP-mode*, we've observed performance anamolies in your log_summary for both your code, and ex2.f.F This could be due one or more of the following: - issues in your code - issues with MPI you are using - isues with the cluster you are using. To narrow down - the comparisions I sugest: - compare my ex2f.F with the *exact* same runs on your machine [You've claimed that you also hav access to a 2-quad-core Intel Xeon X5355 machine]. So you should be able to reproduce the exact same experiment as me - and compare the results. This should keep both software same - and show differences in system software etc.. ? No of Nodes Processors Qty per node Total cores per node Memory per node ? ? 4 Quad-Core Intel Xeon X5355 2 8 16 GB ? ^^^ ? 60 Dual-Core Intel Xeon 5160 2 4 8 GB i.e configure latest mpich2 with [default compilers gcc/gfortran]: ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker Build PETSc with this MPI [and same compilers] ./config/configure.py --with-mpi-dir= --with-debugging=0 And run ex2f.F 600x600 on 1, 2, 4, 8 procs on a *single* X5355 machine. [it might have a different queue name] - Now compare ex2f.F performance wtih MPICH [as built above] and the current MPI you are using. This should identify the performance differences between MPI implemenations within the box [within the SMP box] - Now compare runs between ex2f.F and your application. At each of the above steps of comparision - we are hoping to identify the reason for differences and rectify. Perhaps this is not possible on your cluster and you can't improve on what you already have.. If you can't debug the SMP performance issues, you can avoid SMP completely, and use 1 MPI task per machine [or 1 MPI task per memory bank = 2 per machine]. But you'll still have to do similar analysis to make sure there are no performance anamolies in the tool chain. [i.e hardware, system software, MPI, application] If you are willing to do the above steps, we can help with the comparisions. As mentioned - this is getting long and windy. If you have futher questions in this regard - we should contiune it at petsc-maint at mcs.anl.gov Satish On Sat, 19 Apr 2008, Ben Tay wrote: Hi Satish, 1st of all, I forgot to inform u that I've changed the m and n to 800. I would like to see if the larger value can make the scaling better. If req, I can redo the test with m,n=600. I can install MPICH but I don't think I can choose to run on a single machine using from 1 to 8 procs. In order to run the code, I usually have to use the command bsub -o log -q linux64 ./a.out for single procs bsub -o log -q mcore_parallel -n $ -a mvapich mpirun.lsf ./a.out where $=no. of procs. for multiple procs After that, when the job is running, I'll be given the server which my job runs on e.g. atlas3-c10 (1 procs) or 2*atlas3-c10 + 2*atlas3-c12 (4 procs) or 2*atlas3-c10 + 2*atlas3-c12 +2*atlas3-c11 + 2*atlas3-c13 (8 procs). I was told that 2*atlas3-c10 doesn't mean that it is running on a dual core single cpu. Btw, are you saying that I should 1st install the latest MPICH2 build with the option : ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker And then install PETSc with the MPICH2? So after that do you know how to do what you've suggest for my servers? I don't really understand what you mean. May I supposed to run 4 jobs on 1 quadcore? Or 1 job using 4 cores on 1 quadcore? Well, I do know that atlas3-c00 to c03 are the location of the quad cores. I can force to use them
Slow speed after changing from serial to parallel
An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080418/ca9caf8e/attachment.htm
Slow speed after changing from serial to parallel
On Fri, 18 Apr 2008, Ben Tay wrote: Hi, I've email my school super computing staff and they told me that the queue which I'm using is one meant for testing, hence, it's handling of work load is not good. I've sent my job to another queue and it's run on 4 processors. It's my own code because there seems to be something wrong with the server displaying the summary when using -log_summary with ex2f.F. I'm trying it again. Thats wierd. We should first make sure ex2f [or ex2] are running properly before looking at your code. Anyway comparing just kspsolve between the two, the speedup is about 2.7. However, I noticed that for the 4 processors one, its MatAssemblyBegin is? 1.5158e+02, which is more than KSPSolve's 4.7041e+00. So is MatAssemblyBegin's time included in KSPSolve? If not, does it mean that there's something wrong about my MatAssemblyBegin? MatAssemblyBegin is not included in KSPSolve(). Something wierd is going here. There are 2 possibilities. - whatever code you have before matrix assembly is unbalanced, so MatAssemblyBegin() acts as a barrier . - MPI communication is not optimal within the node. Its best to first make sure ex2 or ex2f runs fine. As recommended earlier - you should try latest mpich2 with --with-device=ch3:nemesis:newtcp and compare ex2/ex2f performance with your current MPI. Satish
Slow speed after changing from serial to parallel
Oh sorry here's the whole information. I'm using 2 processors currently: *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document*** -- PETSc Performance Summary: -- ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by g0306332 Tue Apr 15 23:03:09 2008 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b Max Max/MinAvg Total Time (sec): 1.114e+03 1.00054 1.114e+03 Objects: 5.400e+01 1.0 5.400e+01 Flops:1.574e+11 1.0 1.574e+11 3.147e+11 Flops/sec:1.414e+08 1.00054 1.413e+08 2.826e+08 MPI Messages: 8.777e+03 1.0 8.777e+03 1.755e+04 MPI Message Lengths: 4.213e+07 1.0 4.800e+03 8.425e+07 MPI Reductions: 8.644e+03 1.0 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N -- 2N flops and VecAXPY() for complex vectors of length N -- 8N flops Summary of Stages: - Time -- - Flops - --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 1.1136e+03 100.0% 3.1475e+11 100.0% 1.755e+04 100.0% 4.800e+03 100.0% 1.729e+04 100.0% See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops/sec: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ## ## # WARNING!!!# ## # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend# # preloading. otherwise timing numbers may be # # meaningless. # ## EventCount Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s --- Event Stage 0: Main Stage MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 0.0e+00 10 11100100 0 10 11100100 0 217 MatSolve8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 MatILUFactorSym1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0 MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
Slow speed after changing from serial to parallel
Hi, Here's the summary for 1 processor. Seems like it's also using a long time... Can someone tell me when my mistakes possibly lie? Thank you very much! *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document*** -- PETSc Performance Summary: -- ./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 Wed Apr 16 00:39:22 2008 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b Max Max/MinAvg Total Time (sec): 1.088e+03 1.0 1.088e+03 Objects: 4.300e+01 1.0 4.300e+01 Flops:2.658e+11 1.0 2.658e+11 2.658e+11 Flops/sec:2.444e+08 1.0 2.444e+08 2.444e+08 MPI Messages: 0.000e+00 0.0 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.0 0.000e+00 0.000e+00 MPI Reductions: 1.460e+04 1.0 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N -- 2N flops and VecAXPY() for complex vectors of length N -- 8N flops Summary of Stages: - Time -- - Flops - --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 1.0877e+03 100.0% 2.6584e+11 100.0% 0.000e+00 0.0% 0.000e+000.0% 1.460e+04 100.0% See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops/sec: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ## ## # WARNING!!!# ## # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend# # preloading. otherwise timing numbers may be # # meaningless. # # preloading. otherwise timing numbers may be # # meaningless. # ## EventCount Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s --- Event Stage 0: Main Stage MatMult 7412 1.0 1.3344e+02 1.0 2.16e+08 1.0 0.0e+00 0.0e+00 0.0e+00 12 11 0 0 0 12 11 0 0 0 216 MatSolve7413 1.0 2.6851e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 0.0e+00 25 11 0 0 0 25 11 0 0 0 107 MatLUFactorNum 1 1.0 4.3947e-02 1.0 8.83e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 088 MatILUFactorSym1 1.0 3.7798e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyBegin 1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyEnd 1 1.0 2.5835e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0
Slow speed after changing from serial to parallel
Hi, I was initially using LU and Hypre to solve my serial code. I switched to the default GMRES when I converted the parallel code. I've now redo the test using KSPBCGS and also Hypre BommerAMG. Seems like MatAssemblyBegin, VecAYPX, VecScatterEnd (in bold) are the problems. What should I be checking? Here's the results for 1 and 2 processor for each solver. Thank you so much! *1 processor KSPBCGS * *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document*** -- PETSc Performance Summary: -- ./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 Wed Apr 16 08:32:21 2008 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b Max Max/MinAvg Total Time (sec): 8.176e+01 1.0 8.176e+01 Objects: 2.700e+01 1.0 2.700e+01 Flops:1.893e+10 1.0 1.893e+10 1.893e+10 Flops/sec:2.315e+08 1.0 2.315e+08 2.315e+08 MPI Messages: 0.000e+00 0.0 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.0 0.000e+00 0.000e+00 MPI Reductions: 3.743e+03 1.0 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N -- 2N flops and VecAXPY() for complex vectors of length N -- 8N flops Summary of Stages: - Time -- - Flops - --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 8.1756e+01 100.0% 1.8925e+10 100.0% 0.000e+00 0.0% 0.000e+000.0% 3.743e+03 100.0% See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops/sec: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ## ## # WARNING!!!# ## # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend# # preloading. otherwise timing numbers may be # # meaningless. # ## EventCount Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s --- Event Stage 0: Main Stage MatMult 1498 1.0 1.6548e+01 1.0 3.55e+08 1.0 0.0e+00 0.0e+00 0.0e+00 20 31 0 0 0 20 31 0 0 0 355 MatSolve1500 1.0 3.2228e+01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 0.0e+00 39 31 0 0 0 39 31 0 0 0 183 MatLUFactorNum 2 1.0 2.0642e-01 1.0 1.02e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 102 MatILUFactorSym2 1.0 2.0250e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyBegin 2 1.0 2.1458e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAssemblyEnd 2 1.0 1.7963e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0
Slow speed after changing from serial to parallel (with ex2f.F)
An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080416/0f3ee54b/attachment.htm
Slow speed after changing from serial to parallel (with ex2f.F)
On Wed, 16 Apr 2008, Ben Tay wrote: Hi Satish, thank you very much for helping me run the ex2f.F code. I think I've a clearer picture now. I believe I'm running on Dual-Core Intel Xeon 5160. The quad core is only on atlas3-01 to 04 and there's only 4 of them. I guess that the lower peak is because I'm using Xeon 5160, while you are using Xeon X5355. I'm still a bit puzzled. I just ran the same binary on a 2 dualcore xeon 5130 machine [which should be similar to your 5160 machine] and get the following: [balay at n001 ~]$ grep MatMult log* log.1:MatMult 1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 364 log.2:MatMult 1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 615 log.4:MatMult 969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 656 [balay at n001 ~]$ You mention about the speedups for MatMult and compare between KSPSolve. Are these the only things we have to look at? Because I see that some other event such as VecMAXPY also takes up a sizable % of the time. To get an accurate speedup, do I just compare the time taken by KSPSolve between different no. of processors or do I have to look at other events such as MatMult as well? Sometimes we look at individual components like MatMult() VecMAXPY() to understand whats hapenning in each stage - and at KSPSolve() to look at the agregate performance for the whole solve [which includes MatMult VecMAXPY etc..]. Perhaps I should have also looked at VecMDot() aswell - at 48% of runtime - its the biggest contributor to KSPSolve() for your run. Its easy to get lost in the details of log_summary. Looking for anamolies is one thing. Plotting scalability charts for the solver is something else.. In summary, due to load imbalance, my speedup is quite bad. So maybe I'll just send your results to my school's engineer and see if they could do anything. For my part, I guess I'll just 've to wait? Yes - load imbalance at MatMult level is bad. On 4 proc run you have ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6 times slower than the other task [so all speedup is lost here] You could try the latest mpich2 [1.0.7] - just for this SMP experiment, and see if it makes a difference. I've built mpich2 with [default gcc/gfortran and]: ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker There could be something else going on on this machine thats messing up load-balance for basic petsc example.. Satish
Slow speed after changing from serial to parallel (with ex2f.F)
On Wed, Apr 16, 2008 at 8:44 AM, Ben Tay zonexo at gmail.com wrote: Hi, Am I right to say that despite all the hype about multi-core processors, they can't speed up solving of linear eqns? It's not possible to get a 2x speedup when using 2 cores. And is this true for all types of linear equation solver besides PETSc? What about parallel direct solvers (e.g. MUMPS) or those which uses openmp instead of mpich? Well, I just can't help feeling disappointed if that's the case... Notice that Satish got much much better scaling than you did on our box here. I think something is really wrong either with the installation of MPI on that box or something hardware-wise. Matt Also, with a smart enough LSF scheduler, I will be assured of getting separate processors ie 1 core from each different processor instead of 2-4 cores from just 1 processor. In that case, if I use 1 core from processor A and 1 core from processor B, I should be able to get a decent speedup of more than 1, is that so? This option is also better than using 2 or even 4 cores from the same processor. Thank you very much. Satish Balay wrote: On Wed, 16 Apr 2008, Ben Tay wrote: Hi Satish, thank you very much for helping me run the ex2f.F code. I think I've a clearer picture now. I believe I'm running on Dual-Core Intel Xeon 5160. The quad core is only on atlas3-01 to 04 and there's only 4 of them. I guess that the lower peak is because I'm using Xeon 5160, while you are using Xeon X5355. I'm still a bit puzzled. I just ran the same binary on a 2 dualcore xeon 5130 machine [which should be similar to your 5160 machine] and get the following: [balay at n001 ~]$ grep MatMult log* log.1:MatMult 1192 1.0 1.0591e+01 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 364 log.2:MatMult 1217 1.0 6.3982e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 615 log.4:MatMult 969 1.0 4.7780e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 656 [balay at n001 ~]$ You mention about the speedups for MatMult and compare between KSPSolve. Are these the only things we have to look at? Because I see that some other event such as VecMAXPY also takes up a sizable % of the time. To get an accurate speedup, do I just compare the time taken by KSPSolve between different no. of processors or do I have to look at other events such as MatMult as well? Sometimes we look at individual components like MatMult() VecMAXPY() to understand whats hapenning in each stage - and at KSPSolve() to look at the agregate performance for the whole solve [which includes MatMult VecMAXPY etc..]. Perhaps I should have also looked at VecMDot() aswell - at 48% of runtime - its the biggest contributor to KSPSolve() for your run. Its easy to get lost in the details of log_summary. Looking for anamolies is one thing. Plotting scalability charts for the solver is something else.. In summary, due to load imbalance, my speedup is quite bad. So maybe I'll just send your results to my school's engineer and see if they could do anything. For my part, I guess I'll just 've to wait? Yes - load imbalance at MatMult level is bad. On 4 proc run you have ratio = 3.6 . This implies - there is one of the mpi-tasks is 3.6 times slower than the other task [so all speedup is lost here] You could try the latest mpich2 [1.0.7] - just for this SMP experiment, and see if it makes a difference. I've built mpich2 with [default gcc/gfortran and]: ./configure --with-device=ch3:nemesis:newtcp -with-pm=gforker There could be something else going on on this machine thats messing up load-balance for basic petsc example.. Satish -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
Slow speed after changing from serial to parallel
1) Please never cut out parts of the summary. All the information is valuable, and most times, necessary 2) You seem to have huge load imbalance (look at VecNorm). Do you partition the system yourself. How many processes is this? 3) You seem to be setting a huge number of off-process values in the matrix (see MatAssemblyBegin). Is this true? I would reorganize this part. Matt On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay zonexo at gmail.com wrote: Hi, I have converted the poisson eqn part of the CFD code to parallel. The grid size tested is 600x720. For the momentum eqn, I used another serial linear solver (nspcg) to prevent mixing of results. Here's the output summary: --- Event Stage 0: Main Stage MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 0.0e+00 10 11100100 0 10 11100100 0 217 MatSolve8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 MatILUFactorSym1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 *MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0* MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetRowIJ1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03 1.7e+04 89100100100100 89100100100100 317 PCSetUp2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 069 PCSetUpOnBlocks1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00 3.0e+00 0 0 0 0 0 0 0 0 0 069 PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 *VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00 8.8e+03 9 2 0 0 51 9 2 0 0 5142* *VecScale8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 636* VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 VecMAXPY8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 *VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03 0.0e+00 0 0100100 0 0 0100100 0 0* *VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0* *VecNormalize8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00 8.8e+03 9 4 0 0 51 9 4 0 0 5162* Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. --- Event Stage 0: Main Stage Matrix 4 4 49227380 0 Krylov Solver 2 2 17216 0 Preconditioner 2 2256 0 Index Set 5 52596120 0 Vec40 40 62243224 0 Vec Scatter 1 1 0 0 Average time to get PetscTime(): 4.05312e-07 Average time for MPI_Barrier(): 7.62939e-07 Average time for zero size MPI_Send(): 2.02656e-06 OptionTable: -log_summary The PETSc manual states that ratio
Slow speed after changing from serial to parallel (with ex2f.F)
On Tue, Apr 15, 2008 at 9:08 PM, Ben Tay zonexo at gmail.com wrote: Hi, I just tested the ex2f.F example, changing m and n to 600. Here's the result for 1, 2 and 4 processors. Interestingly, MatAssemblyBegin, MatGetOrdering and KSPSetup have ratios 1. The time taken seems to be faster as the processor increases, although speedup is not 1:1. I thought that this example should scale well, shouldn't it? Is there something wrong with my installation then? 1) Notice that the events that are unbalanced take 0.01% of the time. Not important. 2) The speedup really stinks. Even though this is a small problem. Are you sure that you are actually running on two processors with separate memory pipes and not on 1 dual core? Matt Thank you. 1 processor: Norm of error 0.3371E+01 iterations 1153 *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document*** -- PETSc Performance Summary: -- ./a.out on a atlas3-mp named atlas3-c58 with 1 processor, by g0306332 Wed Apr 16 10:03:12 2008 Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b Max Max/MinAvg Total Time (sec): 1.222e+02 1.0 1.222e+02 Objects: 4.400e+01 1.0 4.400e+01 Flops:3.547e+10 1.0 3.547e+10 3.547e+10 Flops/sec:2.903e+08 1.0 2.903e+08 2.903e+08 MPI Messages: 0.000e+00 0.0 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.0 0.000e+00 0.000e+00 MPI Reductions: 2.349e+03 1.0 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N -- 2N flops and VecAXPY() for complex vectors of length N -- 8N flops Summary of Stages: - Time -- - Flops - --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 1.2216e+02 100.0% 3.5466e+10 100.0% 0.000e+00 0.0% 0.000e+000.0% 2.349e+03 100.0% See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flops/sec: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent Avg. len: average message length Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flops in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors) ## ## # WARNING!!!# ## # This code was run without the PreLoadBegin() # # macros. To get timing results we always recommend# # preloading. otherwise timing numbers may be # # meaningless. # ## EventCount Time (sec) Flops/sec --- Global --- --- Stage --- Total Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s --- Event Stage 0: Main Stage MatMult 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00 0.0e+00 13 11 0 0 0 13 11 0 0 0 239 MatSolve1192 1.0 3.1017e+01 1.0 1.24e+08 1.0 0.0e+00 0.0e+00 0.0e+00 25 11
Slow speed after changing from serial to parallel (with ex2f.F)
On Wed, 16 Apr 2008, Ben Tay wrote: I think you may be right. My school uses : ? No of Nodes Processors Qty per node Total cores per node Memory per node ? ? 4 Quad-Core Intel Xeon X5355 2 8 16 GB ? ? 60 Dual-Core Intel Xeon 5160 2 4 8 GB I've attempted to run the same ex2f on a 2x quad-core Intel Xeon X5355 machine [with gcc/ latest mpich2 with --with-device=ch3:nemesis:newtcp] - and I get the following: Logs for my run are attached asterix:/home/balay/download-pinegrep MatMult * ex2f-600-1p.log:MatMult 1192 1.0 9.7109e+00 1.0 3.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 14 11 0 0 0 14 11 0 0 0 397 ex2f-600-2p.log:MatMult 1217 1.0 6.2256e+00 1.0 1.97e+09 1.0 2.4e+03 4.8e+03 0.0e+00 14 11100100 0 14 11100100 0 632 ex2f-600-4p.log:MatMult 969 1.0 4.3311e+00 1.0 7.84e+08 1.0 5.8e+03 4.8e+03 0.0e+00 15 11100100 0 15 11100100 0 724 ex2f-600-8p.log:MatMult 1318 1.0 5.6966e+00 1.0 5.33e+08 1.0 1.8e+04 4.8e+03 0.0e+00 16 11100100 0 16 11100100 0 749 asterix:/home/balay/download-pinegrep KSPSolve * ex2f-600-1p.log:KSPSolve 1 1.0 6.9165e+01 1.0 3.55e+10 1.0 0.0e+00 0.0e+00 2.3e+03100100 0 0100 100100 0 0100 513 ex2f-600-2p.log:KSPSolve 1 1.0 4.4005e+01 1.0 1.81e+10 1.0 2.4e+03 4.8e+03 2.4e+03100100100100100 100100100100100 824 ex2f-600-4p.log:KSPSolve 1 1.0 2.8139e+01 1.0 7.21e+09 1.0 5.8e+03 4.8e+03 1.9e+03100100100100 99 100100100100 99 1024 ex2f-600-8p.log:KSPSolve 1 1.0 3.6260e+01 1.0 4.90e+09 1.0 1.8e+04 4.8e+03 2.6e+03100100100100100 100100100100100 1081 asterix:/home/balay/download-pine You get the following [with intel compilers?]: asterix:/home/balay/download-pine/xgrep MatMult * log.1:MatMult 1192 1.0 1.6115e+01 1.0 2.39e+08 1.0 0.0e+00 0.0e+00 0.0e+00 13 11? 0? 0? 0? 13 11? 0? 0? 0?? 239 log.2:MatMult 1217 1.0 1.2502e+01 1.2 1.88e+08 1.2 2.4e+03 4.8e+03 0.0e+00 11 11100100? 0? 11 11100100? 0?? 315 log.4:MatMult? 969 1.0 9.7564e+00 3.6 2.87e+08 3.6 5.8e+03 4.8e+03 0.0e+00? 8 11100100? 0?? 8 11100100? 0?? 321 asterix:/home/balay/download-pine/xgrep KSPSolve * log.1:KSPSolve?? 1 1.0 1.2159e+02 1.0 2.92e+08 1.0 0.0e+00 0.0e+00 2.3e+03100100? 0? 0100 100100? 0? 0100?? 292 log.2:KSPSolve?? 1 1.0 1.0289e+02 1.0 1.76e+08 1.0 2.4e+03 4.8e+03 2.4e+03 99100100100100? 99100100100100?? 352 log.4:KSPSolve?? 1 1.0 6.2496e+01 1.0 1.15e+08 1.0 5.8e+03 4.8e+03 1.9e+03 98100100100 99? 98100100100 99?? 461 asterix:/home/balay/download-pine/x What exact CPU was this run on? A couple of comments: - my runs for MatMult have 1.0 ratio for 2,4,8 proc runs, while yours have 1.2, 3.6 for 2,4 proc runs [so higher load imbalance on your machine] - The peaks are also lower - not sure why. 397 for 1p-MatMult for me - vs 239 for you - Speedups I see for MatMult are: np me you 2 1.59 1.32 4 1.82 1.34 8 1.88 -- The primary issue is - expecting speedup of 4, from 4-cores and 8 from 8-cores. As Matt indicated perhaps in Subject: general question on speed using quad core Xeons thread, for sparse linear algebra - the performance is limited by memory bandwidth - not CPU So one have to look at the hardware memory architecture of the machine if you expect scalability. The 2x quad-core has a memory architecture that gives 11GB/s if one CPU-socket is used, but 22GB/s when both CPUs-sockets are used [irrespective of the number of cores in each CPU socket]. One inference is - max of 2 speedup can be obtained from such machine [due to 2 memory bank architecture]. So if you have 2 such machines [i.e 4 memory banks] - then you can expect a theoretical max speedup of 4. We are generally used to evaluating performance/cpu [or core]. Here the scalability numbers suck. However if you do performance/number-of-memory-banks - then things look better. Its just that we are used to always expecting scalability per node and assume it translates to scalability per core. [however the scalability per node - was more about scalability per memory bank - before multicore cpus took over] There is also another measure - performance/dollar spent. Generally the extra cores are practically free - so here this measure also holds up ok. Satish -- next part -- *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document*** -- PETSc Performance Summary: -- ./ex2f on a linux-tes named intel-loaner1 with 1 processor, by balay Tue Apr 15 22:02:38 2008 Using Petsc Development Version
Slow speed after changing from serial to parallel
Thank you Matthew. Sorry to trouble you again. I tried to run it with -log_summary output and I found that there's some errors in the execution. Well, I was busy with other things and I just came back to this problem. Some of my files on the server has also been deleted. It has been a while and I remember that it worked before, only much slower. Anyway, most of the serial code has been updated and maybe it's easier to convert the new serial code instead of debugging on the old parallel code now. I believe I can still reuse part of the old parallel code. However, I hope I can approach it better this time. So supposed I need to start converting my new serial code to parallel. There's 2 eqns to be solved using PETSc, the momentum and poisson. I also need to parallelize other parts of my code. I wonder which route is the best: 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify other parts of my code to parallel e.g. looping, updating of values etc. Once the execution is fine and speedup is reasonable, then modify the PETSc part - poisson eqn 1st followed by the momentum eqn. 2. Reverse the above order ie modify the PETSc part - poisson eqn 1st followed by the momentum eqn. Then do other parts of my code. I'm not sure if the above 2 mtds can work or if there will be conflicts. Of course, an alternative will be: 3. Do the poisson, momentum eqns and other parts of the code separately. That is, code a standalone parallel poisson eqn and use samples values to test it. Same for the momentum and other parts of the code. When each of them is working, combine them to form the full parallel code. However, this will be much more troublesome. I hope someone can give me some recommendations. Thank you once again. Matthew Knepley wrote: 1) There is no way to have any idea what is going on in your code without -log_summary output 2) Looking at that output, look at the percentage taken by the solver KSPSolve event. I suspect it is not the biggest component, because it is very scalable. Matt On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay zonexo at gmail.com wrote: Hi, I've a serial 2D CFD code. As my grid size requirement increases, the simulation takes longer. Also, memory requirement becomes a problem. Grid size 've reached 1200x1200. Going higher is not possible due to memory problem. I tried to convert my code to a parallel one, following the examples given. I also need to restructure parts of my code to enable parallel looping. I 1st changed the PETSc solver to be parallel enabled and then I restructured parts of my code. I proceed on as longer as the answer for a simple test case is correct. I thought it's not really possible to do any speed testing since the code is not fully parallelized yet. When I finished during most of the conversion, I found that in the actual run that it is much slower, although the answer is correct. So what is the remedy now? I wonder what I should do to check what's wrong. Must I restart everything again? Btw, my grid size is 1200x1200. I believed it should be suitable for parallel run of 4 processors? Is that so? Thank you.
Slow speed after changing from serial to parallel
I am not sure why you would ever have two codes. I never do this. PETSc is designed to write one code to run in serial and parallel. The PETSc part should look identical. To test, run the code yo uhave verified in serial and output PETSc data structures (like Mat and Vec) using a binary viewer. Then run in parallel with the same code, which will output the same structures. Take the two files and write a small verification code that loads both versions and calls MatEqual and VecEqual. Matt On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay zonexo at gmail.com wrote: Thank you Matthew. Sorry to trouble you again. I tried to run it with -log_summary output and I found that there's some errors in the execution. Well, I was busy with other things and I just came back to this problem. Some of my files on the server has also been deleted. It has been a while and I remember that it worked before, only much slower. Anyway, most of the serial code has been updated and maybe it's easier to convert the new serial code instead of debugging on the old parallel code now. I believe I can still reuse part of the old parallel code. However, I hope I can approach it better this time. So supposed I need to start converting my new serial code to parallel. There's 2 eqns to be solved using PETSc, the momentum and poisson. I also need to parallelize other parts of my code. I wonder which route is the best: 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify other parts of my code to parallel e.g. looping, updating of values etc. Once the execution is fine and speedup is reasonable, then modify the PETSc part - poisson eqn 1st followed by the momentum eqn. 2. Reverse the above order ie modify the PETSc part - poisson eqn 1st followed by the momentum eqn. Then do other parts of my code. I'm not sure if the above 2 mtds can work or if there will be conflicts. Of course, an alternative will be: 3. Do the poisson, momentum eqns and other parts of the code separately. That is, code a standalone parallel poisson eqn and use samples values to test it. Same for the momentum and other parts of the code. When each of them is working, combine them to form the full parallel code. However, this will be much more troublesome. I hope someone can give me some recommendations. Thank you once again. Matthew Knepley wrote: 1) There is no way to have any idea what is going on in your code without -log_summary output 2) Looking at that output, look at the percentage taken by the solver KSPSolve event. I suspect it is not the biggest component, because it is very scalable. Matt On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay zonexo at gmail.com wrote: Hi, I've a serial 2D CFD code. As my grid size requirement increases, the simulation takes longer. Also, memory requirement becomes a problem. Grid size 've reached 1200x1200. Going higher is not possible due to memory problem. I tried to convert my code to a parallel one, following the examples given. I also need to restructure parts of my code to enable parallel looping. I 1st changed the PETSc solver to be parallel enabled and then I restructured parts of my code. I proceed on as longer as the answer for a simple test case is correct. I thought it's not really possible to do any speed testing since the code is not fully parallelized yet. When I finished during most of the conversion, I found that in the actual run that it is much slower, although the answer is correct. So what is the remedy now? I wonder what I should do to check what's wrong. Must I restart everything again? Btw, my grid size is 1200x1200. I believed it should be suitable for parallel run of 4 processors? Is that so? Thank you. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
Slow speed after changing from serial to parallel
Hi Matthew, I think you've misunderstood what I meant. What I'm trying to say is initially I've got a serial code. I tried to convert to a parallel one. Then I tested it and it was pretty slow. Due to some work requirement, I need to go back to make some changes to my code. Since the parallel is not working well, I updated and changed the serial one. Well, that was a while ago and now, due to the updates and changes, the serial code is different from the old converted parallel code. Some files were also deleted and I can't seem to get it working now. So I thought I might as well convert the new serial code to parallel. But I'm not very sure what I should do 1st. Maybe I should rephrase my question in that if I just convert my poisson equation subroutine from a serial PETSc to a parallel PETSc version, will it work? Should I expect a speedup? The rest of my code is still serial. Thank you very much. Matthew Knepley wrote: I am not sure why you would ever have two codes. I never do this. PETSc is designed to write one code to run in serial and parallel. The PETSc part should look identical. To test, run the code yo uhave verified in serial and output PETSc data structures (like Mat and Vec) using a binary viewer. Then run in parallel with the same code, which will output the same structures. Take the two files and write a small verification code that loads both versions and calls MatEqual and VecEqual. Matt On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay zonexo at gmail.com wrote: Thank you Matthew. Sorry to trouble you again. I tried to run it with -log_summary output and I found that there's some errors in the execution. Well, I was busy with other things and I just came back to this problem. Some of my files on the server has also been deleted. It has been a while and I remember that it worked before, only much slower. Anyway, most of the serial code has been updated and maybe it's easier to convert the new serial code instead of debugging on the old parallel code now. I believe I can still reuse part of the old parallel code. However, I hope I can approach it better this time. So supposed I need to start converting my new serial code to parallel. There's 2 eqns to be solved using PETSc, the momentum and poisson. I also need to parallelize other parts of my code. I wonder which route is the best: 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify other parts of my code to parallel e.g. looping, updating of values etc. Once the execution is fine and speedup is reasonable, then modify the PETSc part - poisson eqn 1st followed by the momentum eqn. 2. Reverse the above order ie modify the PETSc part - poisson eqn 1st followed by the momentum eqn. Then do other parts of my code. I'm not sure if the above 2 mtds can work or if there will be conflicts. Of course, an alternative will be: 3. Do the poisson, momentum eqns and other parts of the code separately. That is, code a standalone parallel poisson eqn and use samples values to test it. Same for the momentum and other parts of the code. When each of them is working, combine them to form the full parallel code. However, this will be much more troublesome. I hope someone can give me some recommendations. Thank you once again. Matthew Knepley wrote: 1) There is no way to have any idea what is going on in your code without -log_summary output 2) Looking at that output, look at the percentage taken by the solver KSPSolve event. I suspect it is not the biggest component, because it is very scalable. Matt On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay zonexo at gmail.com wrote: Hi, I've a serial 2D CFD code. As my grid size requirement increases, the simulation takes longer. Also, memory requirement becomes a problem. Grid size 've reached 1200x1200. Going higher is not possible due to memory problem. I tried to convert my code to a parallel one, following the examples given. I also need to restructure parts of my code to enable parallel looping. I 1st changed the PETSc solver to be parallel enabled and then I restructured parts of my code. I proceed on as longer as the answer for a simple test case is correct. I thought it's not really possible to do any speed testing since the code is not fully parallelized yet. When I finished during most of the conversion, I found that in the actual run that it is much slower, although the answer is correct. So what is the remedy now? I wonder what I should do to check what's wrong. Must I restart everything again? Btw, my grid size is 1200x1200. I believed it should be suitable for parallel run of 4 processors? Is that so? Thank you.
Slow speed after changing from serial to parallel
On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay zonexo at gmail.com wrote: Hi Matthew, I think you've misunderstood what I meant. What I'm trying to say is initially I've got a serial code. I tried to convert to a parallel one. Then I tested it and it was pretty slow. Due to some work requirement, I need to go back to make some changes to my code. Since the parallel is not working well, I updated and changed the serial one. Well, that was a while ago and now, due to the updates and changes, the serial code is different from the old converted parallel code. Some files were also deleted and I can't seem to get it working now. So I thought I might as well convert the new serial code to parallel. But I'm not very sure what I should do 1st. Maybe I should rephrase my question in that if I just convert my poisson equation subroutine from a serial PETSc to a parallel PETSc version, will it work? Should I expect a speedup? The rest of my code is still serial. You should, of course, only expect speedup in the parallel parts Matt Thank you very much. Matthew Knepley wrote: I am not sure why you would ever have two codes. I never do this. PETSc is designed to write one code to run in serial and parallel. The PETSc part should look identical. To test, run the code yo uhave verified in serial and output PETSc data structures (like Mat and Vec) using a binary viewer. Then run in parallel with the same code, which will output the same structures. Take the two files and write a small verification code that loads both versions and calls MatEqual and VecEqual. Matt On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay zonexo at gmail.com wrote: Thank you Matthew. Sorry to trouble you again. I tried to run it with -log_summary output and I found that there's some errors in the execution. Well, I was busy with other things and I just came back to this problem. Some of my files on the server has also been deleted. It has been a while and I remember that it worked before, only much slower. Anyway, most of the serial code has been updated and maybe it's easier to convert the new serial code instead of debugging on the old parallel code now. I believe I can still reuse part of the old parallel code. However, I hope I can approach it better this time. So supposed I need to start converting my new serial code to parallel. There's 2 eqns to be solved using PETSc, the momentum and poisson. I also need to parallelize other parts of my code. I wonder which route is the best: 1. Don't change the PETSc part ie continue using PETSC_COMM_SELF, modify other parts of my code to parallel e.g. looping, updating of values etc. Once the execution is fine and speedup is reasonable, then modify the PETSc part - poisson eqn 1st followed by the momentum eqn. 2. Reverse the above order ie modify the PETSc part - poisson eqn 1st followed by the momentum eqn. Then do other parts of my code. I'm not sure if the above 2 mtds can work or if there will be conflicts. Of course, an alternative will be: 3. Do the poisson, momentum eqns and other parts of the code separately. That is, code a standalone parallel poisson eqn and use samples values to test it. Same for the momentum and other parts of the code. When each of them is working, combine them to form the full parallel code. However, this will be much more troublesome. I hope someone can give me some recommendations. Thank you once again. Matthew Knepley wrote: 1) There is no way to have any idea what is going on in your code without -log_summary output 2) Looking at that output, look at the percentage taken by the solver KSPSolve event. I suspect it is not the biggest component, because it is very scalable. Matt On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay zonexo at gmail.com wrote: Hi, I've a serial 2D CFD code. As my grid size requirement increases, the simulation takes longer. Also, memory requirement becomes a problem. Grid size 've reached 1200x1200. Going higher is not possible due to memory problem. I tried to convert my code to a parallel one, following the examples given. I also need to restructure parts of my code to enable parallel looping. I 1st changed the PETSc solver to be parallel enabled and then I restructured parts of my code. I proceed on as longer as the answer for a simple test case is correct. I thought it's not really possible to do any speed testing since the code is not fully parallelized yet. When I finished during most of the conversion, I found that in the actual run that it is much
Slow speed after changing from serial to parallel
Hi, I've a serial 2D CFD code. As my grid size requirement increases, the simulation takes longer. Also, memory requirement becomes a problem. Grid size 've reached 1200x1200. Going higher is not possible due to memory problem. I tried to convert my code to a parallel one, following the examples given. I also need to restructure parts of my code to enable parallel looping. I 1st changed the PETSc solver to be parallel enabled and then I restructured parts of my code. I proceed on as longer as the answer for a simple test case is correct. I thought it's not really possible to do any speed testing since the code is not fully parallelized yet. When I finished during most of the conversion, I found that in the actual run that it is much slower, although the answer is correct. So what is the remedy now? I wonder what I should do to check what's wrong. Must I restart everything again? Btw, my grid size is 1200x1200. I believed it should be suitable for parallel run of 4 processors? Is that so? Thank you. -- next part -- An HTML attachment was scrubbed... URL: http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20080413/9c4ce213/attachment.htm
Slow speed after changing from serial to parallel
1) There is no way to have any idea what is going on in your code without -log_summary output 2) Looking at that output, look at the percentage taken by the solver KSPSolve event. I suspect it is not the biggest component, because it is very scalable. Matt On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay zonexo at gmail.com wrote: Hi, I've a serial 2D CFD code. As my grid size requirement increases, the simulation takes longer. Also, memory requirement becomes a problem. Grid size 've reached 1200x1200. Going higher is not possible due to memory problem. I tried to convert my code to a parallel one, following the examples given. I also need to restructure parts of my code to enable parallel looping. I 1st changed the PETSc solver to be parallel enabled and then I restructured parts of my code. I proceed on as longer as the answer for a simple test case is correct. I thought it's not really possible to do any speed testing since the code is not fully parallelized yet. When I finished during most of the conversion, I found that in the actual run that it is much slower, although the answer is correct. So what is the remedy now? I wonder what I should do to check what's wrong. Must I restart everything again? Btw, my grid size is 1200x1200. I believed it should be suitable for parallel run of 4 processors? Is that so? Thank you. -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener