Re: [Yade-dev] parallel collider - testing needed
Thanks Matthias, Actually I don't understand your benchmark results. You are the first one to find no speedup on the colliding part. It seems the results below were not using the parallel collider, since the time it takes is exactly the same for all number of threads. What version is that (diplayed at yade startup)? Bruno On 16/04/14 17:14, Matthias Frank wrote: > hi bruno, > > i use your first version of the parallel collider for quiet a while > during model development and also calibration. i saw no differences > between yade-1.07 and your version. > > i did some benchmarks with 4 to 16 sandy bridge cores at our bull > cluster. getting more than 16 cores for openmp applications is quit > difficult. > done on an exclusively used 16 core node > > === 1 threads = > number of bodies 200813 > > Elapsed 47.6222550869 sec > Performance 4.19971712039 iter/sec > Extrapolation on 1e5 iters 6.6142020954 hours > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* > Name Count TimeRel. time > --- > > ForceResetter 200 594120us1.25% > InsertionSortCollider 7 > 15686671us 32.95% > InteractionLoop 200 > 21787610us 45.76% > NewtonIntegrator200 > 9541243us 20.04% > TOTAL 47609645us 100.00% > > Common time 1383.60180092 s > > > 5037 spheres, velocity= 103.875852973 +- 6.56561134015 % > 25103 spheres, velocity= 31.681069095 +- 3.69992939292 % > 50250 spheres, velocity= 15.6112167455 +- 0.651579666153 % > 100467 spheres, velocity= 7.65955209926 +- 0.740064173207 % > Calculation velocity is unstable, try to close all programs and start > performance tests again > 200813 spheres, velocity= 4.52368811131 +- 12.3907756519 % > > > SCORE: 6055 > Number of threads 1 > === 4 threads = > number of bodies 200813 > > Elapsed 29.6409780979 sec > Performance 6.7474156669 iter/sec > Extrapolation on 1e5 iters 4.1168025136 hours > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* > Name Count TimeRel. time > --- > > ForceResetter 200 > 2919976us9.85% > InsertionSortCollider 7 > 15675024us 52.89% > InteractionLoop 200 > 5309648us 17.92% > NewtonIntegrator200 > 5730646us 19.34% > TOTAL 29635295us 100.00% > > Common time 641.693111897 s > > > Calculation velocity is unstable, try to close all programs and start > performance tests again > 5037 spheres, velocity= 232.725838879 +- 14.3014472878 % > Calculation velocity is unstable, try to close all programs and start > performance tests again > 25103 spheres, velocity= 72.3475644141 +- 12.8106054968 % > 50250 spheres, velocity= 50.2926096116 +- 3.01250915287 % > 100467 spheres, velocity= 18.9664279425 +- 1.40241049531 % > 200813 spheres, velocity= 6.95879166249 +- 2.72955035307 % > > > SCORE: 13080 > Number of threads 4 > === 8 threads = > number of bodies 200813 > > Elapsed 28.8497908115 sec > Performance 6.9324592787 iter/sec > Extrapolation on 1e5 iters 4.00691539049 hours > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* > Name Count TimeRel. time > --- > > ForceResetter 200 > 4760739us 16.51% > InsertionSortCollider 7 > 15682352us 54.38% > InteractionLoop 200 > 3398981us 11.79% > NewtonIntegrator200 > 4997676us 17.33% > TOTAL 28839750us 100.00% > > Common time 629.34264183 s > > > Calculation velocity is unstable, try to close all programs and start > performance tests again > 5037 spheres, velocity= 242.232297207 +- 18.7054194438 % > 25103 spheres, velocity= 78.2112705997 +- 4.19360243937 % > 50250 spheres, velocity= 46.6877664726 +- 2.81481812835 % > 100467 spheres, velocity= 19.9932164704 +- 3.06039659404 % > 200813 spheres, velocity= 6.92396036557 +- 0.361116951928 % > > > SCORE: 13272 > Number of threads 8 > === 12 threads = > number of bodies 200813 > > Elapsed 29.2484679222 sec > Performance 6.83796500151 iter/sec > Extrapolation on 1e5 iters 4.06228721142 hou
Re: [Yade-dev] parallel collider - testing needed
hi bruno, i use your first version of the parallel collider for quiet a while during model development and also calibration. i saw no differences between yade-1.07 and your version. i did some benchmarks with 4 to 16 sandy bridge cores at our bull cluster. getting more than 16 cores for openmp applications is quit difficult. done on an exclusively used 16 core node === 1 threads = number of bodies 200813 Elapsed 47.6222550869 sec Performance 4.19971712039 iter/sec Extrapolation on 1e5 iters 6.6142020954 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* Name Count TimeRel. time --- ForceResetter 200 594120us1.25% InsertionSortCollider 7 15686671us 32.95% InteractionLoop 200 21787610us 45.76% NewtonIntegrator200 9541243us 20.04% TOTAL 47609645us 100.00% Common time 1383.60180092 s 5037 spheres, velocity= 103.875852973 +- 6.56561134015 % 25103 spheres, velocity= 31.681069095 +- 3.69992939292 % 50250 spheres, velocity= 15.6112167455 +- 0.651579666153 % 100467 spheres, velocity= 7.65955209926 +- 0.740064173207 % Calculation velocity is unstable, try to close all programs and start performance tests again 200813 spheres, velocity= 4.52368811131 +- 12.3907756519 % SCORE: 6055 Number of threads 1 === 4 threads = number of bodies 200813 Elapsed 29.6409780979 sec Performance 6.7474156669 iter/sec Extrapolation on 1e5 iters 4.1168025136 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* Name Count TimeRel. time --- ForceResetter 200 2919976us9.85% InsertionSortCollider 7 15675024us 52.89% InteractionLoop 200 5309648us 17.92% NewtonIntegrator200 5730646us 19.34% TOTAL 29635295us 100.00% Common time 641.693111897 s Calculation velocity is unstable, try to close all programs and start performance tests again 5037 spheres, velocity= 232.725838879 +- 14.3014472878 % Calculation velocity is unstable, try to close all programs and start performance tests again 25103 spheres, velocity= 72.3475644141 +- 12.8106054968 % 50250 spheres, velocity= 50.2926096116 +- 3.01250915287 % 100467 spheres, velocity= 18.9664279425 +- 1.40241049531 % 200813 spheres, velocity= 6.95879166249 +- 2.72955035307 % SCORE: 13080 Number of threads 4 === 8 threads = number of bodies 200813 Elapsed 28.8497908115 sec Performance 6.9324592787 iter/sec Extrapolation on 1e5 iters 4.00691539049 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* Name Count TimeRel. time --- ForceResetter 200 4760739us 16.51% InsertionSortCollider 7 15682352us 54.38% InteractionLoop 200 3398981us 11.79% NewtonIntegrator200 4997676us 17.33% TOTAL 28839750us 100.00% Common time 629.34264183 s Calculation velocity is unstable, try to close all programs and start performance tests again 5037 spheres, velocity= 242.232297207 +- 18.7054194438 % 25103 spheres, velocity= 78.2112705997 +- 4.19360243937 % 50250 spheres, velocity= 46.6877664726 +- 2.81481812835 % 100467 spheres, velocity= 19.9932164704 +- 3.06039659404 % 200813 spheres, velocity= 6.92396036557 +- 0.361116951928 % SCORE: 13272 Number of threads 8 === 12 threads = number of bodies 200813 Elapsed 29.2484679222 sec Performance 6.83796500151 iter/sec Extrapolation on 1e5 iters 4.06228721142 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* Name Count TimeRel. time --- ForceResetter 200 7943958us 27.17% InsertionSortCollider 7 15713441us 53.75% InteractionLoop 200 2522508us8.63% NewtonIntegrator200 3055652u
Re: [Yade-dev] parallel collider - testing needed
On 10/04/14 02:01, Klaus Thoeni wrote: > just to clarify, Test 2 is done by increasing the number of iterations (1x, > 3x > and 12x the number of iterations specified in checkPerf.py). This means the > number of interactions should increase as well and, hence, particle > velocities > should decrease because of more interactions. That is what I was thinking. And more interactions means less (relative) time spent in collider. > I added a table with the collider scaling factor for 1 million particles and > iter x 12. Thanks! So there is still an optimum near 12-14. It may be possible to improve (choosing approriate chunksizes internally), but it needs serious testing. > Note your T(j8)=T(j1)/5.8 is actually T(j8)=T(j1)/4.8. Where did you get the > number from? You must look into the uploaded files in order to get this > numbers I used the x1 line since I was not expecting any influence of the number of steps on the collider's performance: 187/20=5.8 Now I see it is different with other lines. Weird. Bruno ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
Hi Bruno, just to clarify, Test 2 is done by increasing the number of iterations (1x, 3x and 12x the number of iterations specified in checkPerf.py). This means the number of interactions should increase as well and, hence, particle velocities should decrease because of more interactions. I added a table with the collider scaling factor for 1 million particles and iter x 12. Note your T(j8)=T(j1)/5.8 is actually T(j8)=T(j1)/4.8. Where did you get the number from? You must look into the uploaded files in order to get this numbers ;-) Cheers Klaus On Wednesday 09 April 2014 14:58:19 Bruno Chareyre wrote: > Thanks! > If I understand correctly, particles velocities are decreasing with > iterations. So, more iterations means less weight for the collider > overall (hence less effect of parallelizing it). > From you results with 1million, I see for the collider T(j8)=T(j1)/5.8. > Could you tell if the collider time alone still decreases with j>8 for > 1million of particles? > > Bruno > > On 09/04/14 14:32, Klaus Thoeni wrote: > > Hi guys, > > > > just to let you know. I updated the results on the wiki [1]. Still > > performance test but with more iterations and up to 1 million particles. > > > > Cheers, > > Klaus > > > > [1] https://yade-dem.org/wiki/Performance_Test#Test_2 > > -- > ___ > Bruno Chareyre > Associate Professor > ENSE³ - Grenoble INP > Lab. 3SR > BP 53 > 38041 Grenoble cedex 9 > Tél : +33 4 56 52 86 21 > Fax : +33 4 76 82 70 43 > > > > ___ > Mailing list: https://launchpad.net/~yade-dev > Post to : yade-dev@lists.launchpad.net > Unsubscribe : https://launchpad.net/~yade-dev > More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
Hi guys, just to let you know. I updated the results on the wiki [1]. Still performance test but with more iterations and up to 1 million particles. Cheers, Klaus [1] https://yade-dem.org/wiki/Performance_Test#Test_2 ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
Thanks! If I understand correctly, particles velocities are decreasing with iterations. So, more iterations means less weight for the collider overall (hence less effect of parallelizing it). >From you results with 1million, I see for the collider T(j8)=T(j1)/5.8. Could you tell if the collider time alone still decreases with j>8 for 1million of particles? Bruno On 09/04/14 14:32, Klaus Thoeni wrote: > Hi guys, > > just to let you know. I updated the results on the wiki [1]. Still > performance > test but with more iterations and up to 1 million particles. > > Cheers, > Klaus > > [1] https://yade-dem.org/wiki/Performance_Test#Test_2 > > > -- ___ Bruno Chareyre Associate Professor ENSE³ - Grenoble INP Lab. 3SR BP 53 38041 Grenoble cedex 9 Tél : +33 4 56 52 86 21 Fax : +33 4 76 82 70 43 ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
2014-03-31 10:29 GMT+02:00 Bruno Chareyre : >> I think, we can include this code into the master branch in git. >> Let`s check the code more precisely and merge it. > For me the code is in its final version and ready to merge if nobody > find bugs (at least you could run your QS simulation without crash - the > good part!). > But if someone wants to review it is never bad. I have done some more tests (sorry, again without timings) and I do not see any problems with the code. So I think we can safely merge it. Thanks for you work on this Anton ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
Hi guys, I run some dynamic tests with my mesh too (some times ago, but I forgot to check). Implementation is fine and speed up is only about 6-8%. However, the simulation has just about 30 particles. I even have more results for the performance check (with 1 Mio particles) which I will put on the wiki, at some stage (if I find time to analyse the results :-)). But something I can tell you, the maximum scaling for 1 Mio particles I get is about 3-4. Cheers, Klaus On Monday 31 March 2014 10:29:03 Bruno Chareyre wrote: > > I have tested this version of collider and have got a speedup for > > about 5..10% with number of cores 2..6. But it was quasi-static > > simulations, so the contact list is updating not so often. > > Thanks Anton for feedback. Testing in quasistatic cases is indeed not > very interesting. > Or, in that case, it needs to report the collider's timing, not the wall > clock time of yade as a whole. > > > I think, we can include this code into the master branch in git. > > Let`s check the code more precisely and merge it. > > For me the code is in its final version and ready to merge if nobody > find bugs (at least you could run your QS simulation without crash - the > good part!). > But if someone wants to review it is never bad. > > Bruno > > > ___ > Mailing list: https://launchpad.net/~yade-dev > Post to : yade-dev@lists.launchpad.net > Unsubscribe : https://launchpad.net/~yade-dev > More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
> I have tested this version of collider and have got a speedup for > about 5..10% with number of cores 2..6. But it was quasi-static > simulations, so the contact list is updating not so often. Thanks Anton for feedback. Testing in quasistatic cases is indeed not very interesting. Or, in that case, it needs to report the collider's timing, not the wall clock time of yade as a whole. > I think, we can include this code into the master branch in git. > Let`s check the code more precisely and merge it. For me the code is in its final version and ready to merge if nobody find bugs (at least you could run your QS simulation without crash - the good part!). But if someone wants to review it is never bad. Bruno ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
Hi Bruno, I have tested this version of collider and have got a speedup for about 5..10% with number of cores 2..6. But it was quasi-static simulations, so the contact list is updating not so often. I think, we can include this code into the master branch in git. Let`s check the code more precisely and merge it. Thank you! Anton 2014-02-24 16:36 GMT+01:00 Bruno Chareyre : > Hi there, > I implemented a parallel version of the InsertionSortCollider. It is > almost ready but not yet pushed to the main trunk, as I have a few > things to check before that. > It would be helpful if some of you could 1/ test that your scripts work > correctly and 2/ benchmark this for N>100k and j>4. > If you run benchmarks, please remember to always activate timing and > report the result of timing.stats(). It gives much more interesting data > than the wall clock time. > > Preliminary benchmark results are below (from my laptop...), showing a > speedup by a factor 2 on the total computation time for j4/200k > particles (compared to the sequential collider). > The speedup on collider alone is in fact of the order of x3.68 for 4 > threads. Nearly linear at least for such small number of threads. > > My expectation is that it should change almost nothing for small number > of particles (say, N<10k), where colliding is an inexpensive step. > For 1million of particles OTOH, there could be significant speedup, > since the collider takes most of the time. > > You can get the "pc" branch at my github repo: > git clone -b pc https://github.com/bchareyre/trunk.git > > Results of yade -j4 --performance are below (I7 quad-core with > hyperthreading enabled, lightly loaded by background tasks - j>4 not > reported as hyperthreading is probably doing no good). > > Happy benchmarking. :) > > Bruno > > > > ./yade-trunk -j4 --performance (the current trunk) > ... > number of bodies 200813 > > Elapsed 29.4102840424 sec > Performance 6.80034234664 iter/sec > Extrapolation on 1e5 iters 4.08476167255 hours > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* > Name > Count TimeRel. time > --- > ForceResetter 200 > 700881us2.38% > InsertionSortCollider 7 > 18816625us 64.02% > InteractionLoop 200 > 6581283us 22.39% > NewtonIntegrator200 > 3293119us 11.20% > TOTAL > 29391910us 100.00% > > Common time 597.731503963 s > > > 5037 spheres, velocity= 327.689688709 +- 5.13604387635 % > 25103 spheres, velocity= 81.2726909754 +- 1.0105334405 % > 50250 spheres, velocity= 45.4114521341 +- 3.02333274436 % > 100467 spheres, velocity= 19.0287424005 +- 2.26073439157 % > 200813 spheres, velocity= 6.51664351023 +- 4.03351515402 % > > > SCORE: 13777 > Number of threads 4 > > > > ./yade-parallel -j4 --performance (my "pc" branch) > > > number of bodies 200813 > > Elapsed 15.4320101738 sec > Performance 12.9600744004 iter/sec > Extrapolation on 1e5 iters 2.14333474636 hours > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* > Name > Count TimeRel. time > --- > ForceResetter 200 > 671157us4.36% > InsertionSortCollider 7 > 5145114us 33.42% > boundDispatcher 7 > 93186us1.81% > bound > 7 12us0.00% > copy 7 > 160891us3.13% > erase 7 > 66932us1.30% > sort&collide 7 > 4824071us 93.76% > TOTAL35 > 5145095us 100.00% > InteractionLoop 200 > 6545848us 42.52% > NewtonIntegrator200 > 3030989us 19.69% > TOTAL > 15393110us 100.00% > > Common time 460.37680912 s > > > 5037 spheres, velocity= 365.599773471 +- 8.02397068512 % > 25103 spheres, velocity= 92.0077536966 +- 3.81069496509 % > 50250 spheres, velocity= 54.1683980588 +- 0.528288534811 % > 100467 spheres, velocity= 25.7134767981 +- 1.0796373464 % > 200813 spheres, velocity= 12.6488486429 +- 4.66276699319 % > > > SCORE: 18800 > Number of threads 4 > > > ___ > Mailing list: https://launchpad.net/~yade-dev > Post to : yade-dev@lists.launchpad.net > Unsubscribe : ht
Re: [Yade-dev] parallel collider - testing needed
> > https://yade-dem.org/wiki/Performance_Test > > Wow! Speed x6 for 500k particules?! > It was definitely worth trying with larger numbers, it changes the > picture completely when the last points are included. > > Very nice page. > Could you also give some absolute timings for completness? A convenient > value could be the Cundall's number: Np*Nt/Tcpu > With Np number of bodies, Nt number of iterations, Tcpu computation time. I just updated the page: https://yade-dem.org/wiki/Performance_Test ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
> https://yade-dem.org/wiki/Performance_Test Wow! Speed x6 for 500k particules?! It was definitely worth trying with larger numbers, it changes the picture completely when the last points are included. Very nice page. Could you also give some absolute timings for completness? A convenient value could be the Cundall's number: Np*Nt/Tcpu With Np number of bodies, Nt number of iterations, Tcpu computation time. Bruno ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
(forwarding to yade-dev) On 28/02/14 10:13, Klaus Thoeni wrote: > Hi guys., > > have a look at this: > > https://yade-dem.org/wiki/Performance_Test > > Feel free to add your own tests. If you want I can provide the scripts for > the > graphs. > > Cheers > Klaus > > > > ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
Zitat von Klaus Thoeni : Hi Bruno, 2/ Hyperthreading is completely useless for heavy computing tasks, actually even bad, as your results suggest. I did some tests by enabling and disabling hyperthreading some time ago. Conclusions: always disable hyperthreading, as you say it makes no sense for the kind of thinks we are doing. Maybe we should mention it somewhere on our web page. Any suggestions where? Good idea. Maybe this page would be a nice place: https://yade-dem.org/wiki/Multicore_Performance There is also one thing I am missing on the wiki. What about comparison of different cpus/hardware-combinations? We could put one or two benchmark scripts in the examples folder. Users can follow instructions and provide benchmark results of the hardware they are using. Results will be published on the wiki. What do you think about that? ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
Hi Bruno, > 2/ Hyperthreading is completely useless for heavy computing tasks, > actually even bad, as your results suggest. I did some tests by enabling and disabling hyperthreading some time ago. Conclusions: always disable hyperthreading, as you say it makes no sense for the kind of thinks we are doing. Maybe we should mention it somewhere on our web page. Any suggestions where? > Benchmarking 8 threads via this technique is irrelevant for this reason. > What I would really like to see is how the collider scales with 8 > non-virtual cores and more. > I think they can do that in Freiberg and Newcastle (in Grenoble as well, > in fact, I just didn't find the time). I did some testing with --performance on our grid on 3 different nodes and with various numbers of cores. I am rerunning the test with 50 particles at the moment and will try to post a summary of all the results here or on the wiki later. In the mean time some results of our slow AMD Opteron Processor 6282 SE: yade -j4 5037 spheres, velocity= 94.8073682494 +- 3.55139591623 % 25103 spheres, velocity= 27.7389795715 +- 8.63375047506 % 50250 spheres, velocity= 16.0519684282 +- 5.60688183622 % 100467 spheres, velocity= 6.67235752786 +- 8.84758076674 % 200813 spheres, velocity= 2.66158958354 +- 7.70653861779 % yade-pc -j4 5037 spheres, velocity= 78.264605326 +- 4.06741633055 % 25103 spheres, velocity= 26.0879865929 +- 2.61754448363 % 50250 spheres, velocity= 15.7245773611 +- 2.24679654566 % 100467 spheres, velocity= 7.64762330727 +- 2.59000324319 % 200813 spheres, velocity= 3.64194000319 +- 1.80798282427 % yade -j8 5037 spheres, velocity= 138.024763661 +- 14.7299332104 % 25103 spheres, velocity= 35.7526851013 +- 4.24184671794 % 50250 spheres, velocity= 22.0071042904 +- 8.36195041437 % 100467 spheres, velocity= 11.1704832541 +- 11.725537817 % 200813 spheres, velocity= 3.54394003786 +- 5.48119712335 % yade-pc -j8 5037 spheres, velocity= 133.311680084 +- 1.88168292497 % 25103 spheres, velocity= 34.3688804144 +- 7.43189318211 % 50250 spheres, velocity= 21.3620031259 +- 3.8532356508 % 100467 spheres, velocity= 11.3218727607 +- 3.77428592406 % 200813 spheres, velocity= 6.16209240352 +- 6.24680400297 % yade -j16 5037 spheres, velocity= 71.8232644642 +- 41.7059425388 % 25103 spheres, velocity= 24.6342039841 +- 3.98148164778 % 50250 spheres, velocity= 16.1247061321 +- 4.73981941981 % 100467 spheres, velocity= 9.23509237236 +- 2.14822969955 % 200813 spheres, velocity= 2.91721702399 +- 3.88145803663 % yade-pc -j16 5037 spheres, velocity= 129.908588625 +- 15.6874714595 % 25103 spheres, velocity= 33.526601121 +- 13.7594343427 % 50250 spheres, velocity= 17.7898704143 +- 7.7469432427 % 100467 spheres, velocity= 11.3877154372 +- 1.74832633634 % 200813 spheres, velocity= 6.95545612967 +- 2.35988760251 % yade -j32 5037 spheres, velocity= 59.0283160736 +- 51.2569740982 % 25103 spheres, velocity= 18.7622567759 +- 6.54660223453 % 50250 spheres, velocity= 12.3588048445 +- 8.49295845839 % 100467 spheres, velocity= 7.6569548227 +- 6.71719242602 % 200813 spheres, velocity= 2.47982732752 +- 10.4129796959 % yade-pc -j32 5037 spheres, velocity= 88.990043 +- 15.7295668423 % 25103 spheres, velocity= 18.1857423869 +- 1.17387945175 % 50250 spheres, velocity= 12.6321967406 +- 5.31792620843 % 100467 spheres, velocity= 8.98513348696 +- 4.48699885744 % 200813 spheres, velocity= 6.12495571697 +- 1.48933071382 % Summary for 20 particles: -> -j4: scale =1.37 -> -j8: scale =1.74 -> -j16: scale =2.38 -> -j32: scale =2.47 These numbers might look differently on our Intel nodes, I still have to check. > What I need also before pushing to trunk is more testing with real > scripts, not just --performance. > I only covered a narrow range of situations with my own scripts, I would > like to be sure that it will not break in other cases. Maybe ask mister Fu, he really seems to be keen on increasing his computing scale ;-) Cheers Klaus ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
Thanks! Comments below. On 26/02/14 13:33, Matthias Frank wrote: > > > i have also some benchmark results: > > for 1 thread > --- > > InsertionSortCollider 7 > 21314382us 51.34% > InteractionLoop 200 > 14890015us 35.87% > NewtonIntegrator200 > 5084295us 12.25% > TOTAL 41513619us 100.00% > for 4 threads > --- > > InsertionSortCollider 7 > 8374089us 44.57% > InteractionLoop 200 > 6866564us 36.55% > NewtonIntegrator200 > 2915176us 15.52% > TOTAL 18787178us 100.00% > > > > for 8 threads > --- > > InsertionSortCollider 7 > 7577257us 39.74% > InteractionLoop 200 > 6923126us 36.31% > NewtonIntegrator200 > 3186823us 16.71% > TOTAL 19067561us 100.00% > You are confirming my timings. 1/ ISC scales much better than interaction loop and newton. 2/ Hyperthreading is completely useless for heavy computing tasks, actually even bad, as your results suggest. Benchmarking 8 threads via this technique is irrelevant for this reason. What I would really like to see is how the collider scales with 8 non-virtual cores and more. I think they can do that in Freiberg and Newcastle (in Grenoble as well, in fact, I just didn't find the time). What I need also before pushing to trunk is more testing with real scripts, not just --performance. I only covered a narrow range of situations with my own scripts, I would like to be sure that it will not break in other cases. Cheers. Bruno ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
hi guys, i have also some benchmark results: for 1 thread 200801 number of bodies 200813 Elapsed 41.6678731441 sec Performance 4.79986101782 iter/sec Extrapolation on 1e5 iters 5.78720460335 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* Name Count TimeRel. time --- ForceResetter 200 224925us0.54% InsertionSortCollider 7 21314382us 51.34% InteractionLoop 200 14890015us 35.87% NewtonIntegrator200 5084295us 12.25% TOTAL 41513619us 100.00% Common time 1013.57112694 s 5037 spheres, velocity= 140.463364272 +- 1.28620387158 % 25103 spheres, velocity= 41.138472944 +- 2.34750742651 % 50250 spheres, velocity= 24.1614197693 +- 0.709212706826 % 100467 spheres, velocity= 11.7041352478 +- 0.681390348657 % 200813 spheres, velocity= 5.20881044621 +- 5.57298683259 % SCORE: 7993 Number of threads 1 for 4 threads 200801 number of bodies 200813 Elapsed 18.8133409023 sec Performance 10.6307540505 iter/sec Extrapolation on 1e5 iters 2.61296401421 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* Name Count TimeRel. time --- ForceResetter 200 631347us3.36% InsertionSortCollider 7 8374089us 44.57% InteractionLoop 200 6866564us 36.55% NewtonIntegrator200 2915176us 15.52% TOTAL 18787178us 100.00% Common time 443.513967991 s 5037 spheres, velocity= 404.919400864 +- 0.912571165941 % 25103 spheres, velocity= 105.118936499 +- 2.36368208547 % 50250 spheres, velocity= 61.4143580936 +- 1.40115209383 % 100467 spheres, velocity= 25.7654736657 +- 2.93262637568 % 200813 spheres, velocity= 12.2452664182 +- 9.39816092272 % SCORE: 19832 Number of threads 4 for 8 threads 200801 number of bodies 200813 Elapsed 19.0994348526 sec Performance 10.4715140287 iter/sec Extrapolation on 1e5 iters 2.65269928508 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* Name Count TimeRel. time --- ForceResetter 200 1380352us7.24% InsertionSortCollider 7 7577257us 39.74% InteractionLoop 200 6923126us 36.31% NewtonIntegrator200 3186823us 16.71% TOTAL 19067561us 100.00% Common time 479.59920001 s 5037 spheres, velocity= 355.829004066 +- 2.37547928463 % 25103 spheres, velocity= 87.4558634849 +- 2.63148596504 % 50250 spheres, velocity= 56.1805332982 +- 2.18028212667 % 100467 spheres, velocity= 26.26403263 +- 9.82416513972 % 200813 spheres, velocity= 11.736613584 +- 8.6342992153 % SCORE: 18265 Number of threads 8 4 threads without virtualization 200801 number of bodies 200813 Elapsed 23.8045229912 sec Performance 8.40176465935 iter/sec Extrapolation on 1e5 iters 3.30618374878 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* Name Count TimeRel. time --- ForceResetter 200 523769us2.20% InsertionSortCollider 7 15676590us 65.87% InteractionLoop 200 5077634us 21.33% NewtonIntegrator200 2522054us 10.60% TOTAL 23800048us 100.00% Common time 437.141875982 s 5037 spheres, velocity= 611.163145541 +- 0.257590873987 % 25103 spheres, velocity= 1
Re: [Yade-dev] parallel collider - testing needed
> after running "make install" in my build folder I start yade using "python > yadeparallel -j4 --performance" Why "python" in the first place?! I would not be surprised if the number of cores allocated to python was 1, which may cause "yade -j4" to run in a single thread context. B ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
>>> It is a good benchmark overall, the problem is that it is hardly >>> reproducible. Each run can give a really different total time (more than >>> a factor 2 between two measure time, didn't you see that to? >> when i run the script with num_balls1D = 10 i get: >> > Mmmmh... I should try again then (I didn't save the logs). > Thanks. I confirm your results. The timings are stable in another try. I guess it was an effect of other tasks (internet browsing and so on). Bruno ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
> > yes, it is faster at -j1: > So this is an independent problem. For me -j4 is always faster and effectively uses 4 cores, be it with the old or the new collider. I have no idea what can be wrong with your processor. B ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
I can confirm this behavior fort he performance test at my machine. I'm not sure whether it is caused by the way I start the compiled yade: after running "make install" in my build folder I start yade using "python yadeparallel -j4 --performance" in the ./install/bin folder. "parallel" is the DSUFFIX I added in cmake. I will now try to run some other script (martin-niehoff's dynamic simulation) using an equal command. -Ursprüngliche Nachricht- Von: Yade-dev [mailto:yade-dev-bounces+alexander.eulitz=iwf.tu-berlin...@lists.launchpad.net] Im Auftrag von Christian Jakob Gesendet: Mittwoch, 26. Februar 2014 08:54 An: yade-dev@lists.launchpad.net Betreff: Re: [Yade-dev] parallel collider - testing needed >> There is apparently a problem with your computer/compilation option/other? >> If you run an ordinary simulation with -j4 and many particles do you >> see >> 4 cores used? yes, for normal scripts it is running 4 threads at 4 cores, but --performance assigns all threads to one core it seems... > Is there any difference at all on this machine, between -j1 and -j4? yes, it is faster at -j1: number of bodies 200813 Elapsed 69.9356219769 sec Performance 2.85977295042 iter/sec Extrapolation on 1e5 iters 9.71328083012 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* NameCount TimeRel. time --- ForceResetter 200 1069504us1.53% InsertionSortCollider 7 21301263us 30.46% InteractionLoop 200 29700514us 42.47% NewtonIntegrator200 17853603us 25.53% TOTAL 69924885us 100.00% Common time 2067.0501442 s 5037 spheres, velocity= 71.3011948258 +- 0.426132271892 % 25103 spheres, velocity= 18.804595478 +- 1.73479566756 % 50250 spheres, velocity= 10.9461326398 +- 0.367180852894 % 100467 spheres, velocity= 5.45291715221 +- 0.431878602357 % 200813 spheres, velocity= 2.85102513277 +- 0.221541088185 % SCORE: 3959 Number of threads 1 ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
There is apparently a problem with your computer/compilation option/other? If you run an ordinary simulation with -j4 and many particles do you see 4 cores used? yes, for normal scripts it is running 4 threads at 4 cores, but --performance assigns all threads to one core it seems... Is there any difference at all on this machine, between -j1 and -j4? yes, it is faster at -j1: number of bodies 200813 Elapsed 69.9356219769 sec Performance 2.85977295042 iter/sec Extrapolation on 1e5 iters 9.71328083012 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* NameCount TimeRel. time --- ForceResetter 200 1069504us1.53% InsertionSortCollider 7 21301263us 30.46% InteractionLoop 200 29700514us 42.47% NewtonIntegrator200 17853603us 25.53% TOTAL 69924885us 100.00% Common time 2067.0501442 s 5037 spheres, velocity= 71.3011948258 +- 0.426132271892 % 25103 spheres, velocity= 18.804595478 +- 1.73479566756 % 50250 spheres, velocity= 10.9461326398 +- 0.367180852894 % 100467 spheres, velocity= 5.45291715221 +- 0.431878602357 % 200813 spheres, velocity= 2.85102513277 +- 0.221541088185 % SCORE: 3959 Number of threads 1 ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
Is there any difference at all on this machine, between -j1 and -j4? B On 25/02/14 18:56, Bruno Chareyre wrote: > There is apparently a problem with your computer/compilation option/other? > If you run an ordinary simulation with -j4 and many particles do you see > 4 cores used? > > Bruno > > > > On 25/02/14 16:26, Christian Jakob wrote: >> Hi Bruno, >> >> I did some tests with your new collider: >> >> My "old" machine (2 cpu sockets with 4 cores each, Intel(R) Xeon(R) >> CPU X5460 @ 3.16GHz) says: >> >> >> yade-trunk -j4 --performance >> >> Welcome to Yade 2014-02-18.git-af75797 >> . >> number of bodies 200813 >> >> Elapsed 74.6882498264 sec >> Performance 2.67779738399 iter/sec >> Extrapolation on 1e5 iters 10.3733680314 hours >> =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* >> NameCount >>TimeRel. time >> --- >> >> ForceResetter 200 >> 2625848us3.52% >> InsertionSortCollider 7 >> 21494603us 28.79% >> InteractionLoop 200 >> 32631323us 43.70% >> NewtonIntegrator200 >> 17913859us 23.99% >> TOTAL >> 74665635us 100.00% >> >> Common time 3845.09048295 s >> >> >> Calculation velocity is unstable, try to close all programs and start >> performance tests again >> 5037 spheres, velocity= 44.7832284176 +- 60.1189421161 % >> 25103 spheres, velocity= 17.4121076601 +- 0.99355345037 % >> 50250 spheres, velocity= 10.0714940216 +- 1.5389769 % >> 100467 spheres, velocity= 5.05891811219 +- 0.434738330959 % >> 200813 spheres, velocity= 2.65826879857 +- 0.933088603948 % >> >> >> SCORE: 3479 >> Number of threads 4 >> >> >> >> ### >> >> yade-parallel -j4 --performance (your pc branch) >> >> Welcome to Yade 2014-02-24.git-b60d388 >> . >> number of bodies 200813 >> >> Elapsed 75.6688189507 sec >> Performance 2.64309662518 iter/sec >> Extrapolation on 1e5 iters 10.5095581876 hours >> =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* >> NameCount >>TimeRel. time >> --- >> >> ForceResetter 200 >> 2600100us3.44% >> InsertionSortCollider 7 >> 20746020us 27.43% >> InteractionLoop 200 >> 34455725us 45.55% >> NewtonIntegrator200 >> 17838205us 23.58% >> TOTAL >> 75640051us 100.00% >> >> Common time 4093.34840894 s >> >> >> Calculation velocity is unstable, try to close all programs and start >> performance tests again >> 5037 spheres, velocity= 44.3999135517 +- 61.0812025756 % >> 25103 spheres, velocity= 16.8531534243 +- 1.32470154863 % >> 50250 spheres, velocity= 9.61504490252 +- 0.670186229301 % >> 100467 spheres, velocity= 4.86679881913 +- 0.487840014886 % >> 200813 spheres, velocity= 2.64490152313 +- 0.285084118261 % >> >> >> SCORE: 3402 >> Number of threads 4 >> >> ## >> >> >> For my computer it seems to have nearly no speed up ... >> >> Looking at htop tells my, that -j4 --performance is using 4 threads, >> but just on 1 core ... >> >> Regards, >> >> Christian >> >> >> >> Zitat von Bruno Chareyre : >> >>> Hi there, >>> I implemented a parallel version of the InsertionSortCollider. It is >>> almost ready but not yet pushed to the main trunk, as I have a few >>> things to check before that. >>> It would be helpful if some of you could 1/ test that your scripts work >>> correctly and 2/ benchmark this for N>100k and j>4. >>> If you run benchmarks, please remember to always activate timing and >>> report the result of timing.stats(). It gives much more interesting data >>> than the wall clock time. >>> >>> Preliminary benchmark results are below (from my laptop...), showing a >>> speedup by a factor 2 on the total computation time for j4/200k >>> particles (compared to the sequential collider). >>> The speedup on collider alone is in fact of the order of x3.68 for 4 >>> threads. Nearly linear at least for such small number of threads. >>> >>> My expectation is that it should change almost nothing for small number >>> of particles (say, N<10k), where colliding is an inexpensive st
Re: [Yade-dev] parallel collider - testing needed
There is apparently a problem with your computer/compilation option/other? If you run an ordinary simulation with -j4 and many particles do you see 4 cores used? Bruno On 25/02/14 16:26, Christian Jakob wrote: > Hi Bruno, > > I did some tests with your new collider: > > My "old" machine (2 cpu sockets with 4 cores each, Intel(R) Xeon(R) > CPU X5460 @ 3.16GHz) says: > > > yade-trunk -j4 --performance > > Welcome to Yade 2014-02-18.git-af75797 > . > number of bodies 200813 > > Elapsed 74.6882498264 sec > Performance 2.67779738399 iter/sec > Extrapolation on 1e5 iters 10.3733680314 hours > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* > NameCount >TimeRel. time > --- > > ForceResetter 200 > 2625848us3.52% > InsertionSortCollider 7 > 21494603us 28.79% > InteractionLoop 200 > 32631323us 43.70% > NewtonIntegrator200 > 17913859us 23.99% > TOTAL > 74665635us 100.00% > > Common time 3845.09048295 s > > > Calculation velocity is unstable, try to close all programs and start > performance tests again > 5037 spheres, velocity= 44.7832284176 +- 60.1189421161 % > 25103 spheres, velocity= 17.4121076601 +- 0.99355345037 % > 50250 spheres, velocity= 10.0714940216 +- 1.5389769 % > 100467 spheres, velocity= 5.05891811219 +- 0.434738330959 % > 200813 spheres, velocity= 2.65826879857 +- 0.933088603948 % > > > SCORE: 3479 > Number of threads 4 > > > > ### > > yade-parallel -j4 --performance (your pc branch) > > Welcome to Yade 2014-02-24.git-b60d388 > . > number of bodies 200813 > > Elapsed 75.6688189507 sec > Performance 2.64309662518 iter/sec > Extrapolation on 1e5 iters 10.5095581876 hours > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* > NameCount >TimeRel. time > --- > > ForceResetter 200 > 2600100us3.44% > InsertionSortCollider 7 > 20746020us 27.43% > InteractionLoop 200 > 34455725us 45.55% > NewtonIntegrator200 > 17838205us 23.58% > TOTAL > 75640051us 100.00% > > Common time 4093.34840894 s > > > Calculation velocity is unstable, try to close all programs and start > performance tests again > 5037 spheres, velocity= 44.3999135517 +- 61.0812025756 % > 25103 spheres, velocity= 16.8531534243 +- 1.32470154863 % > 50250 spheres, velocity= 9.61504490252 +- 0.670186229301 % > 100467 spheres, velocity= 4.86679881913 +- 0.487840014886 % > 200813 spheres, velocity= 2.64490152313 +- 0.285084118261 % > > > SCORE: 3402 > Number of threads 4 > > ## > > > For my computer it seems to have nearly no speed up ... > > Looking at htop tells my, that -j4 --performance is using 4 threads, > but just on 1 core ... > > Regards, > > Christian > > > > Zitat von Bruno Chareyre : > >> Hi there, >> I implemented a parallel version of the InsertionSortCollider. It is >> almost ready but not yet pushed to the main trunk, as I have a few >> things to check before that. >> It would be helpful if some of you could 1/ test that your scripts work >> correctly and 2/ benchmark this for N>100k and j>4. >> If you run benchmarks, please remember to always activate timing and >> report the result of timing.stats(). It gives much more interesting data >> than the wall clock time. >> >> Preliminary benchmark results are below (from my laptop...), showing a >> speedup by a factor 2 on the total computation time for j4/200k >> particles (compared to the sequential collider). >> The speedup on collider alone is in fact of the order of x3.68 for 4 >> threads. Nearly linear at least for such small number of threads. >> >> My expectation is that it should change almost nothing for small number >> of particles (say, N<10k), where colliding is an inexpensive step. >> For 1million of particles OTOH, there could be significant speedup, >> since the collider takes most of the time. >> >> You can get the "pc" branch at my github repo: >> git clone -b pc https://github.com/bchareyre/trunk.git >> >> Results of yade
Re: [Yade-dev] parallel collider - testing needed
Hi Bruno, I did some tests with your new collider: My "old" machine (2 cpu sockets with 4 cores each, Intel(R) Xeon(R) CPU X5460 @ 3.16GHz) says: yade-trunk -j4 --performance Welcome to Yade 2014-02-18.git-af75797 . number of bodies 200813 Elapsed 74.6882498264 sec Performance 2.67779738399 iter/sec Extrapolation on 1e5 iters 10.3733680314 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* NameCount TimeRel. time --- ForceResetter 200 2625848us3.52% InsertionSortCollider 7 21494603us 28.79% InteractionLoop 200 32631323us 43.70% NewtonIntegrator200 17913859us 23.99% TOTAL 74665635us 100.00% Common time 3845.09048295 s Calculation velocity is unstable, try to close all programs and start performance tests again 5037 spheres, velocity= 44.7832284176 +- 60.1189421161 % 25103 spheres, velocity= 17.4121076601 +- 0.99355345037 % 50250 spheres, velocity= 10.0714940216 +- 1.5389769 % 100467 spheres, velocity= 5.05891811219 +- 0.434738330959 % 200813 spheres, velocity= 2.65826879857 +- 0.933088603948 % SCORE: 3479 Number of threads 4 ### yade-parallel -j4 --performance (your pc branch) Welcome to Yade 2014-02-24.git-b60d388 . number of bodies 200813 Elapsed 75.6688189507 sec Performance 2.64309662518 iter/sec Extrapolation on 1e5 iters 10.5095581876 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* NameCount TimeRel. time --- ForceResetter 200 2600100us3.44% InsertionSortCollider 7 20746020us 27.43% InteractionLoop 200 34455725us 45.55% NewtonIntegrator200 17838205us 23.58% TOTAL 75640051us 100.00% Common time 4093.34840894 s Calculation velocity is unstable, try to close all programs and start performance tests again 5037 spheres, velocity= 44.3999135517 +- 61.0812025756 % 25103 spheres, velocity= 16.8531534243 +- 1.32470154863 % 50250 spheres, velocity= 9.61504490252 +- 0.670186229301 % 100467 spheres, velocity= 4.86679881913 +- 0.487840014886 % 200813 spheres, velocity= 2.64490152313 +- 0.285084118261 % SCORE: 3402 Number of threads 4 ## For my computer it seems to have nearly no speed up ... Looking at htop tells my, that -j4 --performance is using 4 threads, but just on 1 core ... Regards, Christian Zitat von Bruno Chareyre : Hi there, I implemented a parallel version of the InsertionSortCollider. It is almost ready but not yet pushed to the main trunk, as I have a few things to check before that. It would be helpful if some of you could 1/ test that your scripts work correctly and 2/ benchmark this for N>100k and j>4. If you run benchmarks, please remember to always activate timing and report the result of timing.stats(). It gives much more interesting data than the wall clock time. Preliminary benchmark results are below (from my laptop...), showing a speedup by a factor 2 on the total computation time for j4/200k particles (compared to the sequential collider). The speedup on collider alone is in fact of the order of x3.68 for 4 threads. Nearly linear at least for such small number of threads. My expectation is that it should change almost nothing for small number of particles (say, N<10k), where colliding is an inexpensive step. For 1million of particles OTOH, there could be significant speedup, since the collider takes most of the time. You can get the "pc" branch at my github repo: git clone -b pc https://github.com/bchareyre/trunk.git Results of yade -j4 --performance are below (I7 quad-core with hyperthreading enabled, lightly loaded by background tasks - j>4 not reported as hyperthreading is probably doing no good). Happy benchmarking. :) Bruno ./yade-trunk -j4 --performance (the current trunk) ... number of bodies 200813 Elapsed 29.4102840424 sec Performance 6.80034234664 iter/sec Extrapolation on 1e5 iters 4.08476167255 hours =*=
Re: [Yade-dev] parallel collider - testing needed
> sorry for late reply. Feel free to share the pdf. Originally it was supposed > to be transferred to the wiki, anyway. > I'm thinking about a good way to measure performance for highly dynamic > simulations, now. Maybe the script that martin-niehoff posted[1] would be > useful. It is basicly a regular cubic pack of spheres that is placed in > vibrating tub. The simulation runs for a single second (simulation time) and > excitation of the tub causes the pack to disperse. It is of great interest to > see whether such simulations benefit from the new collider, too, I think. The more it is dynamic the more you should see the effect of parallel collider, since higher velocities will trigger collision detection more often. The benchmark you suggest should be ok. B ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
sorry for late reply. Feel free to share the pdf. Originally it was supposed to be transferred to the wiki, anyway. I'm thinking about a good way to measure performance for highly dynamic simulations, now. Maybe the script that martin-niehoff posted[1] would be useful. It is basicly a regular cubic pack of spheres that is placed in vibrating tub. The simulation runs for a single second (simulation time) and excitation of the tub causes the pack to disperse. It is of great interest to see whether such simulations benefit from the new collider, too, I think. Alex [1] https://answers.launchpad.net/yade/+question/242644 answer #10 -Ursprüngliche Nachricht- Von: Yade-dev [mailto:yade-dev-bounces+alexander.eulitz=iwf.tu-berlin...@lists.launchpad.net] Im Auftrag von Bruno Chareyre Gesendet: Montag, 24. Februar 2014 16:57 An: yade-dev@lists.launchpad.net Betreff: Re: [Yade-dev] parallel collider - testing needed I forgot to mention two things: 1- I tried the benchmark used by Christian for comparisons with PFC [1], however it seems that this test is very special. I get large differences between two runs. Basically, it seems the simulation only depends on truncation errors: vertical columns of sheres remain stable until a small bit of horizontal noise makes them fall down one by one. If you look at the simulation in the GUI it looks strange. I did not insist with this one, I think it could be improved by replacing the lattice by disordered packings. 2- The benchmark done by Alexander some time ago (on the same problem but with -j>1) is not visible anywhere if I'm not wrong. I have a copy of the pdf, is it ok to upload it on the wiki? It is an interesting starting point for evaluating the parallel collider. Bruno [1] https://www.yade-dem.org/wiki/Comparisons_with_PFC3D On 24/02/14 16:36, Bruno Chareyre wrote: > Hi there, > I implemented a parallel version of the InsertionSortCollider. It is > almost ready but not yet pushed to the main trunk, as I have a few > things to check before that. > It would be helpful if some of you could 1/ test that your scripts > work correctly and 2/ benchmark this for N>100k and j>4. > If you run benchmarks, please remember to always activate timing and > report the result of timing.stats(). It gives much more interesting > data than the wall clock time. > > Preliminary benchmark results are below (from my laptop...), showing a > speedup by a factor 2 on the total computation time for j4/200k > particles (compared to the sequential collider). > The speedup on collider alone is in fact of the order of x3.68 for 4 > threads. Nearly linear at least for such small number of threads. > > My expectation is that it should change almost nothing for small > number of particles (say, N<10k), where colliding is an inexpensive step. > For 1million of particles OTOH, there could be significant speedup, > since the collider takes most of the time. > > You can get the "pc" branch at my github repo: > git clone -b pc https://github.com/bchareyre/trunk.git > > Results of yade -j4 --performance are below (I7 quad-core with > hyperthreading enabled, lightly loaded by background tasks - j>4 not > reported as hyperthreading is probably doing no good). > > Happy benchmarking. :) > > Bruno > > > > ./yade-trunk -j4 --performance (the current trunk) ... > number of bodies 200813 > > Elapsed 29.4102840424 sec > Performance 6.80034234664 iter/sec > Extrapolation on 1e5 iters 4.08476167255 hours > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* > Name > Count TimeRel. time > --- > ForceResetter 200 > 700881us2.38% > InsertionSortCollider 7 > 18816625us 64.02% > InteractionLoop 200 > 6581283us 22.39% > NewtonIntegrator200 > 3293119us 11.20% > TOTAL > 29391910us 100.00% > > Common time 597.731503963 s > > > 5037 spheres, velocity= 327.689688709 +- 5.13604387635 % > 25103 spheres, velocity= 81.2726909754 +- 1.0105334405 % > 50250 spheres, velocity= 45.4114521341 +- 3.02333274436 % > 100467 spheres, velocity= 19.0287424005 +- 2.26073439157 % > 200813 spheres, velocity= 6.51664351023 +- 4.03351515402 % > > > SCORE: 13777 > Number of threads 4 >
Re: [Yade-dev] parallel collider - testing needed
On 25/02/14 10:17, Christian Jakob wrote: > >> It is a good benchmark overall, the problem is that it is hardly >> reproducible. Each run can give a really different total time (more than >> a factor 2 between two measure time, didn't you see that to? > > when i run the script with num_balls1D = 10 i get: > Mmmmh... I should try again then (I didn't save the logs). Thanks. B ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
It is a good benchmark overall, the problem is that it is hardly reproducible. Each run can give a really different total time (more than a factor 2 between two measure time, didn't you see that to? when i run the script with num_balls1D = 10 i get: Welcome to Yade 2014-02-18.git-af75797 TCP python prompt on localhost:9001, auth cookie `cydaeu' XMLRPC info provider on http://localhost:21001 Running script calc-time-YADE2014.py --- run 1 of 20 --- 1.041 --- run 2 of 20 --- 0.961 --- run 3 of 20 --- 0.961 --- run 4 of 20 --- 0.961 --- run 5 of 20 --- 0.961 --- run 6 of 20 --- 0.961 --- run 7 of 20 --- 1.001 --- run 8 of 20 --- 1.041 --- run 9 of 20 --- 1.041 --- run 10 of 20 --- 1.001 --- run 11 of 20 --- 0.961 --- run 12 of 20 --- 1.001 --- run 13 of 20 --- 1.001 --- run 14 of 20 --- 0.961 --- run 15 of 20 --- 0.961 --- run 16 of 20 --- 1.001 --- run 17 of 20 --- 1.041 --- run 18 of 20 --- 1.041 --- run 19 of 20 --- 1.001 --- run 20 of 20 --- 1.041 i do not see a factor 2, do you? can you run the original script and post your results? c ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
> I rotated the wall below a little bit to make it slightly aslope. This > is the reason why columns can collapse (not because truncation error): > I see. > > As you mentioned in a previous post we should define two benchmarking > scripts. One for quasi-static simulations and one for dynamic ones. > The one I used for comparison to PFC is quasi-static at the beginning > and turns into a dynamic one. > It seems not to be the best choice for a benchmark. It is a good benchmark overall, the problem is that it is hardly reproducible. Each run can give a really different total time (more than a factor 2 between two measure time, didn't you see that to? or it is my computer that failed for some reason?), so it needs a very large number of runs to get a relevant average. It also means that subtle difference in codes could lead to systematic bias. I would change the benchmark a little, with some randomness in the initial positions. I could prepare another benchmark of the triaxial type. It can combines dynamics in the initial steps and static situations after enough steps. Bruno ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp
Re: [Yade-dev] parallel collider - testing needed
Zitat von Bruno Chareyre : I forgot to mention two things: 1- I tried the benchmark used by Christian for comparisons with PFC [1], however it seems that this test is very special. I get large differences between two runs. Basically, it seems the simulation only depends on truncation errors: vertical columns of sheres remain stable until a small bit of horizontal noise makes them fall down one by one. If you I rotated the wall below a little bit to make it slightly aslope. This is the reason why columns can collapse (not because truncation error): #rotation quaternion: orientationWall = Quaternion(Vector3(.01,.01,1),math.pi) #create box: id_box=O.bodies.append(utils.box((origin_wall,origin_wall,-.5),(200,200,.5),orientationWall,fixed=True,material=WallMat)) As you mentioned in a previous post we should define two benchmarking scripts. One for quasi-static simulations and one for dynamic ones. The one I used for comparison to PFC is quasi-static at the beginning and turns into a dynamic one. It seems not to be the best choice for a benchmark. look at the simulation in the GUI it looks strange. I did not insist with this one, I think it could be improved by replacing the lattice by disordered packings. 2- The benchmark done by Alexander some time ago (on the same problem but with -j>1) is not visible anywhere if I'm not wrong. I have a copy of the pdf, is it ok to upload it on the wiki? It is an interesting starting point for evaluating the parallel collider. Bruno [1] https://www.yade-dem.org/wiki/Comparisons_with_PFC3D On 24/02/14 16:36, Bruno Chareyre wrote: Hi there, I implemented a parallel version of the InsertionSortCollider. It is almost ready but not yet pushed to the main trunk, as I have a few things to check before that. It would be helpful if some of you could 1/ test that your scripts work correctly and 2/ benchmark this for N>100k and j>4. If you run benchmarks, please remember to always activate timing and report the result of timing.stats(). It gives much more interesting data than the wall clock time. Preliminary benchmark results are below (from my laptop...), showing a speedup by a factor 2 on the total computation time for j4/200k particles (compared to the sequential collider). The speedup on collider alone is in fact of the order of x3.68 for 4 threads. Nearly linear at least for such small number of threads. My expectation is that it should change almost nothing for small number of particles (say, N<10k), where colliding is an inexpensive step. For 1million of particles OTOH, there could be significant speedup, since the collider takes most of the time. You can get the "pc" branch at my github repo: git clone -b pc https://github.com/bchareyre/trunk.git Results of yade -j4 --performance are below (I7 quad-core with hyperthreading enabled, lightly loaded by background tasks - j>4 not reported as hyperthreading is probably doing no good). Happy benchmarking. :) Bruno ./yade-trunk -j4 --performance (the current trunk) ... number of bodies 200813 Elapsed 29.4102840424 sec Performance 6.80034234664 iter/sec Extrapolation on 1e5 iters 4.08476167255 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* Name Count TimeRel. time --- ForceResetter 200 700881us2.38% InsertionSortCollider 7 18816625us 64.02% InteractionLoop 200 6581283us 22.39% NewtonIntegrator200 3293119us 11.20% TOTAL 29391910us 100.00% Common time 597.731503963 s 5037 spheres, velocity= 327.689688709 +- 5.13604387635 % 25103 spheres, velocity= 81.2726909754 +- 1.0105334405 % 50250 spheres, velocity= 45.4114521341 +- 3.02333274436 % 100467 spheres, velocity= 19.0287424005 +- 2.26073439157 % 200813 spheres, velocity= 6.51664351023 +- 4.03351515402 % SCORE: 13777 Number of threads 4 ./yade-parallel -j4 --performance (my "pc" branch) number of bodies 200813 Elapsed 15.4320101738 sec Performance 12.9600744004 iter/sec Extrapolation on 1e5 iters 2.14333474636 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* Name Count TimeRel. time --- ForceResetter 200 671157us4.36% InsertionSortCollider 7 5145114us 33.42% boundDispatcher 7 93186us1.81% bound 7 12us0.00% copy 7 160891us
Re: [Yade-dev] parallel collider - testing needed
I forgot to mention two things: 1- I tried the benchmark used by Christian for comparisons with PFC [1], however it seems that this test is very special. I get large differences between two runs. Basically, it seems the simulation only depends on truncation errors: vertical columns of sheres remain stable until a small bit of horizontal noise makes them fall down one by one. If you look at the simulation in the GUI it looks strange. I did not insist with this one, I think it could be improved by replacing the lattice by disordered packings. 2- The benchmark done by Alexander some time ago (on the same problem but with -j>1) is not visible anywhere if I'm not wrong. I have a copy of the pdf, is it ok to upload it on the wiki? It is an interesting starting point for evaluating the parallel collider. Bruno [1] https://www.yade-dem.org/wiki/Comparisons_with_PFC3D On 24/02/14 16:36, Bruno Chareyre wrote: > Hi there, > I implemented a parallel version of the InsertionSortCollider. It is > almost ready but not yet pushed to the main trunk, as I have a few > things to check before that. > It would be helpful if some of you could 1/ test that your scripts work > correctly and 2/ benchmark this for N>100k and j>4. > If you run benchmarks, please remember to always activate timing and > report the result of timing.stats(). It gives much more interesting data > than the wall clock time. > > Preliminary benchmark results are below (from my laptop...), showing a > speedup by a factor 2 on the total computation time for j4/200k > particles (compared to the sequential collider). > The speedup on collider alone is in fact of the order of x3.68 for 4 > threads. Nearly linear at least for such small number of threads. > > My expectation is that it should change almost nothing for small number > of particles (say, N<10k), where colliding is an inexpensive step. > For 1million of particles OTOH, there could be significant speedup, > since the collider takes most of the time. > > You can get the "pc" branch at my github repo: > git clone -b pc https://github.com/bchareyre/trunk.git > > Results of yade -j4 --performance are below (I7 quad-core with > hyperthreading enabled, lightly loaded by background tasks - j>4 not > reported as hyperthreading is probably doing no good). > > Happy benchmarking. :) > > Bruno > > > > ./yade-trunk -j4 --performance (the current trunk) > ... > number of bodies 200813 > > Elapsed 29.4102840424 sec > Performance 6.80034234664 iter/sec > Extrapolation on 1e5 iters 4.08476167255 hours > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* > Name > Count TimeRel. time > --- > ForceResetter 200 > 700881us2.38% > InsertionSortCollider 7 > 18816625us 64.02% > InteractionLoop 200 > 6581283us 22.39% > NewtonIntegrator200 > 3293119us 11.20% > TOTAL > 29391910us 100.00% > > Common time 597.731503963 s > > > 5037 spheres, velocity= 327.689688709 +- 5.13604387635 % > 25103 spheres, velocity= 81.2726909754 +- 1.0105334405 % > 50250 spheres, velocity= 45.4114521341 +- 3.02333274436 % > 100467 spheres, velocity= 19.0287424005 +- 2.26073439157 % > 200813 spheres, velocity= 6.51664351023 +- 4.03351515402 % > > > SCORE: 13777 > Number of threads 4 > > > > ./yade-parallel -j4 --performance (my "pc" branch) > > > number of bodies 200813 > > Elapsed 15.4320101738 sec > Performance 12.9600744004 iter/sec > Extrapolation on 1e5 iters 2.14333474636 hours > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* > Name > Count TimeRel. time > --- > ForceResetter 200 > 671157us4.36% > InsertionSortCollider 7 > 5145114us 33.42% > boundDispatcher 7 > 93186us1.81% > bound > 7 12us0.00% > copy 7 > 160891us3.13% > erase 7 > 66932us1.30% > sort&collide 7
[Yade-dev] parallel collider - testing needed
Hi there, I implemented a parallel version of the InsertionSortCollider. It is almost ready but not yet pushed to the main trunk, as I have a few things to check before that. It would be helpful if some of you could 1/ test that your scripts work correctly and 2/ benchmark this for N>100k and j>4. If you run benchmarks, please remember to always activate timing and report the result of timing.stats(). It gives much more interesting data than the wall clock time. Preliminary benchmark results are below (from my laptop...), showing a speedup by a factor 2 on the total computation time for j4/200k particles (compared to the sequential collider). The speedup on collider alone is in fact of the order of x3.68 for 4 threads. Nearly linear at least for such small number of threads. My expectation is that it should change almost nothing for small number of particles (say, N<10k), where colliding is an inexpensive step. For 1million of particles OTOH, there could be significant speedup, since the collider takes most of the time. You can get the "pc" branch at my github repo: git clone -b pc https://github.com/bchareyre/trunk.git Results of yade -j4 --performance are below (I7 quad-core with hyperthreading enabled, lightly loaded by background tasks - j>4 not reported as hyperthreading is probably doing no good). Happy benchmarking. :) Bruno ./yade-trunk -j4 --performance (the current trunk) ... number of bodies 200813 Elapsed 29.4102840424 sec Performance 6.80034234664 iter/sec Extrapolation on 1e5 iters 4.08476167255 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* Name Count TimeRel. time --- ForceResetter 200 700881us2.38% InsertionSortCollider 7 18816625us 64.02% InteractionLoop 200 6581283us 22.39% NewtonIntegrator200 3293119us 11.20% TOTAL 29391910us 100.00% Common time 597.731503963 s 5037 spheres, velocity= 327.689688709 +- 5.13604387635 % 25103 spheres, velocity= 81.2726909754 +- 1.0105334405 % 50250 spheres, velocity= 45.4114521341 +- 3.02333274436 % 100467 spheres, velocity= 19.0287424005 +- 2.26073439157 % 200813 spheres, velocity= 6.51664351023 +- 4.03351515402 % SCORE: 13777 Number of threads 4 ./yade-parallel -j4 --performance (my "pc" branch) number of bodies 200813 Elapsed 15.4320101738 sec Performance 12.9600744004 iter/sec Extrapolation on 1e5 iters 2.14333474636 hours =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=* Name Count TimeRel. time --- ForceResetter 200 671157us4.36% InsertionSortCollider 7 5145114us 33.42% boundDispatcher 7 93186us1.81% bound 7 12us0.00% copy 7 160891us3.13% erase 7 66932us1.30% sort&collide 7 4824071us 93.76% TOTAL35 5145095us 100.00% InteractionLoop 200 6545848us 42.52% NewtonIntegrator200 3030989us 19.69% TOTAL 15393110us 100.00% Common time 460.37680912 s 5037 spheres, velocity= 365.599773471 +- 8.02397068512 % 25103 spheres, velocity= 92.0077536966 +- 3.81069496509 % 50250 spheres, velocity= 54.1683980588 +- 0.528288534811 % 100467 spheres, velocity= 25.7134767981 +- 1.0796373464 % 200813 spheres, velocity= 12.6488486429 +- 4.66276699319 % SCORE: 18800 Number of threads 4 ___ Mailing list: https://launchpad.net/~yade-dev Post to : yade-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~yade-dev More help : https://help.launchpad.net/ListHelp