On 04/02/2015 14:29, Michael Black wrote: Hi Mike, > I don't think you'll find any gain using FFTW openmp. WSJT-X does not do > big enough FFTs to overtake the thread create/delete overhead. That's not so Mike. Joe has already determined that FFTW3 given roughly 2 threads has a small performance gain on the larger FFTs in the decoders.
There may be some confusion here, the talk about using the OpenMP version of FFTW3 is as an option to the native/pthreads version. Both are multi-threaded and have similar performance. The OpenMP version has the benefit that it is aware of the threads also being used elsewhere in the application and therefore plays well with the dynamic number of threads algorithm in OpenMP. This is currently not relevant to us as we are simply dividing the work in half and running the two halves (JT65 decode and JT9 decode) in parallel, the thread allocation for that is trivial i.e. 2 if there are at least 2 CPUs available. We also have direct control of the number of threads FFTW3 uses so we can allocate any spare CPUs, above the two used for the decoder threads, to the larger FFT plans. > > I worked a job a few years ago on a 512-core machine doing FFTs on synthetic > aperture radar systems. Using FFTW with OpenMp did very little good. Using > OpenMP at the layer above did...which is the same thing I think we'll find > here. > OpenMP inside FFTW for small FFTs wil have the overhead dominate and defeat > it. wsjtx/jt9 use a number of different FFT sizes, currently the '-m #' argument is being used as the thread count for all of them, we probably need to only use more than one thread for the large FFTs as you are correct that there is a high proportion thread synchronization overhead for small FFTs, but FFTW3 does address this internally by only using multiple threads on FFTs larger than ~2^11. > > We're already seeing only a 20-25% improvement in openmp at this level which > is a clear indication to me that we're not getting anywhere near 100% gain > for threading so doing it at a lower level isn't worth it. That is comparing Apples and Screwdrivers ;) the threading strategy for the decoders is one task per thread whereas the FFT strategy is a true divide and conquer algorithm with a recursive distribution to threads. They are both able to deliver performance improvements in the same application given enough CPUs to run on (2 for decoders + ~2 for FFTs has been shown to be optimal). Note that there is absolutely no threading contention or overhead between the FFTs and the decoders, even though the latter uses the former. So given that the average low end PC these days is usually at least a dual core hyper-threaded Intel processor or equivalent, we can assume that 4 CPUs are available. Not achieving 100% improvement at this stage from parallel decoding is likely to be due to overheads that we can and should address like not having the correct granularity on locks and being too pessimistic about data sharing controls, the FFTW3 concurrency is in and working but the direct use of OpenMP for parallel decoding is yet to be fully implemented. > When I did my 512-core system I was getting over 90% gain for each thread I > added. > Much like "don't sweat the small stuff" I think you''ll find "don't > multi-thread the small FFTs" is a good paradigm... > When you got a ~50% gain then it's time to look at multi-threading below > that level. OK but the FFTW3 threading is almost free in terms of complexity, the FFTW developers have done all the hard stuff, we just need to turn it on. That means even quite small gains are cost effective. OTOH the direct use of OpenMP in the decoder is adding a lot of complexity since we have to design and implement or eliminate the data sharing controls, the potential gain is large so is probably worth the cost in development effort and complexity. > > Mike W9MDB 73 Bill G4WJS. > > > -----Original Message----- > From: Bill Somerville [mailto:[email protected]] > Sent: Wednesday, February 04, 2015 8:21 AM > To: [email protected] > Subject: Re: [wsjt-devel] v4926 OpenMP > > On 04/02/2015 14:05, John Nelson wrote: >> Hi Bill and Joe, > Hi John, >> With regard to Mac builds, your [Bill] code test with workspace and > workspace_mt executes correctly with my gfortran compiler. However, as you > point out the current clang/clang++ do not [yet] have OpenMP support. >> So when I compile fftw_3.3.4 with --enable-threads, I cannot also use > --with-openmp. I also get: >> -- Try OpenMP C flag = [ ] >> -- Performing Test OpenMP_FLAG_DETECTED >> -- Performing Test OpenMP_FLAG_DETECTED - Failed > I am experimenting with the MacPorts gcc 4.9 suite with building WSJT-X. > That needs changes to the CMake script which I have not committed yet. > So far it doesn't seem to be necessary to build or use the OpenMP version of > FFTW3, the native/pthreads version is working well and seems to be > compatible with an OpenMP program. I believe the only issue is that we need > to control the number of threads used by FFTW3 and OpenMP manually to a > certain extent. If it does become necessary to use the OpenMP version of > FFTW3, that can be built on Mac, again I have the MacPorts version > available. > > There also appears to be a bug in CMake that is causing it not to pass on > the portability options to the gcc compilers/linker (MAC_OSX_SYSROOT and > MAC_OSX_DEPLOYMENT_TARGET). This is not serious and can be worked around if > necessary but I want to get it sorted out properly if possible. > > My current focus apart from v1.4 issues is to help Joe with multi-threading > hazards in jt9 but I am working on the Mac builds with OpenMP as well. >> when building WSJT-X r4928 which is currently executing successfully - and > certainly decodes rapidly. > You are getting the latest performance increases which are significant. > The OpenMP jt9, which is not in WSJT-X yet, has the potential to almost half > decoding times in dual JT65+JT9 mode when there is equivalent work to be > done in each mode. >> --- John G4KLA > 73 > Bill > G4WJS. > > ---------------------------------------------------------------------------- > -- > Dive into the World of Parallel Programming. The Go Parallel Website, > sponsored by Intel and developed in partnership with Slashdot Media, is your > hub for all things parallel software development, from weekly thought > leadership blogs to news, videos, case studies, tutorials and more. Take a > look and join the conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > wsjt-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/wsjt-devel > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming. The Go Parallel Website, > sponsored by Intel and developed in partnership with Slashdot Media, is your > hub for all things parallel software development, from weekly thought > leadership blogs to news, videos, case studies, tutorials and more. Take a > look and join the conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > wsjt-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/wsjt-devel ------------------------------------------------------------------------------ Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ wsjt-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/wsjt-devel
