I have a recent iMac with 4 logical cores (and 8 hyper threads). I would have thought that peakflops(N) for a large enough N should be increasing in the number of threads I allow BLAS to use. I do find that peakflops(N) with 1 thread is about half as high as peakflops(N) with 2 threads, but there is no gain to 4 threads. Are my expectations wrong here, or is it possible that BLAS is somehow configured incorrectly on my machine? In the example below, N = 6755, a number relevant for my work, but the results are similar with 5000 or 10000.
here is my versioninfo() julia> versioninfo() Julia Version 0.5.0 Commit 3c9d753* (2016-09-19 18:14 UTC) Platform Info: System: Darwin (x86_64-apple-darwin15.6.0) CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz WORD_SIZE: 64 BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell) LAPACK: libopenblas LIBM: libopenlibm LLVM: libLLVM-3.7.1 (ORCJIT, haswell) here is an example peakflops() exercise: julia> BLAS.set_num_threads(1) julia> mean(peakflops(6755) for i=1:10) 5.225580459387056e10 julia> BLAS.set_num_threads(2) julia> mean(peakflops(6755) for i=1:10) 1.004317640281997e11 julia> BLAS.set_num_threads(4) julia> mean(peakflops(6755) for i=1:10) 9.838116463900085e10