[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-06-23 Thread Sam McCall via Phabricator via cfe-commits
sammccall added a comment.

I've sent https://reviews.llvm.org/D82352 to clean up some of the logic in 
clangd.




Comment at: clang-tools-extra/clangd/index/Background.cpp:154
   assert(this->IndexStorageFactory && "Storage factory can not be null!");
-  for (unsigned I = 0; I < ThreadPoolSize; ++I) {
+  for (unsigned I = 0; I < Rebuilder.TUsBeforeFirstBuild; ++I) {
 ThreadPool.runAsync("background-worker-" + llvm::Twine(I + 1), [this] {

aganea wrote:
> sammccall wrote:
> > Hmm, I finally stumbled across this today when editing the rebuild policy.
> > I do wish this hadn't been changed without understanding what this code is 
> > doing or getting review from an owner.
> > 
> > After this change, modifying the background-index rebuild frequency (which 
> > was initially tuned to be roughly "after each thread built one file") has 
> > the side-effect of changing the number of threads used for background 
> > indexing!
> > 
> > Less seriously, the use of zero to mean "all the threads" is problematic 
> > here because in the other thread roots in this project we use zero to mean 
> > "no threads, work synchronously".
> > 
> > I'll look into reverting the clangd parts of this patch.
> Sorry about that Sam. Do you think 'one' could be used in clangd instead? 
> That is the common value used across other parts of LLVM to signify 'no 
> threads', and also when `LLVM_ENABLE_THREADS` is off. 'zero' means to use the 
> default settings for the thread strategy. That is, 
> `llvm::hardware_concurrency(0)` means to use all hardware threads; or 
> `llvm::heavyweight_hardware_concurrency(0)` means to use all hardware cores, 
> but only one `std::thread` per core.
No worries, it happens and nothing came of it except a little head-scratching. 
Sorry for being grumpy, I shouldn't send email late at night...

> Do you think 'one' could be used in clangd instead? That is the common value 
> used across other parts of LLVM to signify 'no threads', and also when 
> LLVM_ENABLE_THREADS is off

One can't be used here, it means "spawn one background thread". (Clangd has 
several distinct components that spawn threads).

FWIW, clangd can't be built without LLVM_ENABLE_THREADS - it needs concurrency 
to be useful, and we designed around threads. Most of the threading can be 
turned off *at runtime* for certain tests (that's what a threadpool size of 
zero means) but not the background index - we just disable it instead.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-06-23 Thread Alexandre Ganea via Phabricator via cfe-commits
aganea marked an inline comment as done.
aganea added inline comments.



Comment at: clang-tools-extra/clangd/index/Background.cpp:154
   assert(this->IndexStorageFactory && "Storage factory can not be null!");
-  for (unsigned I = 0; I < ThreadPoolSize; ++I) {
+  for (unsigned I = 0; I < Rebuilder.TUsBeforeFirstBuild; ++I) {
 ThreadPool.runAsync("background-worker-" + llvm::Twine(I + 1), [this] {

sammccall wrote:
> Hmm, I finally stumbled across this today when editing the rebuild policy.
> I do wish this hadn't been changed without understanding what this code is 
> doing or getting review from an owner.
> 
> After this change, modifying the background-index rebuild frequency (which 
> was initially tuned to be roughly "after each thread built one file") has the 
> side-effect of changing the number of threads used for background indexing!
> 
> Less seriously, the use of zero to mean "all the threads" is problematic here 
> because in the other thread roots in this project we use zero to mean "no 
> threads, work synchronously".
> 
> I'll look into reverting the clangd parts of this patch.
Sorry about that Sam. Do you think 'one' could be used in clangd instead? That 
is the common value used across other parts of LLVM to signify 'no threads', 
and also when `LLVM_ENABLE_THREADS` is off. 'zero' means to use the default 
settings for the thread strategy. That is, `llvm::hardware_concurrency(0)` 
means to use all hardware threads; or 
`llvm::heavyweight_hardware_concurrency(0)` means to use all hardware cores, 
but only one `std::thread` per core.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-06-23 Thread Sam McCall via Phabricator via cfe-commits
sammccall added inline comments.
Herald added subscribers: msifontes, jurahul, Kayjukh, stephenneuendorffer, 
aartbik.
Herald added a project: MLIR.



Comment at: clang-tools-extra/clangd/index/Background.cpp:154
   assert(this->IndexStorageFactory && "Storage factory can not be null!");
-  for (unsigned I = 0; I < ThreadPoolSize; ++I) {
+  for (unsigned I = 0; I < Rebuilder.TUsBeforeFirstBuild; ++I) {
 ThreadPool.runAsync("background-worker-" + llvm::Twine(I + 1), [this] {

Hmm, I finally stumbled across this today when editing the rebuild policy.
I do wish this hadn't been changed without understanding what this code is 
doing or getting review from an owner.

After this change, modifying the background-index rebuild frequency (which was 
initially tuned to be roughly "after each thread built one file") has the 
side-effect of changing the number of threads used for background indexing!

Less seriously, the use of zero to mean "all the threads" is problematic here 
because in the other thread roots in this project we use zero to mean "no 
threads, work synchronously".

I'll look into reverting the clangd parts of this patch.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-04-15 Thread John Brawn via Phabricator via cfe-commits
john.brawn added a comment.
Herald added subscribers: frgossen, grosul1.
Herald added a reviewer: MaskRay.

With this patch the Threading.PhysicalConcurrency unit test fails when run with 
an affinity less than the number of physical cpus. I've raised 
https://bugs.llvm.org/show_bug.cgi?id=45556.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-02-28 Thread Mehdi AMINI via Phabricator via cfe-commits
mehdi_amini added inline comments.



Comment at: llvm/lib/Support/Threading.cpp:94
+  // uselessly induce more context-switching and cache eviction.
+  if (!ThreadsRequested || ThreadsRequested > (unsigned)MaxThreadCount)
+return MaxThreadCount;

aganea wrote:
> @mehdi_amini You mean this? Testing was showing degraded performance if 
> ThreadsRequested > MaxThreadCount, so I thought it'd maybe better to prevent 
> that situation. More soft threads than hardware threads means you'd pay for 
> context switching, for the cache eviction and for extra memory pressure (even 
> more if the allocator has per-thread pools).
> Do you see a cases where not having this test would be valuable? Providing 
> `--threads=50`, and creating 50-threads ThreadPool when your CPU only 
> supports 12 hardware threads? The only case I can think of is usage of async 
> operations in threads, but then your ThreadPool sounds more like a queue, and 
> maybe it's not the right hammer for the nail? Should we support that case and 
> explicitly tag the ThreadPool constructor in client code with something like 
> ThreadPool(50, AcknowledgeDegradedPerformance)?
>  Testing was showing degraded performance

I don't understand why this is relevant to anything else than the default? This 
is a library and if the caller override the default then IMO this layer should 
just let it be!

The current issue with bulldozer is a perfect example of this: the user is 
*requesting* more threads: why wouldn't you honor this request? If they want to 
shoot themselves in the foot, I'd let them (in most cases the user would know 
something you don't, like here).

I don't think we need the extra `AcknowledgeDegradedPerformance` flag, if the 
client override the default they can be assumed to know what they're doing (or 
they shouldn't use such an API in the first place), this seems more in line 
with the rest of the system (and C++ API in general I believe).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-02-27 Thread Alexandre Ganea via Phabricator via cfe-commits
aganea marked an inline comment as done.
aganea added inline comments.



Comment at: llvm/lib/Support/Threading.cpp:94
+  // uselessly induce more context-switching and cache eviction.
+  if (!ThreadsRequested || ThreadsRequested > (unsigned)MaxThreadCount)
+return MaxThreadCount;

@mehdi_amini You mean this? Testing was showing degraded performance if 
ThreadsRequested > MaxThreadCount, so I thought it'd maybe better to prevent 
that situation. More soft threads than hardware threads means you'd pay for 
context switching, for the cache eviction and for extra memory pressure (even 
more if the allocator has per-thread pools).
Do you see a cases where not having this test would be valuable? Providing 
`--threads=50`, and creating 50-threads ThreadPool when your CPU only supports 
12 hardware threads? The only case I can think of is usage of async operations 
in threads, but then your ThreadPool sounds more like a queue, and maybe it's 
not the right hammer for the nail? Should we support that case and explicitly 
tag the ThreadPool constructor in client code with something like 
ThreadPool(50, AcknowledgeDegradedPerformance)?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-02-26 Thread Robert Richmond via Phabricator via cfe-commits
RobRich999 added a comment.

I too can see how SMT might not afford much performance difference for LTO 
codegen.

CMT appears to be more significant. I do not have exact numbers right now as my 
build box is busy, but the change added like an hour to locally building the 
Chromium browser with ThinLTO optimizations enabled. Win10 running on a 32-core 
Opteron Piledriver (bdver2) system.

Definitely agree something like "/opt:lldltojobs=all" in a separate patch would 
be a good solution if possible for this particular (corner) case.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-02-26 Thread Mehdi AMINI via Phabricator via cfe-commits
mehdi_amini added a comment.

There is something puzzling to me in this patch in the way the ThreadPool was 
modified: the client of the thread pool cannot request a number of threads 
anymore, but only a *maximum* number of threads. I don't really understand why? 
Using the OS reported concurrency was always a "default", isn't this change 
orthogonal / unrelated to the stated goal of this patch?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-02-26 Thread Alexandre Ganea via Phabricator via cfe-commits
aganea added a comment.

@thakis I think this is a side-effect of implementing 
`computeHostNumPhysicalCores()` for Windows, which previously returned -1, 
which in turn made `llvm::heavyweight_hardware_concurrency()` previously behave 
like `llvm::hardware_concurrency()` (on Windows only).
At first sight, Linux seems to report the same thing as Windows (2 cores per 
"module"). If I read this 
 
correctly, the behavior on Linux for a AMD Opteron is/was the same as it is 
currently on Windows after this patch, that is, only one core out of two would 
be used for `llvm::heavyweight_hardware_concurrency()`.

I think this is a wider problem outside the scope of this patch, which is, some 
users want to use 100% of the cores/hyper-threads when using ThinLTO. Currently 
there's no way. An algorithm using `llvm::heavyweight_hardware_concurrency()` 
explicitly states "I only want one core out of two". It'd be nice if we had a 
cmd-line flag to override this, to say "I want to use all 
hyper-threads/cores/modules". `/opt:lldltojobs=all`? The impact is small on a 
modern Intel Skylake CPU, but it might be better depending on the CPU (AMD 
Bulldozer).

F11420112: image.png 


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-02-25 Thread Nico Weber via Phabricator via cfe-commits
thakis added a comment.

https://bugs.chromium.org/p/chromium/issues/detail?id=1051578#c12 :

"""
FYI for those building on AMD Bulldozer family of processors and its various 
iterations after this commit:

https://reviews.llvm.org/D71775

Building with ThinLTO on Bulldozer and similar appears to now be capped to how 
Windows reports cores versus logical processors, thus now halving the number of 
LTO threads available when building. Manually setting /opt:lldltojobs= for LLD 
does not override it, as that only sets an upper limit.

Found out as I locally build on a 32-core Opteron system. Windows treats it as 
16 cores and 32 logical processors, but it is not a SMT setup like Intel 
HyperTreading. In particular:

"A module consists of a coupling of two "conventional" x86 out of order 
processing cores. The processing core shares the early pipeline stages (e.g. 
L1i, fetch, decode), the FPUs, and the L2 cache with the rest of the module."

https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)

Naturally, build times have increased dramatically. YMMV.
"""

Sounds like this patch might have some drawbacks.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-02-14 Thread Alexandre Ganea via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rG8404aeb56a73: [Support] On Windows, ensure 
hardware_concurrency() extends to all CPU sockets… (authored by aganea).

Changed prior to commit:
  https://reviews.llvm.org/D71775?vs=237197=244664#toc

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775

Files:
  clang-tools-extra/clang-doc/tool/ClangDocMain.cpp
  clang-tools-extra/clangd/TUScheduler.cpp
  clang-tools-extra/clangd/index/Background.cpp
  clang-tools-extra/clangd/index/Background.h
  clang-tools-extra/clangd/index/BackgroundRebuild.h
  clang/lib/Tooling/AllTUsExecution.cpp
  clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
  clang/tools/clang-scan-deps/ClangScanDeps.cpp
  lld/ELF/SyntheticSections.cpp
  llvm/include/llvm/LTO/LTO.h
  llvm/include/llvm/Support/ThreadPool.h
  llvm/include/llvm/Support/Threading.h
  llvm/lib/CodeGen/ParallelCG.cpp
  llvm/lib/DWARFLinker/DWARFLinker.cpp
  llvm/lib/DebugInfo/GSYM/DwarfTransformer.cpp
  llvm/lib/ExecutionEngine/Orc/LLJIT.cpp
  llvm/lib/LTO/LTO.cpp
  llvm/lib/LTO/LTOBackend.cpp
  llvm/lib/LTO/ThinLTOCodeGenerator.cpp
  llvm/lib/Support/Host.cpp
  llvm/lib/Support/Parallel.cpp
  llvm/lib/Support/ThreadPool.cpp
  llvm/lib/Support/Threading.cpp
  llvm/lib/Support/Unix/Threading.inc
  llvm/lib/Support/Windows/Threading.inc
  llvm/tools/dsymutil/dsymutil.cpp
  llvm/tools/gold/gold-plugin.cpp
  llvm/tools/llvm-cov/CodeCoverage.cpp
  llvm/tools/llvm-cov/CoverageExporterJson.cpp
  llvm/tools/llvm-cov/CoverageReport.cpp
  llvm/tools/llvm-lto2/llvm-lto2.cpp
  llvm/tools/llvm-profdata/llvm-profdata.cpp
  llvm/unittests/Support/Host.cpp
  llvm/unittests/Support/TaskQueueTest.cpp
  llvm/unittests/Support/ThreadPool.cpp
  llvm/unittests/Support/Threading.cpp
  mlir/lib/Pass/Pass.cpp

Index: mlir/lib/Pass/Pass.cpp
===
--- mlir/lib/Pass/Pass.cpp
+++ mlir/lib/Pass/Pass.cpp
@@ -411,7 +411,8 @@
   // Create the async executors if they haven't been created, or if the main
   // pipeline has changed.
   if (asyncExecutors.empty() || hasSizeMismatch(asyncExecutors.front(), mgrs))
-asyncExecutors.assign(llvm::hardware_concurrency(), mgrs);
+asyncExecutors.assign(llvm::hardware_concurrency().compute_thread_count(),
+  mgrs);
 
   // Run a prepass over the module to collect the operations to execute over.
   // This ensures that an analysis manager exists for each operation, as well as
Index: llvm/unittests/Support/Threading.cpp
===
--- llvm/unittests/Support/Threading.cpp
+++ llvm/unittests/Support/Threading.cpp
@@ -21,7 +21,8 @@
   auto Num = heavyweight_hardware_concurrency();
   // Since Num is unsigned this will also catch us trying to
   // return -1.
-  ASSERT_LE(Num, thread::hardware_concurrency());
+  ASSERT_LE(Num.compute_thread_count(),
+hardware_concurrency().compute_thread_count());
 }
 
 #if LLVM_ENABLE_THREADS
Index: llvm/unittests/Support/ThreadPool.cpp
===
--- llvm/unittests/Support/ThreadPool.cpp
+++ llvm/unittests/Support/ThreadPool.cpp
@@ -8,11 +8,13 @@
 
 #include "llvm/Support/ThreadPool.h"
 
+#include "llvm/ADT/DenseSet.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Triple.h"
 #include "llvm/Support/Host.h"
 #include "llvm/Support/TargetSelect.h"
+#include "llvm/Support/Threading.h"
 
 #include "gtest/gtest.h"
 
@@ -69,6 +71,8 @@
 
   void SetUp() override { MainThreadReady = false; }
 
+  void TestAllThreads(ThreadPoolStrategy S);
+
   std::condition_variable WaitMainThread;
   std::mutex WaitMainThreadMutex;
   bool MainThreadReady = false;
@@ -131,7 +135,7 @@
 
 TEST_F(ThreadPoolTest, GetFuture) {
   CHECK_UNSUPPORTED();
-  ThreadPool Pool{2};
+  ThreadPool Pool(hardware_concurrency(2));
   std::atomic_int i{0};
   Pool.async([this, ] {
 waitForMainThread();
@@ -162,3 +166,45 @@
   }
   ASSERT_EQ(5, checked_in);
 }
+
+#if LLVM_ENABLE_THREADS == 1
+
+void ThreadPoolTest::TestAllThreads(ThreadPoolStrategy S) {
+  // FIXME: Skip these tests on non-Windows because multi-socket system were not
+  // tested on Unix yet, and llvm::get_thread_affinity_mask() isn't implemented
+  // for Unix.
+  Triple Host(Triple::normalize(sys::getProcessTriple()));
+  if (!Host.isOSWindows())
+return;
+
+  llvm::DenseSet ThreadsUsed;
+  std::mutex Lock;
+  unsigned Threads = 0;
+  {
+ThreadPool Pool(S);
+Threads = Pool.getThreadCount();
+for (size_t I = 0; I < 1; ++I) {
+  Pool.async([&] {
+waitForMainThread();
+std::lock_guard Guard(Lock);
+auto Mask = llvm::get_thread_affinity_mask();
+ThreadsUsed.insert(Mask);
+  });
+}
+ASSERT_EQ(true, ThreadsUsed.empty());
+

[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-02-13 Thread River Riddle via Phabricator via cfe-commits
rriddle accepted this revision.
rriddle added a comment.
This revision is now accepted and ready to land.

(Appeasing herald for MLIR changes).


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-02-11 Thread Reid Kleckner via Phabricator via cfe-commits
rnk accepted this revision.
rnk added a comment.
This revision is now accepted and ready to land.
Herald added a reviewer: rriddle.
Herald added a subscriber: Joonsoo.
This revision now requires review to proceed.

lgtm




Comment at: llvm/lib/Support/Unix/Threading.inc:273
+
+int computeHostNumPhysicalThreads() {
+#if defined(HAVE_SCHED_GETAFFINITY) && defined(HAVE_CPU_COUNT)

aganea wrote:
> rnk wrote:
> > I'm not sure it makes sense to say "physical threads". I think "physical 
> > cores" was meant to distinguish between hardware threads and cores. 
> > Describing hardware execution resources as physical or non-physical isn't 
> > very precise or meaningful in the first place, but I don't think it should 
> > apply to hyper threads.
> `computeHostNumHardwareThreads()` ?
lgtm


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-01-16 Thread Alexandre Ganea via Phabricator via cfe-commits
aganea added a comment.
Herald added a reviewer: nicolasvasilache.
Herald added a subscriber: liufengdb.

Ping! Any further comments?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-01-09 Thread Alexandre Ganea via Phabricator via cfe-commits
aganea added inline comments.



Comment at: llvm/lib/Support/Unix/Threading.inc:273
+
+int computeHostNumPhysicalThreads() {
+#if defined(HAVE_SCHED_GETAFFINITY) && defined(HAVE_CPU_COUNT)

rnk wrote:
> I'm not sure it makes sense to say "physical threads". I think "physical 
> cores" was meant to distinguish between hardware threads and cores. 
> Describing hardware execution resources as physical or non-physical isn't 
> very precise or meaningful in the first place, but I don't think it should 
> apply to hyper threads.
`computeHostNumHardwareThreads()` ?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2020-01-09 Thread Alexandre Ganea via Phabricator via cfe-commits
aganea updated this revision to Diff 237197.
aganea marked 13 inline comments as done.
aganea added a comment.
Herald added subscribers: lucyrfox, mgester, arpith-jacob, nicolasvasilache, 
antiagainst, shauheen, burmako, jpienaar, rriddle.

Updated as suggested by @rnk.
I've also removed `ThreadPoolStrategy::PinThreads` because it wasn't 
conclusive. In the best case, pinning threads to a core / hyperthread, was 
similar in performance to that of using a full CPU socket affinity; in the 
worst case, it was degrading performance. The NT scheduler seems to be doing a 
pretty good job here, so we'll stick with it :-)


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775

Files:
  clang-tools-extra/clang-doc/tool/ClangDocMain.cpp
  clang-tools-extra/clangd/TUScheduler.cpp
  clang-tools-extra/clangd/index/Background.cpp
  clang-tools-extra/clangd/index/Background.h
  clang-tools-extra/clangd/index/BackgroundRebuild.h
  clang/lib/Tooling/AllTUsExecution.cpp
  clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
  clang/tools/clang-scan-deps/ClangScanDeps.cpp
  lld/ELF/SyntheticSections.cpp
  llvm/include/llvm/ADT/BitVector.h
  llvm/include/llvm/ADT/SmallBitVector.h
  llvm/include/llvm/LTO/LTO.h
  llvm/include/llvm/Support/ThreadPool.h
  llvm/include/llvm/Support/Threading.h
  llvm/lib/CodeGen/ParallelCG.cpp
  llvm/lib/ExecutionEngine/Orc/LLJIT.cpp
  llvm/lib/LTO/LTO.cpp
  llvm/lib/LTO/LTOBackend.cpp
  llvm/lib/LTO/ThinLTOCodeGenerator.cpp
  llvm/lib/Support/Host.cpp
  llvm/lib/Support/Parallel.cpp
  llvm/lib/Support/ThreadPool.cpp
  llvm/lib/Support/Threading.cpp
  llvm/lib/Support/Unix/Threading.inc
  llvm/lib/Support/Windows/Threading.inc
  llvm/tools/dsymutil/DwarfLinkerForBinary.cpp
  llvm/tools/dsymutil/dsymutil.cpp
  llvm/tools/gold/gold-plugin.cpp
  llvm/tools/llvm-cov/CodeCoverage.cpp
  llvm/tools/llvm-cov/CoverageExporterJson.cpp
  llvm/tools/llvm-cov/CoverageReport.cpp
  llvm/tools/llvm-lto2/llvm-lto2.cpp
  llvm/tools/llvm-profdata/llvm-profdata.cpp
  llvm/unittests/Support/Host.cpp
  llvm/unittests/Support/TaskQueueTest.cpp
  llvm/unittests/Support/ThreadPool.cpp
  llvm/unittests/Support/Threading.cpp
  mlir/lib/Pass/Pass.cpp

Index: mlir/lib/Pass/Pass.cpp
===
--- mlir/lib/Pass/Pass.cpp
+++ mlir/lib/Pass/Pass.cpp
@@ -411,7 +411,8 @@
   // Create the async executors if they haven't been created, or if the main
   // pipeline has changed.
   if (asyncExecutors.empty() || hasSizeMismatch(asyncExecutors.front(), mgrs))
-asyncExecutors.assign(llvm::hardware_concurrency(), mgrs);
+asyncExecutors.assign(llvm::hardware_concurrency().compute_thread_count(),
+  mgrs);
 
   // Run a prepass over the module to collect the operations to execute over.
   // This ensures that an analysis manager exists for each operation, as well as
Index: llvm/unittests/Support/Threading.cpp
===
--- llvm/unittests/Support/Threading.cpp
+++ llvm/unittests/Support/Threading.cpp
@@ -21,7 +21,8 @@
   auto Num = heavyweight_hardware_concurrency();
   // Since Num is unsigned this will also catch us trying to
   // return -1.
-  ASSERT_LE(Num, thread::hardware_concurrency());
+  ASSERT_LE(Num.compute_thread_count(),
+hardware_concurrency().compute_thread_count());
 }
 
 #if LLVM_ENABLE_THREADS
Index: llvm/unittests/Support/ThreadPool.cpp
===
--- llvm/unittests/Support/ThreadPool.cpp
+++ llvm/unittests/Support/ThreadPool.cpp
@@ -8,11 +8,13 @@
 
 #include "llvm/Support/ThreadPool.h"
 
+#include "llvm/ADT/DenseSet.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Triple.h"
 #include "llvm/Support/Host.h"
 #include "llvm/Support/TargetSelect.h"
+#include "llvm/Support/Threading.h"
 
 #include "gtest/gtest.h"
 
@@ -69,6 +71,8 @@
 
   void SetUp() override { MainThreadReady = false; }
 
+  void TestAllThreads(ThreadPoolStrategy S);
+
   std::condition_variable WaitMainThread;
   std::mutex WaitMainThreadMutex;
   bool MainThreadReady = false;
@@ -131,7 +135,7 @@
 
 TEST_F(ThreadPoolTest, GetFuture) {
   CHECK_UNSUPPORTED();
-  ThreadPool Pool{2};
+  ThreadPool Pool(hardware_concurrency(2));
   std::atomic_int i{0};
   Pool.async([this, ] {
 waitForMainThread();
@@ -162,3 +166,45 @@
   }
   ASSERT_EQ(5, checked_in);
 }
+
+#if LLVM_ENABLE_THREADS == 1
+
+void ThreadPoolTest::TestAllThreads(ThreadPoolStrategy S) {
+  // FIXME: Skip these tests on non-Windows because multi-socket system were not
+  // tested on Unix yet, and llvm::get_thread_affinity_mask() isn't implemented
+  // for Unix.
+  Triple Host(Triple::normalize(sys::getProcessTriple()));
+  if (!Host.isOSWindows())
+return;
+
+  llvm::DenseSet ThreadsUsed;
+  std::mutex Lock;
+  unsigned Threads = 0;
+  {
+ThreadPool Pool(S);
+Threads 

[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2019-12-30 Thread Reid Kleckner via Phabricator via cfe-commits
rnk added subscribers: inglorion, gbiv.
rnk added a comment.
Herald added a subscriber: george.burgess.iv.

+@gbiv @inglorion, other ThinLTO users.

> To solve this situation, we capture (and retain) the initial intention until 
> the point of usage, through a new ThreadPoolStrategy class. The number of 
> threads to use is deferred as late as possible, until the moment where the 
> std::threads are created (ThreadPool in the case of ThinLTO).

That seems reasonable to me.




Comment at: llvm/include/llvm/ADT/SmallBitVector.h:487
+
+  int compare(const SmallBitVector ) const {
+for (unsigned I = 0; I < std::max(size(), RHS.size()); ++I) {

I guess we don't need these changes, since it looks like you use BitVector only 
below.



Comment at: llvm/include/llvm/ADT/SmallBitVector.h:500
+
+  bool operator<(const SmallBitVector ) const { return compare(RHS) == -1; 
}
+

You return `A - B` but then compare for equality to -1 and 1. I guess it works 
because you are doing it bit by bit, but it's exciting.



Comment at: llvm/include/llvm/Support/Threading.h:153
+// runtime might use a lower value (not higher).
+unsigned ThreadCount = 0;
+

Let's name the fields in a way that indicates that these numbers are the 
requested number of threads, not the final number. So, `ThreadsRequested` or 
something like that.



Comment at: llvm/include/llvm/Support/Threading.h:158
+// specific core).
+bool HyperThreads = true;
+

This could be `UseHyperThreads`. The first time I read it, I guessed that it 
indicated if the system has hyperthreads.



Comment at: llvm/lib/Support/Host.cpp:1277
  << "/proc/cpuinfo: " << EC.message() << "\n";
 return -1;
   }

Another -1 case.



Comment at: llvm/lib/Support/Host.cpp:1334
 // On other systems, return -1 to indicate unknown.
 static int computeHostNumPhysicalCores() { return -1; }
 #endif

Note: the -1 case.



Comment at: llvm/lib/Support/Threading.cpp:85-86
 
-#include 
-unsigned llvm::heavyweight_hardware_concurrency() {
-  // Since we can't get here unless LLVM_ENABLE_THREADS == 1, it is safe to use
-  // `std::thread` directly instead of `llvm::thread` (and indeed, doing so
-  // allows us to not define `thread` in the llvm namespace, which conflicts
-  // with some platforms such as FreeBSD whose headers also define a struct
-  // called `thread` in the global namespace which can cause ambiguity due to
-  // ADL.
-  int NumPhysical = sys::getHostNumPhysicalCores();
-  if (NumPhysical == -1)
-return std::thread::hardware_concurrency();
-  return NumPhysical;
-}
+int computeHostNumPhysicalCores();
+int computeHostNumPhysicalThreads();
 

We already have a public API, getHostNumPhysicalCores. Can this use that?



Comment at: llvm/lib/Support/Threading.cpp:89
+unsigned llvm::ThreadPoolStrategy::compute_thread_count() const {
+  unsigned MaxThreadCount = HyperThreads ? computeHostNumPhysicalThreads()
+ : computeHostNumPhysicalCores();

I see these `computeHostNum*` methods return int, and some versions seem to 
return -1 to indicate errors. I think you'll want to use `int` here and check 
if `MaxThreadCount <= 0`. Otherwise below we may do a signed/unsigned 
comparison mismatch, and return ~0U for MaxThreadCount.



Comment at: llvm/lib/Support/Unix/Threading.inc:273
+
+int computeHostNumPhysicalThreads() {
+#if defined(HAVE_SCHED_GETAFFINITY) && defined(HAVE_CPU_COUNT)

I'm not sure it makes sense to say "physical threads". I think "physical cores" 
was meant to distinguish between hardware threads and cores. Describing 
hardware execution resources as physical or non-physical isn't very precise or 
meaningful in the first place, but I don't think it should apply to hyper 
threads.



Comment at: llvm/lib/Support/Windows/Threading.inc:160
+
+static ArrayRef computeProcessorGroups() {
+  auto computeGroups = []() {

This is cached, so maybe `getProcessorGroups` to indicate that it is not 
intended to be expensive.



Comment at: llvm/lib/Support/Windows/Threading.inc:212
+template 
+unsigned aggregate(R &, UnaryPredicate P) {
+  unsigned I{};

Seems like `static` can be used here.



Comment at: llvm/unittests/Support/ThreadPool.cpp:31
 protected:
   // This is intended for platform as a temporary "XFAIL"
   bool isUnsupportedOSOrEnvironment() {

"Temporary" >_>



Comment at: llvm/unittests/Support/ThreadPool.cpp:180
+
+  //std::set ThreadsUsed;
+  llvm::DenseSet ThreadsUsed;

I guess this is why you added comparison operators. In any case, let's remove 
the commented out code in the 

[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2019-12-21 Thread Alexandre Ganea via Phabricator via cfe-commits
aganea added a comment.

In D71775#1793767 , @mehdi_amini wrote:

> > Will it make sense to say "I don't want hyper-threads" ?
>
> Not sure I remember correctly, but I believe one motivation behind avoiding 
> "hyper-threads" and other virtual cores was that while they improve slightly 
> the performance, they also increase the peak memory requirements: using 
> heavyweight_hardware_concurrency() seemed like a good default tradeoff for 
> most end-users.


It all makes sense. After this patch, the memory consumption is doubled when 
using both CPU sockets. Evidently then there's also a discussion about memory 
bandwidth, which doesn't scale in my case, when using both sockets (possibly on 
a AMD Epyc it could be better because is has more memory channels.). This is 
also why enabling the second socket only marginally decrease the timings.

F5603: 6140_two_sockets.PNG 

In Ubisoft's case, time is immensely more valuable (both compute and human) 
than memory sticks. Historically we didn't really use LTO on game productions 
because it was really slow, and often introduced undesirable bugs or 
side-effects. The graphs in D71786  are for 
Rainbow 6: Siege, which is a "smaller" codebase. For larger games, LTO link 
time is more in the range of 1h 20min, both for MSVC and previous versions of 
Clang. If there's a LTO-specific bug in a final build, it is very hard to 
iterate with that kind of timings. In addition, there are hundreds of builds 
every day on the build system, and we want to keep all the cores in the build 
system busy. This is why both build and link times are important to us.

In D71775#1793768 , @mehdi_amini wrote:

> Also: using heavyweight_hardware_concurrency() in the linker but having 
> multiple linker jobs schedules by the build system was another reason (I 
> think LLVM CMake default to 2 parallel link jobs when using ThinLTO for 
> instance).


Understood. If one sets the CPU affinity when starting the application, ie. 
`start /affinity XXX lld-link.exe ...`, then this patch disables dispatching on 
other "processor groups", even if they are available. However, there doesn't 
seem to be a way to //restrain// the application to one "processor group".


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2019-12-20 Thread Mehdi AMINI via Phabricator via cfe-commits
mehdi_amini added a comment.

> Will it make sense to say "I don't want hyper-threads" ?

Not sure I remember correctly, but I believe one motivation behind avoiding 
"hyper-threads" and other virtual cores was that while they improve slightly 
the performance, they also increase the peak memory requirements: using 
heavyweight_hardware_concurrency() seemed like a good default tradeoff for most 
end-users.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2019-12-20 Thread Mehdi AMINI via Phabricator via cfe-commits
mehdi_amini added a comment.

Also: using heavyweight_hardware_concurrency() in the linker but having 
multiple linker jobs schedules by the build system was another reason (I think 
LLVM CMake default to 2 parallel link jobs when using ThinLTO for instance).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2019-12-20 Thread pre-merge checks [bot] via Phabricator via cfe-commits
merge_guards_bot added a comment.

{icon check-circle color=green} Unit tests: pass. 61057 tests passed, 0 failed 
and 728 were skipped.

{icon times-circle color=red} clang-tidy: fail. Please fix clang-tidy findings 
.

{icon times-circle color=red} clang-format: fail. Please format your changes 
with clang-format by running `git-clang-format HEAD^` or applying this patch 
.

Build artifacts 
: 
diff.json 
,
 clang-tidy.txt 
,
 clang-format.patch 
,
 CMakeCache.txt 
,
 console-log.txt 
,
 test-results.xml 



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71775/new/

https://reviews.llvm.org/D71775



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

2019-12-20 Thread Alexandre Ganea via Phabricator via cfe-commits
aganea created this revision.
aganea added reviewers: mehdi_amini, rnk, tejohnson, russell.gallop, dexonsmith.
Herald added subscribers: llvm-commits, cfe-commits, usaxena95, dang, jfb, 
kadircet, arphaman, steven_wu, jkorous, MaskRay, javed.absar, hiraditya, 
kristof.beyls, arichardson, emaste.
Herald added a reviewer: JDevlieghere.
Herald added a reviewer: espindola.
Herald added projects: clang, LLVM.

**TL;DR:** This patches ensures that, on Windows, all CPU sockets and all NUMA 
nodes are used by the `ThreadPool`. The goal was to have LLD/ThinLTO use all 
hardware threads in the system, which isn't the case currently on multi-socket 
or large CPU count systems.

(this could possibly be split into a few patches, but I just wanted an overall 
opinion)

Background
--

Windows doesn't have a flat `cpu_set_t` like Linux. Instead, it projects 
hardware CPUs (or NUMA nodes) to applications through a concept of "processor 
groups". A "processor" is the smallest unit of execution on a CPU, that is, an 
hyper-thread if SMT is active; a core otherwise. There's a limit of 32-bit 
processors on older 32-bit versions of Windows, which later was raised to 
64-processors with 64-bit versions of Windows. This limit comes from the 
affinity mask, which historically was represented by the `sizeof(void*)` (still 
is that way). Consequently, the concept of "processor groups" was introduced 
for dealing with systems with more than 64 hyper-threads.

By default, the Windows OS assigns only one "processor group" to each starting 
application, in a round-robin manner. If the application wants to use more 
processors, it needs to programmatically enable it, by assigning threads to 
other "processor groups". This also means that affinity cannot cross "processor 
group" boundaries; one can only specify a "preferred" group on startup, but the 
application is free to allocate more groups if it wants to.

This creates a peculiar situation, where newer CPUs like the AMD EPYC 7702P 
(64-cores, 128-hyperthreads) are projected by the OS as two (2) "processor 
groups". This means that by default, an application can only use half of the 
cores. This situation will only get worse in the years to come, as dies with 
more cores will appear on the market.

The changes in this patch
-

Previously, the problem was that `heavyweight_hardware_concurrency()` API was 
introduced so that only one hardware thread per core was used. Once that API 
returns, //that original intention is lost//. Consider a situation, on Windows, 
where the system has 2 CPU sockets, 18 cores each, each core having 2 
hyper-threads, for a total of 72 hyper-threads. Both 
`heavyweight_hardware_concurrency()` and `hardware_concurrency()` currently 
return 36, because on Windows they are simply wrappers over 
`std::thread::hardware_concurrency()` -- which returns only processors from the 
current "processor group".

What if we wanted to use all "processor groups" ? Even if we implemented 
properly `heavyweight_hardware_concurrency()` to returns 18, what should it 
then return ? 18 or 36 ?
What if user specified `/opt:lldltojobs=36` ? Should we assign 36 threads on 
the current "processor group" or should we dispatch extra threads on the second 
"processor groups" ?

To solve this situation, we capture (and retain) the initial intention until 
the point of usage, through a new `ThreadPoolStrategy` class. The number of 
threads to use is deferred as late as possible, until the moment where the 
`std::thread`s are created (`ThreadPool` in the case of ThinLTO).

Discussion
--

Ideally, we should consider all "processors" (on Windows) or all "CPUs" (Linux) 
as all equal, in which case `heavyweight_hardware_concurrency()` wouldn't be 
needed. I'm not sure how micro-managing threads, cores and NUMA nodes will 
scale in the years to come (probably not well). Will it make sense to say "I 
don't want hyper-threads" ? Or saying `/opt:lldltojobs=whatever` when you have 
a thousand(s)-core system ? How would that work with NUMA affinity ? For 
example the Fujitsu A64FX 

 has 4x "12-core tiles" on the same die, each tile being connected to an 
internal 8-GB HBM2 memory (each located internally on the CPU die). How would 
we dispatch threads in that case ? The AMD EPYC uses the same concept of 
"tiles", however it doesn't have internal memory yet, but most likely the EPYC 
v3 will use the same architecture.

@tejohnson : Teresa, since you added `heavyweight_hardware_concurrency()`, do 
you have a benchmark which compares ThinLTO running with 
`heavyweight_hardware_concurrency()` or with `hardware_concurrency()` ? (I 
haven't done that test yet)
It would make things a lot simpler if we didn't have that API, and in general 
considered that we could use all hardware threads in the system; and that they 
can perform equally.


Repository:
  rG LLVM Github