- Revision
- 204744
- Author
- [email protected]
- Date
- 2016-08-22 16:18:09 -0700 (Mon, 22 Aug 2016)
Log Message
bmalloc: speed up the lock slow path
https://bugs.webkit.org/show_bug.cgi?id=161058
Reviewed by Filip Pizlo.
It is generally accepted practice that a lock should yield instead of
spinning when a lock acquisition fails, to avoid wasting CPU and power.
There are two problems with this generally accepted practice:
(1) It's a fallacy that yielding is free. In reality, yielding itself
consumes CPU and power -- by performing a syscall, running the OS
scheduler, and possibly performing a context switch. (Instruments
traces of MallocBench show the cost of yielding.) Therefore, spinning a
little to avoid yielding can actually *save* CPU and power.
(2) std::this_thread_yield() on Darwin is way too aggressive: It not only
yields but also depresses your priority to absolute zero for 10ms. A
recent PLT trace showed a few spots where the main thread just gave up
on loading and rendering a page for 10ms so an unimportant background
task could run.
To correct these problems, this patch adds a little bit of spinning to
the bmalloc lock slow path.
Below are performance results on various CPUs.
Mac Pro (12 hyperthreaded cores = 24 threads):
Baseline Patch Δ
Execution Time:
message_one 173ms 173ms
message_many 953ms 927ms ^ 1.03x faster
churn --parallel 60ms 41ms ^ 1.46x faster
list_allocate --parallel 224ms 143ms ^ 1.57x faster
tree_allocate --parallel 1,190ms 758ms ^ 1.57x faster
tree_churn --parallel 1,517ms 906ms ^ 1.67x faster
facebook --parallel 6,519ms 4,580ms ^ 1.42x faster
reddit --parallel 5,097ms 3,411ms ^ 1.49x faster
flickr --parallel 4,903ms 3,501ms ^ 1.4x faster
theverge --parallel 6,641ms 4,505ms ^ 1.47x faster
<geometric mean> 1,158ms 832ms ^ 1.39x faster
<arithmetic mean> 2,728ms 1,895ms ^ 1.44x faster
<harmonic mean> 332ms 240ms ^ 1.38x faster
MacBook Air (2 hyperthreaded cores = 4 threads):
Baseline Patch Δ
Execution Time:
message_one 911ms 907ms ^ 1.0x faster
message_many 515ms 513ms ^ 1.0x faster
churn --parallel 132ms 134ms ! 1.02x slower
list_allocate --parallel 104ms 102ms ^ 1.02x faster
tree_allocate --parallel 117ms 111ms ^ 1.05x faster
tree_churn --parallel 154ms 151ms ^ 1.02x faster
facebook --parallel 719ms 687ms ^ 1.05x faster
reddit --parallel 382ms 341ms ^ 1.12x faster
flickr --parallel 372ms 345ms ^ 1.08x faster
theverge --parallel 489ms 444ms ^ 1.1x faster
<geometric mean> 299ms 287ms ^ 1.04x faster
<arithmetic mean> 390ms 374ms ^ 1.04x faster
<harmonic mean> 227ms 220ms ^ 1.03x faster
iPad (2 cores = 2 threads):
[ Doesn't run Ruby, so no pretty subtest output. ]
Baseline Patch Δ
Execution Time: 174.14ms 171.5ms ^ 1.02x faster
* bmalloc.xcodeproj/project.pbxproj:
* bmalloc/ScopeExit.h: Added. A barebones very wimpy version of
WTF::ScopeExit.
(bmalloc::ScopeExit::ScopeExit):
(bmalloc::ScopeExit::~ScopeExit):
(bmalloc::makeScopeExit):
* bmalloc/StaticMutex.cpp:
(bmalloc::StaticMutex::lockSlowCase): Spin before yielding -- that's the
speedup. Don't spin if another CPU is already spinning. In theory, more
than one spinner accomplishes nothing, and I found that there's a cutoff
around 8 or 16 spinners that becomes performance negative on Mac Pro.
(Note: Another way to accomplish a similar result, if you don't want to
use a bit of state in the lock, is to spin for a random duration between
0 and aLot. I tested a version of WTF::WeakRandom with unsynchronized
static state and it worked great. But I ultimately opted for the explicit
bit because I thought it was clearer.)
* bmalloc/StaticMutex.h:
(bmalloc::StaticMutex::init): Initialize our new bit.
* bmalloc/ThreadSwitch.h: Added.
(bmalloc::threadSwitch): Don't call yield() on Darwin because it's too
aggressive. swtch() does what we want: Go run something else, without
any other side-effects.
Modified Paths
Added Paths
Diff
Modified: trunk/Source/bmalloc/ChangeLog (204743 => 204744)
--- trunk/Source/bmalloc/ChangeLog 2016-08-22 22:44:42 UTC (rev 204743)
+++ trunk/Source/bmalloc/ChangeLog 2016-08-22 23:18:09 UTC (rev 204744)
@@ -1,3 +1,105 @@
+2016-08-22 Geoffrey Garen <[email protected]>
+
+ bmalloc: speed up the lock slow path
+ https://bugs.webkit.org/show_bug.cgi?id=161058
+
+ Reviewed by Filip Pizlo.
+
+ It is generally accepted practice that a lock should yield instead of
+ spinning when a lock acquisition fails, to avoid wasting CPU and power.
+
+ There are two problems with this generally accepted practice:
+
+ (1) It's a fallacy that yielding is free. In reality, yielding itself
+ consumes CPU and power -- by performing a syscall, running the OS
+ scheduler, and possibly performing a context switch. (Instruments
+ traces of MallocBench show the cost of yielding.) Therefore, spinning a
+ little to avoid yielding can actually *save* CPU and power.
+
+ (2) std::this_thread_yield() on Darwin is way too aggressive: It not only
+ yields but also depresses your priority to absolute zero for 10ms. A
+ recent PLT trace showed a few spots where the main thread just gave up
+ on loading and rendering a page for 10ms so an unimportant background
+ task could run.
+
+ To correct these problems, this patch adds a little bit of spinning to
+ the bmalloc lock slow path.
+
+ Below are performance results on various CPUs.
+
+ Mac Pro (12 hyperthreaded cores = 24 threads):
+
+ Baseline Patch Δ
+ Execution Time:
+ message_one 173ms 173ms
+ message_many 953ms 927ms ^ 1.03x faster
+ churn --parallel 60ms 41ms ^ 1.46x faster
+ list_allocate --parallel 224ms 143ms ^ 1.57x faster
+ tree_allocate --parallel 1,190ms 758ms ^ 1.57x faster
+ tree_churn --parallel 1,517ms 906ms ^ 1.67x faster
+ facebook --parallel 6,519ms 4,580ms ^ 1.42x faster
+ reddit --parallel 5,097ms 3,411ms ^ 1.49x faster
+ flickr --parallel 4,903ms 3,501ms ^ 1.4x faster
+ theverge --parallel 6,641ms 4,505ms ^ 1.47x faster
+
+ <geometric mean> 1,158ms 832ms ^ 1.39x faster
+ <arithmetic mean> 2,728ms 1,895ms ^ 1.44x faster
+ <harmonic mean> 332ms 240ms ^ 1.38x faster
+
+ MacBook Air (2 hyperthreaded cores = 4 threads):
+
+ Baseline Patch Δ
+ Execution Time:
+ message_one 911ms 907ms ^ 1.0x faster
+ message_many 515ms 513ms ^ 1.0x faster
+ churn --parallel 132ms 134ms ! 1.02x slower
+ list_allocate --parallel 104ms 102ms ^ 1.02x faster
+ tree_allocate --parallel 117ms 111ms ^ 1.05x faster
+ tree_churn --parallel 154ms 151ms ^ 1.02x faster
+ facebook --parallel 719ms 687ms ^ 1.05x faster
+ reddit --parallel 382ms 341ms ^ 1.12x faster
+ flickr --parallel 372ms 345ms ^ 1.08x faster
+ theverge --parallel 489ms 444ms ^ 1.1x faster
+
+ <geometric mean> 299ms 287ms ^ 1.04x faster
+ <arithmetic mean> 390ms 374ms ^ 1.04x faster
+ <harmonic mean> 227ms 220ms ^ 1.03x faster
+
+ iPad (2 cores = 2 threads):
+
+ [ Doesn't run Ruby, so no pretty subtest output. ]
+
+ Baseline Patch Δ
+ Execution Time: 174.14ms 171.5ms ^ 1.02x faster
+
+ * bmalloc.xcodeproj/project.pbxproj:
+
+ * bmalloc/ScopeExit.h: Added. A barebones very wimpy version of
+ WTF::ScopeExit.
+ (bmalloc::ScopeExit::ScopeExit):
+ (bmalloc::ScopeExit::~ScopeExit):
+ (bmalloc::makeScopeExit):
+
+ * bmalloc/StaticMutex.cpp:
+ (bmalloc::StaticMutex::lockSlowCase): Spin before yielding -- that's the
+ speedup. Don't spin if another CPU is already spinning. In theory, more
+ than one spinner accomplishes nothing, and I found that there's a cutoff
+ around 8 or 16 spinners that becomes performance negative on Mac Pro.
+
+ (Note: Another way to accomplish a similar result, if you don't want to
+ use a bit of state in the lock, is to spin for a random duration between
+ 0 and aLot. I tested a version of WTF::WeakRandom with unsynchronized
+ static state and it worked great. But I ultimately opted for the explicit
+ bit because I thought it was clearer.)
+
+ * bmalloc/StaticMutex.h:
+ (bmalloc::StaticMutex::init): Initialize our new bit.
+
+ * bmalloc/ThreadSwitch.h: Added.
+ (bmalloc::threadSwitch): Don't call yield() on Darwin because it's too
+ aggressive. swtch() does what we want: Go run something else, without
+ any other side-effects.
+
2016-08-03 Geoffrey Garen <[email protected]>
[bmalloc] Merging of XLargeRanges can leak the upper range
Added: trunk/Source/bmalloc/bmalloc/ScopeExit.h (0 => 204744)
--- trunk/Source/bmalloc/bmalloc/ScopeExit.h (rev 0)
+++ trunk/Source/bmalloc/bmalloc/ScopeExit.h 2016-08-22 23:18:09 UTC (rev 204744)
@@ -0,0 +1,53 @@
+/*
+ * Copyright (C) 2016 Apple Inc. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL APPLE INC. OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <type_traits>
+
+namespace bmalloc {
+
+template<typename ExitFunction>
+class ScopeExit {
+public:
+ explicit ScopeExit(ExitFunction&& exitFunction)
+ : m_exitFunction(exitFunction)
+ {
+ }
+
+ ~ScopeExit()
+ {
+ m_exitFunction();
+ }
+
+private:
+ ExitFunction m_exitFunction;
+};
+
+template<typename ExitFunction>
+ScopeExit<ExitFunction> makeScopeExit(ExitFunction&& exitFunction)
+{
+ return ScopeExit<ExitFunction>(std::forward<ExitFunction>(exitFunction));
+}
+
+} // namespace bmalloc
Modified: trunk/Source/bmalloc/bmalloc/StaticMutex.cpp (204743 => 204744)
--- trunk/Source/bmalloc/bmalloc/StaticMutex.cpp 2016-08-22 22:44:42 UTC (rev 204743)
+++ trunk/Source/bmalloc/bmalloc/StaticMutex.cpp 2016-08-22 23:18:09 UTC (rev 204744)
@@ -23,15 +23,31 @@
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
+#include "ScopeExit.h"
#include "StaticMutex.h"
-#include <thread>
+#include "ThreadSwitch.h"
namespace bmalloc {
void StaticMutex::lockSlowCase()
{
+ // The longest critical section in bmalloc is much shorter than the
+ // time it takes to make a system call to yield to the OS scheduler.
+ // So, we try again a lot before we yield.
+ static const size_t aLot = 256;
+
+ if (!m_isSpinning.test_and_set()) {
+ auto clear = makeScopeExit([&] { m_isSpinning.clear(); });
+
+ for (size_t i = 0; i < aLot; ++i) {
+ if (try_lock())
+ return;
+ }
+ }
+
+ // Avoid spinning pathologically.
while (!try_lock())
- std::this_thread::yield();
+ threadSwitch();
}
} // namespace bmalloc
Modified: trunk/Source/bmalloc/bmalloc/StaticMutex.h (204743 => 204744)
--- trunk/Source/bmalloc/bmalloc/StaticMutex.h 2016-08-22 22:44:42 UTC (rev 204743)
+++ trunk/Source/bmalloc/bmalloc/StaticMutex.h 2016-08-22 23:18:09 UTC (rev 204744)
@@ -52,6 +52,7 @@
void lockSlowCase();
std::atomic_flag m_flag;
+ std::atomic_flag m_isSpinning;
};
static inline void sleep(
@@ -78,6 +79,7 @@
inline void StaticMutex::init()
{
m_flag.clear();
+ m_isSpinning.clear();
}
inline bool StaticMutex::try_lock()
Added: trunk/Source/bmalloc/bmalloc/ThreadSwitch.h (0 => 204744)
--- trunk/Source/bmalloc/bmalloc/ThreadSwitch.h (rev 0)
+++ trunk/Source/bmalloc/bmalloc/ThreadSwitch.h 2016-08-22 23:18:09 UTC (rev 204744)
@@ -0,0 +1,44 @@
+/*
+ * Copyright (C) 2016 Apple Inc. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY APPLE INC. ``AS IS'' AND ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL APPLE INC. OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#if BOS(DARWIN)
+#include <mach/thread_switch.h>
+#endif
+#include <thread>
+
+namespace bmalloc {
+
+inline void threadSwitch()
+{
+ // yield() on Darwin will depress your priority to absolute 0 for 10ms,
+ // and possibly clock down the CPU -- so we avoid it.
+#if BOS(DARWIN)
+ swtch();
+#else
+ std::this_thread::yield();
+#endif
+}
+
+} // namespace bmalloc
Modified: trunk/Source/bmalloc/bmalloc.xcodeproj/project.pbxproj (204743 => 204744)
--- trunk/Source/bmalloc/bmalloc.xcodeproj/project.pbxproj 2016-08-22 22:44:42 UTC (rev 204743)
+++ trunk/Source/bmalloc/bmalloc.xcodeproj/project.pbxproj 2016-08-22 23:18:09 UTC (rev 204744)
@@ -24,6 +24,8 @@
147DC6E31CA5B70B00724E8D /* Chunk.h in Headers */ = {isa = PBXBuildFile; fileRef = 147DC6E21CA5B70B00724E8D /* Chunk.h */; settings = {ATTRIBUTES = (Private, ); }; };
14895D911A3A319C0006235D /* Environment.cpp in Sources */ = {isa = PBXBuildFile; fileRef = 14895D8F1A3A319C0006235D /* Environment.cpp */; };
14895D921A3A319C0006235D /* Environment.h in Headers */ = {isa = PBXBuildFile; fileRef = 14895D901A3A319C0006235D /* Environment.h */; settings = {ATTRIBUTES = (Private, ); }; };
+ 148EFAE81D6B953B008E721E /* ScopeExit.h in Headers */ = {isa = PBXBuildFile; fileRef = 148EFAE61D6B953B008E721E /* ScopeExit.h */; };
+ 148EFAE91D6B953B008E721E /* ThreadSwitch.h in Headers */ = {isa = PBXBuildFile; fileRef = 148EFAE71D6B953B008E721E /* ThreadSwitch.h */; };
14C8992B1CC485E70027A057 /* Map.h in Headers */ = {isa = PBXBuildFile; fileRef = 14C8992A1CC485E70027A057 /* Map.h */; settings = {ATTRIBUTES = (Private, ); }; };
14C8992D1CC578330027A057 /* XLargeRange.h in Headers */ = {isa = PBXBuildFile; fileRef = 14C8992C1CC578330027A057 /* XLargeRange.h */; settings = {ATTRIBUTES = (Private, ); }; };
14C919C918FCC59F0028DB43 /* BPlatform.h in Headers */ = {isa = PBXBuildFile; fileRef = 14C919C818FCC59F0028DB43 /* BPlatform.h */; settings = {ATTRIBUTES = (Private, ); }; };
@@ -110,6 +112,8 @@
1485656018A43DBA00ED6942 /* ObjectType.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = ObjectType.h; path = bmalloc/ObjectType.h; sourceTree = "<group>"; };
14895D8F1A3A319C0006235D /* Environment.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; name = Environment.cpp; path = bmalloc/Environment.cpp; sourceTree = "<group>"; };
14895D901A3A319C0006235D /* Environment.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = Environment.h; path = bmalloc/Environment.h; sourceTree = "<group>"; };
+ 148EFAE61D6B953B008E721E /* ScopeExit.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = ScopeExit.h; path = bmalloc/ScopeExit.h; sourceTree = "<group>"; };
+ 148EFAE71D6B953B008E721E /* ThreadSwitch.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = ThreadSwitch.h; path = bmalloc/ThreadSwitch.h; sourceTree = "<group>"; };
14B650C518F39F4800751968 /* Base.xcconfig */ = {isa = PBXFileReference; lastKnownFileType = text.xcconfig; path = Base.xcconfig; sourceTree = "<group>"; };
14B650C618F39F4800751968 /* bmalloc.xcconfig */ = {isa = PBXFileReference; lastKnownFileType = text.xcconfig; path = bmalloc.xcconfig; sourceTree = "<group>"; };
14B650C718F39F4800751968 /* DebugRelease.xcconfig */ = {isa = PBXFileReference; lastKnownFileType = text.xcconfig; path = DebugRelease.xcconfig; sourceTree = "<group>"; };
@@ -262,9 +266,11 @@
14446A0717A61FA400F9EA1D /* PerProcess.h */,
144469FD17A61F1F00F9EA1D /* PerThread.h */,
145F6878179E3A4400D65598 /* Range.h */,
+ 148EFAE61D6B953B008E721E /* ScopeExit.h */,
143CB81A19022BC900B16A45 /* StaticMutex.cpp */,
143CB81B19022BC900B16A45 /* StaticMutex.h */,
1417F64F18B7280C0076FA3F /* Syscall.h */,
+ 148EFAE71D6B953B008E721E /* ThreadSwitch.h */,
1479E21217A1A255006D4E9D /* Vector.h */,
1479E21417A1A63E006D4E9D /* VMAllocate.h */,
);
@@ -309,6 +315,7 @@
4426E2831C839547008EB042 /* BSoftLinking.h in Headers */,
14DD789018F48CEB00950702 /* Sizes.h in Headers */,
14DD78C718F48D7500950702 /* BAssert.h in Headers */,
+ 148EFAE91D6B953B008E721E /* ThreadSwitch.h in Headers */,
14DD78D018F48D7500950702 /* VMAllocate.h in Headers */,
14DD78CE18F48D7500950702 /* Syscall.h in Headers */,
14DD78C618F48D7500950702 /* AsyncTask.h in Headers */,
@@ -316,6 +323,7 @@
14895D921A3A319C0006235D /* Environment.h in Headers */,
1400274A18F89C2300115C97 /* VMHeap.h in Headers */,
1400274918F89C1300115C97 /* Heap.h in Headers */,
+ 148EFAE81D6B953B008E721E /* ScopeExit.h in Headers */,
140FA00319CE429C00FFD3C8 /* BumpRange.h in Headers */,
4426E2811C838EE0008EB042 /* Logging.h in Headers */,
14DD78C518F48D7500950702 /* Algorithm.h in Headers */,