- Revision
- 206274
- Author
- fpi...@apple.com
- Date
- 2016-09-22 14:11:42 -0700 (Thu, 22 Sep 2016)
Log Message
Fences on x86 should be a lot cheaper
https://bugs.webkit.org/show_bug.cgi?id=162417
Reviewed by Mark Lam and Geoffrey Garen.
Source/_javascript_Core:
It turns out that:
lock; orl $0, (%rsp)
does everything that we wanted from:
mfence
And it's a lot faster. When I tried mfence for making object visiting concurrent-GC-TSO-
friendly, it was a 9% regression on Octane/splay. But when I tried ortop, it was neutral.
So, we should use ortop from now on.
This part of the change is for the JITs. MacroAssembler::memoryFence() appears to always
mean something like an acqrel fence, so it's safe to make this use ortop. Since B3's Fence
compiles to Air MemoryFence, which is just MacroAssembler::memoryFence(), this also changes
B3 codegen.
* assembler/MacroAssemblerX86Common.h:
(JSC::MacroAssemblerX86Common::memoryFence):
* assembler/X86Assembler.h:
(JSC::X86Assembler::lock):
* b3/testb3.cpp:
(JSC::B3::testX86MFence):
(JSC::B3::testX86CompilerFence):
Source/WTF:
It turns out that:
lock; orl $0, (%rsp)
does everything that we wanted from:
mfence
And it's a lot faster. When I tried mfence for making object visiting concurrent-GC-TSO-
friendly, it was a 9% regression on Octane/splay. But when I tried ortop, it was neutral.
So, we should use ortop from now on.
This part of the change just affects our Atomics. I also changed this in the JITs.
* wtf/Atomics.h:
(WTF::x86_ortop):
(WTF::storeLoadFence):
(WTF::x86_mfence): Deleted.
Modified Paths
Diff
Modified: trunk/Source/_javascript_Core/ChangeLog (206273 => 206274)
--- trunk/Source/_javascript_Core/ChangeLog 2016-09-22 21:05:46 UTC (rev 206273)
+++ trunk/Source/_javascript_Core/ChangeLog 2016-09-22 21:11:42 UTC (rev 206274)
@@ -1,3 +1,35 @@
+2016-09-22 Filip Pizlo <fpi...@apple.com>
+
+ Fences on x86 should be a lot cheaper
+ https://bugs.webkit.org/show_bug.cgi?id=162417
+
+ Reviewed by Mark Lam and Geoffrey Garen.
+
+ It turns out that:
+
+ lock; orl $0, (%rsp)
+
+ does everything that we wanted from:
+
+ mfence
+
+ And it's a lot faster. When I tried mfence for making object visiting concurrent-GC-TSO-
+ friendly, it was a 9% regression on Octane/splay. But when I tried ortop, it was neutral.
+ So, we should use ortop from now on.
+
+ This part of the change is for the JITs. MacroAssembler::memoryFence() appears to always
+ mean something like an acqrel fence, so it's safe to make this use ortop. Since B3's Fence
+ compiles to Air MemoryFence, which is just MacroAssembler::memoryFence(), this also changes
+ B3 codegen.
+
+ * assembler/MacroAssemblerX86Common.h:
+ (JSC::MacroAssemblerX86Common::memoryFence):
+ * assembler/X86Assembler.h:
+ (JSC::X86Assembler::lock):
+ * b3/testb3.cpp:
+ (JSC::B3::testX86MFence):
+ (JSC::B3::testX86CompilerFence):
+
2016-09-22 Joseph Pecoraro <pecor...@apple.com>
test262: Function length should be number of parameters before parameters with default values
Modified: trunk/Source/_javascript_Core/assembler/MacroAssemblerX86Common.h (206273 => 206274)
--- trunk/Source/_javascript_Core/assembler/MacroAssemblerX86Common.h 2016-09-22 21:05:46 UTC (rev 206273)
+++ trunk/Source/_javascript_Core/assembler/MacroAssemblerX86Common.h 2016-09-22 21:11:42 UTC (rev 206274)
@@ -2629,9 +2629,12 @@
m_assembler.nop();
}
+ // We take memoryFence to mean acqrel. This has acqrel semantics on x86.
void memoryFence()
{
- m_assembler.mfence();
+ // lock; orl $0, (%rsp)
+ m_assembler.lock();
+ m_assembler.orl_im(0, 0, X86Registers::esp);
}
static void replaceWithJump(CodeLocationLabel instructionStart, CodeLocationLabel destination)
Modified: trunk/Source/_javascript_Core/assembler/X86Assembler.h (206273 => 206274)
--- trunk/Source/_javascript_Core/assembler/X86Assembler.h 2016-09-22 21:05:46 UTC (rev 206273)
+++ trunk/Source/_javascript_Core/assembler/X86Assembler.h 2016-09-22 21:11:42 UTC (rev 206274)
@@ -249,6 +249,7 @@
OP_ESCAPE_DD = 0xDD,
OP_CALL_rel32 = 0xE8,
OP_JMP_rel32 = 0xE9,
+ PRE_LOCK = 0xF0,
PRE_SSE_F2 = 0xF2,
PRE_SSE_F3 = 0xF3,
OP_HLT = 0xF4,
@@ -2684,6 +2685,11 @@
m_formatter.prefix(PRE_PREDICT_BRANCH_NOT_TAKEN);
}
+ void lock()
+ {
+ m_formatter.prefix(PRE_LOCK);
+ }
+
void mfence()
{
m_formatter.threeByteOp(OP2_3BYTE_ESCAPE_AE, OP3_MFENCE);
Modified: trunk/Source/_javascript_Core/b3/testb3.cpp (206273 => 206274)
--- trunk/Source/_javascript_Core/b3/testb3.cpp 2016-09-22 21:05:46 UTC (rev 206273)
+++ trunk/Source/_javascript_Core/b3/testb3.cpp 2016-09-22 21:11:42 UTC (rev 206274)
@@ -13062,7 +13062,8 @@
root->appendNew<Value>(proc, Return, Origin());
auto code = compile(proc);
- checkUsesInstruction(*code, "mfence");
+ checkUsesInstruction(*code, "lock or $0x0, (%rsp)");
+ checkDoesNotUseInstruction(*code, "mfence");
}
void testX86CompilerFence()
@@ -13075,6 +13076,7 @@
root->appendNew<Value>(proc, Return, Origin());
auto code = compile(proc);
+ checkDoesNotUseInstruction(*code, "lock");
checkDoesNotUseInstruction(*code, "mfence");
}
Modified: trunk/Source/WTF/ChangeLog (206273 => 206274)
--- trunk/Source/WTF/ChangeLog 2016-09-22 21:05:46 UTC (rev 206273)
+++ trunk/Source/WTF/ChangeLog 2016-09-22 21:11:42 UTC (rev 206274)
@@ -1,3 +1,29 @@
+2016-09-22 Filip Pizlo <fpi...@apple.com>
+
+ Fences on x86 should be a lot cheaper
+ https://bugs.webkit.org/show_bug.cgi?id=162417
+
+ Reviewed by Mark Lam and Geoffrey Garen.
+
+ It turns out that:
+
+ lock; orl $0, (%rsp)
+
+ does everything that we wanted from:
+
+ mfence
+
+ And it's a lot faster. When I tried mfence for making object visiting concurrent-GC-TSO-
+ friendly, it was a 9% regression on Octane/splay. But when I tried ortop, it was neutral.
+ So, we should use ortop from now on.
+
+ This part of the change just affects our Atomics. I also changed this in the JITs.
+
+ * wtf/Atomics.h:
+ (WTF::x86_ortop):
+ (WTF::storeLoadFence):
+ (WTF::x86_mfence): Deleted.
+
2016-09-21 Alexey Proskuryakov <a...@apple.com>
Rolling out r206244, as it caused flaky crashes on tests.
Modified: trunk/Source/WTF/wtf/Atomics.h (206273 => 206274)
--- trunk/Source/WTF/wtf/Atomics.h 2016-09-22 21:05:46 UTC (rev 206273)
+++ trunk/Source/WTF/wtf/Atomics.h 2016-09-22 21:11:42 UTC (rev 206274)
@@ -167,7 +167,7 @@
#elif CPU(X86) || CPU(X86_64)
-inline void x86_mfence()
+inline void x86_ortop()
{
#if OS(WINDOWS)
// I think that this does the equivalent of a dummy interlocked instruction,
@@ -176,13 +176,15 @@
// investigate if that is actually better.
MemoryBarrier();
#else
- asm volatile("mfence" ::: "memory");
+ // This has acqrel semantics and is much cheaper than mfence. For exampe, in the JSC GC, using
+ // mfence as a store-load fence was a 9% slow-down on Octane/splay while using this was neutral.
+ asm volatile("lock; orl $0, (%%rsp)" ::: "memory");
#endif
}
inline void loadLoadFence() { compilerFence(); }
inline void loadStoreFence() { compilerFence(); }
-inline void storeLoadFence() { x86_mfence(); }
+inline void storeLoadFence() { x86_ortop(); }
inline void storeStoreFence() { compilerFence(); }
inline void memoryBarrierAfterLock() { compilerFence(); }
inline void memoryBarrierBeforeUnlock() { compilerFence(); }