Re: Spinlock implementation on x86_64 (was Re: [HACKERS] Better LWLocks with compare-and-swap (9.4))

2013-08-29 Thread Heikki Linnakangas

On 28.08.2013 20:21, Tom Lane wrote:

Heikki Linnakangashlinnakan...@vmware.com  writes:

So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to
master. But I would love to get feedback from people running different
x86_64 hardware.


Surely this patch should update the existing comment at line 209?  Or at
least point out that a non-locked test in TAS_SPIN is not the same as a
non-locked test in tas() itself.


Committed with some comment changes. I also added a note to the 32-bit 
x86 implementation, suggesting we probably should do the same there, 
no-one's just gotten around to do the testing.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Spinlock implementation on x86_64 (was Re: [HACKERS] Better LWLocks with compare-and-swap (9.4))

2013-08-28 Thread Heikki Linnakangas

On 21.05.2013 00:20, Heikki Linnakangas wrote:

On 16.05.2013 01:08, Daniel Farina wrote:

On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this
profile, when
I run pgbench -S -c64 -j64 -T60 -M prepared on a 32-core Linux
machine:

- 64.09% postgres postgres [.] tas
- tas
- 99.83% s_lock
- 53.22% LWLockAcquire
+ 99.87% GetSnapshotData
- 46.78% LWLockRelease
GetSnapshotData
+ GetTransactionSnapshot
+ 2.97% postgres postgres [.] tas
+ 1.53% postgres libc-2.13.so [.] 0x119873
+ 1.44% postgres postgres [.] GetSnapshotData
+ 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable
+ 1.18% postgres postgres [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My
theory
is that after that point all the cores are busy, and processes start
to be
sometimes context switched while holding the spinlock, which kills
performance.


I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes number of processors. It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time. I figured preemption while in
the spinlock was to blame at the time, given the extreme nature


Stop the press! I'm getting the same speedup on that 32-core box I got
with the compare-and-swap patch, from this one-liner:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -200,6 +200,8 @@ typedef unsigned char slock_t;

#define TAS(lock) tas(lock)

+#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
+
static __inline__ int
tas(volatile slock_t *lock)
{

So, on this system, doing a non-locked test before the locked xchg
instruction while spinning, is a very good idea. That contradicts the
testing that was done earlier when the x86-64 implementation was added,
as we have this comment in the tas() implementation:


/*
* On Opteron, using a non-locking test before the locking instruction
* is a huge loss. On EM64T, it appears to be a wash or small loss,
* so we needn't bother to try to distinguish the sub-architectures.
*/


On my test system, the non-locking test is a big win. I tested this
because I was reading this article from Intel:

http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/.
It says very explicitly that the non-locking test is a good idea:


Spinning on volatile read vs. spin on lock attempt

One common mistake made by developers developing their own spin-wait
loops is attempting to spin on an atomic instruction instead of
spinning on a volatile read. Spinning on a dirty read instead of
attempting to acquire a lock consumes less time and resources. This
allows an application to only attempt to acquire a lock only when it
is free.


Now, I'm not sure what to do about this. If we put the non-locking test
in there, according to the previous testing that would be a huge loss on
Opterons.


I think we should apply the attached patch to put a non-locked test in 
TAS_SPIN() on x86_64.


Tom tested this back in 2011 on a 32-core Opteron [1] and found little 
difference. And on a 80-core Xeon (160 with hyperthreading) system, he 
saw a quite significant gain from it [2]. Reading the thread, I don't 
understand why it wasn't applied to x86_64 back then. Maybe it was for 
the fear of having a negative impact on performance on old Opterons, but 
I strongly suspect that the performance would be OK even on old Opterons 
if you only do the non-locked test in TAS_SPIN() and not in TAS(). Back 
when the comment about poor performance with a non-locking test on 
Opteron was written, we used the same TAS() macro in the first and while 
spinning.


I tried to find some sort of an authoritative guide from AMD that would 
say how spinlocks should be implemented, but didn't find any. The Intel 
guide I linked above says we should apply the patch. Linux uses a ticket 
spinlock thingie nowadays, which is slightly different, but if I'm 
reading the older pre-ticket version correctly, they used to do the 
same. Ie. non-locking test while spinning, but not before the first 
attempt to grab the lock. Same in FreeBSD.


So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to 
master. But I would love to get feedback from people running different 
x86_64 hardware.


[1] http://www.postgresql.org/message-id/8292.1314641...@sss.pgh.pa.us
[2] http://www.postgresql.org/message-id/25989.1314734...@sss.pgh.pa.us

- Heikki
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 

Re: Spinlock implementation on x86_64 (was Re: [HACKERS] Better LWLocks with compare-and-swap (9.4))

2013-08-28 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to 
 master. But I would love to get feedback from people running different 
 x86_64 hardware.

Surely this patch should update the existing comment at line 209?  Or at
least point out that a non-locked test in TAS_SPIN is not the same as a
non-locked test in tas() itself.

Other than the commenting, I have no objection to this.  I think you're
probably right that the old tests in which this looked like a bad idea
were adding the unlocked test to tas() not only the spin case.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Dickson S. Guedes
Em Dom, 2013-05-19 às 09:29 +0300, Heikki Linnakangas escreveu:
 On 18.05.2013 03:52, Dickson S. Guedes wrote:
  pgbench -S is such a workload. With 9.3beta1, I'm seeing this
  profile, when I run pgbench -S -c64 -j64 -T60 -M prepared on a
  32-core Linux machine:
 
  -  64.09%  postgres  postgres   [.] tas - tas - 99.83%
  s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
  LWLockRelease GetSnapshotData + GetTransactionSnapshot +   2.97%
  postgres  postgres   [.] tas +   1.53%  postgres
  libc-2.13.so   [.] 0x119873 +   1.44%  postgres  postgres
  [.] GetSnapshotData +   1.29%  postgres  [kernel.kallsyms]  [k]
  arch_local_irq_enable +   1.18%  postgres  postgres   [.]
  AllocSetAlloc ...
 
  I'd like to test this here but I couldn't reproduce that perf output
  here in a 64-core or 24-core machines, could you post the changes to
  postgresql.conf and the perf arguments that you used?
 
 Sure, here are the non-default postgresql.conf settings:


Thank you for your information.


 While pgbench was running, I ran this:
 
 perf record -p 6050 -g -e cpu-clock
 
 to connect to one of the backends. (I used cpu-clock, because the 
 default cpu-cycles event didn't work on the box)


Hum, I was supposing that I was doing something wrong but I'm getting
the same result as before even using your test case and my results is
still different from yours:


+ 71,27% postgres postgres [.] AtEOXact_Buffers
+  7,67% postgres postgres [.] AtEOXact_CatCache
+  6,30% postgres postgres [.] AllocSetCheck
+  5,34% postgres libc-2.12.so [.] __mcount_internal
+  2,14% postgres [kernel.kallsyms][k] activate_page


It's a 64-core machine with PGDATA in a SAN.

vendor_id   : GenuineIntel
cpu family  : 6
model   : 47
model name  : Intel(R) Xeon(R) CPU E7- 4830  @ 2.13GHz
stepping: 2
cpu MHz : 1064.000
cache size  : 24576 KB
physical id : 3
siblings: 16
core id : 24
cpu cores   : 8
apicid  : 241
initial apicid  : 241
fpu : yes
fpu_exception   : yes
cpuid level : 11
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx
smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes
lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4255.87
clflush size: 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management:


Would you like that I test some other configuration to try to simulate
that expected workload?


[]s
-- 
Dickson S. Guedes
mail/xmpp: gue...@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Andres Freund
On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote:
 Hum, I was supposing that I was doing something wrong but I'm getting
 the same result as before even using your test case and my results is
 still different from yours:
 
 
 + 71,27% postgres postgres [.] AtEOXact_Buffers
 +  7,67% postgres postgres [.] AtEOXact_CatCache
 +  6,30% postgres postgres [.] AllocSetCheck
 +  5,34% postgres libc-2.12.so [.] __mcount_internal
 +  2,14% postgres [kernel.kallsyms][k] activate_page

That looks like you have configured with --enable-cassert and probably
also --enable-profiling? The former will give completely distorted
performance results...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Dickson S. Guedes
Em Seg, 2013-05-20 às 14:35 +0200, Andres Freund escreveu:
 On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote:
  Hum, I was supposing that I was doing something wrong but I'm getting
  the same result as before even using your test case and my results is
  still different from yours:
  
  
  + 71,27% postgres postgres [.] AtEOXact_Buffers
  +  7,67% postgres postgres [.] AtEOXact_CatCache
  +  6,30% postgres postgres [.] AllocSetCheck
  +  5,34% postgres libc-2.12.so [.] __mcount_internal
  +  2,14% postgres [kernel.kallsyms][k] activate_page
 
 That looks like you have configured with --enable-cassert and probably
 also --enable-profiling? The former will give completely distorted
 performance results...


Ah! Wrong PATH, so wrong binaries. Thanks Andres.


-- 
Dickson S. Guedes
mail/xmpp: gue...@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Heikki Linnakangas

On 13.05.2013 17:21, Merlin Moncure wrote:

On Mon, May 13, 2013 at 7:50 AM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

The attached patch is still work-in-progress. There needs to be a configure
test and fallback to spinlock if a CAS instruction is not available. I used
the gcc __sync_val_compare_and_swap() builtin directly, that needs to be
abstracted. Also, in the case that the wait queue needs to be manipulated,
the code spins on the CAS instruction, but there is no delay mechanism like
there is on a regular spinlock; that needs to be added in somehow.


These are really interesting results. Why is the CAS method so much
faster then TAS?


If my theory is right, the steep fall is caused by backends being 
sometimes context switched while holding the spinlock. The CAS method 
alleviates that, because the lock is never held by anyone. With a 
spinlock, if a process gets context switched while holding the lock, 
everyone else will have to wait until the process gets context switched 
back, and releases the spinlock. With the CAS method, if a process gets 
switched in the CAS-loop, it doesn't hurt others.


With the patch, the CAS-loop still spins, just like a spinlock, when the 
wait queue has to be manipulated. So when that happens, it will behave 
more or less like the spinlock implementation. I'm not too concerned 
about that optimizing that further; sleeping and signaling sleeping 
backends are heavy-weight operations anyway.



Did you see any contention?


With the current spinlock implementation? Yeah, per the profile I 
posted, a lot of CPU time was spent spinning, which is a sign of 
contention. I didn't see contention with the CAS implemention.


Attached is a new version of the patch. I cleaned it up a lot, and added 
a configure test, and fallback to the good old spinlock implementation 
if compare-and-swap is not available. Also, if the CAS loops needs to 
spin, like a spinlock, it now sleeps after a while and updates the 
spins_per_delay like regular spinlock.


- Heikki
diff --git a/configure b/configure
index 98d889a..43faaf8 100755
--- a/configure
+++ b/configure
@@ -22742,71 +22742,6 @@ fi
 done
 
 
-{ $as_echo $as_me:$LINENO: checking for builtin locking functions 5
-$as_echo_n checking for builtin locking functions...  6; }
-if test ${pgac_cv_gcc_int_atomics+set} = set; then
-  $as_echo_n (cached)  6
-else
-  cat conftest.$ac_ext _ACEOF
-/* confdefs.h.  */
-_ACEOF
-cat confdefs.h conftest.$ac_ext
-cat conftest.$ac_ext _ACEOF
-/* end confdefs.h.  */
-
-int
-main ()
-{
-int lock = 0;
-   __sync_lock_test_and_set(lock, 1);
-   __sync_lock_release(lock);
-  ;
-  return 0;
-}
-_ACEOF
-rm -f conftest.$ac_objext conftest$ac_exeext
-if { (ac_try=$ac_link
-case (($ac_try in
-  *\* | *\`* | *\\*) ac_try_echo=\$ac_try;;
-  *) ac_try_echo=$ac_try;;
-esac
-eval ac_try_echo=\\$as_me:$LINENO: $ac_try_echo\
-$as_echo $ac_try_echo) 5
-  (eval $ac_link) 2conftest.er1
-  ac_status=$?
-  grep -v '^ *+' conftest.er1 conftest.err
-  rm -f conftest.er1
-  cat conftest.err 5
-  $as_echo $as_me:$LINENO: \$? = $ac_status 5
-  (exit $ac_status); }  {
-	 test -z $ac_c_werror_flag ||
-	 test ! -s conftest.err
-   }  test -s conftest$ac_exeext  {
-	 test $cross_compiling = yes ||
-	 $as_test_x conftest$ac_exeext
-   }; then
-  pgac_cv_gcc_int_atomics=yes
-else
-  $as_echo $as_me: failed program was: 5
-sed 's/^/| /' conftest.$ac_ext 5
-
-	pgac_cv_gcc_int_atomics=no
-fi
-
-rm -rf conftest.dSYM
-rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
-  conftest$ac_exeext conftest.$ac_ext
-fi
-{ $as_echo $as_me:$LINENO: result: $pgac_cv_gcc_int_atomics 5
-$as_echo $pgac_cv_gcc_int_atomics 6; }
-if test x$pgac_cv_gcc_int_atomics = xyes; then
-
-cat confdefs.h \_ACEOF
-#define HAVE_GCC_INT_ATOMICS 1
-_ACEOF
-
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS=$LIBS_including_readline
 
@@ -28573,6 +28508,136 @@ _ACEOF
 fi
 
 
+# Check for GCC built-in atomic functions
+{ $as_echo $as_me:$LINENO: checking for builtin test-and-set function 5
+$as_echo_n checking for builtin test-and-set function...  6; }
+if test ${pgac_cv_gcc_int_test_and_set+set} = set; then
+  $as_echo_n (cached)  6
+else
+  cat conftest.$ac_ext _ACEOF
+/* confdefs.h.  */
+_ACEOF
+cat confdefs.h conftest.$ac_ext
+cat conftest.$ac_ext _ACEOF
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int lock = 0;
+   __sync_lock_test_and_set(lock, 1);
+   __sync_lock_release(lock);
+  ;
+  return 0;
+}
+_ACEOF
+rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try=$ac_link
+case (($ac_try in
+  *\* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo=\\$as_me:$LINENO: $ac_try_echo\
+$as_echo $ac_try_echo) 5
+  (eval $ac_link) 2conftest.er1
+  ac_status=$?
+  grep -v '^ *+' conftest.er1 conftest.err
+  rm -f conftest.er1
+  cat conftest.err 5
+  $as_echo $as_me:$LINENO: \$? = $ac_status 5
+  (exit $ac_status); }  {
+	 test -z $ac_c_werror_flag ||
+	 test 

Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Bruce Momjian
On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:
 Stephen Frost sfr...@snowman.net writes:
  Isn't this the same issue which has prompted multiple people to propose
  (sometimes with code, as I recall) to rip out our internal spinlock
  system and replace it with kernel-backed calls which do it better,
  specifically by dealing with issues like the above?  Have you seen those
  threads in the past?  Any thoughts about moving in that direction?
 
 All of the proposals of that sort that I've seen had a flavor of
 my OS is the only one that matters.  While I don't object to
 platform-dependent implementations of spinlocks as such, they're not
 much of a cure for a generic performance issue.

Uh, is this an x86-64-only optimization?  Seems so.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Alvaro Herrera
 diff --git a/configure.in b/configure.in
 index 4ea5699..ff8470e 100644
 --- a/configure.in
 +++ b/configure.in
 @@ -1445,17 +1445,6 @@ fi
  AC_CHECK_FUNCS([strtoll strtoq], [break])
  AC_CHECK_FUNCS([strtoull strtouq], [break])
  
 -AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
 -[AC_TRY_LINK([],
 -  [int lock = 0;
 -   __sync_lock_test_and_set(lock, 1);
 -   __sync_lock_release(lock);],
 -  [pgac_cv_gcc_int_atomics=yes],
 -  [pgac_cv_gcc_int_atomics=no])])
 -if test x$pgac_cv_gcc_int_atomics = xyes; then
 -  AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have 
 __sync_lock_test_and_set(int *) and friends.])
 -fi
 -

Careful here --- s_lock.h has some code conditional on
HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is
removing the definition, unless I'm misreading.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Heikki Linnakangas

On 20.05.2013 23:01, Bruce Momjian wrote:

On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:

Stephen Frostsfr...@snowman.net  writes:

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above?  Have you seen those
threads in the past?  Any thoughts about moving in that direction?


All of the proposals of that sort that I've seen had a flavor of
my OS is the only one that matters.  While I don't object to
platform-dependent implementations of spinlocks as such, they're not
much of a cure for a generic performance issue.


Uh, is this an x86-64-only optimization?  Seems so.


All modern architectures have an atomic compare-and-swap instruction (or 
something more powerful that can be used to implement it). That includes 
x86, x86-64, ARM, PowerPC, among others.


There are some differences in how wide values can be swapped with it; 
386 only supported 32-bit, until Pentium, which added a 64-bit variant. 
I used the 64-bit variant in the patch, but for lwlocks, we could 
actually get away with the 32-bit variant if we packed the booleans and 
the shared lock counter more tightly.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Heikki Linnakangas

On 20.05.2013 23:11, Alvaro Herrera wrote:

diff --git a/configure.in b/configure.in
index 4ea5699..ff8470e 100644
--- a/configure.in
+++ b/configure.in
@@ -1445,17 +1445,6 @@ fi
  AC_CHECK_FUNCS([strtoll strtoq], [break])
  AC_CHECK_FUNCS([strtoull strtouq], [break])

-AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
-[AC_TRY_LINK([],
-  [int lock = 0;
-   __sync_lock_test_and_set(lock, 1);
-   __sync_lock_release(lock);],
-  [pgac_cv_gcc_int_atomics=yes],
-  [pgac_cv_gcc_int_atomics=no])])
-if test x$pgac_cv_gcc_int_atomics = xyes; then
-  AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have 
__sync_lock_test_and_set(int *) and friends.])
-fi
-


Careful here --- s_lock.h has some code conditional on
HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is
removing the definition, unless I'm misreading.


Thanks, good catch. I renamed HAVE_GCC_INT_ATOMICS to 
HAVE_GCC_INT_TEST_AND_SET because atomics seems too generic when we 
also test for __sync_val_compare_and_swap(p, oldval, newval).


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Bruce Momjian
On Mon, May 20, 2013 at 11:16:41PM +0300, Heikki Linnakangas wrote:
 On 20.05.2013 23:01, Bruce Momjian wrote:
 On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:
 Stephen Frostsfr...@snowman.net  writes:
 Isn't this the same issue which has prompted multiple people to propose
 (sometimes with code, as I recall) to rip out our internal spinlock
 system and replace it with kernel-backed calls which do it better,
 specifically by dealing with issues like the above?  Have you seen those
 threads in the past?  Any thoughts about moving in that direction?
 
 All of the proposals of that sort that I've seen had a flavor of
 my OS is the only one that matters.  While I don't object to
 platform-dependent implementations of spinlocks as such, they're not
 much of a cure for a generic performance issue.
 
 Uh, is this an x86-64-only optimization?  Seems so.
 
 All modern architectures have an atomic compare-and-swap instruction
 (or something more powerful that can be used to implement it). That
 includes x86, x86-64, ARM, PowerPC, among others.
 
 There are some differences in how wide values can be swapped with
 it; 386 only supported 32-bit, until Pentium, which added a 64-bit
 variant. I used the 64-bit variant in the patch, but for lwlocks, we
 could actually get away with the 32-bit variant if we packed the
 booleans and the shared lock counter more tightly.

So we are going to need to add this kind of assembly language
optimization for every CPU type?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-20 Thread Heikki Linnakangas

On 16.05.2013 01:08, Daniel Farina wrote:

On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
hlinnakan...@vmware.com  wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
I run pgbench -S -c64 -j64 -T60 -M prepared on a 32-core Linux machine:

-  64.09%  postgres  postgres   [.] tas
- tas
   - 99.83% s_lock
  - 53.22% LWLockAcquire
 + 99.87% GetSnapshotData
  - 46.78% LWLockRelease
   GetSnapshotData
 + GetTransactionSnapshot
+   2.97%  postgres  postgres   [.] tas
+   1.53%  postgres  libc-2.13.so   [.] 0x119873
+   1.44%  postgres  postgres   [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres   [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
is that after that point all the cores are busy, and processes start to be
sometimes context switched while holding the spinlock, which kills
performance.


I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes  number of processors.  It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time.  I figured preemption while in
the spinlock was to blame at the time, given the extreme nature


Stop the press! I'm getting the same speedup on that 32-core box I got 
with the compare-and-swap patch, from this one-liner:


--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -200,6 +200,8 @@ typedef unsigned char slock_t;

 #define TAS(lock) tas(lock)

+#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
+
 static __inline__ int
 tas(volatile slock_t *lock)
 {

So, on this system, doing a non-locked test before the locked xchg 
instruction while spinning, is a very good idea. That contradicts the 
testing that was done earlier when the x86-64 implementation was added, 
as we have this comment in the tas() implementation:



/*
 * On Opteron, using a non-locking test before the locking instruction
 * is a huge loss.  On EM64T, it appears to be a wash or small loss,
 * so we needn't bother to try to distinguish the sub-architectures.
 */


On my test system, the non-locking test is a big win. I tested this 
because I was reading this article from Intel:


http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/. 
It says very explicitly that the non-locking test is a good idea:



Spinning on volatile read vs. spin on lock attempt

One common mistake made by developers developing their own spin-wait loops is 
attempting to spin on an atomic instruction instead of spinning on a volatile 
read. Spinning on a dirty read instead of attempting to acquire a lock consumes 
less time and resources. This allows an application to only attempt to acquire 
a lock only when it is free.


Now, I'm not sure what to do about this. If we put the non-locking test 
in there, according to the previous testing that would be a huge loss on 
Opterons.


Perhaps we should just sleep earlier, ie. lower MAX_SPINS_PER_DELAY. 
That way, even if each TAS_SPIN test is very expensive, we don't spend 
too much time spinning if it's really busy, or held by a process that is 
sleeping.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-19 Thread Heikki Linnakangas

On 18.05.2013 03:52, Dickson S. Guedes wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this
profile, when I run pgbench -S -c64 -j64 -T60 -M prepared on a
32-core Linux machine:

-  64.09%  postgres  postgres   [.] tas - tas - 99.83%
s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
LWLockRelease GetSnapshotData + GetTransactionSnapshot +   2.97%
postgres  postgres   [.] tas +   1.53%  postgres
libc-2.13.so   [.] 0x119873 +   1.44%  postgres  postgres
[.] GetSnapshotData +   1.29%  postgres  [kernel.kallsyms]  [k]
arch_local_irq_enable +   1.18%  postgres  postgres   [.]
AllocSetAlloc ...


I'd like to test this here but I couldn't reproduce that perf output
here in a 64-core or 24-core machines, could you post the changes to
postgresql.conf and the perf arguments that you used?


Sure, here are the non-default postgresql.conf settings:


name| current_setting
+-
 application_name   | psql
 autovacuum | off
 checkpoint_segments| 70
 config_file| /home/hlinnakangas/data/postgresql.conf
 data_directory | /home/hlinnakangas/data
 default_text_search_config | pg_catalog.english
 hba_file   | /home/hlinnakangas/data/pg_hba.conf
 ident_file | /home/hlinnakangas/data/pg_ident.conf
 lc_collate | en_US.UTF-8
 lc_ctype   | en_US.UTF-8
 log_timezone   | US/Pacific
 logging_collector  | on
 max_connections| 100
 max_stack_depth| 2MB
 server_encoding| UTF8
 shared_buffers | 2GB
 synchronous_commit | off
 TimeZone   | US/Pacific
 transaction_deferrable | off
 transaction_isolation  | read committed
 transaction_read_only  | off
 wal_buffers| 16MB

While pgbench was running, I ran this:

perf record -p 6050 -g -e cpu-clock

to connect to one of the backends. (I used cpu-clock, because the 
default cpu-cycles event didn't work on the box)


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-17 Thread Dickson S. Guedes
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Em 13-05-2013 09:50, Heikki Linnakangas escreveu:
 I've been working on-and-off on the WAL-insert scaling patch. It's
 in pretty good shape now, and I'll post it shortly, but one thing I
 noticed is that it benefits a lot from using an atomic
 compare-and-swap instruction for the contention-critical part.
 
 I realized that we could also use compare-and-swap to make LWLocks
 scale better. The LWLock struct is too large to compare-and-swap
 atomically, but we can still use CAS to increment/decrement the
 shared/exclusive counters, when there's no need to manipulate the
 wait queue. That would help with workloads where you have a lot of
 CPUs, and a lot of backends need to acquire the same lwlock in
 shared mode, but there's no real contention (ie. few exclusive
 lockers).
 
 pgbench -S is such a workload. With 9.3beta1, I'm seeing this
 profile, when I run pgbench -S -c64 -j64 -T60 -M prepared on a
 32-core Linux machine:
 
 -  64.09%  postgres  postgres   [.] tas - tas - 99.83%
 s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
 LWLockRelease GetSnapshotData + GetTransactionSnapshot +   2.97%
 postgres  postgres   [.] tas +   1.53%  postgres
 libc-2.13.so   [.] 0x119873 +   1.44%  postgres  postgres
 [.] GetSnapshotData +   1.29%  postgres  [kernel.kallsyms]  [k]
 arch_local_irq_enable +   1.18%  postgres  postgres   [.]
 AllocSetAlloc ...

I'd like to test this here but I couldn't reproduce that perf output
here in a 64-core or 24-core machines, could you post the changes to
postgresql.conf and the perf arguments that you used?

Thanks!

[]s
- -- 
Dickson S. Guedes
mail/xmpp: gue...@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://github.net/guedes - twitter: @guediz
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJRltDJAAoJEBa5zL7BI5C7UQkH/Au8p90pTMl1qvbft3q1Gtxp
a4PV8fjOrzQou2I+9Sxu5W1ql3qyVmfFare+bJVKg5L3LmvACjZ6bbw9oKBEnPGB
vzE9nB6+3F3eyo464Niq19cTVgmyRQBcuOT/Ye88Uh2mrrgUYB+lGfk9M2Af7on1
nUZI5YsWWXt/bm9wf6rRCzDs76fS7ity943V0aSg2AHryjfcB8o4oBhJBnrRfnm7
v+SxLg0xDEWQPo8VOCQlIw5IhoxNokHjMAt8Ho7o0dXJRR91vSerdulK4Uxkz13Q
E9GlDBDBzZsHmqHCGECNSglqVegXRA5g2i/o3tmQ/lEKzCF9OiX7GBSkXN+gEsc=
=nGJ5
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-16 Thread Stephen Frost
* Heikki Linnakangas (hlinnakan...@vmware.com) wrote:
 My theory is that after that point all the cores are busy,
 and processes start to be sometimes context switched while holding
 the spinlock, which kills performance. Has anyone else seen that
 pattern? 

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above?  Have you seen those
threads in the past?  Any thoughts about moving in that direction?

 Curiously, I don't see that when connecting pgbench via TCP
 over localhost, only when connecting via unix domain sockets.
 Overall performance is higher over unix domain sockets, so I guess
 the TCP layer adds some overhead, hurting performance, and also
 affects scheduling somehow, making the steep drop go away.

I wonder if the kernel locks around unix domain sockets are helping us
out here, while it's not able to take advantage of such knowledge about
the process that's waiting when it's a TCP connection?  Just a hunch.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-16 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 Isn't this the same issue which has prompted multiple people to propose
 (sometimes with code, as I recall) to rip out our internal spinlock
 system and replace it with kernel-backed calls which do it better,
 specifically by dealing with issues like the above?  Have you seen those
 threads in the past?  Any thoughts about moving in that direction?

All of the proposals of that sort that I've seen had a flavor of
my OS is the only one that matters.  While I don't object to
platform-dependent implementations of spinlocks as such, they're not
much of a cure for a generic performance issue.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-15 Thread Daniel Farina
On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
 I run pgbench -S -c64 -j64 -T60 -M prepared on a 32-core Linux machine:

 -  64.09%  postgres  postgres   [.] tas
- tas
   - 99.83% s_lock
  - 53.22% LWLockAcquire
 + 99.87% GetSnapshotData
  - 46.78% LWLockRelease
   GetSnapshotData
 + GetTransactionSnapshot
 +   2.97%  postgres  postgres   [.] tas
 +   1.53%  postgres  libc-2.13.so   [.] 0x119873
 +   1.44%  postgres  postgres   [.] GetSnapshotData
 +   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
 +   1.18%  postgres  postgres   [.] AllocSetAlloc
 ...

 So, on this test, a lot of time is wasted spinning on the mutex of
 ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
 surprisingly steep drop in performance once you go beyond 29 clients
 (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
 is that after that point all the cores are busy, and processes start to be
 sometimes context switched while holding the spinlock, which kills
 performance.

I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes  number of processors.  It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time.  I figured preemption while in
the spinlock was to blame at the time, given the extreme nature.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-15 Thread Daniel Farina
On Wed, May 15, 2013 at 3:08 PM, Daniel Farina dan...@heroku.com wrote:
 On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
 I run pgbench -S -c64 -j64 -T60 -M prepared on a 32-core Linux machine:

 -  64.09%  postgres  postgres   [.] tas
- tas
   - 99.83% s_lock
  - 53.22% LWLockAcquire
 + 99.87% GetSnapshotData
  - 46.78% LWLockRelease
   GetSnapshotData
 + GetTransactionSnapshot
 +   2.97%  postgres  postgres   [.] tas
 +   1.53%  postgres  libc-2.13.so   [.] 0x119873
 +   1.44%  postgres  postgres   [.] GetSnapshotData
 +   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
 +   1.18%  postgres  postgres   [.] AllocSetAlloc
 ...

 So, on this test, a lot of time is wasted spinning on the mutex of
 ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
 surprisingly steep drop in performance once you go beyond 29 clients
 (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
 is that after that point all the cores are busy, and processes start to be
 sometimes context switched while holding the spinlock, which kills
 performance.

I accidentally some important last words from Heikki's last words in
his mail, which make my correspondence pretty bizarre to understand at
the outset.  Apologies.  He wrote:

 [...] Has anyone else seen that pattern?

 I have, I also used linux perf to come to this conclusion, and my
 determination was similar: a system was undergoing increasingly heavy
 load, in this case with processes  number of processors.  It was
 also a phase-change type of event: at one moment everything would be
 going great, but once a critical threshold was hit, s_lock would
 consume enormous amount of CPU time.  I figured preemption while in
 the spinlock was to blame at the time, given the extreme nature.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-13 Thread Heikki Linnakangas
I've been working on-and-off on the WAL-insert scaling patch. It's in 
pretty good shape now, and I'll post it shortly, but one thing I noticed 
is that it benefits a lot from using an atomic compare-and-swap 
instruction for the contention-critical part.


I realized that we could also use compare-and-swap to make LWLocks scale 
better. The LWLock struct is too large to compare-and-swap atomically, 
but we can still use CAS to increment/decrement the shared/exclusive 
counters, when there's no need to manipulate the wait queue. That would 
help with workloads where you have a lot of CPUs, and a lot of backends 
need to acquire the same lwlock in shared mode, but there's no real 
contention (ie. few exclusive lockers).


pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, 
when I run pgbench -S -c64 -j64 -T60 -M prepared on a 32-core Linux 
machine:


-  64.09%  postgres  postgres   [.] tas
   - tas
  - 99.83% s_lock
 - 53.22% LWLockAcquire
+ 99.87% GetSnapshotData
 - 46.78% LWLockRelease
  GetSnapshotData
+ GetTransactionSnapshot
+   2.97%  postgres  postgres   [.] tas
+   1.53%  postgres  libc-2.13.so   [.] 0x119873
+   1.44%  postgres  postgres   [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres   [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of 
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a 
surprisingly steep drop in performance once you go beyond 29 clients 
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My 
theory is that after that point all the cores are busy, and processes 
start to be sometimes context switched while holding the spinlock, which 
kills performance. Has anyone else seen that pattern? Curiously, I don't 
see that when connecting pgbench via TCP over localhost, only when 
connecting via unix domain sockets. Overall performance is higher over 
unix domain sockets, so I guess the TCP layer adds some overhead, 
hurting performance, and also affects scheduling somehow, making the 
steep drop go away.


Using a compare-and-swap instruction in place of spinning solves the 
problem (green line in attached graph). This needs some more testing 
with different workloads to make sure it doesn't hurt other scenarios, 
but I don't think it will. I'm also waiting for a colleague to test this 
on a different machine, where he saw a similar steep drop in performance.


The attached patch is still work-in-progress. There needs to be a 
configure test and fallback to spinlock if a CAS instruction is not 
available. I used the gcc __sync_val_compare_and_swap() builtin 
directly, that needs to be abstracted. Also, in the case that the wait 
queue needs to be manipulated, the code spins on the CAS instruction, 
but there is no delay mechanism like there is on a regular spinlock; 
that needs to be added in somehow.


- Heikki
attachment: pgbench-lwlock-cas-local-clients-sets.pngdiff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4f88d3f..0d92a30 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -36,13 +36,27 @@
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
 
+/*
+ * The part of LWLock struct that needs to be swappable with an atomic
+ * compare-and-swap instruction. Assuming we have an atomic 64-bit CAS
+ * instruction, sizeof(cas_t) must be = 8.
+ */
+typedef union
+{
+	struct
+	{
+		bool	mutex;			/* Protects LWLock and queue of PGPROCs */
+		bool	releaseOK;		/* T if ok to release waiters */
+		bool	waiters;		/* T if wait queue is not empty (head != NULL) */
+		char	exclusive;		/* # of exclusive holders (0 or 1) */
+		uint32	shared;			/* # of shared holders (0..MaxBackends) */
+	} f;
+	uint64 val;
+} LWLock_cas_t;
 
 typedef struct LWLock
 {
-	slock_t		mutex;			/* Protects LWLock and queue of PGPROCs */
-	bool		releaseOK;		/* T if ok to release waiters */
-	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	int			shared;			/* # of shared holders (0..MaxBackends) */
+	LWLock_cas_t		casval;
 	PGPROC	   *head;			/* head of list of waiting PGPROCs */
 	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
 	/* tail is undefined when head is NULL */
@@ -267,6 +281,9 @@ CreateLWLocks(void)
 	char	   *ptr;
 	int			id;
 
+	StaticAssertStmt(sizeof(LWLock_cas_t) == sizeof(uint64),
+	 unexpected size of LWLock_cas_t);
+
 	/* Allocate space */
 	ptr = (char *) ShmemAlloc(spaceLocks);
 
@@ -283,10 +300,10 @@ CreateLWLocks(void)
 	 */
 	for (id = 0, lock = LWLockArray; id  numLocks; id++, lock++)
 	{
-		SpinLockInit(lock-lock.mutex);
-		lock-lock.releaseOK = true;
-		lock-lock.exclusive = 0;
-		lock-lock.shared = 0;
+		lock-lock.casval.val = 0; /* clear possible padding */
+		lock-lock.casval.f.releaseOK = true;
+		

Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-13 Thread Merlin Moncure
On Mon, May 13, 2013 at 7:50 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 The attached patch is still work-in-progress. There needs to be a configure
 test and fallback to spinlock if a CAS instruction is not available. I used
 the gcc __sync_val_compare_and_swap() builtin directly, that needs to be
 abstracted. Also, in the case that the wait queue needs to be manipulated,
 the code spins on the CAS instruction, but there is no delay mechanism like
 there is on a regular spinlock; that needs to be added in somehow.

These are really interesting results. Why is the CAS method so much
faster then TAS?  Did you see any contention?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers