[HACKERS] Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-10-13 Thread Amir Rohan
On 10/11/2015 03:20 AM, Peter Geoghegan wrote:
> On Thu, Sep 3, 2015 at 5:35 PM, David Rowley
>  wrote:
>> My test cases are:
> 
> Note that my text caching and unsigned integer comparison patches have
> moved the baseline down quite noticeably. I think that my mobile
> processor out-performs the Xeon you used for this, which seems a
> little odd even taken the change in baseline performance into account.
> 

To add a caveat not yet mentioned, the idea behind  prefetching is to
scarifice spare memory bandwidth for performance. That can be
a winning bet on a quiet box (the one we benchmark on), but can backfire
on production db when the extra memory pressure can degrade all running
queries.  Something to test for, or at least keep in mind.

>> set work_mem ='1GB';
>> create table t1 as select md5(random()::text) from
>> generate_series(1,1000);
>>
>> Times are in milliseconds. Median and average over 10 runs.
>>
>> Test 1
>

I am the reluctant owner of outmoded hardware. Namely a core2 from
around 2007 on plain spinning metal. My results (linux 64bit):

--
Test 1
--
set work_mem ='1GB';
select count(distinct md5) from t1;

== Master ==
42771.040 ms <- outlier?
41704.570 ms
41631.660 ms
41421.877 ms

== Patch ==
42469.911 ms  <- outlier?
41378.556 ms
41375.870 ms
41118.105 ms
41096.283 ms
41095.705 ms

--
Test 2
--
select sum(rn) from (select row_number() over (order by md5) rn from t1) a;

== Master ==
44186.775 ms
44137.154 ms
44111.271 ms
44061.896 ms
44109.122 ms

== Patch ==
44592.590 ms
44403.076 ms
44276.170 ms


very slight difference in an ambiguous direction, but also no perf
catastrophe.

> It's worth considering that for some (possibly legitimate) reason, the
> built-in function call is ignored by your compiler, since GCC has
> license to do that. You might try this on both master and patched
> builds:
> 
> ~/postgresql/src/backend/utils/sort$ gdb -batch -ex 'file tuplesort.o'
> -ex 'disassemble tuplesort_gettuple_common' > prefetch_disassembly.txt
> 
> ...
> 
> Notably, there is a prefetchnta instruction here.
> 

I have verified the prefetech is emitted in the disassembly.

An added benefit of owning outmoded hardware is that the MSR for this
generation is public and I can disable individual prefetcher units by
twiddeling a bit.

Disabling the "HW prefetch" or the "DCU prefetch" units on a pacthed
version gave results that look relatively unchanged, which seems promising.
Disabling them both at once on an unpatched version shows a slowdown of
5-6% in test1 (43347.181, 43898.705, 43399.428). That gives an
indication of maximum potential gains in this direction, for this box
at least.

Finally, I notice my results are 4x slower than everyone else's.
That can be very tough on a man's pride, let me tell you.

Amir



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

2015-07-16 Thread Peter Geoghegan
On Thu, Jul 16, 2015 at 4:01 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch adds a portable Postgres wrapper on the GCC intrinsic.
 It also adds a client within tuplesort.c -- a common function that
 seems like a good generic choke point. I can get a speed-up of 6% - 9%
 for all of these cases by prefetching a few SortTuples ahead, for the
 tuple proper. (In-memory sorts only.)

I added a silly bug during last minute clean-up. I attach a V2.

-- 
Peter Geoghegan
From 5d3093e9c0d10f4140072da8d18d0dddf6b84b1e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan peter.geoghega...@gmail.com
Date: Sun, 12 Jul 2015 13:14:01 -0700
Subject: [PATCH] Prefetch from memtuples array in tuplesort

Testing shows that prefetching the tuple proper of a slightly later
SortTuple in the memtuples array during each of many sequential,
in-logical-order SortTuple fetches speeds up various sorting intense
operations considerably.  For example, B-Tree index builds are
accelerated as leaf pages are created from the memtuples array.
(i.e.  The operation following actually performing the sort, but
before a tuplesort_end() call is made as a B-Tree spool is destroyed.)

Similarly, ordered set aggregates (all cases except the datumsort case
with a pass-by-value type), and regular heap tuplesorts benefit to about
the same degree.  The optimization is only used when sorts fit in
memory, though.

Also, prefetch a few places ahead within the analogous fetching point
in tuplestore.c.  This appears to offer similar benefits in certain
cases.  For example, queries involving large common table expressions
significantly benefit.
---
 config/c-compiler.m4| 17 +
 configure   | 30 ++
 configure.in|  1 +
 src/backend/utils/sort/tuplesort.c  | 18 ++
 src/backend/utils/sort/tuplestore.c | 11 +++
 src/include/c.h | 19 +++
 src/include/pg_config.h.in  |  3 +++
 src/include/pg_config.h.win32   |  3 +++
 8 files changed, 102 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 050bfa5..5776201 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -287,6 +287,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE
 
 
 
+# PGAC_C_BUILTIN_PREFETCH
+# -
+# Check if the C compiler understands __builtin_prefetch(),
+# and define HAVE__BUILTIN_PREFETCH if so.
+AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
+[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
+[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],
+[int i = 0;__builtin_prefetch(i, 0, 3);])],
+[pgac_cv__builtin_prefetch=yes],
+[pgac_cv__builtin_prefetch=no])])
+if test x$pgac_cv__builtin_prefetch = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
+  [Define to 1 if your compiler understands __builtin_prefetch.])
+fi])# PGAC_C_BUILTIN_PREFETCH
+
+
+
 # PGAC_C_VA_ARGS
 # --
 # Check if the C compiler understands C99-style variadic macros,
diff --git a/configure b/configure
index 77048e8..f14e65d 100755
--- a/configure
+++ b/configure
@@ -11248,6 +11248,36 @@ if test x$pgac_cv__builtin_unreachable = xyes ; then
 $as_echo #define HAVE__BUILTIN_UNREACHABLE 1 confdefs.h
 
 fi
+{ $as_echo $as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch 5
+$as_echo_n checking for __builtin_prefetch...  6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n (cached)  6
+else
+  cat confdefs.h - _ACEOF conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int i = 0;__builtin_prefetch(i, 0, 3);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile $LINENO; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo $as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch 5
+$as_echo $pgac_cv__builtin_prefetch 6; }
+if test x$pgac_cv__builtin_prefetch = xyes ; then
+
+$as_echo #define HAVE__BUILTIN_PREFETCH 1 confdefs.h
+
+fi
 { $as_echo $as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__ 5
 $as_echo_n checking for __VA_ARGS__...  6; }
 if ${pgac_cv__va_args+:} false; then :
diff --git a/configure.in b/configure.in
index 5724a4d..2132d69 100644
--- a/configure.in
+++ b/configure.in
@@ -1318,6 +1318,7 @@ PGAC_C_TYPES_COMPATIBLE
 PGAC_C_BUILTIN_BSWAP32
 PGAC_C_BUILTIN_CONSTANT_P
 PGAC_C_BUILTIN_UNREACHABLE
+PGAC_C_BUILTIN_PREFETCH
 PGAC_C_VA_ARGS
 PGAC_STRUCT_TIMEZONE
 PGAC_UNION_SEMUN
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 435041a..70b53d1 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1663,6 +1663,24 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 if (state-current  state-memtupcount)
 {
 	*stup = state-memtuples[state-current++];
+
+	/*
+	 * Perform memory prefetch of tuple proper of the
+	 * SortTuple that's