Re: [HACKERS] avoiding tuple copying in btree index builds

2014-06-08 Thread Amit Kapila
On Wed, Jun 4, 2014 at 2:08 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Jun 1, 2014 at 3:26 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

  I also think it's possible to have similar optimization for hash index
  incase it has to spool the tuple for sorting.
 
  In function hashbuildCallback(), when buildstate-spool is true, we
  can avoid to form index tuple. To check for nulls before calling
 
  _h_spool(), we can traverse the isnull array.

 Hmm, that might work.  Arguably it's less efficient, but on the other
 hand if it avoids forming the tuple sometimes it might be MORE
 efficient.  And anyway the difference might not be enough to matter.

Considering the fact that for hash indexes, this optimization would
only get triggered for large indexes (atleast greater than shared buffer's),
I agree that it will not be as useful as it will be for btree indexes.  The
another minor advantage could have been that we can remove

tuplesort_putindextuple() from code, if this optimization is done for hash

indexes.  However we can leave it as it is for now as there doesn't seem

to be any gain by doing so.


You seem to have removed puttuple_common() call from function

tuplesort_putindextuple(), is there any reason for doing so?


Apart from that patch looks good to me.


Below performance data on various size of indexes shows that this

patch can improve performance upto ~7%.  The performance

difference becomes lesser when the index size is too big and I think

that is probably due to the reason that we have to write all the data

at end of operation, so if the data is big the improvement is not shown

due to large I/O.  The performance improvement is shown considering

median value of 3 runs and time is taken by enabling \timing option

of psql.


Performance Data

---

Configuration:

IBM POWER-7 16 cores, 64 hardware threads

RAM = 64GB

shared_buffers=8GB


pgbench_accounts

No. of Records -10million

Index - on integer

Operation - Reindex


Master

16043.500 ms

16058.723 ms

15941.057 ms


Patch

15525.054 ms

15551.935 ms

15492.879 ms


Perf Improvement: 3.43%


pgbench_accounts

No. of Records -30million

Index - on integer

Operation - Reindex


Master

51258.338 ms

50520.328 ms

50562.022 ms


Patch

49610.710 ms

49302.571 ms

49301.390 ms


Perf Improvement: 2.41%


table (c1 int, c2 char(10))

No. of records = 300,000

Index -  on char(10)

Operation - Reindex


Master

443.584 ms

444.798 ms

452.888 ms


Patch

421.554 ms

430.528 ms

447.558 ms


Performance Improvement: 3.2%


table (c1 int, c2 char(10))

No. of records = 500,000

Index -  on char(10)

Operation - Reindex


Master

663.621 ms

661.299 ms

657.754 ms


Patch

652.325 ms

644.782 ms

643.218 ms


Performance Improvement: 2.5%


table (c1 int, c2 char(10))

No. of records = 1000,000

Index -  on char(10)

Operation - Reindex


Master

16554.076 ms

16686.528 ms

16571.129 ms


Patch

16556.852 ms

16513.543 ms

16610.615 ms


Performance Improvement: less than 1%



table (c1 int, c2 char(20))

No. of records = 300,000

Index -  on char(20)

Operation - Reindex


Master

429.670 ms

441.445 ms

411.539 ms


Patch

401.801 ms

412.716 ms

395.002 ms


Performance Improvement: 6.48%


table (c1 int, c2 char(20))

No. of records = 500,000

Index -  on char(20)

Operation - Reindex


Master

724.541 ms

731.582 ms

704.934 ms


Patch

686.004 ms

677.361 ms

686.915 ms


Performance Improvement: 5.31%


table (c1 int, c2 char(20))

No. of records = 1000,000

Index -  on char(20)

Operation - Reindex


Master

20728.758 ms

20665.289 ms

20656.375 ms


Patch

20594.022 ms

20617.383 ms

20628.181 ms


Performance Improvement: 0.2%

Apart from this, I have taken data for unlogged tables and much
larger indexes and all the data shows similar results as above.
So I think above data is quite good representation of value for
this patch, however if you or anybody else still feels that more
data is required for any other configuration, please do let me
know about the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] cancelling statement due to user request error occurs but the transaction has committed.

2014-06-08 Thread Amit Kapila
On Fri, Jun 6, 2014 at 2:11 PM, Naoya Anzai anzai-na...@mxu.nes.nec.co.jp
wrote:

 Hi All,

 When log_duration is true ( or log_min_duration_statement=0 ),
 If a transaction has internally been commited receives a SIGINT signal
 then a query cancellation error is output.

 For example,
 1. A query like a TRUNCATE is removing bigger table files.
 2. The session receives SIGINT signal.
 3. Query cancellation error occurs.
 4. But the query has commited.


 naoya=# truncate hoge;
 Cancel request sent
 ERROR:  canceling statement due to user request
 naoya=# select count(*) from hoge;
  count
 ---
  0
 (1 row)
 ---

 This is because  ProcessInterrupts function is called by errfinish ( in
query-duration ereport).

 I think this cancellation request must not interrupt the internal
commited transaction.

 This is because clients may misunderstand the transaction has
rollbacked.

There can be similar observation if the server goes off (power
outage or anything like) after committing transaction, client will
receive connection broken, so he can misunderstand that as well.
I think for such corner cases, client needs to reconfirm his action
results with database before concluding anything.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Proposing pg_hibernate

2014-06-08 Thread Amit Kapila
On Fri, Jun 6, 2014 at 5:31 PM, Gurjeet Singh gurj...@singh.im wrote:
 On Thu, Jun 5, 2014 at 11:32 PM, Amit Kapila amit.kapil...@gmail.com
wrote:

  Buffer saver process itself can crash while saving or restoring
  buffers.

 True. That may lead to partial list of buffers being saved. And the
 code in Reader process tries hard to read only valid data, and punts
 at the first sight of data that doesn't make sense or on ERROR raised
 from Postgres API call.

Inspite of Reader Process trying hard, I think we should ensure by
some other means that file saved by buffer saver is valid (may be
first write in tmp file and then rename it or something else).

  IIUC on shutdown request, postmaster will send signal to BG Saver
  and BG Saver will save the buffers and then postmaster will send
  signal to checkpointer to shutdown.  So before writing Checkpoint
  record, BG Saver can crash (it might have saved half the buffers)

 Case handled as described above.

  or may BG saver saves buffers, but checkpointer crashes (due to
  power outage or any such thing).

 Checkpointer process' crash seems to be irrelevant to Postgres
 Hibernator's  workings.

Yeap, but if it crashes before writing checkpoint record, it will lead to
recovery which is what we are considering.

 I think you are trying to argue the wording in my claim save-files
 are created only on clean shutdowons; not on a crash or immediate
 shutdown, by implying that a crash may occur at any time during and
 after the BufferSaver processing. I agree the wording can be improved.

Not only wording, but in your above mail Case 2 and 1b would need to
load buffer's and perform recovery as well, so we need to decide which
one to give preference.

So If you agree that we should have consideration for recovery data
along with saved files data, then I think we have below options to
consider:

1. Have an provision for user to specify which data (recovery or
previous cached blocks) should be considered more important
and then load buffers before or after recovery based on that
input.

2. Always perform before recovery and mention in docs that users
can expect more time for servers to start in case they enable this
extension along with the advantages of the same.

3. Always perform after recovery and mention in docs that enabling
this extension might discard cached data by recovery or initial few
operations done by user.

4. Have an exposed API by BufMgr module such that Buffer loader
will only consider buffers in freelist to load buffers.

Based on opinion of others, I think we can decide on one of these
or if any other better way.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] Allowing NOT IN to use ANTI joins

2014-06-08 Thread David Rowley
Currently pull_up_sublinks_qual_recurse only changes the plan for NOT
EXISTS queries and leaves NOT IN alone. The reason for this is because the
values returned by a subquery in the IN clause could have NULLs.

A simple example of this (without a subquery) is:

select 1 where 3 not in (1, 2, null); returns 0 rows because 3  NULL is
unknown.

The attached patch allows an ANTI-join plan to be generated in cases like:

CREATE TABLE a (id INT PRIMARY KEY, b_id INT NOT NULL);
CREATE TABLE b (id INT NOT NULL);

SELECT * FROM a WHERE b_id NOT IN(SELECT id FROM b);

To generate a plan like:
   QUERY PLAN
-
 Hash Anti Join  (cost=64.00..137.13 rows=1070 width=8)
   Hash Cond: (a.b_id = b.id)
   -  Seq Scan on a  (cost=0.00..31.40 rows=2140 width=8)
   -  Hash  (cost=34.00..34.00 rows=2400 width=4)
 -  Seq Scan on b  (cost=0.00..34.00 rows=2400 width=4)

But if we then do:
ALTER TABLE b ALTER COLUMN id DROP NOT NULL;

The plan will go back to the current behaviour of:

 QUERY PLAN
-
 Seq Scan on a  (cost=40.00..76.75 rows=1070 width=8)
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
 -  Seq Scan on b  (cost=0.00..34.00 rows=2400 width=4)

Comments are welcome

Regards

David Rowley


not_in_anti_join_v0.4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-06-08 Thread Kevin Grittner
Amit Kapila amit.kapil...@gmail.com wrote:

 I have improved the patch  by making following changes:
 a. Improved the bgwriter logic to log for xl_running_xacts info and
    removed the hibernate logic as bgwriter will now work only when
    there is scarcity of buffer's in free list. Basic idea is when the
    number of buffers on freelist drops below the low threshold, the
    allocating backend sets the latch and bgwriter wakesup and begin
    adding buffer's to freelist until it reaches high threshold and then
    again goes back to sleep.

The numbers from your benchmarks are very exciting, but the above
concerns me.  My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist, but toward
preventing shared_buffers from accumulating a lot of dirty pages,
which were leading to cascades of writes between caches and thus to
write stalls.  By pushing dirty pages into the (*much* larger) OS
cache, and letting write combining happen there, where the OS could
pace based on the total number of dirty pages instead of having
some hidden and appearing rather suddenly, latency spikes were
avoided while not causing any noticeable increase in the number of
OS writes to the RAID controller's cache.

Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter. Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Securing make check (CVE-2014-0067)

2014-06-08 Thread Noah Misch
On Sun, Mar 23, 2014 at 07:04:20PM -0400, Noah Misch wrote:
 On Thu, Mar 06, 2014 at 11:52:22PM -0500, Noah Misch wrote:
  On Thu, Mar 06, 2014 at 12:44:34PM -0500, Tom Lane wrote:
   I'm inclined to suggest that we should put the socket under $CWD by
   default, but provide some way for the user to override that choice.
   If they want to put it in /tmp, it's on their head as to how secure
   that is.  On most modern platforms it'd be fine.
  
  I am skeptical about the value of protecting systems with non-sticky /tmp, 
  but
  long $CWD isn't of great importance, either.  I'm fine with your suggestion.
  Though the $CWD or one of its parents could be world-writable, that would
  typically mean an attacker could just replace the test cases directly.
 
 Here's the patch.  The temporary data directory makes for a convenient socket
 directory; initdb already gives it mode 0700, and we have existing
 arrangements to purge it when finished.  One can override the socket directory
 by defining PG_REGRESS_SOCK_DIR in the environment.

Socket path length limitations thwarted that patch:
http://www.postgresql.org/message-id/flat/e1wtnv2-00047s...@gemulon.postgresql.org

Here's an update that places the socket in a temporary subdirectory of /tmp.
The first attached patch adds NetBSD mkdtemp() to libpgport.  The second,
principal, patch uses mkdtemp() to implement this design in pg_regress.  The
corresponding change to contrib/pg_upgrade/test.sh is based on the configure
script's arrangements for its temporary directory.

NetBSD's mkdtemp() has assertions, and I initially mapped its assertion macro
to our Assert().  However, a bug in our MinGW build process causes build
failures if an Assert() call appears in libpgport.  I will post about that in
a new thread.  The affected assertions were uncompelling, so I dropped them.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
diff --git a/configure b/configure
index ed1ff0a..f8232db 100755
--- a/configure
+++ b/configure
@@ -11650,6 +11650,19 @@ esac
 
 fi
 
+ac_fn_c_check_func $LINENO mkdtemp ac_cv_func_mkdtemp
+if test x$ac_cv_func_mkdtemp = xyes; then :
+  $as_echo #define HAVE_MKDTEMP 1 confdefs.h
+
+else
+  case  $LIBOBJS  in
+  * mkdtemp.$ac_objext * ) ;;
+  *) LIBOBJS=$LIBOBJS mkdtemp.$ac_objext
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func $LINENO random ac_cv_func_random
 if test x$ac_cv_func_random = xyes; then :
   $as_echo #define HAVE_RANDOM 1 confdefs.h
diff --git a/configure.in b/configure.in
index 80df1d7..c95e2cd 100644
--- a/configure.in
+++ b/configure.in
@@ -1357,7 +1357,7 @@ else
   AC_CHECK_FUNCS([fpclass fp_class fp_class_d class], [break])
 fi
 
-AC_REPLACE_FUNCS([crypt fls getopt getrusage inet_aton random rint srandom 
strerror strlcat strlcpy])
+AC_REPLACE_FUNCS([crypt fls getopt getrusage inet_aton mkdtemp random rint 
srandom strerror strlcat strlcpy])
 
 case $host_os in
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 5ff9e41..4fb7288 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -330,6 +330,9 @@
 /* Define to 1 if the system has the type `MINIDUMP_TYPE'. */
 #undef HAVE_MINIDUMP_TYPE
 
+/* Define to 1 if you have the `mkdtemp' function. */
+#undef HAVE_MKDTEMP
+
 /* Define to 1 if you have the netinet/in.h header file. */
 #undef HAVE_NETINET_IN_H
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index e6e3c8d..58777ca 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -249,6 +249,9 @@
 /* Define to 1 if the system has the type `MINIDUMP_TYPE'. */
 #define HAVE_MINIDUMP_TYPE 1
 
+/* Define to 1 if you have the `mkdtemp' function. */
+/* #undef HAVE_MKDTEMP */
+
 /* Define to 1 if you have the netinet/in.h header file. */
 #define HAVE_NETINET_IN_H 1
 
diff --git a/src/include/port.h b/src/include/port.h
index c9226f3..3d97481 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -462,6 +462,9 @@ extern int  pg_check_dir(const char *dir);
 /* port/pgmkdirp.c */
 extern int pg_mkdir_p(char *path, int omode);
 
+/* port/mkdtemp.c */
+extern char *mkdtemp(char *path);
+
 /* port/pqsignal.c */
 typedef void (*pqsigfunc) (int signo);
 extern pqsigfunc pqsignal(int signo, pqsigfunc func);
diff --git a/src/port/mkdtemp.c b/src/port/mkdtemp.c
new file mode 100644
index 000..a5e991f
--- /dev/null
+++ b/src/port/mkdtemp.c
@@ -0,0 +1,293 @@
+/*-
+ *
+ * mkdtemp.c
+ *   create a mode-0700 temporary directory
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *   src/port/mkdtemp.c
+ *
+ * This code was taken from NetBSD to provide an implementation for platforms
+ * that lack it.  (Among compatibly-licensed implementations, the OpenBSD
+ * version better resists denial-of-service attacks.  However, it has a
+ * cryptographic dependency.)  The NetBSD copyright 

[HACKERS] configure does not check for bison or flex

2014-06-08 Thread Eric L
I am installing postgresql from source on 64 bit Ubuntu 14.04 and when I run 
the ./configure script, it is successful, but when I run make it fails with an 
error:

ERROR: `flex' is missing on your system.  It is needed to create the file 
`bootscanner.c'.  You can either get flex from a GNU mirror site or download an 
official distribution of PostgreSQL, which contains pre-packaged flex output
I get this same error for bison.  I can install it with sudo apt-get install 
bison and sudo apt-get install flex.  However shouldn't the ./configure 
script be checking for these first?
Eric
  

Re: [HACKERS] configure does not check for bison or flex

2014-06-08 Thread David G Johnston
Eric L wrote
 I am installing postgresql from source on 64 bit Ubuntu 14.04 and when I
 run the ./configure script, it is successful, but when I run make it fails
 with an error:
 
 ERROR: `flex' is missing on your system.  It is needed to create the file
 `bootscanner.c'.  You can either get flex from a GNU mirror site or
 download an official distribution of PostgreSQL, which contains
 pre-packaged flex output
 I get this same error for bison.  I can install it with sudo apt-get
 install bison and sudo apt-get install flex.  However shouldn't the
 ./configure script be checking for these first?
 Eric

Google:  postgresql bison

http://postgresql.1045698.n5.nabble.com/bison-flex-and-configure-td5789215.html

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/configure-does-not-check-for-bison-or-flex-tp5806447p5806449.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] MinGW/Cygwin build snags

2014-06-08 Thread Noah Misch
First, when I tried to add an Assert() call to a file in src/port, a MinGW-w64
build failed like this:

Creating library file: libpostgres.a
../../src/port/libpgport_srv.a(fls_srv.o): In function `fls':
/home/nm/src/pg/mingw-postgresql/src/port/fls.c:63: undefined reference to 
`__imp_assert_enabled'

Since src/makefiles/Makefile.win32 does not set BUILDING_DLL for src/port,
PGDLLIMPORT is set improperly for code to be linked directly into the backend.
Makefile.win32 does set BUILDING_DLL for src/common.  (Makefile.cygwin has the
same discrepancy, though I haven't checked whether it causes an actual build
failure there.  The MSVC build system sets BUILDING_DLL for both src/port and
src/common files.)  This affects any reference to a data symbol from src/port.
The fix is straightforward enough: cause Makefile.{win32,cygwin} to treat
src/port like src/common.


Second, src/template/{win32,cygwin} completely replaces LDFLAGS, so overriding
LDFLAGS on the configure command line is ineffective.  Those scripts should
instead add to the existing LDFLAGS, like other templates do for CPPFLAGS.
Several other templates completely override CFLAGS; that's undesirable for the
same reason.  I don't have ready access to those affected configurations, so
I'm leaving the CFLAGS damage alone.  A fix for the Makefile.win32 half of the
original problem appeared as part of a larger patch:
http://www.postgresql.org/message-id/flat/86sk8845pl@ds4.des.no


Both of these changes fix bugs, but I plan not to back-patch.  Builds that
work today won't see any change, and I found no other user reports.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
diff --git a/src/makefiles/Makefile.cygwin b/src/makefiles/Makefile.cygwin
index bd83e5f..bb2efed 100644
--- a/src/makefiles/Makefile.cygwin
+++ b/src/makefiles/Makefile.cygwin
@@ -28,6 +28,10 @@ ifneq (,$(findstring src/common,$(subdir)))
 override CPPFLAGS+= -DBUILDING_DLL
 endif
 
+ifneq (,$(findstring src/port,$(subdir)))
+override CPPFLAGS+= -DBUILDING_DLL
+endif
+
 ifneq (,$(findstring timezone,$(subdir)))
 override CPPFLAGS+= -DBUILDING_DLL
 endif
diff --git a/src/makefiles/Makefile.win32 b/src/makefiles/Makefile.win32
index b18621b..9cb84f2 100644
--- a/src/makefiles/Makefile.win32
+++ b/src/makefiles/Makefile.win32
@@ -27,6 +27,10 @@ ifneq (,$(findstring src/common,$(subdir)))
 override CPPFLAGS+= -DBUILDING_DLL
 endif
 
+ifneq (,$(findstring src/port,$(subdir)))
+override CPPFLAGS+= -DBUILDING_DLL
+endif
+
 ifneq (,$(findstring timezone,$(subdir)))
 override CPPFLAGS+= -DBUILDING_DLL
 endif
diff --git a/src/template/cygwin b/src/template/cygwin
index e484fe6..b6ea0de 100644
--- a/src/template/cygwin
+++ b/src/template/cygwin
@@ -6,4 +6,4 @@ SRCH_LIB=/usr/local/lib
 # pg_toupper() etc. in both libpq and pgport
 # we'd prefer to use --disable-auto-import to match MSVC linking behavior,
 # but support for it in Cygwin is too haphazard
-LDFLAGS=-Wl,--allow-multiple-definition -Wl,--enable-auto-import
+LDFLAGS=$LDFLAGS -Wl,--allow-multiple-definition -Wl,--enable-auto-import
diff --git a/src/template/win32 b/src/template/win32
index dc5b77e..7da9719 100644
--- a/src/template/win32
+++ b/src/template/win32
@@ -3,4 +3,4 @@
 # --allow-multiple-definition is required to link pg_dump because it finds
 # pg_toupper() etc. in both libpq and pgport
 # --disable-auto-import is to ensure we get MSVC-like linking behavior
-LDFLAGS=-Wl,--allow-multiple-definition -Wl,--disable-auto-import
+LDFLAGS=$LDFLAGS -Wl,--allow-multiple-definition -Wl,--disable-auto-import

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MinGW/Cygwin build snags

2014-06-08 Thread Tom Lane
Noah Misch n...@leadboat.com writes:
 First, when I tried to add an Assert() call to a file in src/port, a MinGW-w64
 build failed like this:

 Creating library file: libpostgres.a
 ../../src/port/libpgport_srv.a(fls_srv.o): In function `fls':
 /home/nm/src/pg/mingw-postgresql/src/port/fls.c:63: undefined reference to 
 `__imp_assert_enabled'

FWIW, I think we had consensus to remove the assert_enabled variable
entirely.  Not that it wouldn't be good to fix this, but Assert per se
won't need it after we do that.

 Since src/makefiles/Makefile.win32 does not set BUILDING_DLL for src/port,
 PGDLLIMPORT is set improperly for code to be linked directly into the backend.
 Makefile.win32 does set BUILDING_DLL for src/common.  (Makefile.cygwin has the
 same discrepancy, though I haven't checked whether it causes an actual build
 failure there.  The MSVC build system sets BUILDING_DLL for both src/port and
 src/common files.)  This affects any reference to a data symbol from src/port.
 The fix is straightforward enough: cause Makefile.{win32,cygwin} to treat
 src/port like src/common.

I wonder whether these cases shouldn't distinguish between the frontend
and backend builds of src/port/ and src/common/.  In particular, it
seems odd that we're getting this type of failure in the backend build.

 Second, src/template/{win32,cygwin} completely replaces LDFLAGS, so overriding
 LDFLAGS on the configure command line is ineffective.  Those scripts should
 instead add to the existing LDFLAGS, like other templates do for CPPFLAGS.
 Several other templates completely override CFLAGS; that's undesirable for the
 same reason.  I don't have ready access to those affected configurations, so
 I'm leaving the CFLAGS damage alone.

+1 for doing something about CFLAGS while we're at it.

 Both of these changes fix bugs, but I plan not to back-patch.

Agreed; the lack of complaints to date suggests that we should leave
well enough alone in the back branches.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] NUMA packaging and patch

2014-06-08 Thread Kevin Grittner
I ran into a situation where a machine with 4 NUMA memory nodes and
40 cores had performance problems due to NUMA.  The problems were
worst right after they rebooted the OS and warmed the cache by
running a script of queries to read all tables.  These were all run
on a single connection.  As it turned out, the size of the database
was just over one-quarter of the size of RAM, and with default NUMA
policies both the OS cache for the database and the PostgreSQL
shared memory allocation were placed on a single NUMA segment, so
access to the CPU package managing that segment became a
bottleneck.  On top of that, processes which happened to run on the
CPU package which had all the cached data had to allocate memory
for local use on more distant memory because there was none left in
the more local memory.

Through normal operations, things eventually tended to shift around
and get better (after several hours of heavy use with substandard
performance).  I ran some benchmarks and found that even in
long-running tests, spreading these allocations among the memory
segments showed about a 2% benefit in a read-only load.  The
biggest difference I saw in a long-running read-write load was
about a 20% hit for unbalanced allocations, but I only saw that
once.  I talked to someone at PGCon who managed to engineer much
worse performance hits for an unbalanced load, although the
circumstances were fairly artificial.  Still, fixing this seems
like something worth doing if further benchmarks confirm benefits
at this level.

By default, the OS cache and buffers are allocated in the memory
node with the shortest distance from the CPU a process is running
on.  This is determined by a the cpuset associated with the
process which reads or writes the disk page.  Typically a NUMA
machine starts with a single cpuset with a policy specifying this
behavior.  Fixing this aspect of things seems like an issue for
packagers, although we should probably document it for those
running from their own source builds.

To set an alternate policy for PostgreSQL, you first need to find
or create the location for cpuset specification, which uses a
filesystem in a way similar to the /proc directory.  On a machine
with more than one memory node, the appropriate filesystem is
probably already mounted, although different distributions use
different filesystem names and mount locations.  I will illustrate
the process on my Ubuntu machine.  Even though it has only one
memory node (and so, this makes no difference), I have it handy at
the moment to confirm the commands as I put them into the email.

# Sysadmin must create the root cpuset if not already done.  (On a
# system with NUMA memory, this will probably already be mounted.)
# Location and options can vary by distro.

sudo sudo mkdir /dev/cpuset
sudo mount -t cpuset none /dev/cpuset

# Sysadmin must create a cpuset for postgres and configure
# resources.  This will normally be all cores and all RAM.  This is
# where we specify that this cpuset will spread pages among its
# memory nodes.

sudo mkdir /dev/cpuset/postgres
sudo /bin/bash -c echo 0-3 /dev/cpuset/postgres/cpus
sudo /bin/bash -c echo 0 /dev/cpuset/postgres/mems
sudo /bin/bash -c echo 1 /dev/cpuset/postgres/memory_spread_page

# Sysadmin must grant permissions to the desired setting(s).
# This could be by user or group.

sudo chown postgres /dev/cpuset/postgres/tasks

# The pid of postmaster or an ancestor process must be written to
# the tasks file of the cpuset.  This can be a shell from which
# pg_ctl is run, at least for bash shells.  It could also be
# written by the postmaster itself, essentially as an extra pid
# file.  Possible snippet from a service script:

echo $$ /dev/cpuset/postgres/tasks
pg_ctl start ...

Where the OS cache is larger than shared_buffers, the above is
probably more important than the attached patch, which causes the
main shared memory segment to be spread among all available memory
nodes.  This patch only compiles in the relevant code if configure
is run using the --with-libnuma option, in which case a dependency
on the numa library is created.  It is v3 to avoid confusion with
earlier versions I have shared with a few people off-list.  (The
only difference from v2 is fixing bitrot.)

I'll add it to the next CF.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/configure b/configure
index ed1ff0a..79a0dea 100755
--- a/configure
+++ b/configure
@@ -702,6 +702,7 @@ EGREP
 GREP
 with_zlib
 with_system_tzdata
+with_libnuma
 with_libxslt
 with_libxml
 XML2_CONFIG
@@ -831,6 +832,7 @@ with_uuid
 with_ossp_uuid
 with_libxml
 with_libxslt
+with_libnuma
 with_system_tzdata
 with_zlib
 with_gnu_ld
@@ -1518,6 +1520,7 @@ Optional Packages:
   --with-ossp-uuidobsolete spelling of --with-uuid=ossp
   --with-libxml   build with XML support
   --with-libxslt  use XSLT support when building contrib/xml2
+  --with-libnuma  use libnuma for NUMA support

Re: [HACKERS] MinGW/Cygwin build snags

2014-06-08 Thread Noah Misch
On Sun, Jun 08, 2014 at 06:04:46PM -0400, Tom Lane wrote:
 Noah Misch n...@leadboat.com writes:
  Since src/makefiles/Makefile.win32 does not set BUILDING_DLL for src/port,
  PGDLLIMPORT is set improperly for code to be linked directly into the 
  backend.
  Makefile.win32 does set BUILDING_DLL for src/common.  (Makefile.cygwin has 
  the
  same discrepancy, though I haven't checked whether it causes an actual build
  failure there.  The MSVC build system sets BUILDING_DLL for both src/port 
  and
  src/common files.)  This affects any reference to a data symbol from 
  src/port.
  The fix is straightforward enough: cause Makefile.{win32,cygwin} to treat
  src/port like src/common.
 
 I wonder whether these cases shouldn't distinguish between the frontend
 and backend builds of src/port/ and src/common/.  In particular, it
 seems odd that we're getting this type of failure in the backend build.

Good question; they need not distinguish.  !BUILDING_DLL means that the code
being compiled will access backend symbols through dynamic linking to
postgres.exe.  Server modules, such as plpgsql, build with !BUILDING_DLL, and
normal backend code builds with BUILDING_DLL.  The setting is irrelevant for
frontend/non-server code, which does not link to backend symbols at all.

  Second, src/template/{win32,cygwin} completely replaces LDFLAGS, so 
  overriding
  LDFLAGS on the configure command line is ineffective.  Those scripts 
  should
  instead add to the existing LDFLAGS, like other templates do for CPPFLAGS.
  Several other templates completely override CFLAGS; that's undesirable for 
  the
  same reason.  I don't have ready access to those affected configurations, so
  I'm leaving the CFLAGS damage alone.
 
 +1 for doing something about CFLAGS while we're at it.

Scratch that; configure.in has logic to discard template script CFLAGS changes
if the user had specified CFLAGS.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] backup_label revisited

2014-06-08 Thread Noah Misch
On Wed, Jun 04, 2014 at 06:17:29PM +0100, Greg Stark wrote:
 On Mon, Jun 2, 2014 at 2:10 PM, Fujii Masao masao.fu...@gmail.com wrote:
  What about the case where we restore the backup to another server and
  start the recovery? In this case, ISTM inode can be the same. No?
 
 Hm, I was about to comment that that seems very unlikely. Even on the
 same server if you delete the old database root and then unpack a
 backup it could possibly reuse the exact same inode again. But it's
 really not likely to happen.
 
 However in the brave new world of filesystem snapshots there is the
 possibility someone has restored a backup by opening one of their
 snapshots read-write. In which case the backup-label would have the
 same inode number. That would still be fine if the snapshot is
 consistent but if they have tablespaces and those tablespaces were
 snapshotted separately then it wouldn't be ok.

 I think that's a fatal flaw unless anyone can see a way to distinguish
 a filesystem from a snapshot of the filesystem.

Implementations of the dump program likely have that property of preserving
inode metadata while not promising a consistent snapshot.  If we keep such
backup methods fully supported, I agree with your conclusion.

PostgreSQL can't, in general, distinguish an almost-snapshot from a master
that crashed during a backup.  But the administrator can track the difference.
If you use pg_start_backup(), your init.d script that gets control after a
crash can have 'rm -f $PGDATA/backup_label'.  Use a different script to
recover hot backups.  Perhaps what ought to change is the documentation and
contrib/start-scripts?  Maybe even add a --definitely-not-a-backup postmaster
option, so scripts need not hardcode knowledge of a semi-internal fact like
the name of the file to delete.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] updated emacs configuration

2014-06-08 Thread Noah Misch
On Wed, Aug 07, 2013 at 07:57:53AM -0400, Peter Eisentraut wrote:
 On 7/2/13 8:42 PM, Peter Eisentraut wrote:
  Updated files with changes:
  
  - adjusted fill-column to 78, per Noah
  - added c-file-style, per Andrew
  - support both postgresql and postgres directory names
  - use defun instead of lambda, per Dimitri
  - put Perl configuration back into emacs.samples, for Tom
  
  I also added configuration of c-auto-align-backslashes as well as label
  and statement-case-open to c-offsets-alist.  With those changes, the
  result of indent-region is now very very close to pgindent, with the
  main exception of the end-of-line de-indenting that pgindent does, which
  nobody likes anyway.
 
 Did anyone have any outstanding concerns about this latest version?  I
 thought it looked ready to commit.

After upgrading to GNU Emacs 23.4.1 from a version predating directory-local
variables, I saw switch/case indentation go on the fritz.  My hooks were
issuing (c-set-style postgresql), but .dir-locals.el set it back to plain
bsd style.  The most-reasonable fix I found was to locally add c-file-style
to ignored-local-variables.  c-file-style is the only setting appearing in
both emacs.samples and .dir-locals.el with non-identical values, so it alone
calls for this treatment.  Behavior looks sane in both Emacs 21.4.1 and Emacs
24.3.1, the version extrema I have at hand.  Emacs 24.3.1 is fine without this
due to recent c-before-hack-hook changes, but this does no harm.  Shall I add
this workaround to emacs.samples?

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
diff --git a/src/tools/editors/emacs.samples b/src/tools/editors/emacs.samples
index e469d55..a510afd 100644
--- a/src/tools/editors/emacs.samples
+++ b/src/tools/editors/emacs.samples
@@ -30,7 +30,12 @@
 (add-hook 'c-mode-hook
   (defun postgresql-c-mode-hook ()
 (when (string-match /postgres\\(ql\\)?/ buffer-file-name)
-  (c-set-style postgresql
+  (c-set-style postgresql)
+  ;; Don't override the style we just set with the style in
+  ;; `dir-locals-file'.  Emacs 23.4.1 needs this; it is obsolete,
+  ;; albeit harmless, by Emacs 24.3.1.
+  (set (make-local-variable 'ignored-local-variables)
+   (append '(c-file-style) ignored-local-variables)
 
 
 ;;; Perl files

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wrapping in extended mode doesn't work well with default pager

2014-06-08 Thread Noah Misch
On Fri, May 23, 2014 at 10:10:23AM -0400, Alvaro Herrera wrote:
 Sergey Muraviov wrote:
  I found some new bugs and fix them.
  And I had to make many changes.
 
 This version fixes some bugs I had noticed in expanded mode too.  For 
 instance,
 the original looked like this (five lines plus header):
 
 -[ RECORD 49 
 ]-+-
  pg_identify_object | (rule,,,_RETURN on 
 pg_catalog.pg_available_extension_versions) 
 
  pg_identify_object | 
 (view,pg_catalog,pg_available_extension_versions,pg_catalog.pg_availabl
 e.
 |._extension_versions)
  
 
 
 whereas it's correctly only three lines plus header with this patch
 applied.

I had noticed a similar-looking behavior change with aligned format and
expanded display, so I gave psql-wrapped-expanded-fix-v4.patch a spin with
this test case:

\pset expanded on
\pset format aligned
select repeat('a', 2) union all select repeat('a', 500);

The patch did not restore 9.3 behavior for that one.  Starting with commit
6513633, the first line of letters is space-padded on the right to the width
of the second line of letters.  To illustrate, I have attached raw psql output
from both commit 6513633 and its predecessor.  Also note that
psql-wrapped-expanded-fix-v4.patch expands each [ RECORD x ] header from 509
bytes to 511 bytes; 509 is the longstanding width.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
Expanded display (expanded) is on.
Output format (format) is aligned.
-[ RECORD 1 
]
 repeat | aa





   
-[ RECORD 2 
]
 repeat | 

 

Expanded display (expanded) is on.
Output format (format) is aligned.
-[ RECORD 1 
]
repeat | aa
-[ RECORD 2 
]
repeat | 

Re: [HACKERS] Scaling shared buffer eviction

2014-06-08 Thread Amit Kapila
On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner kgri...@ymail.com wrote:
 Amit Kapila amit.kapil...@gmail.com wrote:

  I have improved the patch  by making following changes:
  a. Improved the bgwriter logic to log for xl_running_xacts info and
 removed the hibernate logic as bgwriter will now work only when
 there is scarcity of buffer's in free list. Basic idea is when the
 number of buffers on freelist drops below the low threshold, the
 allocating backend sets the latch and bgwriter wakesup and begin
 adding buffer's to freelist until it reaches high threshold and then
 again goes back to sleep.

 The numbers from your benchmarks are very exciting, but the above
 concerns me.  My tuning of the bgwriter in production has generally
 *not* been aimed at keeping pages on the freelist, but toward
 preventing shared_buffers from accumulating a lot of dirty pages,
 which were leading to cascades of writes between caches and thus to
 write stalls.  By pushing dirty pages into the (*much* larger) OS
 cache, and letting write combining happen there, where the OS could
 pace based on the total number of dirty pages instead of having
 some hidden and appearing rather suddenly, latency spikes were
 avoided while not causing any noticeable increase in the number of
 OS writes to the RAID controller's cache.

 Essentially I was able to tune the bgwriter so that a dirty page
 was always push out to the OS cache within three seconds, which led
 to a healthy balance of writes between the checkpoint process and
 the bgwriter.

I think it would have been better if bgwriter does writes based on
the amount of buffer's that get dirtied to achieve the balance of
writes.

 Backend processes related to user connections still
 performed about 30% of the writes, and this work shows promise
 toward bringing that down, which would be great; but please don't
 eliminate the ability to prevent write stalls in the process.

I agree that for some cases as explained by you, the current bgwriter
logic does satisfy the need, however there are other cases as well
where actually it doesn't help much, one of such cases I am trying to
improve (ease backend buffer allocations) and other may be when
there is constant write activity for which I am not sure how much it
really helps. Part of the reason for trying to make bgwriter respond
mainly to ease backend allocations is the previous discussion for
the same, refer below link:
http://www.postgresql.org/message-id/ca+tgmoz7dvhc4h-ffjmzcff6vwynfoeapz021vxw61uh46r...@mail.gmail.com

However if we want to retain current property of bgwriter, we can do
the same by one of below ways:
a. Have separate processes for writing dirty buffers and moving buffers
to freelist.
b. In the current bgwriter, separate the two works based on the need.
The need can be decided based on whether bgwriter has been waked
due to shortage of buffers on free list or if it has been waked due to
BgWriterDelay.

Now as populating freelist and balance writes by writing dirty buffers
are two separate responsibilities, so not sure if doing that by one
process is a good idea.

I am planing to take some more performance data, part of which will
be write load as well, but I am now sure if that can anyway show the
need as mentioned by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] RETURNING PRIMARY KEY syntax extension

2014-06-08 Thread Ian Barwick

Hi,

The JDBC API provides the getGeneratedKeys() method as a way of retrieving
primary key values without the need to explicitly specify the primary key
column(s). This is a widely-used feature, however the implementation has 
significant
performance drawbacks.

Currently this feature is implemented in the JDBC driver by appending
RETURNING * to the supplied statement. However this means all columns of
affected rows will be returned to the client, which causes significant
performance problems, particularly on wide tables. To mitigate this, it would
be desirable to enable the JDBC driver to request only the primary key value(s).

One possible solution would be to have the driver request the primary key for
a table, but this could cause a race condition where the primary key could 
change,
and even if it does not, it would entail extra overhead.

A more elegant and universal solution, which would allow the JDBC driver to
request the primary key in a single request, would be to extend the RETURNING
clause syntax with the option PRIMARY KEY. This resolves during parse
analysis into the columns of the primary key, which can be done unambiguously
because the table is already locked by that point and the primary key cannot 
change.

A patch is attached which implements this, and will be added to the next 
commitfest.
A separate patch will be submitted to the JDBC project. Example usage shown 
below.


Regards

Ian Barwick

/* -- */
postgres=# CREATE TABLE foo (id SERIAL PRIMARY KEY);
CREATE TABLE

postgres=# INSERT INTO foo VALUES(DEFAULT) RETURNING PRIMARY KEY;
 id

  1
(1 row)

INSERT 0 1

postgres=# CREATE TABLE bar (id1 INT NOT NULL, id2 INT NOT NULL, PRIMARY 
KEY(id1, id2));
CREATE TABLE
postgres=# INSERT INTO bar VALUES(1,2) RETURNING PRIMARY KEY;
 id1 | id2
-+-
   1 |   2
(1 row)

INSERT 0 1

postgres=# INSERT INTO bar VALUES(2,1),(2,2) RETURNING PRIMARY KEY;
 id1 | id2
-+-
   2 |   1
   2 |   2
(2 rows)

INSERT 0 2

postgres=# CREATE TABLE no_pkey (id SERIAL NOT NULL);
CREATE TABLE
postgres=# INSERT INTO no_pkey VALUES(DEFAULT) RETURNING id;
 id

  1
(1 row)

INSERT 0 1
postgres=# INSERT INTO no_pkey VALUES(DEFAULT) RETURNING PRIMARY KEY;
ERROR:  Relation does not have any primary key(s)

/* -- */

--
 Ian Barwick   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/doc/src/sgml/ref/delete.sgml b/doc/src/sgml/ref/delete.sgml
index 74ea907..45295d1 100644
--- a/doc/src/sgml/ref/delete.sgml
+++ b/doc/src/sgml/ref/delete.sgml
@@ -25,7 +25,7 @@ PostgreSQL documentation
 DELETE FROM [ ONLY ] replaceable class=PARAMETERtable_name/replaceable [ * ] [ [ AS ] replaceable class=parameteralias/replaceable ]
 [ USING replaceable class=PARAMETERusing_list/replaceable ]
 [ WHERE replaceable class=PARAMETERcondition/replaceable | WHERE CURRENT OF replaceable class=PARAMETERcursor_name/replaceable ]
-[ RETURNING * | replaceable class=parameteroutput_expression/replaceable [ [ AS ] replaceable class=parameteroutput_name/replaceable ] [, ...] ]
+[ RETURNING * | replaceable class=parameteroutput_expression/replaceable [ [ AS ] replaceable class=parameteroutput_name/replaceable ] [, ...] | PRIMARY KEY ]
 /synopsis
  /refsynopsisdiv
 
@@ -182,6 +182,17 @@ DELETE FROM [ ONLY ] replaceable class=PARAMETERtable_name/replaceable [ *
  /para
 /listitem
/varlistentry
+
+   varlistentry
+termliteralPRIMARY KEY/literal/term
+listitem
+ para
+  Returns the table's primary key column(s) after each row is deleted.
+  Cannot be combined with an replaceable class=PARAMETERoutput_expression/replaceable.
+ /para
+/listitem
+   /varlistentry
+
   /variablelist
  /refsect1
 
@@ -208,7 +219,9 @@ DELETE replaceable class=parametercount/replaceable
clause, the result will be similar to that of a commandSELECT/
statement containing the columns and values defined in the
literalRETURNING/ list, computed over the row(s) deleted by the
-   command.
+   command. literalPRIMARY KEY/ can be specified to return the
+   primary key value(s) for each deleted row. An error will be raised
+   if the table does not have a primary key.
   /para
  /refsect1
 
diff --git a/doc/src/sgml/ref/insert.sgml b/doc/src/sgml/ref/insert.sgml
index a3cccb9..9fbd859 100644
--- a/doc/src/sgml/ref/insert.sgml
+++ b/doc/src/sgml/ref/insert.sgml
@@ -24,7 +24,7 @@ PostgreSQL documentation
 [ WITH [ RECURSIVE ] replaceable class=parameterwith_query/replaceable [, ...] ]
 INSERT INTO replaceable class=PARAMETERtable_name/replaceable [ ( replaceable class=PARAMETERcolumn_name/replaceable [, ...] ) ]
 { DEFAULT VALUES | VALUES ( { replaceable class=PARAMETERexpression/replaceable | DEFAULT } [, ...] ) [, 

Re: [HACKERS] RETURNING PRIMARY KEY syntax extension

2014-06-08 Thread David G Johnston
Ian Barwick wrote
 Hi,
 
 The JDBC API provides the getGeneratedKeys() method as a way of retrieving
 primary key values without the need to explicitly specify the primary key
 column(s). This is a widely-used feature, however the implementation has
 significant
 performance drawbacks.
 
 Currently this feature is implemented in the JDBC driver by appending
 RETURNING * to the supplied statement. However this means all columns of
 affected rows will be returned to the client, which causes significant
 performance problems, particularly on wide tables. To mitigate this, it
 would
 be desirable to enable the JDBC driver to request only the primary key
 value(s).

Seems like a good idea.


  ERROR:  Relation does not have any primary key(s)

Relation does not have a primary key.
or
Relation has no primary key. (preferred)

By definition it cannot have more than one so it must have none.

It could have multiple unique constraints but I do not believe they are
considered if not tagged as primary.

David J.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/RETURNING-PRIMARY-KEY-syntax-extension-tp5806462p5806463.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers