Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Robert Haas
On Sat, Jan 24, 2015 at 10:04 PM, Bruce Momjian br...@momjian.us wrote:
 On Fri, Jan 23, 2015 at 02:34:36PM -0500, Stephen Frost wrote:
   You'd have to replace the existing data directory on the master to do
   that, which pg_upgrade was designed specifically to not do, in case
   things went poorly.
 
  Why? Just rsync the new data directory onto the old directory on the
  standbys. That's fine and simple.

 That still doesn't address the need to use --size-only, it would just
 mean that you don't need to use -H.  If anything the -H part is the
 aspect which worries me the least about this approach.

 I can now confirm that it works, just as Stephen said.  I was able to
 upgrade a standby cluster that contained the regression database, and
 the pg_dump output was perfect.

 I am attaching doc instruction that I will add to all branches as soon
 as someone else confirms my results.  You will need to use rsync
 --itemize-changes to see the hard links being created, e.g.:

hf+ pgsql/data/base/16415/28188 = pgsql.old/data/base/16384/28188

My rsync manual page (on two different systems) mentions nothing about
remote_dir, so I'd be quite unable to follow your proposed directions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 01:40 PM, Tom Lane wrote:
 In particular, I would like to suggest that the current representation of
 \u is fundamentally broken and that we have to change it, not try to
 band-aid around it.  This will mean an on-disk incompatibility for jsonb
 data containing U+, but hopefully there is very little of that out
 there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
 consider such solutions.

 Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on 
 the table what I suggested is unnecessary.

Well, we can either fix it now or suffer with a broken representation
forever.  I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u.
But I assume nobody likes that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release notes

2015-01-27 Thread Andres Freund
On 2015-01-27 10:09:12 -0800, Joshua D. Drake wrote:
 Where can one find the running copy of release notes for pending releases?

In the source tree on the master once it exists. It doesn't yet for the
next set of releases.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Andrew Dunstan


On 01/27/2015 02:28 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 01/27/2015 01:40 PM, Tom Lane wrote:

In particular, I would like to suggest that the current representation of
\u is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on
the table what I suggested is unnecessary.

Well, we can either fix it now or suffer with a broken representation
forever.  I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u.
But I assume nobody likes that.




I don't think we can be in the business of rejecting valid JSON.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Release notes

2015-01-27 Thread Joshua D. Drake


Where can one find the running copy of release notes for pending releases?
--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
If we send our children to Caesar for their education, we should
 not be surprised when they come back as Romans.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release notes

2015-01-27 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2015-01-27 10:09:12 -0800, Joshua D. Drake wrote:
 Where can one find the running copy of release notes for pending releases?

 In the source tree on the master once it exists. It doesn't yet for the
 next set of releases.

The closest thing to a running copy is the git commit log.  Somebody
(usually Bruce or I) writes release notes from that shortly before a
release is made.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Release notes

2015-01-27 Thread Joshua D. Drake


On 01/27/2015 10:17 AM, Tom Lane wrote:

Andres Freund and...@2ndquadrant.com writes:

On 2015-01-27 10:09:12 -0800, Joshua D. Drake wrote:

Where can one find the running copy of release notes for pending releases?



In the source tree on the master once it exists. It doesn't yet for the
next set of releases.


The closest thing to a running copy is the git commit log.  Somebody
(usually Bruce or I) writes release notes from that shortly before a
release is made.


Thanks for the info.

JD



--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
If we send our children to Caesar for their education, we should
 not be surprised when they come back as Romans.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Abbreviated keys for Numeric

2015-01-27 Thread Gavin Flower

On 28/01/15 06:29, Andrew Gierth wrote:

Peter == Peter Geoghegan p...@heroku.com writes:

  Peter What I find particularly interesting about this patch is that it
  Peter makes sorting numerics significantly faster than even sorting
  Peter float8 values,

Played some more with this. Testing on some different gcc versions
showed that the results were not consistent between versions; the latest
I tried (4.9) showed float8 as somewhat faster, while 4.7 showed float8
as slightly slower; the difference was all in the time of the float8
case, the time for numeric was virtually the same.

For one specific test query, taking the best time of multiple runs,

float8:   gcc4.7 = 980ms, gcc4.9 = 833ms
numeric:  gcc4.7 = 940ms, gcc4.9 = 920ms

(vs. 650ms for bigint on either version)

So honestly I think abbreviation for float8 is a complete red herring.

Also, I couldn't get any detectable benefit from inlining
DatumGetFloat8, though I may have to play more with that to be certain
(above tests did not have any float8-related modifications at all, just
the datum and numeric abbrevs patches).

Since gcc5.0 is due to be released in less than 3 months, it might be 
worth testing with that.



Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - Assert statement

2015-01-27 Thread Pavel Stehule
2015-01-26 22:34 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/22/15 2:01 PM, Pavel Stehule wrote:


 * I would to simplify a behave of evaluating of message
 expression - probably I disallow NULL there.


 Well, the only thing I could see you doing there is throwing a
 different error if the hint is null. I don't see that as an improvement.
 I'd just leave it as-is.


 I enabled a NULL - but enforced a WARNING before.


 I don't see the separate warning as being helpful. I'd just do something
 like

 +(err_hint != NULL) ? errhint(%s,
 err_hint) : errhint(Message attached to failed assertion is null) ));


done



 There should also be a test case for a NULL message.


is there, if I understand well

Regards

Pavel



  * GUC enable_asserts will be supported


 That would be good. Would that allow for enabling/disabling on a
 per-function basis too?


 sure - there is only question if we develop a #option
 enable|disable_asserts. I have no string idea.


 The option would be nice, but I don't think it's strictly necessary. The
 big thing is being able to control this on a per-function basis (which I
 think you can do with ALTER FUNCTION SET?)

 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 6bcb106..5663983
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*** dynamic_library_path = 'C:\tools\postgre
*** 6912,6917 
--- 6912,6931 
  
  variablelist
  
+  varlistentry id=guc-enable-user-asserts xreflabel=enable_user_asserts
+   termvarnameenable_user_asserts/varname (typeboolean/type)
+   indexterm
+primaryvarnameenable_user_asserts/ configuration parameter/primary
+   /indexterm
+   /term
+   listitem
+para
+ If true, any user assertions are evaluated.  By default, this 
+ is set to true.
+/para
+   /listitem
+  /varlistentry
+ 
   varlistentry id=guc-exit-on-error xreflabel=exit_on_error
termvarnameexit_on_error/varname (typeboolean/type)
indexterm
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
new file mode 100644
index 69a0885..26f7eba
*** a/doc/src/sgml/plpgsql.sgml
--- b/doc/src/sgml/plpgsql.sgml
*** END LOOP optional replaceablelabel/
*** 3372,3377 
--- 3372,3380 
sect1 id=plpgsql-errors-and-messages
 titleErrors and Messages/title
  
+   sect2 id=plpgsql-statements-raise
+ titleRAISE statement/title
+ 
 indexterm
  primaryRAISE/primary
 /indexterm
*** RAISE unique_violation USING MESSAGE = '
*** 3564,3570 
--- 3567,3599 
   the whole category.
  /para
 /note
+   /sect2
  
+   sect2 id=plpgsql-statements-assert
+ titleASSERT statement/title
+ 
+indexterm
+ primaryASSERT/primary
+/indexterm
+ 
+indexterm
+ primaryassertions/primary
+ secondaryin PL/pgSQL/secondary
+/indexterm
+ 
+para
+ Use the commandASSERT/command statement to ensure so expected
+ predicate is allways true. If the predicate is false or is null,
+ then a assertion exception is raised.
+ 
+ synopsis
+ ASSERT replaceable class=parameterexpression/replaceable optional, replaceable class=parametermessage expression/replaceable /optional;
+ /synopsis
+ 
+ The user assertions can be enabled or disabled by the 
+ xref linkend=guc-enable-user-asserts.
+/para
+   /sect2
   /sect1
  
   sect1 id=plpgsql-trigger
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
new file mode 100644
index 28c8c40..da12428
*** a/src/backend/utils/errcodes.txt
--- b/src/backend/utils/errcodes.txt
*** PEERRCODE_PLPGSQL_ERROR
*** 454,459 
--- 454,460 
  P0001EERRCODE_RAISE_EXCEPTIONraise_exception
  P0002EERRCODE_NO_DATA_FOUND  no_data_found
  P0003EERRCODE_TOO_MANY_ROWS  too_many_rows
+ P0004EERRCODE_ASSERT_EXCEPTION   assert_exception
  
  Section: Class XX - Internal Error
  
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
new file mode 100644
index c35867b..cd55e94
*** a/src/backend/utils/init/globals.c
--- b/src/backend/utils/init/globals.c
*** bool		IsBinaryUpgrade = false;
*** 99,104 
--- 99,105 
  bool		IsBackgroundWorker = false;
  
  bool		ExitOnAnyError = false;
+ bool		enable_user_asserts = true;
  
  int			DateStyle = USE_ISO_DATES;
  int			DateOrder = DATEORDER_MDY;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index f6df077..b3a2660
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c

Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
Hi

2015-01-27 11:41 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com:



 2015-01-26 21:44 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/25/15 4:23 AM, Pavel Stehule wrote:


 I tested a concept iteration over array in format [key1, value1, key2,
 value2, .. ] - what is nice, it works for [[key1,value1],[key2, value2],
 ...] too

 It is only a few lines more to current code, and this change doesn't
 break a compatibility.

 Do you think, so this patch is acceptable?

 Ideas, comments?


 Aside from fixing the comments... I think this needs more tests on corner
 cases. For example, what happens when you do

 foreach a, b, c in array(array(1,2),array(3,4)) ?


 it is relative simple behave -- empty values are NULL

 array(1,2),array(3,4) -- do you think ARRAY[[1,2],[3,4]] is effectively
 ARRAY[1,2,3,4]



 Or the opposite case of

 foreach a,b in array(array(1,2,3))

 Also, what about:

 foreach a,b in '{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[] ?



  postgres=# select array(select
 unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
array
 ---
  {1,2,3,4,5,6,7,8}
 (1 row)

 so it generate pairs {1,2}{3,4},{5,6},{7,8}


I fixed situation when array has not enough elements.

More tests, simple doc

Regards

Pavel



 Regards

 Pavel Stehule


 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com



diff --git a/contrib/hstore/expected/hstore.out b/contrib/hstore/expected/hstore.out
new file mode 100644
index 9749e45..e44532e
*** a/contrib/hstore/expected/hstore.out
--- b/contrib/hstore/expected/hstore.out
*** select %% 'aa=1, cq=l, b=g, fg=NULL'
*** 1148,1153 
--- 1148,1169 
   {b,g,aa,1,cq,l,fg,NULL}
  (1 row)
  
+ -- fast iteration over keys
+ do $$
+ declare
+   key text;
+   value text;
+ begin
+   foreach key, value in array hstore_to_array('aa=1, cq=l, b=g, fg=NULL'::hstore)
+   loop
+ raise notice 'key: %, value: %', key, value;
+   end loop;
+ end;
+ $$;
+ NOTICE:  key: b, value: g
+ NOTICE:  key: aa, value: 1
+ NOTICE:  key: cq, value: l
+ NOTICE:  key: fg, value: NULL
  select hstore_to_matrix('aa=1, cq=l, b=g, fg=NULL'::hstore);
  hstore_to_matrix 
  -
diff --git a/contrib/hstore/sql/hstore.sql b/contrib/hstore/sql/hstore.sql
new file mode 100644
index 5a9e9ee..7b9eb09
*** a/contrib/hstore/sql/hstore.sql
--- b/contrib/hstore/sql/hstore.sql
*** select avals('');
*** 257,262 
--- 257,275 
  select hstore_to_array('aa=1, cq=l, b=g, fg=NULL'::hstore);
  select %% 'aa=1, cq=l, b=g, fg=NULL';
  
+ -- fast iteration over keys
+ do $$
+ declare
+   key text;
+   value text;
+ begin
+   foreach key, value in array hstore_to_array('aa=1, cq=l, b=g, fg=NULL'::hstore)
+   loop
+ raise notice 'key: %, value: %', key, value;
+   end loop;
+ end;
+ $$;
+ 
  select hstore_to_matrix('aa=1, cq=l, b=g, fg=NULL'::hstore);
  select %# 'aa=1, cq=l, b=g, fg=NULL';
  
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
new file mode 100644
index 69a0885..4ef0299
*** a/doc/src/sgml/plpgsql.sgml
--- b/doc/src/sgml/plpgsql.sgml
*** NOTICE:  row = {7,8,9}
*** 2490,2495 
--- 2490,2518 
  NOTICE:  row = {10,11,12}
  /programlisting
  /para
+ 
+ para
+  literalFOREACH/ cycle can be used for iteration over record. You
+  need a xref linkend=hstore extension. For this case a clause
+  literalSLICE/literal should not be used. literalFOREACH/literal
+  statements supports list of target variables. When source array is
+  a array of composites, then composite array element is saved to target
+  variables. When the array is a array of scalar values, then target 
+  variables are filled item by item.
+ programlisting
+ CREATE FUNCTION trig_function() RETURNS TRIGGER AS $$
+ DECLARE
+   key text; value text;
+ BEGIN
+   FOREACH key, value IN ARRAY hstore_to_array(hstore(NEW))
+   LOOP
+ RAISE NOTICE 'key = %, value = %', key, value;
+   END LOOP;
+   RETURN NEW;
+ END;
+ $$ LANGUAGE plpgsql;
+ /programlisting
+ /para
 /sect2
  
 sect2 id=plpgsql-error-trapping
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
new file mode 100644
index ae5421f..4ab3d90
*** a/src/pl/plpgsql/src/pl_exec.c
--- b/src/pl/plpgsql/src/pl_exec.c
*** exec_stmt_foreach_a(PLpgSQL_execstate *e
*** 2242,2247 
--- 2242,2250 
  	Datum		value;
  	bool		isnull;
  
+ 
+ 	bool		multiassign = false;
+ 
  	/* get the value of the array expression */
  	value = exec_eval_expr(estate, stmt-expr, isnull, arrtype);
  	if (isnull)
*** exec_stmt_foreach_a(PLpgSQL_execstate *e
*** 2303,2308 
--- 2306,2328 
  (errcode(ERRCODE_DATATYPE_MISMATCH),
  			  errmsg(FOREACH loop variable must not be of an array type)));
  
+ 	/*
+ 	 * Proof concept -- multiassign in FOREACH cycle
+ 	 *
+ 	 * Motivation: FOREACH key, value IN ARRAY hstore_to_array(hstore(NEW)) 

Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-27 Thread Andres Freund
On 2015-01-27 07:16:27 -0500, Robert Haas wrote:
 On Mon, Jan 26, 2015 at 4:03 PM, Andres Freund and...@2ndquadrant.com wrote:
  I basically have two ideas to fix this.
 
  1) Make do_pg_start_backup() acquire a SHARE lock on
 pg_database. That'll prevent it from starting while a movedb() is
 still in progress. Then additionally add pg_backup_in_progress()
 function to xlog.c that checks (XLogCtl-Insert.exclusiveBackup ||
 XLogCtl-Insert.nonExclusiveBackups != 0). Use that in createdb() and
 movedb() to error out if a backup is in progress.
 
  Attached is a patch trying to this. Doesn't look too bad and lead me to
  discover missing recovery conflicts during a AD ST.
 
  But: It doesn't actually work on standbys, because lock.c prevents any
  stronger lock than RowExclusive from being acquired. And we need need a
  lock that can conflict with WAL replay of DBASE_CREATE, to handle base
  backups that are executed on the primary. Those obviously can't detect
  whether any standby is currently doing a base backup...
 
  I currently don't have a good idea how to mangle lock.c to allow
  this. I've played with doing it like in the second patch, but that
  doesn't actually work because of some asserts around ProcSleep - leading
  to locks on database objects not working in the startup process (despite
  already being used).
 
  The easiest thing would be to just use a lwlock instead of a heavyweight
  lock - but those aren't canceleable...
 
 How about just wrapping an lwlock around a flag variable?  movedb()
 increments the variable when starting and decrements it when done
 (must use PG_ENSURE_ERROR_CLEANUP).  Starting a backup errors out (or
 waits in 1-second increments) if it's non-zero.

That'd end up essentially being a re-emulation of locks. Don't find that
all that pretty. But maybe we have to go there.


Here's an alternative approach. I think it generally is superior and
going in the right direction, but I'm not sure it's backpatchable.

It basically consists out of:
1) Make GetLockConflicts() actually work.
2) Allow the startup process to actually acquire locks other than
   AccessExclusiveLocks. There already is code acquiring other locks,
   but it's currently broken because they're acquired in blocking mode
   which isn't actually supported in the startup mode. Using this
   infrastructure we can actually fix a couple small races around
   database creation/drop.
3) Allow session locks during recovery to be heavier than
   RowExclusiveLock - since relation/object session locks aren't ever
   taken out by the user (without a plain lock first) that's safe.
4) Perform streaming base backups from within a transaction command, to
   provide error handling.
5) Make walsender actually react to recovery conflict interrupts
6) Prevent access to the template database of a CREATE DATABASE during
   WAL replay.
7) Add an interlock to prevent base backups and movedb() to run
   concurrently. What we actually came here for.

Comments?

At the very least it's missing some documentation updates about the
locking changes in ALTER DATABASE, CREATE DATABASE and the base backup
sections.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 6e196d17e3dc3ae923321c1b49eb46ccd5ac75b0 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Tue, 27 Jan 2015 19:52:11 +0100
Subject: [PATCH] Fix various issues around WAL replay and ALTER DATABASE SET
 TABLESPACE.

Danger of basebackups vs. AD ST
Discussion: 20150120152819.gc24...@alap3.anarazel.de

Fix GetLockConflicts() to properly terminate the list of conflicting
backends. It's unclear why this hasn't caused more problems.

Discussion: 20150127142713.gd29...@awork2.anarazel.de

Don't acquire blocking locks on the database in dbase_redo(), not
enough state setup.
Discussion: 20150126212458.ga29...@awork2.anarazel.de

Don't allow access to the template database during the replay of a
CREATE DATABASE.
---
 src/backend/access/transam/xlog.c|  29 +
 src/backend/commands/dbcommands.c| 104 ++-
 src/backend/replication/basebackup.c |  15 +
 src/backend/replication/walsender.c  |  14 +
 src/backend/storage/ipc/standby.c| 117 ++-
 src/backend/storage/lmgr/lmgr.c  |  31 ++
 src/backend/storage/lmgr/lock.c  |  13 ++--
 src/backend/utils/init/postinit.c|   2 +-
 src/include/access/xlog.h|   1 +
 src/include/storage/lmgr.h   |   3 +
 src/include/storage/standby.h|   2 +-
 11 files changed, 240 insertions(+), 91 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..38e7dff 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -53,6 +53,7 @@
 #include storage/ipc.h
 #include storage/large_object.h
 #include storage/latch.h

Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2015-01-27 Thread Andres Freund
On 2015-01-27 08:21:57 +0100, Andreas Karlsson wrote:
 On 01/23/2015 02:58 AM, Petr Jelinek wrote:
 On 23/01/15 00:40, Andreas Karlsson wrote:
 - Renamed some things from int12 to int128, there are still some places
 with int16 which I am not sure what to do with.
 
 I'd vote for renaming them to int128 too, there is enough C functions
 that user int16 for 16bit integer that this is going to be confusing
 otherwise.
 
 Do you also think the SQL functions should be named numeric_int128_sum,
 numeric_int128_avg, etc?

I'm pretty sure we already decided upthread that the SQL interface is
going to keep usint int4/8 and by extension int16.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Incremental backup v3: incremental PoC

2015-01-27 Thread Giuseppe Broccolo
Hi Marco,

On 16/01/15 16:55, Marco Nenciarini wrote:
 On 14/01/15 17:22, Gabriele Bartolini wrote:
 
  My opinion, Marco, is that for version 5 of this patch, you:
 
  1) update the information on the wiki (it is outdated - I know you have
  been busy with LSN map optimisation)

 Done.

  2) modify pg_basebackup in order to accept a directory (or tar file) and
  automatically detect the LSN from the backup profile

 New version of patch attached. The -I parameter now requires a backup
 profile from a previous backup. I've added a sanity check that forbid
 incremental file level backups if the base timeline is different from
 the current one.

  3) add the documentation regarding the backup profile and pg_basebackup
 

 Next on my TODO list.

  Once we have all of this, we can continue trying the patch. Some
  unexplored paths are:
 
  * tablespace usage

 I've improved my pg_restorebackup python PoC. It now supports tablespaces.

About tablespaces, I noticed that any pointing to tablespace locations is
lost during the recovery of an incremental backup changing the tablespace
mapping (-T option). Here the steps I followed:

   - creating and filling a test database obtained through pgbench

   psql -c CREATE DATABASE pgbench
   pgbench -U postgres -i -s 5 -F 80 pgbench

   - a first base backup with pg_basebackup:

   mkdir -p backups/$(date '+%d%m%y%H%M')/data  pg_basebackup -v -F
p -D backups/$(date '+%d%m%y%H%M')/data -x

   - creation of a new tablespace, alter the table pgbench_accounts to
   set the new tablespace:

   mkdir -p /home/gbroccolo/pgsql/tbls
   psql -c CREATE TABLESPACE tbls LOCATION '/home/gbroccolo/pgsql/tbls'
   psql -c ALTER TABLE pgbench_accounts SET TABLESPACE tbls pgbench

   - Doing some work on the database:

   pgbench -U postgres -T 120 pgbench

   - a second incremental backup with pg_basebackup specifying the new
   location for the tablespace through the tablespace mapping:

   mkdir -p backups/$(date '+%d%m%y%H%M')/data backups/$(date
'+%d%m%y%H%M')/tbls  pg_basebackup -v -F p -D backups/$(date
'+%d%m%y%H%M')/data -x -I backups/2601151641/data/backup_profile -T
/home/gbroccolo/pgsql/tbls=/home/gbroccolo/pgsql/backups/$(date
'+%d%m%y%H%M')/tbls

   - a recovery based on the tool pg_restorebackup.py attached in
   http://www.postgresql.org/message-id/54b9428e.9020...@2ndquadrant.it

   ./pg_restorebackup.py backups/2601151641/data
backups/2601151707/data /tmp/data -T
/home/gbroccolo/pgsql/backups/2601151707/tbls=/tmp/tbls


In the last step, I obtained the following stack trace:

Traceback (most recent call last):
  File ./pg_restorebackup.py, line 74, in module
shutil.copy2(base_file, dest_file)
  File /home/gbroccolo/.pyenv/versions/2.7.5/lib/python2.7/shutil.py,
line 130, in copy2
copyfile(src, dst)
  File /home/gbroccolo/.pyenv/versions/2.7.5/lib/python2.7/shutil.py,
line 82, in copyfile
with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory:
'backups/2601151641/data/base/16384/16406_fsm'


Any idea on what's going wrong?

Thanks,
Giuseppe.
-- 
Giuseppe Broccolo - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
giuseppe.brocc...@2ndquadrant.it | www.2ndQuadrant.it


Re: [HACKERS] [POC] FETCH limited by bytes.

2015-01-27 Thread Kyotaro HORIGUCHI
Thank you for the comment.

The automatic way to determin the fetch_size looks become too
much for the purpose. An example of non-automatic way is a new
foreign table option like 'fetch_size' but this exposes the
inside too much... Which do you think is preferable?

Thu, 22 Jan 2015 11:17:52 -0500, Tom Lane t...@sss.pgh.pa.us wrote in 
24503.1421943...@sss.pgh.pa.us
 Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes:
  Hello, as the discuttion on async fetching on postgres_fdw, FETCH
  with data-size limitation would be useful to get memory usage
  stability of postgres_fdw.
 
  Is such a feature and syntax could be allowed to be added?
 
 This seems like a lot of work, and frankly an incredibly ugly API,
 for a benefit that is entirely hypothetical.  Have you got numbers
 showing any actual performance win for postgres_fdw?

The API is a rush work to make the path for the new parameter
(but, yes, I did too much for the purpose that use from
postgres_fdw..)  and it can be any saner syntax but it's not the
time to do so yet.

The data-size limitation, any size to limit, would give
significant gain especially for small sized rows.

This patch began from the fact that it runs about twice faster
when fetch size = 1 than 100.

http://www.postgresql.org/message-id/20150116.171849.109146500.horiguchi.kyot...@lab.ntt.co.jp

I took exec times to get 1M rows from localhost via postgres_fdw
and it showed the following numbers.

=# SELECT a from ft1;
fetch_size,   avg row size(*1),   time,   alloced_mem/fetch(Mbytes)(*1)
(local)0.75s
10060  6.2s   6000 (0.006)
1  60  2.7s 60 (0.6  )
3  60  2.2s180 (2.0  )
6  60  2.4s360 (4.0  )

=# SELECT a, b, c from ft1;
fetch_size,   avg row size(*1),   time,   alloced_mem/fetch(Mbytes)(*1)
(local)0.8s
100   204 12  s  20400 (0.02 )
1000  204 10  s 204000 (0.2  )
1 204  5.8s204 (2)
2 204  5.9s408 (4)

=# SELECT a, b, d from ft1;
fetch_size,   avg row size(*1),   time,   alloced_mem/fetch(Mbytes)(*1)
(local)0.8s
100  1356 17  s 135600 (0.136)
1000 1356 15  s1356000 (1.356)
1475 1356 13  s2000100 (2.0  )
2950 1356 13  s4000200 (4.0  )

The definitions of the environment are the following.

CREATE SERVER sv1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', 
dbname 'postgres');
CREATE USER MAPPING FOR PUBLIC SERVER sv1;
CREATE TABLE lt1 (a int, b timestamp, c text, d text);
CREATE FOREIGN TABLE ft1 (a int, b timestamp, c text, d text) SERVER sv1 
OPTIONS (table_name 'lt1');
INSERT INTO lt1 (SELECT a, now(), repeat('x', 128), repeat('x', 1280) FROM 
generate_series(0, 99) a);

The avg row size is alloced_mem/fetch_size and the alloced_mem
is the sum of HeapTuple[fetch_size] and (HEAPTUPLESIZE +
tup-t_len) for all stored tuples in the receiver side,
fetch_more_data() in postgres_fdw.

They are about 50% gain for the smaller tuple size and 25% for
the larger. They looks to be optimal at where alloced_mem is
around 2MB by the reason unknown to me. Anyway the difference
seems to be significant.

 Even if we wanted to do something like this, I strongly object to
 measuring size by heap_compute_data_size.  That's not a number that users
 would normally have any direct knowledge of; nor does it have anything
 at all to do with the claimed use-case, where what you'd really need to
 measure is bytes transmitted down the wire.  (The difference is not small:
 for instance, toasted values would likely still be toasted at the point
 where you're measuring.)

Sure. Finally, the attached patch #1 which does the following
things.

 - Sender limits the number of tuples using the sum of the net
   length of the column values to be sent, not including protocol
   overhead. It is calculated in the added function
   slot_compute_attr_size(), using raw length for compressed
   values.

 - postgres_fdw calculates fetch limit bytes by the following
   formula,

   MAX_FETCH_MEM - MAX_FETCH_SIZE * (estimated overhead per tuple);

The result of the patch is as follows. MAX_FETCH_MEM = 2MiB and
MAX_FETCH_SIZE = 3.

fetch_size,   avg row size(*1),   time,   max alloced_mem/fetch(Mbytes)
(auto) 60  2.4s   108 ( 1.08)
(auto)204  7.3s536400 ( 0.54)
(auto)   1356 15  s430236 ( 0.43)

This is meaningfully fast but the patch looks too big and the
meaning of the new parameter is hard to understand..:(


On the other hand the cause of the displacements of alloced_mem
shown above is per-tuple overhead, the sum of which is unknown
before execution.  

Re: [HACKERS] Hot Standby WAL reply uses heavyweight session locks, but doesn't have enough infrastructure set up

2015-01-27 Thread Andres Freund
On 2015-01-27 16:23:53 +0900, Michael Paquier wrote:
 On Tue, Jan 27, 2015 at 6:24 AM, Andres Freund and...@2ndquadrant.com wrote:
  Unfortunately that Assert()s when there's a lock conflict because
  e.g. another backend is currently connecting. That's because ProcSleep()
  does a enable_timeout_after(DEADLOCK_TIMEOUT, DeadlockTimeout) - and
  there's no deadlock timeout (or lock timeout) handler registered.

 Yes, that could logically happen if there is a lock conflicting as
 RowExclusiveLock or lower lock can be taken in recovery.

I don't this specific lock (it's a object, not relation lock) can easily
be taken directly by a user except during authentication.

  [...]
  afaics, that should work? Not pretty, but probably easier than starting
  to reason about the deadlock detector in the startup process.

 Wouldn't it be cleaner to simply register a dedicated handler in
 StartupXlog instead, obviously something else than DEADLOCK_TIMEOUT as
 it is reserved for backend operations? For back-branches, we may even
 consider using DEADLOCK_TIMEOUT..

What would that timeout handler actually do? Two problems:
a) We really don't want the startup process be killed/error out as that
   likely means a postmaster restart. So the default deadlock detector
   strategy is something that's really useless for us.
b) Pretty much all other (if not actually all other) heavyweight lock
   acquisitions in the startup process acquire locks using dontWait =
   true - which means that the deadlock detector isn't actually
   run. That's essentially fine because we simply kill everything in our
   way. C.f. StandbyAcquireAccessExclusiveLock() et al.

There's a dedicated 'deadlock detector' like infrastructure around
ResolveRecoveryConflictWithBufferPin(), but it deals with a class of
deadlocks that's not handled in the deadlock detector anyway.


I think this isn't particularly pretty, but it seems to be working well
enough, and changing it would be pretty invasive. So keeping in line
with all that code seems to be easier.

  We probably should also add a Assert(!InRecovery || sessionLock) to
  LockAcquireExtended() - these kind of problems are otherwise hard to
  find in a developer setting.

 So this means that locks other than session ones cannot be taken while
 a node is in recovery, but RowExclusiveLock can be taken while in
 recovery. Don't we have a problem with this assertion then?

Note that InRecovery doesn't mean what you probably think it means:
/*
 * Are we doing recovery from XLOG?
 *
 * This is only ever true in the startup process; it should be read as meaning
 * this process is replaying WAL records, rather than the system is in
 * recovery mode.  It should be examined primarily by functions that need
 * to act differently when called from a WAL redo function (e.g., to skip WAL
 * logging).  To check whether the system is in recovery regardless of which
 * process you're running in, use RecoveryInProgress() but only after shared
 * memory startup and lock initialization.
 */
boolInRecovery = false;

The assertion actually should be even stricter:
Assert(InRecovery || (sessionLock  dontWait));
i.e. we never acquire a heavyweight lock in the startup process unless
it's a session lock (as we don't have resource managers/a xact to track
locks) and we don't wait (as we don't have the deadlock detector
infrastructure set up).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using 128-bit integers for sum, avg and statistics aggregates

2015-01-27 Thread Andreas Karlsson

On 01/27/2015 09:05 AM, Andres Freund wrote:

On 2015-01-27 08:21:57 +0100, Andreas Karlsson wrote:

On 01/23/2015 02:58 AM, Petr Jelinek wrote:

On 23/01/15 00:40, Andreas Karlsson wrote:

- Renamed some things from int12 to int128, there are still some places
with int16 which I am not sure what to do with.


I'd vote for renaming them to int128 too, there is enough C functions
that user int16 for 16bit integer that this is going to be confusing
otherwise.


Do you also think the SQL functions should be named numeric_int128_sum,
numeric_int128_avg, etc?


I'm pretty sure we already decided upthread that the SQL interface is
going to keep usint int4/8 and by extension int16.


Excellent, then I will leave my latest patch as-is and let the reviewer 
say what he thinks.


--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-27 Thread Abhijit Menon-Sen
At 2015-01-26 17:45:52 -0500, robertmh...@gmail.com wrote:

  Based on the recent emails, it appears there's been a shift of
  preference to having it be in-core […]
 
 Well, I'm not sure that anyone else here agreed with me on that

Sure, an in-core AUDIT command would be great. Stephen has always said
that would be the most preferable solution; and if we had the code to
implement it, I doubt anyone would prefer the contrib module instead.
But we don't have that code now, and we won't have it in time for 9.5.

We had an opportunity to work on pgaudit in its current form, and I like
to think that the result is useful. To me, the question has always been
whether some variant of that code would be acceptable for 9.5's contrib.
If so, I had some time to work on that. If not… well, hard luck. But the
option to implement AUDIT was not available to me, which is why I have
not commented much on it recently.

 The basic dynamic here seems to be you asking for changes and Abhijit
 making them but without any real confidence, and I don't feel good
 about that. 

I understand how I might have given you that impression, but I didn't
mean to, and I don't really feel that way.

I appreciate Stephen's suggestions and, although it took me some time to
understand them fully, I think the use of GRANT to provide finer-grained
auditing configuration has improved pgaudit. I am slightly concerned by
the resulting complexity, but I think that can be addressed by examples
and so on. I wouldn't be unhappy if this code were to go into contrib.

(I should point out that it is also not the case that I do not hold any
opinions and would be happy with anything pgaudit-shaped being included.
For example, I strongly prefer GRANT to the 'alice:*:*' approach.)

Anyway, I think it's reasonably clear now that pgaudit is unlikely to
make it into 9.5 in any form, so I'll find something else to do.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Safe memory allocation functions

2015-01-27 Thread Michael Paquier
Alvaro Herrera wrote:
 So how about something like

 #define ALLOCFLAG_HUGE  0x01
 #define ALLOCFLAG_NO_ERROR_ON_OOM   0x02
 void *
 MemoryContextAllocFlags(MemoryContext context, Size size, int flags);
The flag for huge allocations may be useful, but I don't actually see
much value in the flag ALLOC_NO_OOM if the stuff in aset.c returns
unconditionally NULL in case of an OOM and we let palloc complain
about an OOM when allocation returns NULL. Something I am missing
perhaps?

 I definitely do not want to push the nofail stuff via the
 MemoryContextData- API into aset.c. Imo aset.c should always return
 NULL and then mcxt.c should throw the error if in the normal palloc()
 function.

 Sure, that seems reasonable ...
Yes, this would simplify the footprint of this patch to aset.c to a
minimum by changing the ereport to NULL in a couple of places.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Safe memory allocation functions

2015-01-27 Thread Andres Freund
On 2015-01-27 17:27:53 +0900, Michael Paquier wrote:
 Alvaro Herrera wrote:
  So how about something like
 
  #define ALLOCFLAG_HUGE  0x01
  #define ALLOCFLAG_NO_ERROR_ON_OOM   0x02
  void *
  MemoryContextAllocFlags(MemoryContext context, Size size, int flags);
 The flag for huge allocations may be useful, but I don't actually see
 much value in the flag ALLOC_NO_OOM if the stuff in aset.c returns
 unconditionally NULL in case of an OOM and we let palloc complain
 about an OOM when allocation returns NULL. Something I am missing
 perhaps?

I guess the idea is to have *user facing* MemoryContextAllocExtended()
that can do both huge and no-oom allocations. Otherwise we need palloc
like wrappers for all combinations.
We're certainly not just going to ignore memory allocation failures
generally in in MemoryContextAllocExtended()

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alter user/role CURRENT_USER

2015-01-27 Thread Kyotaro HORIGUCHI
Hello, thank you for the comment.

 Looking at this a bit, I'm not sure completely replacing RoleId with a
 node is the best idea; some of the users of that production in the
 grammar are okay with accepting only normal strings as names, and don't
 need all the CURRENT_USER etc stuff.

You're right. the productions don't need RoleSpec already uses
char* for the role member in its *Stmt structs. I might have done
a bit too much.

Adding new nonterminal RoleId and using it makes a bit duplicate
of check code for public/none and others but removes other
redundant check for != CSTRING from some production, CREATE
ROLE/USER/GROUP, ALTER ROLE/USER/GROUP RENAME.

RoleId in the patch still has rule components for CURRENT_USER,
SESSION_USER, and CURRENT_ROLE. Without them, the parser prints
an error ununderstandable to users.

| =# alter role current_user rename to PuBlic;
| ERROR:  syntax error at or near rename
| LINE 1: alter role current_user rename to PuBlic;
| ^

With the components, the error message becomes like this.

| =# alter role current_role rename to PuBlic;
| ERROR:  reserved word cannot be used
| LINE 1: alter role current_role rename to PuBlic;
|^


   Maybe we need a new production,
 say RoleSpec, and we modify the few productions that need the extra
 flexibility?  For instance we could have ALTER USER RoleSpec instead of
 ALTER USER RoleId.  But we would keep CREATE USER RoleId, because it
 doesn't make any sense to accept CREATE USER CURRENT_USER.
 I think that would lead to a less invasive patch also.
 Am I making sense?

I think I did what you expected. It was good as the code got
simpler but the invasive-ness dosn't seem to be reduced..

What do you think about this?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From d90b7e09968f32a5b543242469fb65e304df3318 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp
Date: Fri, 14 Nov 2014 17:37:22 +0900
Subject: [PATCH] ALTER USER CURRENT_USER v4

Added RoleId for the use in where CURRENT_USER-like sutff is not used.
CREATE ROLE/USER/GROUP, ALTER ROLE/USER/GROUP RENAME are changed to
use RoleId.
---
 src/backend/catalog/aclchk.c   |  30 +++--
 src/backend/commands/alter.c   |   2 +-
 src/backend/commands/extension.c   |   2 +-
 src/backend/commands/foreigncmds.c |  57 -
 src/backend/commands/schemacmds.c  |  30 -
 src/backend/commands/tablecmds.c   |   4 +-
 src/backend/commands/tablespace.c  |   2 +-
 src/backend/commands/user.c|  86 +++---
 src/backend/nodes/copyfuncs.c  |  37 --
 src/backend/nodes/equalfuncs.c |  35 --
 src/backend/parser/gram.y  | 229 +
 src/backend/parser/parse_utilcmd.c |   4 +-
 src/backend/utils/adt/acl.c|  34 ++
 src/include/commands/user.h|   2 +-
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/parsenodes.h |  48 ++--
 src/include/utils/acl.h|   2 +-
 17 files changed, 400 insertions(+), 205 deletions(-)

diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 1e3888e..1c90626 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -430,13 +430,16 @@ ExecuteGrantStmt(GrantStmt *stmt)
 	foreach(cell, stmt-grantees)
 	{
 		PrivGrantee *grantee = (PrivGrantee *) lfirst(cell);
+		Oid grantee_uid = ACL_ID_PUBLIC;
 
-		if (grantee-rolname == NULL)
-			istmt.grantees = lappend_oid(istmt.grantees, ACL_ID_PUBLIC);
-		else
-			istmt.grantees =
-lappend_oid(istmt.grantees,
-			get_role_oid(grantee-rolname, false));
+		/* public is mapped to ACL_ID_PUBLIC */
+		if (grantee-role-roltype != ROLESPEC_PUBLIC)
+		{
+			grantee_uid = get_rolespec_oid(grantee-role, false);
+			if (!OidIsValid(grantee_uid))
+grantee_uid = ACL_ID_PUBLIC;
+		}
+		istmt.grantees = lappend_oid(istmt.grantees, grantee_uid);
 	}
 
 	/*
@@ -913,13 +916,16 @@ ExecAlterDefaultPrivilegesStmt(AlterDefaultPrivilegesStmt *stmt)
 	foreach(cell, action-grantees)
 	{
 		PrivGrantee *grantee = (PrivGrantee *) lfirst(cell);
+		Oid grantee_uid = ACL_ID_PUBLIC;
 
-		if (grantee-rolname == NULL)
-			iacls.grantees = lappend_oid(iacls.grantees, ACL_ID_PUBLIC);
-		else
-			iacls.grantees =
-lappend_oid(iacls.grantees,
-			get_role_oid(grantee-rolname, false));
+		/* public is mapped to ACL_ID_PUBLIC */
+		if (grantee-role-roltype != ROLESPEC_PUBLIC)
+		{
+			grantee_uid = get_rolespec_oid(grantee-role, false);
+			if (!OidIsValid(grantee_uid))
+grantee_uid = ACL_ID_PUBLIC;
+		}
+		iacls.grantees = lappend_oid(iacls.grantees, grantee_uid);
 	}
 
 	/*
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 78b54b4..1d8799b 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -679,7 +679,7 @@ AlterObjectNamespace_internal(Relation rel, Oid objid, Oid nspOid)
 Oid
 

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-01-27 Thread Abhijit Menon-Sen
At 2015-01-26 18:47:29 -0600, jim.na...@bluetreble.com wrote:

 Anything happen with this?

Nothing so far.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Partitioning: issues/ideas (Was: Re: [HACKERS] On partitioning)

2015-01-27 Thread Amit Langote
On 27-01-2015 AM 05:46, Jim Nasby wrote:
 On 1/25/15 7:42 PM, Amit Langote wrote:
 On 21-01-2015 PM 07:26, Amit Langote wrote:
 Ok, I will limit myself to focusing on following things at the moment:

 * Provide syntax in CREATE TABLE to declare partition key

 While working on this, I stumbled upon the question of how we deal with
 any index definitions following from constraints defined in a CREATE
 statement. I think we do not want to have a physical index created for a
 table that is partitioned (in other words, has no heap of itself). As
 the current mechanisms dictate, constraints like PRIMARY KEY, UNIQUE,
 EXCLUSION CONSTRAINT are enforced as indexes. It seems there are really
 two decisions to make here:

 1) how do we deal with any index definitions (either explicit or
 implicit following from constraints defined on it) - do we allow them by
 marking them specially, say, in pg_index, as being mere
 placeholders/templates or invent some other mechanism?

 2) As a short-term solution, do we simply reject creating any indexes
 (/any constraints that require them) on a table whose definition also
 includes PARTITION ON clause? Instead define them on its partitions (or
 any relations in hierarchy that are not further partitioned).

 Or maybe I'm missing something...
 
 Wasn't the idea that the parent table in a partitioned table wouldn't
 actually have a heap of it's own? If there's no heap there can't be an
 index.


Yes, that's right. Perhaps, we should look at heap-less partitioned
relation thingy not so soon as you say below.

 That said, I think this is premature optimization that could be done later.

It seems so.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Safe memory allocation functions

2015-01-27 Thread Michael Paquier
On Sat, Jan 17, 2015 at 11:06 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jan 16, 2015 at 10:56 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
 So how about something like

 #define ALLOCFLAG_HUGE  0x01
 #define ALLOCFLAG_NO_ERROR_ON_OOM   0x02
 void *
 MemoryContextAllocFlags(MemoryContext context, Size size, int flags);

 That sounds good, although personally I'd rather have the name be
 something like MemoryContextAllocExtended; we have precedent for using
 Extended for this sort of thing elsewhere.  Also, I'd suggest trying
 to keep the flag name short, e.g. ALLOC_HUGE and ALLOC_NO_OOM (or
 ALLOC_SOFT_FAIL?).
Yes, I think that this name makes more sense (LockAcquire[Extended],
RangeVarGetRelid[Extended]), as well as minimizing shorter name for
the flags.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Stephen Frost
Robert, all,

* Robert Haas (robertmh...@gmail.com) wrote:
 There is a considerable amount of variation in the amount of time this
 takes to run based on how much of the relation is cached.  Clearly,
 there's no way for the system to cache it all, but it can cache a
 significant portion, and that affects the results to no small degree.
 dd on hydra prints information on the data transfer rate; on uncached
 1GB segments, it runs at right around 400 MB/s, but that can soar to
 upwards of 3GB/s when the relation is fully cached.  I tried flushing
 the OS cache via echo 1  /proc/sys/vm/drop_caches, and found that
 immediately after doing that, the above command took 5m21s to run -
 i.e. ~321000 ms.  Most of your test times are faster than that, which
 means they reflect some degree of caching.  When I immediately reran
 the command a second time, it finished in 4m18s the second time, or
 ~258000 ms.  The rate was the same as the first test - about 400 MB/s
 - for most of the files, but 27 of the last 28 files went much faster,
 between 1.3 GB/s and 3.7 GB/s.

[...]

 With 0 workers, first run took 883465.352 ms, and second run took 295050.106 
 ms.
 With 8 workers, first run took 340302.250 ms, and second run took 307767.758 
 ms.
 
 This is a confusing result, because you expect parallelism to help
 more when the relation is partly cached, and make little or no
 difference when it isn't cached.  But that's not what happened.

These numbers seem to indicate that the oddball is the single-threaded
uncached run.  If I followed correctly, the uncached 'dd' took 321s,
which is relatively close to the uncached-lots-of-workers and the two
cached runs.  What in the world is the uncached single-thread case doing
that it takes an extra 543s, or over twice as long?  It's clearly not
disk i/o which is causing the slowdown, based on your dd tests.

One possibility might be round-trip latency.  The multi-threaded case is
able to keep the CPUs and the i/o system going, and the cached results
don't have as much latency since things are cached, but the
single-threaded uncached case going i/o - cpu - i/o - cpu, ends up
with a lot of wait time as it switches between being on CPU and waiting
on the i/o.

Just some thoughts.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Robert Haas
On Thu, Jan 22, 2015 at 5:57 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Script used to test is attached (parallel_count.sh)

Why does this use EXPLAIN ANALYZE instead of \timing ?

 IBM POWER-7 16 cores, 64 hardware threads
 RAM = 64GB

 Table Size - 120GB

 Used below statements to create table -
 create table tbl_perf(c1 int, c2 char(1000));
 insert into tbl_perf values(generate_series(1,1000),'a');
 insert into tbl_perf values(generate_series(1001,3000),'a');
 insert into tbl_perf values(generate_series(3001,11000),'a');

I generated this table using this same method and experimented with
copying the whole file to the bit bucket using dd.  I did this on
hydra, which I think is the same machine you used.

time for i in `seq 0 119`; do if [ $i -eq 0 ]; then f=16388; else
f=16388.$i; fi; dd if=$f of=/dev/null bs=8k; done

There is a considerable amount of variation in the amount of time this
takes to run based on how much of the relation is cached.  Clearly,
there's no way for the system to cache it all, but it can cache a
significant portion, and that affects the results to no small degree.
dd on hydra prints information on the data transfer rate; on uncached
1GB segments, it runs at right around 400 MB/s, but that can soar to
upwards of 3GB/s when the relation is fully cached.  I tried flushing
the OS cache via echo 1  /proc/sys/vm/drop_caches, and found that
immediately after doing that, the above command took 5m21s to run -
i.e. ~321000 ms.  Most of your test times are faster than that, which
means they reflect some degree of caching.  When I immediately reran
the command a second time, it finished in 4m18s the second time, or
~258000 ms.  The rate was the same as the first test - about 400 MB/s
- for most of the files, but 27 of the last 28 files went much faster,
between 1.3 GB/s and 3.7 GB/s.

This tells us that the OS cache on this machine has anti-spoliation
logic in it, probably not dissimilar to what we have in PG.  If the
data were cycled through the system cache in strict LRU fashion, any
data that was leftover from the first run would have been flushed out
by the early part of the second run, so that all the results from the
second set of runs would have hit the disk.  But in fact, that's not
what happened: the last pages from the first run remained cached even
after reading an amount of new data that exceeds the size of RAM on
that machine.  What I think this demonstrates is that we're going to
have to be very careful to control for caching effects, or we may find
that we get misleading results.  To make this simpler, I've installed
a setuid binary /usr/bin/drop_caches that you (or anyone who has an
account on that machine) can use you drop the caches; run 'drop_caches
1'.

 Block-By-Block

 No. of workers/Time (ms) 0 2
 Run-1 267798 295051
 Run-2 276646 296665
 Run-3 281364 314952
 Run-4 290231 326243
 Run-5 288890 295684

The next thing I did was run test with the block-by-block method after
having dropped the caches.  I did this with 0 workers and with 8
workers.  I dropped the caches and restarted postgres before each
test, but then ran each test a second time to see the effect of
caching by both the OS and by PostgreSQL.  I got these results:

With 0 workers, first run took 883465.352 ms, and second run took 295050.106 ms.
With 8 workers, first run took 340302.250 ms, and second run took 307767.758 ms.

This is a confusing result, because you expect parallelism to help
more when the relation is partly cached, and make little or no
difference when it isn't cached.  But that's not what happened.

I've also got a draft of a prefetching implementation here that I'd
like to test out, but I've just discovered that it's buggy, so I'm
going to send these results for now and work on fixing that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
Example:

postgres=# do $$
declare r record;
declare k text; v text;
begin
  for r in select * from foo loop
foreach k,v in array row_to_array(r) loop
  raise notice 'k: %, v: %', k, v;
end loop;
  end loop;
end;
$$;
NOTICE:  k: a, v: 2
NOTICE:  k: b, v: NAZDAR
NOTICE:  k: c, v: 2015-01-27
NOTICE:  k: a, v: 2
NOTICE:  k: b, v: AHOJ
NOTICE:  k: c, v: 2015-01-27
DO

Regards

Pavel

2015-01-27 21:26 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com:

 Hello

 here is a initial version of row_to_array function - transform any row to
 array in format proposed by Andrew.

 Regards

 Pavel

 2015-01-27 19:58 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com:

 Hi

 2015-01-27 11:41 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com:



 2015-01-26 21:44 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/25/15 4:23 AM, Pavel Stehule wrote:


 I tested a concept iteration over array in format [key1, value1, key2,
 value2, .. ] - what is nice, it works for [[key1,value1],[key2, value2],
 ...] too

 It is only a few lines more to current code, and this change doesn't
 break a compatibility.

 Do you think, so this patch is acceptable?

 Ideas, comments?


 Aside from fixing the comments... I think this needs more tests on
 corner cases. For example, what happens when you do

 foreach a, b, c in array(array(1,2),array(3,4)) ?


 it is relative simple behave -- empty values are NULL

 array(1,2),array(3,4) -- do you think ARRAY[[1,2],[3,4]] is effectively
 ARRAY[1,2,3,4]



 Or the opposite case of

 foreach a,b in array(array(1,2,3))

 Also, what about:

 foreach a,b in '{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[] ?



  postgres=# select array(select
 unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
array
 ---
  {1,2,3,4,5,6,7,8}
 (1 row)

 so it generate pairs {1,2}{3,4},{5,6},{7,8}


 I fixed situation when array has not enough elements.

 More tests, simple doc

 Regards

 Pavel



 Regards

 Pavel Stehule


 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com







Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
Hello

here is a initial version of row_to_array function - transform any row to
array in format proposed by Andrew.

Regards

Pavel

2015-01-27 19:58 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com:

 Hi

 2015-01-27 11:41 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com:



 2015-01-26 21:44 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/25/15 4:23 AM, Pavel Stehule wrote:


 I tested a concept iteration over array in format [key1, value1, key2,
 value2, .. ] - what is nice, it works for [[key1,value1],[key2, value2],
 ...] too

 It is only a few lines more to current code, and this change doesn't
 break a compatibility.

 Do you think, so this patch is acceptable?

 Ideas, comments?


 Aside from fixing the comments... I think this needs more tests on
 corner cases. For example, what happens when you do

 foreach a, b, c in array(array(1,2),array(3,4)) ?


 it is relative simple behave -- empty values are NULL

 array(1,2),array(3,4) -- do you think ARRAY[[1,2],[3,4]] is effectively
 ARRAY[1,2,3,4]



 Or the opposite case of

 foreach a,b in array(array(1,2,3))

 Also, what about:

 foreach a,b in '{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[] ?



  postgres=# select array(select
 unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
array
 ---
  {1,2,3,4,5,6,7,8}
 (1 row)

 so it generate pairs {1,2}{3,4},{5,6},{7,8}


 I fixed situation when array has not enough elements.

 More tests, simple doc

 Regards

 Pavel



 Regards

 Pavel Stehule


 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com




diff --git a/src/backend/utils/adt/rowtypes.c b/src/backend/utils/adt/rowtypes.c
new file mode 100644
index 3dc9a84..d758d2d
*** a/src/backend/utils/adt/rowtypes.c
--- b/src/backend/utils/adt/rowtypes.c
***
*** 21,26 
--- 21,27 
  #include catalog/pg_type.h
  #include funcapi.h
  #include libpq/pqformat.h
+ #include utils/array.h
  #include utils/builtins.h
  #include utils/lsyscache.h
  #include utils/typcache.h
*** btrecordimagecmp(PG_FUNCTION_ARGS)
*** 1810,1812 
--- 1811,1898 
  {
  	PG_RETURN_INT32(record_image_cmp(fcinfo));
  }
+ 
+ /*
+  * transform any record to array in format [key1, value1, key2, value2 [, ...]]
+  */
+ Datum
+ row_to_array(PG_FUNCTION_ARGS)
+ {
+ 	HeapTupleHeader		rec = PG_GETARG_HEAPTUPLEHEADER(0);
+ 	TupleDesc		rectupdesc;
+ 	Oid			rectuptyp;
+ 	int32			rectuptypmod;
+ 	HeapTupleData		rectuple;
+ 	int	ncolumns;
+ 	Datum 		*recvalues;
+ 	bool  		*recnulls;
+ 	ArrayBuildState		*builder;
+ 	int	i;
+ 
+ 	/* Extract type info from the tuple itself */
+ 	rectuptyp = HeapTupleHeaderGetTypeId(rec);
+ 	rectuptypmod = HeapTupleHeaderGetTypMod(rec);
+ 	rectupdesc = lookup_rowtype_tupdesc(rectuptyp, rectuptypmod);
+ 	ncolumns = rectupdesc-natts;
+ 
+ 	/* Build a temporary HeapTuple control structure */
+ 	rectuple.t_len = HeapTupleHeaderGetDatumLength(rec);
+ 	ItemPointerSetInvalid((rectuple.t_self));
+ 	rectuple.t_tableOid = InvalidOid;
+ 	rectuple.t_data = rec;
+ 
+ 	recvalues = (Datum *) palloc(ncolumns * sizeof(Datum));
+ 	recnulls = (bool *) palloc(ncolumns * sizeof(bool));
+ 
+ 	/* Break down the tuple into fields */
+ 	heap_deform_tuple(rectuple, rectupdesc, recvalues, recnulls);
+ 
+ 	/* Prepare target array */
+ 	builder = initArrayResult(TEXTOID, CurrentMemoryContext);
+ 
+ 	for (i = 0; i  ncolumns; i++)
+ 	{
+ 		Oid	columntyp = rectupdesc-attrs[i]-atttypid;
+ 		Datum		value;
+ 		bool		isnull;
+ 
+ 		/* Ignore dropped columns */
+ 		if (rectupdesc-attrs[i]-attisdropped)
+ 			continue;
+ 
+ 		builder = accumArrayResult(builder,
+ 			CStringGetTextDatum(NameStr(rectupdesc-attrs[i]-attname)),
+ 			false,
+ 			TEXTOID,
+ 			CurrentMemoryContext);
+ 
+ 		if (!recnulls[i])
+ 		{
+ 			char *outstr;
+ 			bool		typIsVarlena;
+ 			Oid		typoutput;
+ 			FmgrInfo		proc;
+ 
+ 			getTypeOutputInfo(columntyp, typoutput, typIsVarlena);
+ 			fmgr_info_cxt(typoutput, proc, CurrentMemoryContext);
+ 			outstr = OutputFunctionCall(proc, recvalues[i]);
+ 
+ 			value = CStringGetTextDatum(outstr);
+ 			isnull = false;
+ 		}
+ 		else
+ 		{
+ 			value = (Datum) 0;
+ 			isnull = true;
+ 		}
+ 
+ 		builder = accumArrayResult(builder,
+ 		value, isnull,
+ 		TEXTOID,
+ 		CurrentMemoryContext);
+ 	}
+ 
+ 	ReleaseTupleDesc(rectupdesc);
+ 
+ 	PG_RETURN_DATUM(makeArrayResult(builder, CurrentMemoryContext));
+ }
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
new file mode 100644
index 9edfdb8..a27cf4a
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*** DATA(insert OID = 376 (  string_to_array
*** 891,896 
--- 891,898 
  DESCR(split delimited text into text[], with null string);
  DATA(insert OID = 384 (  array_to_string   PGNSP PGUID 12 1 0 0 0 f f f f f f s 3 0 25 2277 25 25 _null_ _null_ _null_ _null_ array_to_text_null _null_ _null_ _null_ ));
  DESCR(concatenate array elements, 

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 02:28 PM, Tom Lane wrote:
 Well, we can either fix it now or suffer with a broken representation
 forever.  I'm not wedded to the exact solution I described, but I think
 we'll regret it if we don't change the representation.
 
 The only other plausible answer seems to be to flat out reject \u.
 But I assume nobody likes that.

 I don't think we can be in the business of rejecting valid JSON.

Actually, after studying the code a bit, I wonder if we wouldn't be best
off to do exactly that, at least for 9.4.x.  At minimum we're talking
about an API change for JsonSemAction functions (which currently get the
already-de-escaped string as a C string; not gonna work for embedded
nulls).  I'm not sure if there are already third-party extensions using
that API, but it seems possible, in which case changing it in a minor
release wouldn't be nice.  Even ignoring that risk, making sure
we'd fixed everything seems like more than a day's work, which is as
much as I for one could spare before 9.4.1.

Also, while the idea of throwing error only when a \0 needs to be
converted to text seems logically clean, it looks like that might pretty
much cripple the usability of such values anyhow, because we convert to
text at the drop of a hat.  So some investigation and probably additional
work would be needed to ensure you could do at least basic things with
such values.  (A function for direct conversion to bytea might be useful
too.)

I think the it would mean rejecting valid JSON argument is utter
hogwash.  We already reject, eg, \u00A0 if you're not using a UTF8
encoding.  And we reject 1e1, not because that's invalid JSON
but because of an implementation restriction of our underlying numeric
type.  I don't see any moral superiority of that over rejecting \u
because of an implementation restriction of our underlying text type.

So at this point I propose that we reject \u when de-escaping JSON.
Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

We probably need to rethink the re-escaping behavior as well; I'm not
sure if your latest patch is the right answer for that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: searching in array function - array_position

2015-01-27 Thread Jim Nasby

On 1/27/15 4:36 AM, Pavel Stehule wrote:

2015-01-26 23:29 GMT+01:00 Jim Nasby jim.na...@bluetreble.com 
mailto:jim.na...@bluetreble.com:

On 1/26/15 4:17 PM, Pavel Stehule wrote:

 Any way to reduce the code duplication between the array and 
non-array versions? Maybe factor out the operator caching code?


I though about it - but there is different checks, different result 
processing, different result type.

I didn't find any readable and reduced form :(


Yeah, that's why I was thinking specifically of the operator caching 
code... isn't that identical? That would at least remove a dozen lines...


It is only partially identical - I would to use cache for array_offset, but it 
is not necessary for array_offsets .. depends how we would to modify current 
API to support externally cached data.


Externally cached data?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 5:55 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 BTW, I know that at least earlier versions of EnterpriseDB's version of
 Postgres (circa 2007) had an auditing feature. I never dealt with any
 customers who were using it when I was there, but perhaps some other folks
 could shed some light on what customers wanted to see an an auditing
 feature... (I'm looking at you, Jimbo!)

It's still there, but it's nothing like pgaudit.  What I hear is that
our customers are looking for something that has the capabilities of
DBMS_FGA.  I haven't researched either that or pgaudit enough to know
how similar they are.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - Assert statement

2015-01-27 Thread Jim Nasby

On 1/27/15 1:30 PM, Pavel Stehule wrote:

I don't see the separate warning as being helpful. I'd just do something 
like

+(err_hint != NULL) ? errhint(%s, err_hint) : 
errhint(Message attached to failed assertion is null) ));


done


There should also be a test case for a NULL message.


is there, if I understand well


I see it now. Looks good.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shortcoming in CLOBBER_FREED_MEMORY coverage: disk buffer pointers

2015-01-27 Thread Tom Lane
Jim Nasby jim.na...@bluetreble.com writes:
 On 1/26/15 6:11 PM, Greg Stark wrote:
 Fwiw I think our experience is that bugs where buffers are unpinned get 
 exposed pretty quickly in production. I suppose the same might not be true 
 for rarely called codepaths or in cases where the buffers are usually 
 already pinned.

 Yeah, that's what I was thinking. If there's some easy way to correctly 
 associate pins with specific code paths (owners?) then maybe it's worth doing 
 so; but I don't think it's worth much effort.

If you have a working set larger than shared_buffers, then yeah it's
likely that reference-after-unpin bugs would manifest pretty quickly.
This does not necessarily translate into something reproducible or
debuggable, however; and besides that you'd really rather that such
bugs not get into the field in the first place.

The point of my Valgrind proposal was to provide a mechanism that would
have a chance of catching such bugs in a *development* context, and
provide some hint of where in the codebase the fault is, too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Jim Nasby

On 1/27/15 2:26 PM, Pavel Stehule wrote:

here is a initial version of row_to_array function - transform any row to array 
in format proposed by Andrew.


Please start a new thread for this... does it depend on the key-value patch?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In particular, I would like to suggest that the current representation of
 \u is fundamentally broken and that we have to change it, not try to
 band-aid around it.  This will mean an on-disk incompatibility for jsonb
 data containing U+, but hopefully there is very little of that out
 there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
 consider such solutions.
 
 The most obvious way to store such data unambiguously is to just go ahead
 and store U+ as a NUL byte (\000).  The only problem with that is that
 then such a string cannot be considered to be a valid value of type TEXT,
 which would mean that we'd need to throw an error if we were asked to
 convert a JSON field containing such a character to text.

 Hm, does this include text out operations for display purposes?   If
 so, then any query selecting jsonb objects with null bytes would fail.
 How come we have to error out?  How about a warning indicating the
 string was truncated?

No, that's not a problem, because jsonb_out would produce an escaped
textual representation, so any null would come out as \u again.
The trouble comes up when you do something that's supposed to produce
a *non escaped* text equivalent of a JSON string value, such as
the - operator.

Arguably, - is broken already with the current coding, in that
these results are entirely inconsistent:

regression=# select '{a:foo\u0040bar}'::jsonb - 'a';
 ?column? 
--
 foo@bar
(1 row)

regression=# select '{a:foo\ubar}'::jsonb - 'a';
   ?column?   
--
 foo\ubar
(1 row)

regression=# select '{a:foo\\ubar}'::jsonb - 'a';
   ?column?   
--
 foo\ubar
(1 row)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Jim Nasby

On 1/27/15 12:58 PM, Pavel Stehule wrote:

  postgres=# select array(select 
unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
array
---
  {1,2,3,4,5,6,7,8}
(1 row)

so it generate pairs {1,2}{3,4},{5,6},{7,8}


I fixed situation when array has not enough elements.

More tests, simple doc


Hrm, this wasn't what I was expecting:

+ select foreach_test_ab(array[1,2,3,4]);
+ NOTICE:  a: 1, b: 2
+ NOTICE:  a: 3, b: 4

I was expecting that foreach a,b array would be expecting something in the 
array to have a dimension of 2. :(

I think this is bad, because this:

foreach_test_ab('{{1,2,3},{4,5,6}}'::int[]);

will give you 1,2; 3,4; 5,6. I don't see any way that that makes sense. Even if 
it did make sense, I'm more concerned that adding this will seriously paint us 
into a corner when it comes to the (to me) more rational case of returning 
{1,2,3},{4,5,6}.

I think we need to think some more about this, at least to make sure we're not 
painting ourselves into a corner for more appropriate array iteration.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dereferenced pointers checked as NULL in btree_utils_var.c

2015-01-27 Thread Michael Paquier
On Wed, Jan 28, 2015 at 3:15 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 So I'm fine with taking out both this documentation text and the dead
 null-pointer checks; but let's do that all in one patch not piecemeal.
 In any case, this is just cosmetic cleanup; there's no actual hazard
 here.
Attached is a patch with all those things done. I added as well an
assertion in gistKeyIsEQ checking if the input datums are NULL. I
believe that this is still useful for developers, feel free to rip it
out from the patch if you think otherwise.
Regards,
-- 
Michael
From 95b051aac56de68412a7b0635484e46eb4e24ad0 Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Wed, 28 Jan 2015 09:36:11 +0900
Subject: [PATCH] Remove dead NULL-pointer checks in GiST code

The key value passed to the same method is generated by either the
compress, union or picksplit methods, but it happens that none of them
actually pass a NULL pointer. Some compress methods do not check for a
NULL pointer, what should have resulted in a crash. Note that
gist_poly_compress() and gist_circle_compress() do check for a NULL,
but they only return a NULL key if the input key is NULL, which cannot
happen because the input comes from a table.

This commit also removes a documentation note added in commit a0a3883
that has been introduced based on the fact that some routines of the
same method were doing NULL-pointer checks, and adds an assertion
in gistKeyIsEQ to ensure that no NULL keys are passed when calling the
same method.
---
 contrib/btree_gist/btree_utils_num.c |  8 ++
 contrib/btree_gist/btree_utils_var.c | 10 ++-
 doc/src/sgml/gist.sgml   |  5 
 src/backend/access/gist/gistproc.c   | 56 
 src/backend/access/gist/gistutil.c   |  3 ++
 5 files changed, 25 insertions(+), 57 deletions(-)

diff --git a/contrib/btree_gist/btree_utils_num.c b/contrib/btree_gist/btree_utils_num.c
index 505633c..7c5b911 100644
--- a/contrib/btree_gist/btree_utils_num.c
+++ b/contrib/btree_gist/btree_utils_num.c
@@ -147,13 +147,9 @@ gbt_num_same(const GBT_NUMKEY *a, const GBT_NUMKEY *b, const gbtree_ninfo *tinfo
 	b2.lower = (((GBT_NUMKEY *) b)[0]);
 	b2.upper = (((GBT_NUMKEY *) b)[tinfo-size]);
 
-	if (
-		(*tinfo-f_eq) (b1.lower, b2.lower) 
-		(*tinfo-f_eq) (b1.upper, b2.upper)
-		)
-		return TRUE;
-	return FALSE;
 
+	return ((*tinfo-f_eq) (b1.lower, b2.lower) 
+			(*tinfo-f_eq) (b1.upper, b2.upper));
 }
 
 
diff --git a/contrib/btree_gist/btree_utils_var.c b/contrib/btree_gist/btree_utils_var.c
index b7dd060..6ad3347 100644
--- a/contrib/btree_gist/btree_utils_var.c
+++ b/contrib/btree_gist/btree_utils_var.c
@@ -337,7 +337,6 @@ bool
 gbt_var_same(Datum d1, Datum d2, Oid collation,
 			 const gbtree_vinfo *tinfo)
 {
-	bool		result;
 	GBT_VARKEY *t1 = (GBT_VARKEY *) DatumGetPointer(d1);
 	GBT_VARKEY *t2 = (GBT_VARKEY *) DatumGetPointer(d2);
 	GBT_VARKEY_R r1,
@@ -346,13 +345,8 @@ gbt_var_same(Datum d1, Datum d2, Oid collation,
 	r1 = gbt_var_key_readable(t1);
 	r2 = gbt_var_key_readable(t2);
 
-	if (t1  t2)
-		result = ((*tinfo-f_cmp) (r1.lower, r2.lower, collation) == 0 
-  (*tinfo-f_cmp) (r1.upper, r2.upper, collation) == 0);
-	else
-		result = (t1 == NULL  t2 == NULL);
-
-	return result;
+	return ((*tinfo-f_cmp) (r1.lower, r2.lower, collation) == 0 
+			(*tinfo-f_cmp) (r1.upper, r2.upper, collation) == 0);
 }
 
 
diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index 5de282b..f9644b0 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -498,11 +498,6 @@ my_compress(PG_FUNCTION_ARGS)
course.
   /para
 
-  para
-Depending on your needs, you could also need to care about
-compressing literalNULL/ values in there, storing for example
-literal(Datum) 0/ like literalgist_circle_compress/ does.
-  /para
  /listitem
 /varlistentry
 
diff --git a/src/backend/access/gist/gistproc.c b/src/backend/access/gist/gistproc.c
index 4decaa6..1df6357 100644
--- a/src/backend/access/gist/gistproc.c
+++ b/src/backend/access/gist/gistproc.c
@@ -1039,25 +1039,15 @@ gist_poly_compress(PG_FUNCTION_ARGS)
 
 	if (entry-leafkey)
 	{
-		retval = palloc(sizeof(GISTENTRY));
-		if (DatumGetPointer(entry-key) != NULL)
-		{
-			POLYGON*in = DatumGetPolygonP(entry-key);
-			BOX		   *r;
+		POLYGON*in = DatumGetPolygonP(entry-key);
+		BOX		   *r;
 
-			r = (BOX *) palloc(sizeof(BOX));
-			memcpy((void *) r, (void *) (in-boundbox), sizeof(BOX));
-			gistentryinit(*retval, PointerGetDatum(r),
-		  entry-rel, entry-page,
-		  entry-offset, FALSE);
-
-		}
-		else
-		{
-			gistentryinit(*retval, (Datum) 0,
-		  entry-rel, entry-page,
-		  entry-offset, FALSE);
-		}
+		retval = palloc(sizeof(GISTENTRY));
+		r = (BOX *) palloc(sizeof(BOX));
+		memcpy((void *) r, (void *) (in-boundbox), sizeof(BOX));
+		gistentryinit(*retval, PointerGetDatum(r),
+	  entry-rel, entry-page,
+	  entry-offset, FALSE);
 	}
 	else
 		retval = entry;
@@ 

Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread David Steele
On 1/27/15 6:09 PM, Jim Nasby wrote:
 The whole remote_dir discussion made me think of something... would
 --link-dest be any help here?

I'm pretty sure --link-dest would not be effective in this case.  The
problem exists on the source side and --link-dest only operates on the
destination.

-- 
- David Steele
da...@pgmasters.net




signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 6:54 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 On 1/27/15 5:06 PM, Robert Haas wrote:

 On Tue, Jan 27, 2015 at 5:55 PM, Jim Nasby jim.na...@bluetreble.com
 wrote:

 BTW, I know that at least earlier versions of EnterpriseDB's version of
 Postgres (circa 2007) had an auditing feature. I never dealt with any
 customers who were using it when I was there, but perhaps some other
 folks
 could shed some light on what customers wanted to see an an auditing
 feature... (I'm looking at you, Jimbo!)


 It's still there, but it's nothing like pgaudit.  What I hear is that
 our customers are looking for something that has the capabilities of
 DBMS_FGA.  I haven't researched either that or pgaudit enough to know
 how similar they are.


 Do they really need the full capabilities of DBMS_FGA? At first glance, it
 looks even more sophisticated than anything that's been discussed so far. To
 wit (from
 http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_fga.htm#CDEIECAG):

 DBMS_FGA.ADD_POLICY(
object_schema  VARCHAR2,
object_nameVARCHAR2,
policy_nameVARCHAR2,
audit_conditionVARCHAR2,
audit_column   VARCHAR2,
handler_schema VARCHAR2,
handler_module VARCHAR2,
enable BOOLEAN,
statement_typesVARCHAR2,
audit_trailBINARY_INTEGER IN DEFAULT,
audit_column_opts  BINARY_INTEGER IN DEFAULT);

I don't know.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 2:16 PM, Andres Freund and...@2ndquadrant.com wrote:
 That'd end up essentially being a re-emulation of locks. Don't find that
 all that pretty. But maybe we have to go there.

The advantage of it is that it is simple.  The only thing we're really
giving up is the deadlock detector, which I think isn't needed in this
case.

 Here's an alternative approach. I think it generally is superior and
 going in the right direction, but I'm not sure it's backpatchable.

I tend think this is changing too many things to back-patch.  It might
all be fine, but it's pretty wide-reaching, so the chances of
collateral damage seem non-trivial.  Even in master, I'm not sure I
see the point in rocking the apple cart to this degree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] FETCH limited by bytes.

2015-01-27 Thread Corey Huinker
Last year I was working on a patch to postgres_fdw where the fetch_size
could be set at the table level and the server level.

I was able to get the settings parsed and they would show up in
pg_foreign_table
and pg_foreign_servers. Unfortunately, I'm not very familiar with how
foreign data wrappers work, so I wasn't able to figure out how to get these
custom values passed from the PgFdwRelationInfo struct into the
query's PgFdwScanState
struct.

I bring this up only because it might be a simpler solution, in that the
table designer could set the fetch size very high for narrow tables, and
lower or default for wider tables. It's also a very clean syntax, just
another option on the table and/or server creation.

My incomplete patch is attached.



On Tue, Jan 27, 2015 at 4:24 AM, Kyotaro HORIGUCHI 
horiguchi.kyot...@lab.ntt.co.jp wrote:

 Thank you for the comment.

 The automatic way to determin the fetch_size looks become too
 much for the purpose. An example of non-automatic way is a new
 foreign table option like 'fetch_size' but this exposes the
 inside too much... Which do you think is preferable?

 Thu, 22 Jan 2015 11:17:52 -0500, Tom Lane t...@sss.pgh.pa.us wrote in 
 24503.1421943...@sss.pgh.pa.us
  Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp writes:
   Hello, as the discuttion on async fetching on postgres_fdw, FETCH
   with data-size limitation would be useful to get memory usage
   stability of postgres_fdw.
 
   Is such a feature and syntax could be allowed to be added?
 
  This seems like a lot of work, and frankly an incredibly ugly API,
  for a benefit that is entirely hypothetical.  Have you got numbers
  showing any actual performance win for postgres_fdw?

 The API is a rush work to make the path for the new parameter
 (but, yes, I did too much for the purpose that use from
 postgres_fdw..)  and it can be any saner syntax but it's not the
 time to do so yet.

 The data-size limitation, any size to limit, would give
 significant gain especially for small sized rows.

 This patch began from the fact that it runs about twice faster
 when fetch size = 1 than 100.


 http://www.postgresql.org/message-id/20150116.171849.109146500.horiguchi.kyot...@lab.ntt.co.jp

 I took exec times to get 1M rows from localhost via postgres_fdw
 and it showed the following numbers.

 =# SELECT a from ft1;
 fetch_size,   avg row size(*1),   time,   alloced_mem/fetch(Mbytes)(*1)
 (local)0.75s
 10060  6.2s   6000 (0.006)
 1  60  2.7s 60 (0.6  )
 3  60  2.2s180 (2.0  )
 6  60  2.4s360 (4.0  )

 =# SELECT a, b, c from ft1;
 fetch_size,   avg row size(*1),   time,   alloced_mem/fetch(Mbytes)(*1)
 (local)0.8s
 100   204 12  s  20400 (0.02 )
 1000  204 10  s 204000 (0.2  )
 1 204  5.8s204 (2)
 2 204  5.9s408 (4)

 =# SELECT a, b, d from ft1;
 fetch_size,   avg row size(*1),   time,   alloced_mem/fetch(Mbytes)(*1)
 (local)0.8s
 100  1356 17  s 135600 (0.136)
 1000 1356 15  s1356000 (1.356)
 1475 1356 13  s2000100 (2.0  )
 2950 1356 13  s4000200 (4.0  )

 The definitions of the environment are the following.

 CREATE SERVER sv1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host
 'localhost', dbname 'postgres');
 CREATE USER MAPPING FOR PUBLIC SERVER sv1;
 CREATE TABLE lt1 (a int, b timestamp, c text, d text);
 CREATE FOREIGN TABLE ft1 (a int, b timestamp, c text, d text) SERVER sv1
 OPTIONS (table_name 'lt1');
 INSERT INTO lt1 (SELECT a, now(), repeat('x', 128), repeat('x', 1280) FROM
 generate_series(0, 99) a);

 The avg row size is alloced_mem/fetch_size and the alloced_mem
 is the sum of HeapTuple[fetch_size] and (HEAPTUPLESIZE +
 tup-t_len) for all stored tuples in the receiver side,
 fetch_more_data() in postgres_fdw.

 They are about 50% gain for the smaller tuple size and 25% for
 the larger. They looks to be optimal at where alloced_mem is
 around 2MB by the reason unknown to me. Anyway the difference
 seems to be significant.

  Even if we wanted to do something like this, I strongly object to
  measuring size by heap_compute_data_size.  That's not a number that users
  would normally have any direct knowledge of; nor does it have anything
  at all to do with the claimed use-case, where what you'd really need to
  measure is bytes transmitted down the wire.  (The difference is not
 small:
  for instance, toasted values would likely still be toasted at the point
  where you're measuring.)

 Sure. Finally, the attached patch #1 which does the following
 things.

  - Sender limits the number of tuples using the sum of the net
length of 

Re: [HACKERS] Additional role attributes superuser review

2015-01-27 Thread Adam Brightwell
All,

I have attached and updated patch for review.

Thanks,
Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
new file mode 100644
index 62305d2..fd4b9ab
*** a/doc/src/sgml/catalogs.sgml
--- b/doc/src/sgml/catalogs.sgml
***
*** 1445,1450 
--- 1445,1486 
   /row
  
   row
+   entrystructfieldrolbypassrls/structfield/entry
+   entrytypebool/type/entry
+   entryRole can bypass row level security/entry
+  /row
+ 
+  row
+   entrystructfieldrolexclbackup/structfield/entry
+   entrytypebool/type/entry
+   entryRole can perform on-line exclusive backup operations/entry
+  /row
+ 
+  row
+   entrystructfieldrolxlogreplay/structfield/entry
+   entrytypebool/type/entry
+   entryRole can control xlog recovery replay operations/entry
+  /row
+ 
+  row
+   entrystructfieldrollogfile/structfield/entry
+   entrytypebool/type/entry
+   entryRole can rotate log files/entry
+  /row
+ 
+  row
+   entrystructfieldrolmonitor/structfield/entry
+   entrytypebool/type/entry
+   entryRole can view pg_stat_* details/entry
+  /row
+ 
+  row
+   entrystructfieldrolsignal/structfield/entry
+   entrytypebool/type/entry
+   entryRole can signal backend processes/entry
+  /row
+ 
+  row
entrystructfieldrolconnlimit/structfield/entry
entrytypeint4/type/entry
entry
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index d57243a..096c770
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*** SELECT set_config('log_statement_stats',
*** 16425,16438 
  literalfunctionpg_start_backup(parameterlabel/ typetext/ optional, parameterfast/ typeboolean/ /optional)/function/literal
  /entry
 entrytypepg_lsn/type/entry
!entryPrepare for performing on-line backup (restricted to superusers or replication roles)/entry
/row
row
 entry
  literalfunctionpg_stop_backup()/function/literal
  /entry
 entrytypepg_lsn/type/entry
!entryFinish performing on-line backup (restricted to superusers or replication roles)/entry
/row
row
 entry
--- 16425,16438 
  literalfunctionpg_start_backup(parameterlabel/ typetext/ optional, parameterfast/ typeboolean/ /optional)/function/literal
  /entry
 entrytypepg_lsn/type/entry
!entryPrepare for performing on-line exclusive backup (restricted to superusers or replication roles)/entry
/row
row
 entry
  literalfunctionpg_stop_backup()/function/literal
  /entry
 entrytypepg_lsn/type/entry
!entryFinish performing on-line exclusive backup (restricted to superusers or replication roles)/entry
/row
row
 entry
diff --git a/doc/src/sgml/ref/create_role.sgml b/doc/src/sgml/ref/create_role.sgml
new file mode 100644
index ea26027..0fc3b42
*** a/doc/src/sgml/ref/create_role.sgml
--- b/doc/src/sgml/ref/create_role.sgml
*** CREATE ROLE replaceable class=PARAMETE
*** 33,38 
--- 33,43 
  | LOGIN | NOLOGIN
  | REPLICATION | NOREPLICATION
  | BYPASSRLS | NOBYPASSRLS
+ | EXCLUSIVEBACKUP | NOEXCLUSIVEBACKUP
+ | XLOGREPLAY | NOXLOGREPLAY
+ | LOGFILE | NOLOGFILE
+ | MONITOR | NOMONITOR
+ | SIGNAL | NOSIGNAL
  | CONNECTION LIMIT replaceable class=PARAMETERconnlimit/replaceable
  | [ ENCRYPTED | UNENCRYPTED ] PASSWORD 'replaceable class=PARAMETERpassword/replaceable'
  | VALID UNTIL 'replaceable class=PARAMETERtimestamp/replaceable'
*** CREATE ROLE replaceable class=PARAMETE
*** 209,214 
--- 214,283 
 /para
/listitem
   /varlistentry
+ 
+  varlistentry
+   termliteralEXCLUSIVEBACKUP/literal/term
+   termliteralNOEXCLUSIVEBACKUP/literal/term
+   listitem
+para
+These clauses determine whether a role is allowed to start/stop an
+on-line exclusive backup.  An on-line exclusive backup is one that is
+initiated by the low-level base backup api instead of through the
+replication protocol, see xref linkend=backup-lowlevel-base-backup.
+If not specified, literalNOEXCLUSIVEBACKUP/literal is the default.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
+   termliteralXLOGREPLAY/literal/term
+   termliteralNOXLOGREPLAY/literal/term
+   listitem
+para
+These clauses determine whether a role is allowed to pause/resume xlog
+recovery.
+If not specified, literalNOXLOGREPLAY/literal is the default.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
+   termliteralLOGFILE/literal/term
+   

Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 4:46 PM, Stephen Frost sfr...@snowman.net wrote:
 With 0 workers, first run took 883465.352 ms, and second run took 295050.106 
 ms.
 With 8 workers, first run took 340302.250 ms, and second run took 307767.758 
 ms.

 This is a confusing result, because you expect parallelism to help
 more when the relation is partly cached, and make little or no
 difference when it isn't cached.  But that's not what happened.

 These numbers seem to indicate that the oddball is the single-threaded
 uncached run.  If I followed correctly, the uncached 'dd' took 321s,
 which is relatively close to the uncached-lots-of-workers and the two
 cached runs.  What in the world is the uncached single-thread case doing
 that it takes an extra 543s, or over twice as long?  It's clearly not
 disk i/o which is causing the slowdown, based on your dd tests.

Yeah, I'm wondering if the disk just froze up on that run for a long
while, which has been known to occasionally happen on this machine,
because I can't reproduce that crappy number.  I did the 0-worker test
a few more times, with the block-by-block method, dropping the caches
and restarting PostgreSQL each time, and got:

32.968 ms
322873.325 ms
322967.722 ms
321759.273 ms

After that last run, I ran it a few more times without restarting
PostgreSQL or dropping the caches, and got:

257629.348 ms
289668.976 ms
290342.970 ms
258035.226 ms
284237.729 ms

Then I redid the 8-client test.  Cold cache, I got 337312.554 ms.  On
the rerun, 323423.813 ms.  Third run, 324940.785.

There is more variability than I would like here.  Clearly, it goes a
bit faster when the cache is warm, but that's about all I can say with
any confidence.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-27 Thread Andres Freund
On 2015-01-27 19:24:24 -0500, Robert Haas wrote:
 On Tue, Jan 27, 2015 at 2:16 PM, Andres Freund and...@2ndquadrant.com wrote:
  That'd end up essentially being a re-emulation of locks. Don't find that
  all that pretty. But maybe we have to go there.
 
 The advantage of it is that it is simple.  The only thing we're really
 giving up is the deadlock detector, which I think isn't needed in this
 case.

I think it's more than just the deadlock detector. Consider
pg_locks/pg_stat_activity.waiting, cancelling a acquisition, error
cleanup and recursive acquisitions. Acquiring session locks + a
surrounding transaction command deals with with cleanups without
introducing PG_ENSURE blocks in a couple places. And we need recursive
acquisition so a streaming base backup can acquire the lock over the
whole runtime, while a manual pg_start_backup() does only for its own
time.

Especially the visibility in the system views is something I'd not like
to give up in additional blocks we introduce in the backbranches.

  Here's an alternative approach. I think it generally is superior and
  going in the right direction, but I'm not sure it's backpatchable.
 
 I tend think this is changing too many things to back-patch.  It might
 all be fine, but it's pretty wide-reaching, so the chances of
 collateral damage seem non-trivial.  Even in master, I'm not sure I
 see the point in rocking the apple cart to this degree.

It's big, true. But a fair amount of it stuff I think we have to do
anyway. The current code acquiring session locks in dbase_redo() is
borked - we either assert or segfault if there's a connection in the
wrong moment beause there's no deadlock handler. Plus it has races that
aren't that hard to hit :(. To fix the crashes (but not the race) we can
have a separate ResolveRecoveryConflictWithObjectLock that doesn't
record the existance of the lock, but doesn't ever do a ProcSleep(). Not
pretty either :(

Seems like a situation with no nice solutions so far :(

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas robertmh...@gmail.com wrote:
 Now, when you did what I understand to be the same test on the same
 machine, you got times ranging from 9.1 seconds to 35.4 seconds.
 Clearly, there is some difference between our test setups.  Moreover,
 I'm kind of suspicious about whether your results are actually
 physically possible.  Even in the best case where you somehow had the
 maximum possible amount of data - 64 GB on a 64 GB machine - cached,
 leaving no space for cache duplication between PG and the OS and no
 space for the operating system or postgres itself - the table is 120
 GB, so you've got to read *at least* 56 GB from disk.  Reading 56 GB
 from disk in 9 seconds represents an I/O rate of 6 GB/s. I grant that
 there could be some speedup from issuing I/O requests in parallel
 instead of serially, but that is a 15x speedup over dd, so I am a
 little suspicious that there is some problem with the test setup,
 especially because I cannot reproduce the results.

So I thought about this a little more, and I realized after some
poking around that hydra's disk subsystem is actually six disks
configured in a software RAID5[1].  So one advantage of the
chunk-by-chunk approach you are proposing is that you might be able to
get all of the disks chugging away at once, because the data is
presumably striped across all of them.  Reading one block at a time,
you'll never have more than 1 or 2 disks going, but if you do
sequential reads from a bunch of different places in the relation, you
might manage to get all 6.  So that's something to think about.

One could imagine an algorithm like this: as long as there are more
1GB segments remaining than there are workers, each worker tries to
chug through a separate 1GB segment.  When there are not enough 1GB
segments remaining for that to work, then they start ganging up on the
same segments.  That way, you get the benefit of spreading out the I/O
across multiple files (and thus hopefully multiple members of the RAID
group) when the data is coming from disk, but you can still keep
everyone busy until the end, which will be important when the data is
all in-memory and you're just limited by CPU bandwidth.

All that aside, I still can't account for the numbers you are seeing.
When I run with your patch and what I think is your test case, I get
different (slower) numbers.  And even if we've got 6 drives cranking
along at 400MB/s each, that's still only 2.4 GB/s, not 6 GB/s.  So
I'm still perplexed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Not my idea.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Jan 23, 2015 at 1:48 PM, Andres Freund and...@2ndquadrant.com wrote:
 I don't understand why that'd be better than simply fixing (yes, that's
 imo the correct term) pg_upgrade to retain relfilenodes across the
 upgrade. Afaics there's no conflict risk and it'd make the clusters much
 more similar, which would be good; independent of rsyncing standbys.

 +1.

That's certainly impossible for the system catalogs, which means you
have to be able to deal with relfilenode discrepancies for them, which
means that maintaining the same relfilenodes for user tables is of
dubious value.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GetLockConflicts() and thus recovery conflicts seem pretty broken

2015-01-27 Thread Andres Freund
Hi,

While investigating other bugs I noticed that
ResolveRecoveryConflictWithLock() wasn't really working. Turns out
GetLockConflicts() violates it's API contract which says:

 * The result array is palloc'd and is terminated with an invalid VXID.

Problem 1:
We don't actually put the terminator there. It happens to more or less
accidentally work on a master because the array is palloc0()ed there and
while a 0 is valid backend id it happens to not be a valid local
transaction id.  In HS we don't actually allocate the array every time,
but it's instead statically allocated. Without zeroing.

I have no idea why this doesn't crash
ResolveRecoveryConflictWithInterrupt() which does:
while (!lock_acquired)
{
if (++num_attempts  3)
backends = GetLockConflicts(locktag, 
AccessExclusiveLock);
...
ResolveRecoveryConflictWithVirtualXIDs(backends,

 PROCSIG_RECOVERY_CONFLICT_LOCK);
and ResolveRecoveryConflictWithVirtualXIDs does:
...
while (VirtualTransactionIdIsValid(*waitlist))
{
/* kill kill kill */

/* The virtual transaction is gone now, wait for the next one */
waitlist++;
}

I guess we just accidentally happen to come across appropriately
set memory at some point.

I'm really baffled that this hasn't caused significant problems so far.


Problem 2:
Since bcd8528f001 and 29eedd312274 the the result array is palloc'd is
wrong because we're now doing:

static VirtualTransactionId *vxids;
/*
 * Allocate memory to store results, and fill with InvalidVXID.  We only
 * need enough space for MaxBackends + a terminator, since prepared 
xacts
 * don't count. InHotStandby allocate once in TopMemoryContext.
 */
if (InHotStandby)
{
if (vxids == NULL)
vxids = (VirtualTransactionId *)
MemoryContextAlloc(TopMemoryContext,
   sizeof(VirtualTransactionId) 
* (MaxBackends + 1));
}
else
vxids = (VirtualTransactionId *)
palloc0(sizeof(VirtualTransactionId) * (MaxBackends + 
1));

Obviously that violates the API contract. I'm inclined to rip the HS
special case out and add a pfree() to the single HS caller. The commit
message introducing it says:

Use malloc() in GetLockConflicts() when called InHotStandby to avoid
repeated palloc calls. Current code assumed this was already true, so
this is a bug fix.
But I can't really believe this is relevant.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 9:36 AM, Stephen Frost sfr...@snowman.net wrote:
 The example listed works, but only when it's a local rsync:

 rsync --archive --hard-links --size-only old_dir new_dir remote_dir

 Perhaps a better example (or additional one) would be with a remote
 rsync, including clarification of old and new dir, like so:

 (run in /var/lib/postgresql)
 rsync --archive --hard-links --size-only \
   9.3/main \
   9.4/main \
   server:/var/lib/postgresql/

 Note that 9.3/main and 9.4/main are two source directories for rsync to
 copy over, while server:/var/lib/postgresql/ is a remote destination
 directory.  The above directories match a default Debian/Ubuntu install.

My point is that Bruce's patch suggests looking for remote_dir in
the rsync documentation, but no such term appears there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Robert Haas
On Tue, Jan 27, 2015 at 9:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Jan 23, 2015 at 1:48 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
 I don't understand why that'd be better than simply fixing (yes, that's
 imo the correct term) pg_upgrade to retain relfilenodes across the
 upgrade. Afaics there's no conflict risk and it'd make the clusters much
 more similar, which would be good; independent of rsyncing standbys.

 +1.

 That's certainly impossible for the system catalogs, which means you
 have to be able to deal with relfilenode discrepancies for them, which
 means that maintaining the same relfilenodes for user tables is of
 dubious value.

Why is that impossible for the system catalogs?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Sat, Jan 24, 2015 at 10:04 PM, Bruce Momjian br...@momjian.us wrote:
  On Fri, Jan 23, 2015 at 02:34:36PM -0500, Stephen Frost wrote:
You'd have to replace the existing data directory on the master to do
that, which pg_upgrade was designed specifically to not do, in case
things went poorly.
  
   Why? Just rsync the new data directory onto the old directory on the
   standbys. That's fine and simple.
 
  That still doesn't address the need to use --size-only, it would just
  mean that you don't need to use -H.  If anything the -H part is the
  aspect which worries me the least about this approach.
 
  I can now confirm that it works, just as Stephen said.  I was able to
  upgrade a standby cluster that contained the regression database, and
  the pg_dump output was perfect.
 
  I am attaching doc instruction that I will add to all branches as soon
  as someone else confirms my results.  You will need to use rsync
  --itemize-changes to see the hard links being created, e.g.:
 
 hf+ pgsql/data/base/16415/28188 = 
  pgsql.old/data/base/16384/28188
 
 My rsync manual page (on two different systems) mentions nothing about
 remote_dir, so I'd be quite unable to follow your proposed directions.

The example listed works, but only when it's a local rsync:

rsync --archive --hard-links --size-only old_dir new_dir remote_dir

Perhaps a better example (or additional one) would be with a remote
rsync, including clarification of old and new dir, like so:

(run in /var/lib/postgresql)
rsync --archive --hard-links --size-only \
  9.3/main \
  9.4/main \
  server:/var/lib/postgresql/

Note that 9.3/main and 9.4/main are two source directories for rsync to
copy over, while server:/var/lib/postgresql/ is a remote destination
directory.  The above directories match a default Debian/Ubuntu install.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread David Fetter
On Tue, Jan 27, 2015 at 08:02:37AM +0100, Daniel Bausch wrote:
 Hi PG devs!
 
 Tom Lane t...@sss.pgh.pa.us writes:
 
  Wait for first IO, issue second IO request
  Compute
  Already have second IO request, issue third
  ...
 
  We'd be a lot less sensitive to IO latency.
 
  It would take about five minutes of coding to prove or disprove this:
  stick a PrefetchBuffer call into heapgetpage() to launch a request for the
  next page as soon as we've read the current one, and then see if that
  makes any obvious performance difference.  I'm not convinced that it will,
  but if it did then we could think about how to make it work for real.
 
 Sorry for dropping in so late...
 
 I have done all this two years ago.  For TPC-H Q8, Q9, Q17, Q20, and Q21
 I see a speedup of ~100% when using IndexScan prefetching + Nested-Loops
 Look-Ahead (the outer loop!).
 (On SSD with 32 Pages Prefetch/Look-Ahead + Cold Page Cache / Small RAM)

Would you be so kind as to pass along any patches (ideally applicable
to git master), tests, and specific measurements you made?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-27 Thread Jim Nasby

On 1/26/15 4:45 PM, Robert Haas wrote:

On Mon, Jan 26, 2015 at 5:21 PM, Stephen Frost sfr...@snowman.net wrote:

I don't disagree with you about any of that.  I don't think you disagree
with my comment either.  What I'm not entirely clear on is how consensus
could be reached.  Leaving it dormant for the better part of a year
certainly doesn't appear to have helped that situation.  We've discussed
having it be part of the main server and having it be a contrib module
and until about a week ago, I had understood that having it in contrib
would be preferrable.  Based on the recent emails, it appears there's
been a shift of preference to having it be in-core, but clearly there's
no time left to do that in this release cycle.


Well, I'm not sure that anyone else here agreed with me on that, and
one person does not a consensus make no matter who it is.  The basic
problem here is that we don't seem to have even two people here who
agree on how this ought to be done.  The basic dynamic here seems to
be you asking for changes and Abhijit making them but without any real
confidence, and I don't feel good about that.  I'm willing to defer to
an emerging consensus here when there is one, but what Abhijit likes
best is not a consensus, and neither is what you like, and neither is
what I like.  What we need is some people agreeing with each other.


BTW, I know that at least earlier versions of EnterpriseDB's version of 
Postgres (circa 2007) had an auditing feature. I never dealt with any customers 
who were using it when I was there, but perhaps some other folks could shed 
some light on what customers wanted to see an an auditing feature... (I'm 
looking at you, Jimbo!)
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-27 Thread Jim Nasby

On 1/27/15 5:06 PM, Robert Haas wrote:

On Tue, Jan 27, 2015 at 5:55 PM, Jim Nasby jim.na...@bluetreble.com wrote:

BTW, I know that at least earlier versions of EnterpriseDB's version of
Postgres (circa 2007) had an auditing feature. I never dealt with any
customers who were using it when I was there, but perhaps some other folks
could shed some light on what customers wanted to see an an auditing
feature... (I'm looking at you, Jimbo!)


It's still there, but it's nothing like pgaudit.  What I hear is that
our customers are looking for something that has the capabilities of
DBMS_FGA.  I haven't researched either that or pgaudit enough to know
how similar they are.


Do they really need the full capabilities of DBMS_FGA? At first glance, it 
looks even more sophisticated than anything that's been discussed so far. To 
wit (from 
http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_fga.htm#CDEIECAG):

DBMS_FGA.ADD_POLICY(
   object_schema  VARCHAR2,
   object_nameVARCHAR2,
   policy_nameVARCHAR2,
   audit_conditionVARCHAR2,
   audit_column   VARCHAR2,
   handler_schema VARCHAR2,
   handler_module VARCHAR2,
   enable BOOLEAN,
   statement_typesVARCHAR2,
   audit_trailBINARY_INTEGER IN DEFAULT,
   audit_column_opts  BINARY_INTEGER IN DEFAULT);
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shortcoming in CLOBBER_FREED_MEMORY coverage: disk buffer pointers

2015-01-27 Thread Jim Nasby

On 1/27/15 5:16 PM, Tom Lane wrote:

Jim Nasby jim.na...@bluetreble.com writes:

On 1/26/15 6:11 PM, Greg Stark wrote:

Fwiw I think our experience is that bugs where buffers are unpinned get exposed 
pretty quickly in production. I suppose the same might not be true for rarely 
called codepaths or in cases where the buffers are usually already pinned.



Yeah, that's what I was thinking. If there's some easy way to correctly 
associate pins with specific code paths (owners?) then maybe it's worth doing 
so; but I don't think it's worth much effort.


If you have a working set larger than shared_buffers, then yeah it's
likely that reference-after-unpin bugs would manifest pretty quickly.
This does not necessarily translate into something reproducible or
debuggable, however; and besides that you'd really rather that such
bugs not get into the field in the first place.

The point of my Valgrind proposal was to provide a mechanism that would
have a chance of catching such bugs in a *development* context, and
provide some hint of where in the codebase the fault is, too.


That's what I was looking for two; I was just wondering if there was an easy 
way to also cover the case of one path forgetting to Unpin and a second path 
accidentally Unpinning (with neither dropping the refcount to 0). It sounds 
like it's just not worth worrying about that though.

Do you think there's merit to having bufmgr.c do something special when 
refcount hits 0 in a CLOBBER_FREED_MEMORY build? It seems like it's a lot 
easier to enable that than to setup valgrind (though I've never tried the 
latter).
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Providing catalog view to pg_hba.conf file - Patch submission

2015-01-27 Thread Jim Nasby

On 1/27/15 1:04 AM, Haribabu Kommi wrote:

On Mon, Jun 30, 2014 at 5:06 PM, Abhijit Menon-Sen a...@2ndquadrant.com wrote:

I think having two columns would work. The columns could be called
database and database_list and user and user_list respectively.

The database column may contain one of all, sameuser, samegroup,
replication, but if it's empty, database_list will contain an array of
database names. Then (all, {}) and (, {all}) are easily separated.
Likewise for user and user_list.


Thanks for the review.

I corrected all the review comments except the one to add two columns
as (database, database_list and user, user_list). I feel this may cause
some confusion to the users.

Here I attached the latest version of the patch.
I will add this patch to the next commitfest.


Apologies if this was covered, but why isn't the IP address an inet instead of 
text?

Also, what happens if someone reloads the config in the middle of running the 
SRF? ISTM it'd be better to do something like process all of parsed_hba_lines 
into a tuplestore. Obviously there's still a race condition there, but at least 
it's a lot smaller, and AFAIK no worse than the pg_stats views.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Jim Nasby

On 1/26/15 11:11 PM, Amit Kapila wrote:

On Tue, Jan 27, 2015 at 3:18 AM, Jim Nasby jim.na...@bluetreble.com 
mailto:jim.na...@bluetreble.com wrote:
 
  On 1/23/15 10:16 PM, Amit Kapila wrote:
 
  Further, if we want to just get the benefit of parallel I/O, then
  I think we can get that by parallelising partition scan where different
  table partitions reside on different disk partitions, however that is
  a matter of separate patch.
 
 
  I don't think we even have to go that far.
 
 
  We'd be a lot less sensitive to IO latency.
 
  I wonder what kind of gains we would see if every SeqScan in a query spawned 
a worker just to read tuples and shove them in a queue (or shove a pointer to a 
buffer in the queue).
 

Here IIUC, you want to say that just get the read done by one parallel
worker and then all expression calculation (evaluation of qualification
and target list) in the main backend, it seems to me that by doing it
that way, the benefit of parallelisation will be lost due to tuple
communication overhead (may be the overhead is less if we just
pass a pointer to buffer but that will have another kind of problems
like holding buffer pins for a longer period of time).

I could see the advantage of testing on lines as suggested by Tom Lane,
but that seems to be not directly related to what we want to achieve by
this patch (parallel seq scan) or if you think otherwise then let me know?


There's some low-hanging fruit when it comes to improving our IO performance 
(or more specifically, decreasing our sensitivity to IO latency). Perhaps the 
way to do that is with the parallel infrastructure, perhaps not. But I think 
it's premature to look at parallelism for increasing IO performance, or 
worrying about things like how many IO threads we should have before we at 
least look at simpler things we could do. We shouldn't assume there's nothing 
to be gained short of a full parallelization implementation.

That's not to say there's nothing else we could use parallelism for. Sort, 
merge and hash operations come to mind.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Jim Nasby

On 1/27/15 3:46 PM, Stephen Frost wrote:

With 0 workers, first run took 883465.352 ms, and second run took 295050.106 ms.
With 8 workers, first run took 340302.250 ms, and second run took 307767.758 
ms.

This is a confusing result, because you expect parallelism to help
more when the relation is partly cached, and make little or no
difference when it isn't cached.  But that's not what happened.

These numbers seem to indicate that the oddball is the single-threaded
uncached run.  If I followed correctly, the uncached 'dd' took 321s,
which is relatively close to the uncached-lots-of-workers and the two
cached runs.  What in the world is the uncached single-thread case doing
that it takes an extra 543s, or over twice as long?  It's clearly not
disk i/o which is causing the slowdown, based on your dd tests.

One possibility might be round-trip latency.  The multi-threaded case is
able to keep the CPUs and the i/o system going, and the cached results
don't have as much latency since things are cached, but the
single-threaded uncached case going i/o - cpu - i/o - cpu, ends up
with a lot of wait time as it switches between being on CPU and waiting
on the i/o.


This exactly mirrors what I've seen on production systems. On a single SeqScan 
I can't get anywhere close to the IO performance I could get with dd. Once I 
got up to 4-8 SeqScans of different tables running together, I saw iostat 
numbers that were similar to what a single dd bs=8k would do. I've tested this 
with iSCSI SAN volumes on both 1Gbit and 10Gbit ethernet.

This is why I think that when it comes to IO performance, before we start 
worrying about real parallelization we should investigate ways to do some kind 
of async IO.

I only have my SSD laptop and a really old server to test on, but I'll try 
Tom's suggestion of adding a PrefetchBuffer call into heapgetpage() unless 
someone beats me to it. I should be able to do it tomorrow.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dereferenced pointers checked as NULL in btree_utils_var.c

2015-01-27 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 I think you are confusing NULL pointers with an SQL NULL. 
 gistgetadjusted checks that it's not an SQL NULL (!oldisnull[i]), but it 
 does not check if it's a NULL pointer 
 (DatumGetPointer(oldentries[i].key) != NULL). The case we're worried 
 about is that the value is not an SQL NULL, i.e. isnull flag is false, 
 but the Datum is a NULL pointer.

Actually both of these deserve to be worried about; though it's fairly
clear from looking at the core GIST code that it avoids calling
gistKeyIsEQ on SQL NULLs.

 Nevertheless, looking at the code, I don't that a NULL pointer can ever 
 be passed to the same-method, for any of the built-in or contrib 
 opclasses, but it's very confusing because some functions check for 
 that. My proof goes like this:

 1. The key value passed as argument must've been originally formed by 
 the compress, union, or picksplit methods.

 1.1. Some compress methods do nothing, and just return the passed-in 
 key, which comes from the table and cannot be a NULL pointer (the 
 compress method is never called for SQL NULLs). Other compress methods 
 don't check for a NULL pointer, and would crash if there was one. 
 gist_poly_compress() and gist_circle_compress() do check for a NULL, but 
 they only return a NULL key if the input key is NULL, which cannot 
 happen because the input comes from a table. So I believe the 
 NULL-checks in those functions are dead code, and none of the compress 
 methods ever return a NULL key.

 1.2. None of the union methods return a NULL pointer (nor expect one as 
 input).

 1.3. None of the picksplit methods return a NULL pointer. They all 
 return one of the original values, or one formed with the union method. 
 The picksplit method can return a NULL pointer in the spl_ldatum or 
 spl_rdatum field, in the degenerate case that it puts all keys on the 
 same halve. However, the caller, gistUserPickSplit checks for that and 
 immediately overwrites spl_ldatum and spl_rdatum with sane values in 
 that case.

Sounds good to me.

 I don't understand what this sentence in the documentation on the 
 compress method means:

 Depending on your needs, you could also need to care about
 compressing NULL values in there, storing for example (Datum) 0 like
 gist_circle_compress does.

I believe you're right that I added this because there were checks for
null pointers in some but not all of the opclass support functions.
It looked to me like some opclasses might be intending to pass around null
pointers as valid (not-SQL-NULL) values.  I think your analysis above
eliminates that idea though.  It's a sufficiently weird concept that
I don't feel a need to document or support it.

So I'm fine with taking out both this documentation text and the dead
null-pointer checks; but let's do that all in one patch not piecemeal.
In any case, this is just cosmetic cleanup; there's no actual hazard
here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 12:23 PM, Tom Lane wrote:
 I think coding anything is premature until we decide how we're going to
 deal with the fundamental ambiguity.

 The input \\uabcd will be stored correctly as \uabcd, but this will in 
 turn be rendered as \uabcd, whereas it should be rendered as \\uabcd. 
 That's what the patch fixes.
 There are two problems here and this addresses one of them. The other 
 problem is the ambiguity regarding \\u and \u.

It's the same problem really, and until we have an answer about
what to do with \u, I think any patch is half-baked and possibly
counterproductive.

In particular, I would like to suggest that the current representation of
\u is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

The most obvious way to store such data unambiguously is to just go ahead
and store U+ as a NUL byte (\000).  The only problem with that is that
then such a string cannot be considered to be a valid value of type TEXT,
which would mean that we'd need to throw an error if we were asked to
convert a JSON field containing such a character to text.  I don't
particularly have a problem with that, except possibly for the time cost
of checking for \000 before allowing a conversion to occur.  While a
memchr() check might be cheap enough, we could also consider inventing a
new JEntry type code for string-containing-null, so that there's a
distinction in the type system between strings that are coercible to text
and those that are not.

If we went down a path like that, the currently proposed patch would be
quite useless.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Andrew Dunstan


On 01/27/2015 01:40 PM, Tom Lane wrote:


In particular, I would like to suggest that the current representation of
\u is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.




Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on 
the table what I suggested is unnecessary.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Robert Haas
On Fri, Jan 23, 2015 at 6:42 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Fixed-Chunks

 No. of workers/Time (ms) 0 2 4 8 16 24 32
 Run-1 250536 266279 251263 234347 87930 50474 35474
 Run-2 249587 230628 225648 193340 83036 35140 9100
 Run-3 234963 220671 230002 256183 105382 62493 27903
 Run-4 239111 245448 224057 189196 123780 63794 24746
 Run-5 239937 222820 219025 220478 114007 77965 39766

I cannot reproduce these results.  I applied your fixed-chunk size
patch and ran SELECT parallel_count('tbl_perf', 32) a few times.  The
first thing I notice is that, as I predicted, there's an issue with
different workers finishing at different times.  For example, from my
first run:

2015-01-27 22:13:09 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34700) exited with exit code 0
2015-01-27 22:13:09 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34698) exited with exit code 0
2015-01-27 22:13:09 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34701) exited with exit code 0
2015-01-27 22:13:10 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34699) exited with exit code 0
2015-01-27 22:15:00 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34683) exited with exit code 0
2015-01-27 22:15:29 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34673) exited with exit code 0
2015-01-27 22:15:58 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34679) exited with exit code 0
2015-01-27 22:16:38 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34689) exited with exit code 0
2015-01-27 22:16:39 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34671) exited with exit code 0
2015-01-27 22:16:47 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34677) exited with exit code 0
2015-01-27 22:16:47 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34672) exited with exit code 0
2015-01-27 22:16:48 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34680) exited with exit code 0
2015-01-27 22:16:50 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34686) exited with exit code 0
2015-01-27 22:16:51 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34670) exited with exit code 0
2015-01-27 22:16:51 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34690) exited with exit code 0
2015-01-27 22:16:51 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34674) exited with exit code 0
2015-01-27 22:16:52 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34684) exited with exit code 0
2015-01-27 22:16:53 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34675) exited with exit code 0
2015-01-27 22:16:53 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34682) exited with exit code 0
2015-01-27 22:16:53 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34691) exited with exit code 0
2015-01-27 22:16:54 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34676) exited with exit code 0
2015-01-27 22:16:54 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34685) exited with exit code 0
2015-01-27 22:16:55 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34692) exited with exit code 0
2015-01-27 22:16:56 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34687) exited with exit code 0
2015-01-27 22:16:56 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34678) exited with exit code 0
2015-01-27 22:16:57 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34681) exited with exit code 0
2015-01-27 22:16:57 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34688) exited with exit code 0
2015-01-27 22:16:59 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34694) exited with exit code 0
2015-01-27 22:16:59 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34693) exited with exit code 0
2015-01-27 22:17:02 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34695) exited with exit code 0
2015-01-27 22:17:02 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34697) exited with exit code 0
2015-01-27 22:17:02 UTC [34660] LOG:  worker process: parallel worker
for PID 34668 (PID 34696) exited with exit code 0

That run started at 22:13:01.  Within 4 seconds, 4 workers exited.  So
clearly we are not getting the promised 32-way parallelism for the
whole test.  Granted, in this instance, *most* of the workers run
until the end, but I think we'll find that there are
uncomfortably-frequent cases where we get significantly less
parallelism than we planned on because the work isn't divided evenly.

But leaving that aside, I've run this test 6 times in a row now, with
a warm 

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-01-27 Thread Jim Nasby

On 1/27/15 3:56 AM, Abhijit Menon-Sen wrote:

At 2015-01-26 18:47:29 -0600, jim.na...@bluetreble.com wrote:


Anything happen with this?


Nothing so far.


It would be best to get this into a commit fest so it's not lost.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shortcoming in CLOBBER_FREED_MEMORY coverage: disk buffer pointers

2015-01-27 Thread Jim Nasby

On 1/26/15 6:11 PM, Greg Stark wrote:


On Tue, Jan 27, 2015 at 12:03 AM, Jim Nasby jim.na...@bluetreble.com 
mailto:jim.na...@bluetreble.com wrote:

But one backend can effectively pin a buffer more than once, no? If so, 
then ISTM there's some risk that code path A pins and forgets to unpin, but path B 
accidentally unpins for A.


The danger is that there's a codepath that refers to memory it doesn't have a 
pin on but that there is no actual test in our regression suite that doesn't 
actually have a second pin on the same buffer. If there is in fact no reachable 
code path at all without the second pin then there's no active bug, only a bad 
coding practice. But if there are code paths that we just aren't testing then 
that's a real bug.

IIRC CLOBBER_FREED_MEMORY only affects palloc'd blocks. Do we not have a mode 
that automatically removes any buffer as soon as it's not pinned? That seems 
like it would be a valuable addition.


By setting BufferDesc.tag to 0?

On a related note... I'm confused by this part of UnpinBuffer. How is refcount 
ending up  0??

Assert(ref-refcount  0);
ref-refcount--;
if (ref-refcount == 0)
{
/* I'd better not still hold any locks on the buffer */
Assert(!LWLockHeldByMe(buf-content_lock));
Assert(!LWLockHeldByMe(buf-io_in_progress_lock));

LockBufHdr(buf);

/* Decrement the shared reference count */
Assert(buf-refcount  0);
buf-refcount--;


BTW, I certainly see no evidence of CLOBBER_FREED_MEMORY coming into play here.


Fwiw I think our experience is that bugs where buffers are unpinned get exposed 
pretty quickly in production. I suppose the same might not be true for rarely 
called codepaths or in cases where the buffers are usually already pinned.


Yeah, that's what I was thinking. If there's some easy way to correctly 
associate pins with specific code paths (owners?) then maybe it's worth doing 
so; but I don't think it's worth much effort.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Jim Nasby

On 1/27/15 9:29 AM, Stephen Frost wrote:

My point is that Bruce's patch suggests looking for remote_dir in
the rsync documentation, but no such term appears there.

Ah, well, perhaps we could simply add a bit of clarification to this:

for details on specifying optionremote_dir/


The whole remote_dir discussion made me think of something... would --link-dest 
be any help here?

   --link-dest=DIR
  This option behaves like --copy-dest, but unchanged files are 
hard linked from DIR to the des-
  tination  directory.   The  files  must be identical in all 
preserved attributes (e.g. permis-
  sions, possibly ownership) in order for the files to be linked 
together.  An example:

rsync -av --link-dest=$PWD/prior_dir host:src_dir/ new_dir/

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Merlin Moncure
On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 12:23 PM, Tom Lane wrote:
 I think coding anything is premature until we decide how we're going to
 deal with the fundamental ambiguity.

 The input \\uabcd will be stored correctly as \uabcd, but this will in
 turn be rendered as \uabcd, whereas it should be rendered as \\uabcd.
 That's what the patch fixes.
 There are two problems here and this addresses one of them. The other
 problem is the ambiguity regarding \\u and \u.

 It's the same problem really, and until we have an answer about
 what to do with \u, I think any patch is half-baked and possibly
 counterproductive.

 In particular, I would like to suggest that the current representation of
 \u is fundamentally broken and that we have to change it, not try to
 band-aid around it.  This will mean an on-disk incompatibility for jsonb
 data containing U+, but hopefully there is very little of that out
 there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
 consider such solutions.

 The most obvious way to store such data unambiguously is to just go ahead
 and store U+ as a NUL byte (\000).  The only problem with that is that
 then such a string cannot be considered to be a valid value of type TEXT,
 which would mean that we'd need to throw an error if we were asked to
 convert a JSON field containing such a character to text.

Hm, does this include text out operations for display purposes?   If
so, then any query selecting jsonb objects with null bytes would fail.
How come we have to error out?  How about a warning indicating the
string was truncated?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
2015-01-26 21:44 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/25/15 4:23 AM, Pavel Stehule wrote:


 I tested a concept iteration over array in format [key1, value1, key2,
 value2, .. ] - what is nice, it works for [[key1,value1],[key2, value2],
 ...] too

 It is only a few lines more to current code, and this change doesn't
 break a compatibility.

 Do you think, so this patch is acceptable?

 Ideas, comments?


 Aside from fixing the comments... I think this needs more tests on corner
 cases. For example, what happens when you do

 foreach a, b, c in array(array(1,2),array(3,4)) ?


it is relative simple behave -- empty values are NULL

array(1,2),array(3,4) -- do you think ARRAY[[1,2],[3,4]] is effectively
ARRAY[1,2,3,4]



 Or the opposite case of

 foreach a,b in array(array(1,2,3))

 Also, what about:

 foreach a,b in '{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[] ?



 postgres=# select array(select
unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
   array
---
 {1,2,3,4,5,6,7,8}
(1 row)

so it generate pairs {1,2}{3,4},{5,6},{7,8}

Regards

Pavel Stehule


 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [HACKERS] proposal: searching in array function - array_position

2015-01-27 Thread Pavel Stehule
2015-01-26 23:29 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/26/15 4:17 PM, Pavel Stehule wrote:

 Any way to reduce the code duplication between the array and
 non-array versions? Maybe factor out the operator caching code?


 I though about it - but there is different checks, different result
 processing, different result type.

 I didn't find any readable and reduced form :(


 Yeah, that's why I was thinking specifically of the operator caching
 code... isn't that identical? That would at least remove a dozen lines...


It is only partially identical - I would to use cache for array_offset, but
it is not necessary for array_offsets .. depends how we would to modify
current API to support externally cached data.

Regards

Pavel




 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-27 Thread Robert Haas
On Mon, Jan 26, 2015 at 4:03 PM, Andres Freund and...@2ndquadrant.com wrote:
 I basically have two ideas to fix this.

 1) Make do_pg_start_backup() acquire a SHARE lock on
pg_database. That'll prevent it from starting while a movedb() is
still in progress. Then additionally add pg_backup_in_progress()
function to xlog.c that checks (XLogCtl-Insert.exclusiveBackup ||
XLogCtl-Insert.nonExclusiveBackups != 0). Use that in createdb() and
movedb() to error out if a backup is in progress.

 Attached is a patch trying to this. Doesn't look too bad and lead me to
 discover missing recovery conflicts during a AD ST.

 But: It doesn't actually work on standbys, because lock.c prevents any
 stronger lock than RowExclusive from being acquired. And we need need a
 lock that can conflict with WAL replay of DBASE_CREATE, to handle base
 backups that are executed on the primary. Those obviously can't detect
 whether any standby is currently doing a base backup...

 I currently don't have a good idea how to mangle lock.c to allow
 this. I've played with doing it like in the second patch, but that
 doesn't actually work because of some asserts around ProcSleep - leading
 to locks on database objects not working in the startup process (despite
 already being used).

 The easiest thing would be to just use a lwlock instead of a heavyweight
 lock - but those aren't canceleable...

How about just wrapping an lwlock around a flag variable?  movedb()
increments the variable when starting and decrements it when done
(must use PG_ENSURE_ERROR_CLEANUP).  Starting a backup errors out (or
waits in 1-second increments) if it's non-zero.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Bruce Momjian
On Mon, Jan 26, 2015 at 05:41:59PM -0500, Stephen Frost wrote:
 I've thought about it a fair bit actually and I agree that there is some
 risk to using rsync for *incremental* base backups.  That is, you have
 a setup where you loop with:
 
 pg_start_backup
 rsync - dest
 pg_stop_backup
 
 without using -I, changing what 'dest' is, or making sure it's empty
 every time.  The problem is the 1s-level granularity used on the
 timestamp.  A possible set of operations, all within 1s, is:
 
 file changed
 rsync starts copying the file
 file changed again (somewhere prior to where rsync is at)
 rsync finishes the file copy
 
 Now, this isn't actually a problem for the first time that file is
 backed up- the issue is if that file isn't changed again.  rsync won't
 re-copy it, but that change that rsync missed won't be in the WAL
 history for the *second* backup that's done (only the first), leading to
 a case where that file would end up corrupted.

Interesting problem, but doesn't rsync use sub-second accuracy?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread David Steele
On 1/27/15 9:51 PM, Bruce Momjian wrote:
 According to my empirical testing on Linux and OSX the answer is no:
 rsync does not use sub-second accuracy.  This seems to be true even on
 file systems like ext4 that support millisecond mod times, at least it
 was true on Ubuntu 12.04 running ext4.

 Even on my laptop there is a full half-second of vulnerability for
 rsync.  Faster systems may have a larger window.
 OK, bummer.  Well, I don't think we ever recommend to run rsync without
 checksums, but the big problem is that rsync doesn't do checksums by
 default.  :-(

 pg_upgrade recommends using two rsyncs:

To make a valid copy of the old cluster, use commandrsync/ to create
a dirty copy of the old cluster while the server is running, then shut
down the old server and run commandrsync/ again to update the copy
with any changes to make it consistent.  You might want to exclude some

 I am afraid that will not work as it could miss changes, right?  When
 would the default mod-time checking every be safe?

According to my testing the default mod-time checking is never
completely safe in rsync.  I've worked around this in PgBackRest by
building the manifest and then waiting until the start of the next
second before starting to copy.  It was the only way I could make the
incremental backups reliable without requiring checksums (which are
optional as in rsync for performance).  Of course, you still have to
trust the clock for this to work.

This is definitely an edge case.  Not only does the file have to be
modified in the same second *after* rsync has done the copy, but the
file also has to not be modified in *any other subsequent second* before
the next incremental backup.  If the file is busy enough to have a
collision with rsync in that second, then it is very likely to be
modified before the next incremental backup which is generally a day or
so later.  And, of course, the backup where the issue occurs is fine -
it's the next backup that is invalid.

However, the hot/cold backup scheme as documented does make the race
condition more likely since the two backups are done in close proximity
temporally.  Ultimately, the most reliable method is to use checksums.

For me the biggest issue is that there is no way to discover if a db in
consistent no matter how much time/resources you are willing to spend. 
I could live with the idea of the occasional bad backup (since I keep as
many as possible), but having no way to know whether it is good or not
is very frustrating.  I know data checksums are a step in that
direction, but they are a long way from providing the optimal solution. 
I've implemented rigorous checksums in PgBackRest but something closer
to the source would be even better.

-- 
- David Steele
da...@pgmasters.net




signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
2015-01-28 6:49 GMT+01:00 Pavel Stehule pavel.steh...@gmail.com:


 Dne 28.1.2015 0:25 Jim Nasby jim.na...@bluetreble.com napsal(a):
 
  On 1/27/15 12:58 PM, Pavel Stehule wrote:
 
postgres=# select array(select
 unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
  array
  ---
{1,2,3,4,5,6,7,8}
  (1 row)
 
  so it generate pairs {1,2}{3,4},{5,6},{7,8}
 
 
  I fixed situation when array has not enough elements.
 
  More tests, simple doc
 
 
  Hrm, this wasn't what I was expecting:
 
  + select foreach_test_ab(array[1,2,3,4]);
  + NOTICE:  a: 1, b: 2
  + NOTICE:  a: 3, b: 4
 
  I was expecting that foreach a,b array would be expecting something in
 the array to have a dimension of 2. :(

 It is inconsist (your expectation) with current implementation of FOREACH.
 It doesnt produce a array when SLICING is missing. And it doesnt calculate
 with dimensions.

 I would not to change this rule. It is not ambigonuous and it allows to
 work with
 1d, 2d, 3d dimensions array. You can process Andrew format well and my
 proposed format (2d array) well too.

one small example

CREATE OR REPLACE FUNCTION iterate_over_pairs(text[])
RETURNS void AS $$
DECLARE v1 text; v2 text; e text; i int := 0;
BEGIN
  FOREACH e IN ARRAY $1 LOOP
IF i % 2 = 0 THEN v1 := e;
ELSE v2 := e; RAISE NOTICE 'v1: %, v2: %', v1, v2; END IF;
i := i + 1;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

postgres=# SELECT iterate_over_pairs(ARRAY[1,2,3,4]::text[]);
NOTICE:  v1: 1, v2: 2
NOTICE:  v1: 3, v2: 4
 iterate_over_pairs


(1 row)

postgres=# SELECT iterate_over_pairs(ARRAY[[1,2],[3,4]]::text[]);
NOTICE:  v1: 1, v2: 2
NOTICE:  v1: 3, v2: 4
 iterate_over_pairs


(1 row)

I can use iterate_over_pairs for 1D or 2D arrays well -- a FOREACH was
designed in this direction - without SLICE a dimensions data are
unimportant.

Discussed enhancing of FOREACH is faster and shorter (readable)
iterate_over_pairs use case.

FOREACH v1, v2 IN ARRAY $1 LOOP
  ..
END LOOP;

It is consistent with current design

You can look to patch - in this moment a SLICE  0 is disallowed for
situation, when target variable is ROW and source is not ROW.

Regards

Pavel

There can be differen behave when SLICING is used. There we can iterate
 exactly with dimensions. We can design a behave in this case?

 
  I think this is bad, because this:
 
  foreach_test_ab('{{1,2,3},{4,5,6}}'::int[]);
 
  will give you 1,2; 3,4; 5,6. I don't see any way that that makes sense.
 Even if it did make sense, I'm more concerned that adding this will
 seriously paint us into a corner when it comes to the (to me) more rational
 case of returning {1,2,3},{4,5,6}.
 
  I think we need to think some more about this, at least to make sure
 we're not painting ourselves into a corner for more appropriate array
 iteration.
 
  --
  Jim Nasby, Data Architect, Blue Treble Consulting
  Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Amit Kapila
On Wed, Jan 28, 2015 at 12:38 PM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:

 On 01/28/2015 04:16 AM, Robert Haas wrote:

 On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas robertmh...@gmail.com
wrote:

 Now, when you did what I understand to be the same test on the same
 machine, you got times ranging from 9.1 seconds to 35.4 seconds.
 Clearly, there is some difference between our test setups.  Moreover,
 I'm kind of suspicious about whether your results are actually
 physically possible.  Even in the best case where you somehow had the
 maximum possible amount of data - 64 GB on a 64 GB machine - cached,
 leaving no space for cache duplication between PG and the OS and no
 space for the operating system or postgres itself - the table is 120
 GB, so you've got to read *at least* 56 GB from disk.  Reading 56 GB
 from disk in 9 seconds represents an I/O rate of 6 GB/s. I grant that
 there could be some speedup from issuing I/O requests in parallel
 instead of serially, but that is a 15x speedup over dd, so I am a
 little suspicious that there is some problem with the test setup,
 especially because I cannot reproduce the results.


 So I thought about this a little more, and I realized after some
 poking around that hydra's disk subsystem is actually six disks
 configured in a software RAID5[1].  So one advantage of the
 chunk-by-chunk approach you are proposing is that you might be able to
 get all of the disks chugging away at once, because the data is
 presumably striped across all of them.  Reading one block at a time,
 you'll never have more than 1 or 2 disks going, but if you do
 sequential reads from a bunch of different places in the relation, you
 might manage to get all 6.  So that's something to think about.

 One could imagine an algorithm like this: as long as there are more
 1GB segments remaining than there are workers, each worker tries to
 chug through a separate 1GB segment.  When there are not enough 1GB
 segments remaining for that to work, then they start ganging up on the
 same segments.  That way, you get the benefit of spreading out the I/O
 across multiple files (and thus hopefully multiple members of the RAID
 group) when the data is coming from disk, but you can still keep
 everyone busy until the end, which will be important when the data is
 all in-memory and you're just limited by CPU bandwidth.


 OTOH, spreading the I/O across multiple files is not a good thing, if you
don't have a RAID setup like that. With a single spindle, you'll just
induce more seeks.


Yeah, if such a thing happens then there is less chance that user
will get any major benefit via parallel sequential scan unless
the qualification expressions or other expressions used in
statement are costly.  So here one way could be that either user
should configure the parallel sequence scan parameters in such
a way that only when it can be beneficial it should perform parallel
scan (something like increase parallel_tuple_comm_cost or we can
have some another parameter) or just not use parallel sequential scan
(parallel_seqscan_degree=0).


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] proposal: searching in array function - array_position

2015-01-27 Thread Pavel Stehule
2015-01-28 0:01 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/27/15 4:36 AM, Pavel Stehule wrote:

 2015-01-26 23:29 GMT+01:00 Jim Nasby jim.na...@bluetreble.com mailto:
 jim.na...@bluetreble.com:

 On 1/26/15 4:17 PM, Pavel Stehule wrote:

  Any way to reduce the code duplication between the array and
 non-array versions? Maybe factor out the operator caching code?


 I though about it - but there is different checks, different
 result processing, different result type.

 I didn't find any readable and reduced form :(


 Yeah, that's why I was thinking specifically of the operator caching
 code... isn't that identical? That would at least remove a dozen lines...


 It is only partially identical - I would to use cache for array_offset,
 but it is not necessary for array_offsets .. depends how we would to modify
 current API to support externally cached data.


 Externally cached data?


Some from these functions has own caches for minimize access to typcache
(array_map, array_cmp is example). And in first case, I am trying to push
these information from fn_extra, in second case I don't do it, because I
don't expect a repeated call (and I am expecting so type cache will be
enough).

I plan to do some benchmark to calculate if we have to do, or we can use
type cache only.

Pavel



 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [HACKERS] TABLESAMPLE patch

2015-01-27 Thread Kyotaro HORIGUCHI
Hi, I took a look on this and found nice.

By the way, the parameter for REPEATABLE seems allowing to be a
expression in ParseTableSample but the grammer rejects it.

The following change seems enough.

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 4578b5e..8cf09d5 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -10590,7 +10590,7 @@ tablesample_clause:
;
 
 opt_repeatable_clause:
-   REPEATABLE '(' AexprConst ')'   { $$ = (Node *) $3; }
+   REPEATABLE '(' a_expr ')'   { $$ = (Node *) $3; }
| /*EMPTY*/ 
{ $$ = NULL; }
;


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Providing catalog view to pg_hba.conf file - Patch submission

2015-01-27 Thread Haribabu Kommi
On Wed, Jan 28, 2015 at 9:47 AM, Jim Nasby jim.na...@bluetreble.com wrote:
 On 1/27/15 1:04 AM, Haribabu Kommi wrote:

 Here I attached the latest version of the patch.
 I will add this patch to the next commitfest.


 Apologies if this was covered, but why isn't the IP address an inet instead
 of text?

Corrected the address type as inet instead of text. updated patch is attached.

 Also, what happens if someone reloads the config in the middle of running
 the SRF?

hba entries are reloaded only in postmaster process, not in every backend.
So there shouldn't be any problem with config file reload. Am i
missing something?

Regards,
Hari Babu
Fujitsu Australia


Catalog_view_to_HBA_settings_patch_V4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby WAL reply uses heavyweight session locks, but doesn't have enough infrastructure set up

2015-01-27 Thread Michael Paquier
Andres Freund wrote:
 I think this isn't particularly pretty, but it seems to be working well
 enough, and changing it would be pretty invasive. So keeping in line
 with all that code seems to be easier.
OK, I'm convinced with this part to remove the call of
LockSharedObjectForSession that uses dontWait and replace it by a loop
in ResolveRecoveryConflictWithDatabase.

 Note that InRecovery doesn't mean what you probably think it means:
 [stuff]
 boolInRecovery = false;
Yes, right. I misunderstood with RecoveryInProgress().

 The assertion actually should be even stricter:
 Assert(!InRecovery || (sessionLock  dontWait));
 i.e. we never acquire a heavyweight lock in the startup process unless
 it's a session lock (as we don't have resource managers/a xact to track
 locks) and we don't wait (as we don't have the deadlock detector
 infrastructure set up).
No problems with this assertion here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-27 Thread Heikki Linnakangas

On 01/28/2015 04:16 AM, Robert Haas wrote:

On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas robertmh...@gmail.com wrote:

Now, when you did what I understand to be the same test on the same
machine, you got times ranging from 9.1 seconds to 35.4 seconds.
Clearly, there is some difference between our test setups.  Moreover,
I'm kind of suspicious about whether your results are actually
physically possible.  Even in the best case where you somehow had the
maximum possible amount of data - 64 GB on a 64 GB machine - cached,
leaving no space for cache duplication between PG and the OS and no
space for the operating system or postgres itself - the table is 120
GB, so you've got to read *at least* 56 GB from disk.  Reading 56 GB
from disk in 9 seconds represents an I/O rate of 6 GB/s. I grant that
there could be some speedup from issuing I/O requests in parallel
instead of serially, but that is a 15x speedup over dd, so I am a
little suspicious that there is some problem with the test setup,
especially because I cannot reproduce the results.


So I thought about this a little more, and I realized after some
poking around that hydra's disk subsystem is actually six disks
configured in a software RAID5[1].  So one advantage of the
chunk-by-chunk approach you are proposing is that you might be able to
get all of the disks chugging away at once, because the data is
presumably striped across all of them.  Reading one block at a time,
you'll never have more than 1 or 2 disks going, but if you do
sequential reads from a bunch of different places in the relation, you
might manage to get all 6.  So that's something to think about.

One could imagine an algorithm like this: as long as there are more
1GB segments remaining than there are workers, each worker tries to
chug through a separate 1GB segment.  When there are not enough 1GB
segments remaining for that to work, then they start ganging up on the
same segments.  That way, you get the benefit of spreading out the I/O
across multiple files (and thus hopefully multiple members of the RAID
group) when the data is coming from disk, but you can still keep
everyone busy until the end, which will be important when the data is
all in-memory and you're just limited by CPU bandwidth.


OTOH, spreading the I/O across multiple files is not a good thing, if 
you don't have a RAID setup like that. With a single spindle, you'll 
just induce more seeks.


Perhaps the OS is smart enough to read in large-enough chunks that the 
occasional seek doesn't hurt much. But then again, why isn't the OS 
smart enough to read in large-enough chunks to take advantage of the 
RAID even when you read just a single file?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - Assert statement

2015-01-27 Thread Pavel Stehule
2015-01-28 0:13 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/27/15 1:30 PM, Pavel Stehule wrote:

 I don't see the separate warning as being helpful. I'd just do
 something like

 +(err_hint != NULL) ? errhint(%s,
 err_hint) : errhint(Message attached to failed assertion is null) ));


 done


 There should also be a test case for a NULL message.


 is there, if I understand well


 I see it now. Looks good.


please, can you enhance a documentation part - I am sorry, my English
skills are not enough

Thank you

Pavel


 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
2015-01-28 0:16 GMT+01:00 Jim Nasby jim.na...@bluetreble.com:

 On 1/27/15 2:26 PM, Pavel Stehule wrote:

 here is a initial version of row_to_array function - transform any row to
 array in format proposed by Andrew.


 Please start a new thread for this... does it depend on the key-value
 patch?


partially - a selected format should be well supported by FOREACH statement

Regards

Pavel



 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: [HACKERS] proposal: row_to_array function

2015-01-27 Thread Pavel Stehule
Dne 28.1.2015 0:25 Jim Nasby jim.na...@bluetreble.com napsal(a):

 On 1/27/15 12:58 PM, Pavel Stehule wrote:

   postgres=# select array(select
unnest('{{{1,2},{3,4}},{{5,6},{7,8}}}'::int[]));
 array
 ---
   {1,2,3,4,5,6,7,8}
 (1 row)

 so it generate pairs {1,2}{3,4},{5,6},{7,8}


 I fixed situation when array has not enough elements.

 More tests, simple doc


 Hrm, this wasn't what I was expecting:

 + select foreach_test_ab(array[1,2,3,4]);
 + NOTICE:  a: 1, b: 2
 + NOTICE:  a: 3, b: 4

 I was expecting that foreach a,b array would be expecting something in
the array to have a dimension of 2. :(

It is inconsist (your expectation) with current implementation of FOREACH.
It doesnt produce a array when SLICING is missing. And it doesnt calculate
with dimensions.

I would not to change this rule. It is not ambigonuous and it allows to
work with
1d, 2d, 3d dimensions array. You can process Andrew format well and my
proposed format (2d array) well too.

There can be differen behave when SLICING is used. There we can iterate
exactly with dimensions. We can design a behave in this case?


 I think this is bad, because this:

 foreach_test_ab('{{1,2,3},{4,5,6}}'::int[]);

 will give you 1,2; 3,4; 5,6. I don't see any way that that makes sense.
Even if it did make sense, I'm more concerned that adding this will
seriously paint us into a corner when it comes to the (to me) more rational
case of returning {1,2,3},{4,5,6}.

 I think we need to think some more about this, at least to make sure
we're not painting ourselves into a corner for more appropriate array
iteration.

 --
 Jim Nasby, Data Architect, Blue Treble Consulting
 Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Noah Misch
On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:
 Andrew Dunstan and...@dunslane.net writes:
  On 01/27/2015 02:28 PM, Tom Lane wrote:
  Well, we can either fix it now or suffer with a broken representation
  forever.  I'm not wedded to the exact solution I described, but I think
  we'll regret it if we don't change the representation.

 So at this point I propose that we reject \u when de-escaping JSON.

I would have agreed on 2014-12-09, and this release is the last chance to make
such a change.  It is a bold wager that could pay off, but -1 from me anyway.
I can already envision the blog post from the DBA staying on 9.4.0 because
9.4.1 pulled his ability to store U+ in jsonb.  jsonb was *the* top-billed
9.4 feature, and this thread started with Andrew conveying a field report of a
scenario more obscure than storing U+.  Therefore, we have to assume many
users will notice the change.  This move would also add to the growing
evidence that our .0 releases are really beta(N+1) releases in disguise.

 Anybody who's seriously unhappy with that can propose a patch to fix it
 properly in 9.5 or later.

Someone can still do that by introducing a V2 of the jsonb binary format and
preserving the ability to read both formats.  (Too bad Andres's proposal to
include a format version didn't inform the final format, but we can wing it.)
I agree that storing U+ as 0x00 is the best end state.

 We probably need to rethink the re-escaping behavior as well; I'm not
 sure if your latest patch is the right answer for that.

Yes, we do.  No change to the representation of U+ is going to fix the
following bug, but that patch does fix it:

[local] test=# select
test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb,
test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb::text::jsonb;
 ?column? | ?column? 
--+--
 t| f
(1 row)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2015-01-27 Thread Andres Freund
On 2015-01-27 10:36:35 -0500, Robert Haas wrote:
 On Mon, Jan 26, 2015 at 11:20 PM, Robert Haas robertmh...@gmail.com wrote:
  This developed a slight merge conflict.  I've rebased the attached
  version, and I also took the step of getting rid of buf_table.c, as I
  think I proposed somewhere upthread.  This avoids the overhead of
  constructing a BufferTag only to copy it into a BufferLookupEnt, plus
  some function calls and so forth.  A quick-and-dirty test suggests
  this might not have cut down on the 1-client overhead much, but I
  think it's worth doing anyway: it's certainly saving a few cycles, and
  I don't think it's complicating anything measurably.
 
 So here are median-of-three results for 5-minute read-only pgbench
 runs at scale factor 1000, shared_buffers = 8GB, on hydra (POWER7) at
 various client counts.

 That's extremely unimpressive.  You (Andres) showed a much bigger
 benefit on a different machine with a much higher scale factor (5000)
 so I don't know whether the modest benefit here is because of the
 different machine, the different scale factor, or what.

Based on my test on hydra some time back the bottleneck is nearly
entirely in GetSnapshotData() at higher client counts. So I'm not that
surprised you don't see the big benefit here.  IIRC on hydra the results
for using a large enough shared_buffers setting for a fully cached run
at that scale is pretty close to your master results - so there's really
not much performance increase to be expected by making the buf table
lockless.

I guess you would see a slightly bigger difference at a different
shared_buffer/scale combination. IIRC a scale 1000 is about 15GB large?
So about half of the data set fit in shared buffers. In the scale 5k
results I posted it was about 1/5.

I had also tested on a four socket x86 machine where GetSnapshotData()
was a, but not the sole big, bottleneck.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Windows buildfarm animals are still not happy with abbreviated keys patch

2015-01-27 Thread Robert Haas
On Mon, Jan 26, 2015 at 9:52 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Tue, Jan 27, 2015 at 4:21 AM, Robert Haas robertmh...@gmail.com wrote:
 That's what I hope to find out.  :-)
 Buildfarm seems happy now. I just gave a try to that on one of my
 small Windows VMs and compared the performance with 9.4 for this
 simple test case when building with MSVC 2010:
 create table aa as select random()::text as a, 'filler filler filler'
 as b a from generate_series(1,100);
 create index aai in aa(a):
 On 9.4, the index creation took 26.5s, while on master it took 18s.
 That's nice, particularly for things like a restore from a dump.

Cool.  That's a little bit smaller win than I would have expected
given my Linux results, but it's certainly respectable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Robert Haas
On Fri, Jan 23, 2015 at 1:48 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2015-01-22 20:54:47 -0500, Stephen Frost wrote:
 * Bruce Momjian (br...@momjian.us) wrote:
  On Fri, Jan 23, 2015 at 01:19:33AM +0100, Andres Freund wrote:
   Or do you - as the text edited in your patch, but not the quote above -
   mean to run pg_upgrade just on the primary and then rsync?
 
  No, I was going to run it on both, then rsync.

 I'm pretty sure this is all a lot easier than you believe it to be.  If
 you want to recreate what pg_upgrade does to a cluster then the simplest
 thing to do is rsync before removing any of the hard links.  rsync will
 simply recreate the same hard link tree that pg_upgrade created when it
 ran, and update files which were actually changed (the catalog tables).

 I don't understand why that'd be better than simply fixing (yes, that's
 imo the correct term) pg_upgrade to retain relfilenodes across the
 upgrade. Afaics there's no conflict risk and it'd make the clusters much
 more similar, which would be good; independent of rsyncing standbys.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 27, 2015 at 9:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 That's certainly impossible for the system catalogs, which means you
 have to be able to deal with relfilenode discrepancies for them, which
 means that maintaining the same relfilenodes for user tables is of
 dubious value.

 Why is that impossible for the system catalogs?

New versions aren't guaranteed to have the same system catalogs, let alone
the same relfilenodes for them.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Andres Freund
On 2015-01-27 10:20:48 -0500, Robert Haas wrote:
 On Tue, Jan 27, 2015 at 9:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  On Fri, Jan 23, 2015 at 1:48 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
  I don't understand why that'd be better than simply fixing (yes, that's
  imo the correct term) pg_upgrade to retain relfilenodes across the
  upgrade. Afaics there's no conflict risk and it'd make the clusters much
  more similar, which would be good; independent of rsyncing standbys.
 
  +1.
 
  That's certainly impossible for the system catalogs, which means you
  have to be able to deal with relfilenode discrepancies for them, which
  means that maintaining the same relfilenodes for user tables is of
  dubious value.
 
 Why is that impossible for the system catalogs?

Maybe it's not impossible for existing catalogs, but it's certainly
complicated. But I don't think it's all that desirable anyway - they're
not the same relation after the pg_upgrade anyway (initdb/pg_dump
filled them). That's different for the user defined relations.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2015-01-27 Thread Robert Haas
On Mon, Jan 26, 2015 at 11:20 PM, Robert Haas robertmh...@gmail.com wrote:
 This developed a slight merge conflict.  I've rebased the attached
 version, and I also took the step of getting rid of buf_table.c, as I
 think I proposed somewhere upthread.  This avoids the overhead of
 constructing a BufferTag only to copy it into a BufferLookupEnt, plus
 some function calls and so forth.  A quick-and-dirty test suggests
 this might not have cut down on the 1-client overhead much, but I
 think it's worth doing anyway: it's certainly saving a few cycles, and
 I don't think it's complicating anything measurably.

So here are median-of-three results for 5-minute read-only pgbench
runs at scale factor 1000, shared_buffers = 8GB, on hydra (POWER7) at
various client counts.

clients - master@4b2a25 - master+chash-buftable-v2
1 8319.254199 8105.479632
2 15485.561948 15138.227533
3 23690.186576 23264.702784
4 32157.346740 31536.870881
5 40879.532747 40455.794841
6 49063.279970 49625.681573
7 57518.374517 57492.275197
8 65351.415323 65622.211763
16 126166.416498 126668.793798
24 155727.916112 155587.414299
32 180329.012019 179543.424754
40 201384.186317 200109.614362
48 222352.265595 225688.574611
56 240400.659338 254398.144976
64 253699.031075 266624.224699
72 261198.336625 270652.578322
80 264569.862257 270332.738084

That's extremely unimpressive.  You (Andres) showed a much bigger
benefit on a different machine with a much higher scale factor (5000)
so I don't know whether the modest benefit here is because of the
different machine, the different scale factor, or what.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Tue, Jan 27, 2015 at 9:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  That's certainly impossible for the system catalogs, which means you
  have to be able to deal with relfilenode discrepancies for them, which
  means that maintaining the same relfilenodes for user tables is of
  dubious value.
 
  Why is that impossible for the system catalogs?
 
 New versions aren't guaranteed to have the same system catalogs, let alone
 the same relfilenodes for them.

Indeed, new versions almost certainly have wholly new system catalogs.

While there might be a reason to keep the relfilenodes the same, it
doesn't actually help with the pg_upgrade use-case we're currently
discussing (at least, not without additional help).  The problem is that
we certainly must transfer all the new catalogs, but how would rsync
know that those catalog files have to be transferred but not the user
relations?  Using --size-only would mean that system catalogs whose
sizes happen to match after the upgrade wouldn't be transferred and that
would certainly lead to a corrupt situation.

Andres proposed a helper script which would go through the entire tree
on the remote side and set all the timestamps on the remote side to
match those on the local side (prior to the pg_upgrade).  If all the
relfilenodes remained the same and the timestamps on the catalog tables
all changed then it might work to do (without using --size-only):

stop-cluster
set-timestamp-script
pg_upgrade
rsync new_data_dir - remote:existing_cluster

This would mean that any other files which happened to be changed by
pg_upgrade beyond the catalog tables would also get copied across.  The
issue that I see with that is that if the pg_upgrade process does touch
anything outside of the system catalogs, then its documented revert
mechanism (rename the control file and start the old cluster back up,
prior to having started the new cluster) wouldn't be valid.  Requiring
an extra script which runs around changing timestamps on files is a bit
awkward too, though I suppose possible, and then we'd also have to
document that this process only works with $version of pg_upgrade that
does the preservation of the relfilenodes.

I suppose there's also technically a race condition to consider, if the
whole thing is scripted and pg_upgrade manages to change an existing
file in the same second that the old cluster did then that file wouldn't
be recognized by the rsync as having been updated.  That's not too hard
to address though- just wait a second somewhere in there.  Still, I'm
not really sure that this approach really gains us much over the
approach that Bruce is proposing.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Tue, Jan 27, 2015 at 9:36 AM, Stephen Frost sfr...@snowman.net wrote:
  The example listed works, but only when it's a local rsync:
 
  rsync --archive --hard-links --size-only old_dir new_dir remote_dir
 
  Perhaps a better example (or additional one) would be with a remote
  rsync, including clarification of old and new dir, like so:
 
  (run in /var/lib/postgresql)
  rsync --archive --hard-links --size-only \
9.3/main \
9.4/main \
server:/var/lib/postgresql/
 
  Note that 9.3/main and 9.4/main are two source directories for rsync to
  copy over, while server:/var/lib/postgresql/ is a remote destination
  directory.  The above directories match a default Debian/Ubuntu install.
 
 My point is that Bruce's patch suggests looking for remote_dir in
 the rsync documentation, but no such term appears there.

Ah, well, perhaps we could simply add a bit of clarification to this:

for details on specifying optionremote_dir/

like so:

for details on specifying the destination optionremote_dir/

?

On my system, the rsync man page has '[DEST]' in the synopsis, but it
doesn't actually go on to specifically define what 'DEST' is, rather
referring to it later as 'destination' or 'remote directory'.

I'm sure other suggestions would be welcome if they'd help clarify.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] numeric access out of bounds

2015-01-27 Thread Tom Lane
I wrote:
 Andrew Gierth and...@tao11.riddles.org.uk writes:
 I can see two possible fixes: one to correct the assumptions in the
 macros, the other to check for NaN before calling init_var_from_num in
 numeric_send (all the other functions seem to do this check explicitly).
 Which would be preferable?

 I'm inclined to think special-casing NaN in numeric_send is the thing to
 do, since it is that way everywhere else.  If we could push down all the
 special casing into init_var_from_num then that would probably be better,
 but in most cases that looks unattractive.

After taking a closer look I realized that you were right with your first
alternative: the field access macros are just plain wrong, and we should
fix 'em.  Done now.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Abbreviated keys for Numeric

2015-01-27 Thread Andrew Gierth
 Peter == Peter Geoghegan p...@heroku.com writes:

 Peter What I find particularly interesting about this patch is that it
 Peter makes sorting numerics significantly faster than even sorting
 Peter float8 values,

Played some more with this. Testing on some different gcc versions
showed that the results were not consistent between versions; the latest
I tried (4.9) showed float8 as somewhat faster, while 4.7 showed float8
as slightly slower; the difference was all in the time of the float8
case, the time for numeric was virtually the same.

For one specific test query, taking the best time of multiple runs,

float8:   gcc4.7 = 980ms, gcc4.9 = 833ms
numeric:  gcc4.7 = 940ms, gcc4.9 = 920ms

(vs. 650ms for bigint on either version)

So honestly I think abbreviation for float8 is a complete red herring.

Also, I couldn't get any detectable benefit from inlining
DatumGetFloat8, though I may have to play more with that to be certain
(above tests did not have any float8-related modifications at all, just
the datum and numeric abbrevs patches).

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Andrew Dunstan


On 01/27/2015 12:24 AM, Noah Misch wrote:

For the moment, maybe I could commit the fix for the non U+ case for
escape_json, and we could think some more about detecting and warning about
the problem strings.

+1 for splitting development that way.  Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.




OK, so something like this patch? I'm mildly concerned that you and I 
are the only ones commenting on this.


cheers

andrew


diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 3c137ea..8d2b38f 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -94,6 +94,8 @@ static void datum_to_json(Datum val, bool is_null, StringInfo result,
 static void add_json(Datum val, bool is_null, StringInfo result,
 		 Oid val_type, bool key_scalar);
 static text *catenate_stringinfo_string(StringInfo buffer, const char *addon);
+static void  escape_json_work(StringInfo buf, const char *str,
+			  bool jsonb_unicode);
 
 /* the null action object used for pure validation */
 static JsonSemAction nullSemAction =
@@ -2356,6 +2358,18 @@ json_object_two_arg(PG_FUNCTION_ARGS)
 void
 escape_json(StringInfo buf, const char *str)
 {
+	escape_json_work( buf, str, false);
+}
+
+void
+escape_jsonb(StringInfo buf, const char *str)
+{
+	escape_json_work( buf, str, true);
+}
+
+static void
+escape_json_work(StringInfo buf, const char *str, bool jsonb_unicode)
+{
 	const char *p;
 
 	appendStringInfoCharMacro(buf, '\');
@@ -2398,7 +2412,9 @@ escape_json(StringInfo buf, const char *str)
  * unicode escape that should be present is \u, all the
  * other unicode escapes will have been resolved.
  */
-if (p[1] == 'u' 
+if (jsonb_unicode  strncmp(p+1, u, 5) == 0)
+	appendStringInfoCharMacro(buf, *p);
+else if (!jsonb_unicode  p[1] == 'u' 
 	isxdigit((unsigned char) p[2]) 
 	isxdigit((unsigned char) p[3]) 
 	isxdigit((unsigned char) p[4]) 
diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index 644ea6d..22e1263 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -305,7 +305,7 @@ jsonb_put_escaped_value(StringInfo out, JsonbValue *scalarVal)
 			appendBinaryStringInfo(out, null, 4);
 			break;
 		case jbvString:
-			escape_json(out, pnstrdup(scalarVal-val.string.val, scalarVal-val.string.len));
+			escape_jsonb(out, pnstrdup(scalarVal-val.string.val, scalarVal-val.string.len));
 			break;
 		case jbvNumeric:
 			appendStringInfoString(out,
diff --git a/src/include/utils/json.h b/src/include/utils/json.h
index 6f69403..661d7bd 100644
--- a/src/include/utils/json.h
+++ b/src/include/utils/json.h
@@ -42,7 +42,13 @@ extern Datum json_build_array_noargs(PG_FUNCTION_ARGS);
 extern Datum json_object(PG_FUNCTION_ARGS);
 extern Datum json_object_two_arg(PG_FUNCTION_ARGS);
 
+/*
+ * escape_jsonb is more strict about unicode escapes, and will only not
+ * escape \u, as that is the only unicode escape that should still be
+ * present.
+ */
 extern void escape_json(StringInfo buf, const char *str);
+extern void escape_jsonb(StringInfo buf, const char *str);
 
 extern Datum json_typeof(PG_FUNCTION_ARGS);
 
diff --git a/src/test/regress/expected/jsonb.out b/src/test/regress/expected/jsonb.out
index aa5686f..b5457c4 100644
--- a/src/test/regress/expected/jsonb.out
+++ b/src/test/regress/expected/jsonb.out
@@ -324,11 +324,20 @@ select to_jsonb(timestamptz '2014-05-28 12:22:35.614298-04');
 (1 row)
 
 COMMIT;
--- unicode escape - backslash is not escaped
+-- unicode null - backslash not escaped
+-- note: using to_jsonb here bypasses the jsonb input routine.
+select jsonb '\u', to_jsonb(text '\u');
+  jsonb   | to_jsonb 
+--+--
+ \u | \u
+(1 row)
+
+-- any other unicode-looking escape - backslash is escaped
+-- all unicode characters should have been resolved
 select to_jsonb(text '\uabcd');
- to_jsonb 
---
- \uabcd
+ to_jsonb  
+---
+ \\uabcd
 (1 row)
 
 -- any other backslash is escaped
@@ -338,6 +347,14 @@ select to_jsonb(text '\abcd');
  \\abcd
 (1 row)
 
+-- doubled backslash should be reproduced
+-- this isn't a unicode escape, it's an escaped backslash followed by 'u'
+select jsonb '\\uabcd';
+   jsonb   
+---
+ \\uabcd
+(1 row)
+
 --jsonb_agg
 CREATE TEMP TABLE rows AS
 SELECT x, 'txt' || x as y
diff --git a/src/test/regress/expected/jsonb_1.out b/src/test/regress/expected/jsonb_1.out
index 687ae63..10aaac8 100644
--- a/src/test/regress/expected/jsonb_1.out
+++ b/src/test/regress/expected/jsonb_1.out
@@ -324,11 +324,20 @@ select to_jsonb(timestamptz '2014-05-28 12:22:35.614298-04');
 (1 row)
 
 COMMIT;
--- unicode escape - backslash is not escaped
+-- unicode null - backslash not escaped
+-- note: using to_jsonb here bypasses the jsonb input routine.
+select jsonb '\u', to_jsonb(text '\u');
+  jsonb   | to_jsonb 
+--+--
+ \u | \u
+(1 row)
+
+-- any 

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 12:24 AM, Noah Misch wrote:
 +1 for splitting development that way.  Fixing the use of escape_json() is
 objective, but the choices around the warning are more subtle.

 OK, so something like this patch? I'm mildly concerned that you and I 
 are the only ones commenting on this.

Doesn't seem to me like this fixes anything.  If the content of a jsonb
value is correct, the output will be the same with or without this patch;
and if it's not, this isn't going to really improve matters.

I think coding anything is premature until we decide how we're going to
deal with the fundamental ambiguity.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] File based incremental backup v6

2015-01-27 Thread Marco Nenciarini
Hi,

here it is another version of the file based incremental backup patch.

Changelog from the previous one:

* pg_basebackup --incremental option take the directory containing the
  base backup instead of the backup profile file
* rename the backup_profile file at the same time of backup_label file
  when starting the first time from a backup.
* handle pg_basebackup -D - appending the backup profile to the
  resulting tar stream
* added documentation for -I/--incremental option to pg_basebackup doc
* updated replication protocol documentation


The reationale of moving the backup_profile out of the way during
recovery is to avoid using a data directory which has been already
started as a base of a backup.

I've also lightly improved the pg_restorebackup PoC implementing the
syntax advised by Gabriele:

./pg_restorebackup.py DESTINATION BACKUP_1 BACKUP_2 [BACKUP_3, ...]

It also supports relocation of tablespace with -T option.
The -T option is mandatory if there was any tablespace defined in the
PostgreSQL instance when the incremental_backup was taken.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it
From 56fed6e250280f8e5d5c17252db631f33a3c9d8f Mon Sep 17 00:00:00 2001
From: Marco Nenciarini marco.nenciar...@2ndquadrant.it
Date: Tue, 14 Oct 2014 14:31:28 +0100
Subject: [PATCH] File-based incremental backup v6

Add backup profile to pg_basebackup
INCREMENTAL option implementaion
---
 doc/src/sgml/protocol.sgml |  86 -
 doc/src/sgml/ref/pg_basebackup.sgml|  31 ++-
 src/backend/access/transam/xlog.c  |  18 +-
 src/backend/access/transam/xlogfuncs.c |   2 +-
 src/backend/replication/basebackup.c   | 335 +++--
 src/backend/replication/repl_gram.y|   6 +
 src/backend/replication/repl_scanner.l |   1 +
 src/bin/pg_basebackup/pg_basebackup.c  | 191 +--
 src/include/access/xlog.h  |   3 +-
 src/include/replication/basebackup.h   |   5 +
 10 files changed, 639 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index efe75ea..fc24648 100644
*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
*** The commands accepted in walsender mode 
*** 1882,1888 
/varlistentry
  
varlistentry
! termBASE_BACKUP [literalLABEL/literal 
replaceable'label'/replaceable] [literalPROGRESS/literal] 
[literalFAST/literal] [literalWAL/literal] [literalNOWAIT/literal] 
[literalMAX_RATE/literal replaceablerate/replaceable]
   indextermprimaryBASE_BACKUP/primary/indexterm
  /term
  listitem
--- 1882,1888 
/varlistentry
  
varlistentry
! termBASE_BACKUP [literalLABEL/literal 
replaceable'label'/replaceable] [literalINCREMENTAL/literal 
replaceable'start_lsn'/replaceable] [literalPROGRESS/literal] 
[literalFAST/literal] [literalWAL/literal] [literalNOWAIT/literal] 
[literalMAX_RATE/literal replaceablerate/replaceable]
   indextermprimaryBASE_BACKUP/primary/indexterm
  /term
  listitem
*** The commands accepted in walsender mode 
*** 1905,1910 
--- 1905,1928 
 /varlistentry
  
 varlistentry
+ termliteralINCREMENTAL/literal 
replaceable'start_lsn'/replaceable/term
+ listitem
+  para
+   Requests a file-level incremental backup of all files changed after
+   replaceablestart_lsn/replaceable. When operating with
+   literalINCREMENTAL/literal, the content of every block-organised
+   file will be analyzed and the file will be sent if at least one
+   block has a LSN higher than or equal to the provided
+   replaceablestart_lsn/replaceable.
+  /para
+  para
+   The filenamebackup_profile/filename will contain information on
+   every file that has been analyzed, even those that have not been 
sent.
+  /para
+ /listitem
+/varlistentry
+ 
+varlistentry
  termliteralPROGRESS//term
  listitem
   para
*** The commands accepted in walsender mode 
*** 2022,2028 
quoteustar interchange format/ specified in the POSIX 1003.1-2008
standard) dump of the tablespace contents, except that the two trailing
blocks of zeroes specified in the standard are omitted.
!   After the tar data is complete, a final ordinary result set will be 
sent,
containing the WAL end position of the backup, in the same format as
the start position.
   /para
--- 2040,2046 
quoteustar interchange format/ specified in the POSIX 1003.1-2008
standard) dump of the tablespace contents, except that the two trailing
blocks of zeroes specified in the standard are omitted.
!   After the tar data is complete, an ordinary result set will be sent,
containing the WAL end position of the 

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Andrew Dunstan


On 01/27/2015 12:23 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 01/27/2015 12:24 AM, Noah Misch wrote:

+1 for splitting development that way.  Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.

OK, so something like this patch? I'm mildly concerned that you and I
are the only ones commenting on this.

Doesn't seem to me like this fixes anything.  If the content of a jsonb
value is correct, the output will be the same with or without this patch;
and if it's not, this isn't going to really improve matters.

I think coding anything is premature until we decide how we're going to
deal with the fundamental ambiguity.





The input \\uabcd will be stored correctly as \uabcd, but this will in 
turn be rendered as \uabcd, whereas it should be rendered as \\uabcd. 
That's what the patch fixes.


There are two problems here and this addresses one of them. The other 
problem is the ambiguity regarding \\u and \u.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] File based Incremental backup v7

2015-01-27 Thread Marco Nenciarini
Il 27/01/15 10:25, Giuseppe Broccolo ha scritto: Hi Marco,

 On 16/01/15 16:55, Marco Nenciarini wrote:
 On 14/01/15 17:22, Gabriele Bartolini wrote:
 
  My opinion, Marco, is that for version 5 of this patch, you:
 
  1) update the information on the wiki (it is outdated - I know you have
  been busy with LSN map optimisation)

 Done.

  2) modify pg_basebackup in order to accept a directory (or tar
file) and
  automatically detect the LSN from the backup profile

 New version of patch attached. The -I parameter now requires a backup
 profile from a previous backup. I've added a sanity check that forbid
 incremental file level backups if the base timeline is different from
 the current one.

  3) add the documentation regarding the backup profile and pg_basebackup
 

 Next on my TODO list.

  Once we have all of this, we can continue trying the patch. Some
  unexplored paths are:
 
  * tablespace usage

 I've improved my pg_restorebackup python PoC. It now supports
tablespaces.

 About tablespaces, I noticed that any pointing to tablespace locations
 is lost during the recovery of an incremental backup changing the
 tablespace mapping (-T option). Here the steps I followed:

   * creating and filling a test database obtained through pgbench

 psql -c CREATE DATABASE pgbench
 pgbench -U postgres -i -s 5 -F 80 pgbench

   * a first base backup with pg_basebackup:

 mkdir -p backups/$(date '+%d%m%y%H%M')/data  pg_basebackup -v -F
p -D backups/$(date '+%d%m%y%H%M')/data -x

   * creation of a new tablespace, alter the table pgbench_accounts to
 set the new tablespace:

 mkdir -p /home/gbroccolo/pgsql/tbls
 psql -c CREATE TABLESPACE tbls LOCATION
'/home/gbroccolo/pgsql/tbls'
 psql -c ALTER TABLE pgbench_accounts SET TABLESPACE tbls pgbench

   * Doing some work on the database:

 pgbench -U postgres -T 120 pgbench

   * a second incremental backup with pg_basebackup specifying the new
 location for the tablespace through the tablespace mapping:

 mkdir -p backups/$(date '+%d%m%y%H%M')/data backups/$(date
'+%d%m%y%H%M')/tbls  pg_basebackup -v -F p -D backups/$(date
'+%d%m%y%H%M')/data -x -I backups/2601151641/data/backup_profile -T
/home/gbroccolo/pgsql/tbls=/home/gbroccolo/pgsql/backups/$(date
'+%d%m%y%H%M')/tbls

   * a recovery based on the tool pg_restorebackup.py attached in
 http://www.postgresql.org/message-id/54b9428e.9020...@2ndquadrant.it

 ./pg_restorebackup.py backups/2601151641/data
backups/2601151707/data /tmp/data -T
/home/gbroccolo/pgsql/backups/2601151707/tbls=/tmp/tbls

 In the last step, I obtained the following stack trace:

 Traceback (most recent call last):
   File ./pg_restorebackup.py, line 74, in module
 shutil.copy2(base_file, dest_file)
   File
/home/gbroccolo/.pyenv/versions/2.7.5/lib/python2.7/shutil.py, line
130, in copy2
 copyfile(src, dst)
   File
/home/gbroccolo/.pyenv/versions/2.7.5/lib/python2.7/shutil.py, line
82, in copyfile
 with open(src, 'rb') as fsrc:
 IOError: [Errno 2] No such file or directory:
'backups/2601151641/data/base/16384/16406_fsm'


 Any idea on what's going wrong?


I've done some test and it looks like that FSM nodes always have
InvalidXLogRecPtr as LSN.

Ive updated the patch to always include files if all their pages have
LSN == InvalidXLogRecPtr

Updated patch v7 attached.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it
From bffcdf0d5c3258c8848215011eb8e8b3377d9f18 Mon Sep 17 00:00:00 2001
From: Marco Nenciarini marco.nenciar...@2ndquadrant.it
Date: Tue, 14 Oct 2014 14:31:28 +0100
Subject: [PATCH] File-based incremental backup v7

Add backup profiles and --incremental to pg_basebackup
---
 doc/src/sgml/protocol.sgml |  86 -
 doc/src/sgml/ref/pg_basebackup.sgml|  31 ++-
 src/backend/access/transam/xlog.c  |  18 +-
 src/backend/access/transam/xlogfuncs.c |   2 +-
 src/backend/replication/basebackup.c   | 344 +++--
 src/backend/replication/repl_gram.y|   6 +
 src/backend/replication/repl_scanner.l |   1 +
 src/bin/pg_basebackup/pg_basebackup.c  | 191 --
 src/include/access/xlog.h  |   3 +-
 src/include/replication/basebackup.h   |   5 +
 10 files changed, 648 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index efe75ea..fc24648 100644
*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
*** The commands accepted in walsender mode 
*** 1882,1888 
/varlistentry
  
varlistentry
! termBASE_BACKUP [literalLABEL/literal 
replaceable'label'/replaceable] [literalPROGRESS/literal] 
[literalFAST/literal] [literalWAL/literal] [literalNOWAIT/literal] 
[literalMAX_RATE/literal replaceablerate/replaceable]
   indextermprimaryBASE_BACKUP/primary/indexterm
  /term
  listitem
--- 1882,1888 
/varlistentry
  
   

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2015-01-27 Thread Abhijit Menon-Sen
At 2015-01-27 17:00:27 -0600, jim.na...@bluetreble.com wrote:

 It would be best to get this into a commit fest so it's not lost.

It's there already.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Bruce Momjian
On Tue, Jan 27, 2015 at 09:36:58AM -0500, Stephen Frost wrote:
 The example listed works, but only when it's a local rsync:
 
 rsync --archive --hard-links --size-only old_dir new_dir remote_dir
 
 Perhaps a better example (or additional one) would be with a remote
 rsync, including clarification of old and new dir, like so:
 
 (run in /var/lib/postgresql)
 rsync --archive --hard-links --size-only \
   9.3/main \
   9.4/main \
   server:/var/lib/postgresql/
 
 Note that 9.3/main and 9.4/main are two source directories for rsync to
 copy over, while server:/var/lib/postgresql/ is a remote destination
 directory.  The above directories match a default Debian/Ubuntu install.

OK, sorry everyone was confused by 'remote_dir'.  Does this new patch
help?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/doc/src/sgml/pgupgrade.sgml b/doc/src/sgml/pgupgrade.sgml
new file mode 100644
index e1cd260..4e4fe64
*** a/doc/src/sgml/pgupgrade.sgml
--- b/doc/src/sgml/pgupgrade.sgml
*** pg_upgrade.exe
*** 409,414 
--- 409,486 
 /step
  
 step
+ titleUpgrade any Log-Shipping Standby Servers/title
+ 
+ para
+  If you have Log-Shipping Standby Servers (xref
+  linkend=warm-standby), follow these steps to upgrade them (before
+  starting any servers):
+ /para
+ 
+ procedure
+ 
+  step
+   titleInstall the new PostgreSQL binaries on standby servers/title
+ 
+   para
+Make sure the new binaries and support files are installed
+on all the standby servers.  Do emphasisnot/ run
+applicationinitdb/.  If applicationinitdb/ was run, delete
+the standby server data directories.  Also, install any custom
+shared object files on the new standbys that you installed in the
+new master cluster.
+   /para
+  /step
+ 
+  step
+   titleRun applicationrsync//title
+ 
+   para
+From a directory that is above the old and new database cluster
+directories, run this for each slave:
+ 
+ programlisting
+rsync --archive --hard-links --size-only old_dir new_dir remote_dir
+ /programlisting
+ 
+where optionold_dir/ and optionnew_dir/ are relative
+to the current directory, and optionremote_dir/ is
+emphasisabove/ the old and new cluster directories on
+the standby server.  The old and new relative cluster paths
+must match on the master and standby server.  Consult the
+applicationrsync/ manual page for details on specifying the
+remote directory, e.g. literalslavehost:/var/lib/postgresql//.
+applicationrsync/ will be fast when option--link/ mode is
+used because it will create hard links on the remote server rather
+than transfering user data.
+   /para
+  /step
+ 
+  step
+   titleConfigure log-shipping to standby servers/title
+ 
+   para
+Configure the servers for log shipping.  (You do not need to run
+functionpg_start_backup()/ and functionpg_stop_backup()/
+or take a file system backup as the slaves are still sychronized
+with the master.)
+   /para
+  /step
+ 
+ /procedure
+ 
+/step
+ 
+step
+ titleStart the new server/title
+ 
+ para
+  The new server and any applicationrsync/'ed standby servers can
+  now be safely started.
+ /para
+/step
+ 
+step
  titlePost-Upgrade processing/title
  
  para
*** psql --username postgres --file script.s
*** 548,562 
/para
  
para
-A Log-Shipping Standby Server (xref linkend=warm-standby) cannot
-be upgraded because the server must allow writes.  The simplest way
-is to upgrade the primary and use commandrsync/ to rebuild the
-standbys.  You can run commandrsync/ while the primary is down,
-or as part of a base backup (xref linkend=backup-base-backup)
-which overwrites the old standby cluster.
-   /para
- 
-   para
 If you want to use link mode and you do not want your old cluster
 to be modified when the new cluster is started, make a copy of the
 old cluster and upgrade that in link mode. To make a valid copy
--- 620,625 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread Bruce Momjian
On Tue, Jan 27, 2015 at 09:44:51PM -0500, David Steele wrote:
 On 1/27/15 9:32 PM, Bruce Momjian wrote
  Now, this isn't actually a problem for the first time that file is
  backed up- the issue is if that file isn't changed again.  rsync won't
  re-copy it, but that change that rsync missed won't be in the WAL
  history for the *second* backup that's done (only the first), leading to
  a case where that file would end up corrupted.
  Interesting problem, but doesn't rsync use sub-second accuracy?
 
 According to my empirical testing on Linux and OSX the answer is no:
 rsync does not use sub-second accuracy.  This seems to be true even on
 file systems like ext4 that support millisecond mod times, at least it
 was true on Ubuntu 12.04 running ext4.
 
 Even on my laptop there is a full half-second of vulnerability for
 rsync.  Faster systems may have a larger window.

OK, bummer.  Well, I don't think we ever recommend to run rsync without
checksums, but the big problem is that rsync doesn't do checksums by
default.  :-(

pg_upgrade recommends using two rsyncs:

   To make a valid copy of the old cluster, use commandrsync/ to create
   a dirty copy of the old cluster while the server is running, then shut
   down the old server and run commandrsync/ again to update the copy
   with any changes to make it consistent.  You might want to exclude some

I am afraid that will not work as it could miss changes, right?  When
would the default mod-time checking every be safe?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-27 Thread David Steele
On 1/27/15 9:32 PM, Bruce Momjian wrote
 Now, this isn't actually a problem for the first time that file is
 backed up- the issue is if that file isn't changed again.  rsync won't
 re-copy it, but that change that rsync missed won't be in the WAL
 history for the *second* backup that's done (only the first), leading to
 a case where that file would end up corrupted.
 Interesting problem, but doesn't rsync use sub-second accuracy?

According to my empirical testing on Linux and OSX the answer is no:
rsync does not use sub-second accuracy.  This seems to be true even on
file systems like ext4 that support millisecond mod times, at least it
was true on Ubuntu 12.04 running ext4.

Even on my laptop there is a full half-second of vulnerability for
rsync.  Faster systems may have a larger window.

-- 
- David Steele
da...@pgmasters.net




signature.asc
Description: OpenPGP digital signature