date:20160819

Re: [HACKERS] Curing plpgsql's memory leaks for statement-lifespan values

2016-08-19 Thread Jim Nasby


On 8/19/16 10:42 AM, Tom Lane wrote:

It appeared to me after collecting some stats about the functions present
in the regression tests that for larger functions, the extra space eaten
is just some small multiple (like 2x-3x) of the function body string
length.  So it's not *that* much data, even for big functions, and it's
only going to be eaten on the first usage of a function within a session.


I wonder if freeing that memory would help speed up further allocations, 
and how much of the expense of the final free for the context gets 
reduced by an early reset. Probably not enough to negate the cost of 
resetting on every call though...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Exporting more function in libpq

2016-08-19 Thread Tatsuo Ishii

> I do not think this is a good idea.  If the purpose of libpq is not
> to abstract away the wire-level protocol, then what is its purpose?

IMHO what currently libpq API does is actually dealing with limited
use cases, not abstraction of the protocol.

> And how could such a tool avoid breaking libpq, anyway?  For one
> example, successfully sending any command message normally results in
> an internal state change in libpq (so that it knows what to do with
> the response).  Your proposed API here doesn't cover that.  Nor does
> it cover actually dealing with the response, which I think would be
> needed in most scenarios where you're trying to deal in custom messages.

Yes, I did not proposed about the message response handling. That's
another story.

> If you feel a need to be sending your own messages, I think a locally
> hacked fork of libpq is a better answer.

I have already done it. I just thought it would be useful to share
this if there are someone else who are willing to do the same thing
like me.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] errno clobbering in reorderbuffer

2016-08-19 Thread Alvaro Herrera

Andres Freund wrote:

> Not at all, thanks.

Done, thanks both for the look-over.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol

2016-08-19 Thread Michael Paquier

On Fri, Aug 19, 2016 at 1:51 AM, Heikki Linnakangas  wrote:
> On 08/18/2016 03:45 PM, Michael Paquier wrote:
>>
>> On Thu, Aug 18, 2016 at 9:28 PM, Heikki Linnakangas 
>> wrote:
>> For the current ip.c, I don't have a better idea than putting in
>> src/common/ip.c the set of routines used by both the frontend and
>> backend, and have fe_ip.c the new file that has the frontend-only
>> things. Need a patch?
>
>
> Yes, please. I don't think there's anything there that's needed by only the
> frontend, but some of the functions are needed by only the backend. So I
> think we'll end up with src/common/ip.c, and src/backend/libpq/be-ip.c. (Not
> sure about those names, pick something that makes sense, given what's left
> in the files.)

OK, so let's do that first correctly. Attached are two patches:
- 0001 moves md5 to src/common
- 0002 that does the same for ip.c.
By the way, it seems to me that having be-ip.c is not that much worth
it. I am noticing that only pg_range_sockaddr could be marked as
backend-only. pg_foreach_ifaddr is being used as well by
tools/ifaddrs/, and this one calls as well pg_sockaddr_cidr_mask. Or
is there still some utility in having src/tools/ifaddrs? If not we
could move pg_sockaddr_cidr_mask and pg_foreach_ifaddr to be
backend-only. With pg_range_sockaddr that would make half the routines
to be marked as backend-only.

I have not rebased the whole series yet of SCRAM... I'll do that after
we agree on those two patches with the two commits you have already
done cleaned up of course (thanks btw for those ones!).
-- 
Michael
From bdc5c5b6d6361a5e5d2de5b5fd749ceb93045aa0 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 19 Aug 2016 11:57:49 +0900
Subject: [PATCH 1/2] Bundle md5.c into src/common/

---
 contrib/passwordcheck/passwordcheck.c |  2 +-
 src/backend/commands/user.c   |  2 +-
 src/backend/libpq/Makefile|  2 +-
 src/backend/libpq/auth.c  |  2 +-
 src/backend/libpq/crypt.c |  2 +-
 src/backend/utils/adt/varlena.c   |  2 +-
 src/common/Makefile   |  2 +-
 src/{backend/libpq => common}/md5.c   | 11 +++
 src/include/{libpq => common}/md5.h   |  2 +-
 src/interfaces/libpq/Makefile |  9 +++--
 src/interfaces/libpq/fe-auth.c|  2 +-
 src/tools/msvc/Mkvcbuild.pm   |  2 +-
 12 files changed, 24 insertions(+), 16 deletions(-)
 rename src/{backend/libpq => common}/md5.c (98%)
 rename src/include/{libpq => common}/md5.h (96%)

diff --git a/contrib/passwordcheck/passwordcheck.c b/contrib/passwordcheck/passwordcheck.c
index b4c1ce0..b38d1e3 100644
--- a/contrib/passwordcheck/passwordcheck.c
+++ b/contrib/passwordcheck/passwordcheck.c
@@ -20,9 +20,9 @@
 #include 
 #endif
 
+#include "common/md5.h"
 #include "commands/user.h"
 #include "fmgr.h"
-#include "libpq/md5.h"
 
 PG_MODULE_MAGIC;
 
diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
index b6ea950..821dce3 100644
--- a/src/backend/commands/user.c
+++ b/src/backend/commands/user.c
@@ -29,7 +29,7 @@
 #include "commands/dbcommands.h"
 #include "commands/seclabel.h"
 #include "commands/user.h"
-#include "libpq/md5.h"
+#include "common/md5.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "utils/acl.h"
diff --git a/src/backend/libpq/Makefile b/src/backend/libpq/Makefile
index 09410c4..82d424f 100644
--- a/src/backend/libpq/Makefile
+++ b/src/backend/libpq/Makefile
@@ -14,7 +14,7 @@ include $(top_builddir)/src/Makefile.global
 
 # be-fsstubs is here for historical reasons, probably belongs elsewhere
 
-OBJS = be-fsstubs.o be-secure.o auth.o crypt.o hba.o ip.o md5.o pqcomm.o \
+OBJS = be-fsstubs.o be-secure.o auth.o crypt.o hba.o ip.o pqcomm.o \
pqformat.o pqmq.o pqsignal.o
 
 ifeq ($(with_openssl),yes)
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index fc8b99b..fd4bc4b 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -21,12 +21,12 @@
 #include 
 #include 
 
+#include "common/md5.h"
 #include "libpq/auth.h"
 #include "libpq/crypt.h"
 #include "libpq/ip.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
-#include "libpq/md5.h"
 #include "miscadmin.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
diff --git a/src/backend/libpq/crypt.c b/src/backend/libpq/crypt.c
index d79f5a2..d84a180 100644
--- a/src/backend/libpq/crypt.c
+++ b/src/backend/libpq/crypt.c
@@ -21,8 +21,8 @@
 #endif
 
 #include "catalog/pg_authid.h"
+#include "common/md5.h"
 #include "libpq/crypt.h"
-#include "libpq/md5.h"
 #include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/syscache.h"
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index bf7c0cd..582d3e4 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -21,8 +21,8 @@
 #include "access/tuptoaster.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
+#include

Re: [HACKERS] Slowness of extended protocol

2016-08-19 Thread Tatsuo Ishii

> Tatsuo>understanding it always uses unnamed portal even if the SQL is like
> "BEGIN" or "COMMIT" (no parameters). They are too often used. Why not
> doing like this?
> 
> Does it actually work?
> 
> The documentation says named portals last till the end of the transaction:
> 
> https://www.postgresql.org/docs/9.5/static/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY
> 
> 
> doc>If successfully created, a named portal object lasts till the end of
> the current transaction, unless explicitly destroyed

Oh, ok. I misunderstood that named portals survive beyond transaction
boundary.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup wish list

2016-08-19 Thread Michael Paquier

On Fri, Aug 19, 2016 at 2:04 PM, Masahiko Sawada  wrote:
> I agree with adding this as an option and removing directory by default.
> And it looks good to me except for missing new line in usage output.
>
> printf(_("  -l, --label=LABEL  set backup label\n"));
> +   printf(_("  -n, --noclean  do not clean up after errors"));
> printf(_("  -P, --progress show progress information\n"));
>
> Registered this patch to CF1.

+1 for the idea. Looking at the patch it is taking a sane approach.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-19 Thread Vladimir Sitnikov

Tatsuo>understanding it always uses unnamed portal even if the SQL is like
"BEGIN" or "COMMIT" (no parameters). They are too often used. Why not
doing like this?

Does it actually work?

The documentation says named portals last till the end of the transaction:

https://www.postgresql.org/docs/9.5/static/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY


doc>If successfully created, a named portal object lasts till the end of
the current transaction, unless explicitly destroyed

Vladimir

[HACKERS] Logical decoding restart problems

2016-08-19 Thread konstantin knizhnik

Hi,

We are using logical decoding in multimaster and we are faced with the problem 
that inconsistent transactions are sent to replica.
Briefly, multimaster is using logical decoding in this way:
1. Each multimaster node is connected with each other using logical decoding 
channel and so each pair of nodes 
has its own replication slot.
2. In normal scenario each replication channel is used to replicate only those 
transactions which were originated at the source node.
We are using origin mechanism to skip "foreign" transactions.
2. When offline cluster node is returned back to the multimaster we need to 
recover this node to the current cluster state.
Recovery is performed from one of the cluster's node. So we are using only one 
replication channel to receive all (self and foreign) transactions.
Only in this case we can guarantee consistent order of applying transactions at 
recovered node.
After the end of recovery we need to recreate replication slots with all other 
cluster nodes (because we have already replied transactions from this nodes).
To restart logical decoding we first drop existed slot, then create new one and 
then start logical replication from the WAL position 0/0 (invalid LSN).
In this case recovery should be started from the last consistent point.

The problem is that for some reasons consistent point is not so consistent and 
we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts only 
the one (last) update and sent this truncated transaction to destination 
causing consistency violation at replica.  I started investigation of logical 
decoding code and found several things which I do not understand.

Assume that we have transactions T1={start_lsn=100, end_lsn=400} and 
T2={start_lsn=200, end_lsn=300}.
Transaction T2 is sent to the replica and replica confirms that flush_lsn=300.
If now we want to restart logical decoding, we can not start with position less 
than 300, because CreateDecodingContext doesn't allow it:

 * start_lsn
 *  The LSN at which to start decoding.  If InvalidXLogRecPtr, 
restart
 *  from the slot's confirmed_flush; otherwise, start from the 
specified
 *  location (but move it forwards to confirmed_flush if it's older 
than
 *  that, see below).
 *
else if (start_lsn < slot->data.confirmed_flush)
{
/*
 * It might seem like we should error out in this case, but it's
 * pretty common for a client to acknowledge a LSN it doesn't 
have to
 * do anything for, and thus didn't store persistently, because 
the
 * xlog records didn't result in anything relevant for logical
 * decoding. Clients have to be able to do that to support 
synchronous
 * replication.
 */

So it means that we have no chance to restore T1?
What is worse, if there are valid T2 transaction records with lsn >= 300, then 
we can partly decode T1 and send this T1' to the replica.
I missed something here?

Are there any alternative way to "seek" slot to the proper position without  
actual fetching data from it or recreation of the slot?
Is there any mechanism in xlog which can enforce consistent decoding of 
transaction (so that no transaction records are missed)?
May be I missed something but I didn't find any "record_number" or something 
else which can identify first record of transaction.

Thanks in advance,
Konstantin Knizhnik,
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] synchronous_commit = remote_flush

2016-08-19 Thread Masahiko Sawada

On Fri, Aug 19, 2016 at 5:25 AM, Robert Haas  wrote:
> On Thu, Aug 18, 2016 at 12:22 AM, Thomas Munro
>  wrote:
>> To do something about the confusion I keep seeing about what exactly
>> "on" means, I've often wished we had "remote_flush".  But it's not
>> obvious how the backwards compatibility could work, ie how to keep the
>> people happy who use "local" vs "on" to control syncrep, and also the
>> people who use "off" vs "on" to control asynchronous commit on
>> single-node systems.  Is there any sensible way to do that, or is it
>> not broken and I should pipe down, or is it just far too entrenched
>> and never going to change?
>
> I don't see why we can't add "remote_flush" as a synonym for "on".  Do
> you have something else in mind?
>

+1 for adding "remote_flush" as a synonym for "on".
It doesn't break backward compatibility.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] patch: function xmltable

2016-08-19 Thread Pavel Stehule

Hi

I am sending implementation of xmltable function. The code should to have
near to final quality and it is available for testing.

I invite any help with documentation and testing.

Regards

Pavel
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 169a385..a6334b6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -10099,6 +10099,47 @@ SELECT xmlroot(xmlparse(document 'abc'),
 

 
+   
+xmltable
+
+   
+xmltable
+   
+
+
+xmltable(xmlnamespaces(namespace uri AS namespace name|DEFAULT namespace uri , ...) rowexpr PASSING BY REF xml BY REF COLUMNS name type PATH columnexpr DEFAULT expr NOT NULL|NULL , ...)
+
+
+
+  The xmltable produces table based on passed XML value.
+
+
+
+
+
+   
+

 xmlagg
 
diff --git a/src/backend/executor/execQual.c b/src/backend/executor/execQual.c
index 743e7d6..77a06da 100644
--- a/src/backend/executor/execQual.c
+++ b/src/backend/executor/execQual.c
@@ -189,6 +189,9 @@ static Datum ExecEvalCurrentOfExpr(ExprState *exprstate, ExprContext *econtext,
 static Datum ExecEvalGroupingFuncExpr(GroupingFuncExprState *gstate,
 		 ExprContext *econtext,
 		 bool *isNull, ExprDoneCond *isDone);
+static Datum ExecEvalTableExpr(TableExprState *tstate,
+		ExprContext *econtext,
+		bool *isnull, ExprDoneCond *isDone);
 
 
 /* 
@@ -4500,6 +4503,203 @@ ExecEvalCurrentOfExpr(ExprState *exprstate, ExprContext *econtext,
 	return 0;	/* keep compiler quiet */
 }
 
+/* 
+ *		ExecEvalTableExpr
+ *
+ * 
+ */
+
+#define XMLTABLE_DEFAULT_NAMESPACE			"pgdefxmlns"
+
+static Datum
+execEvalTableExpr(TableExprState *tstate,
+		ExprContext *econtext,
+		bool *isNull, ExprDoneCond *isDone)
+{
+	TupleDesc tupdesc;
+	HeapTuple		tuple;
+	HeapTupleHeader		result;
+	int	i;
+	Datum		*values;
+	bool		*nulls;
+	Datum	value;
+	bool	isnull;
+	xmltype		   *xmlval;
+	text		   *row_path;
+
+	tupdesc = tstate->tupdesc;
+
+	if (tstate->firstRow)
+	{
+		ListCell	*res;
+
+		/* Evaluate expression first */
+		value = ExecEvalExpr(tstate->expr, econtext, , NULL);
+		if (isnull)
+		{
+			*isDone = ExprSingleResult;
+			*isNull = true;
+			return (Datum) 0;
+		}
+		xmlval = DatumGetXmlP(value);
+
+		value = ExecEvalExpr(tstate->row_path_expr, econtext, , NULL);
+		if (isnull)
+ereport(ERROR,
+		(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+	  errmsg("row query must not be null")));
+		row_path = DatumGetTextP(value);
+
+		Assert(tstate->xmltableCxt == NULL);
+		tstate->xmltableCxt = initXmlTableContext(xmlval,
+	tstate->used_dns ?
+		XMLTABLE_DEFAULT_NAMESPACE : NULL,
+	tstate->ncols,
+			  tstate->in_functions,
+			  tstate->typioparams,
+	econtext->ecxt_per_query_memory);
+
+		foreach(res, tstate->namespaces)
+		{
+			ResTarget *rt = (ResTarget *) lfirst(res);
+			char	*ns_uri;
+
+			value = ExecEvalExpr((ExprState *) rt->val, econtext, , NULL);
+			if (isnull)
+ereport(ERROR,
+		(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+	  errmsg("namespace uri must not be null")));
+			ns_uri = text_to_cstring(DatumGetTextP(value));
+
+			XmlTableSetNs(tstate->xmltableCxt, rt->name, ns_uri);
+		}
+
+		XmlTableSetRowPath(tstate->xmltableCxt, row_path);
+
+		/* Path caclulation */
+		for (i = 0; i < tstate->ncols; i++)
+		{
+			char *col_path;
+
+			if (tstate->col_path_expr[i] != NULL)
+			{
+value = ExecEvalExpr(tstate->col_path_expr[i], econtext, , NULL);
+if (isnull)
+	ereport(ERROR,
+			(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+		  errmsg("column path for column \"%s\" must not be null",
+			  NameStr(tupdesc->attrs[i]->attname;
+col_path = text_to_cstring(DatumGetTextP(value));
+			}
+			else
+col_path = NameStr(tupdesc->attrs[i]->attname);
+
+			XmlTableSetColumnPath(tstate->xmltableCxt, i,
+			tupdesc->attrs[i]->atttypid, col_path);
+		}
+		tstate->firstRow = false;
+	}
+
+	values = tstate->values;
+	nulls = tstate->nulls;
+
+	if (XmlTableFetchRow(tstate->xmltableCxt))
+	{
+		if (tstate->ncols > 0)
+		{
+			for (i = 0; i < tstate->ncols; i++)
+			{
+if (i != tstate->for_ordinality_col - 1)
+{
+	values[i] = XmlTableGetValue(tstate->xmltableCxt, i,
+  tupdesc->attrs[i]->atttypid,
+  tupdesc->attrs[i]->atttypmod,
+			  );
+	if (isnull && tstate->def_expr[i] != NULL)
+		values[i] = ExecEvalExpr(tstate->def_expr[i], econtext, , NULL);
+
+	if (isnull && tstate->not_null[i])
+		ereport(ERROR,
+(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+ errmsg("null is not allowed in column \"%s\"",
+	  NameStr(tupdesc->attrs[i]->attname;
+	nulls[i] = isnull;
+}
+else
+{
+	values[i] =

[HACKERS] Exporting more function in libpq

2016-08-19 Thread Tatsuo Ishii

I would like to proppse to export these functions in libpq.

pqPutMsgStart
pqPutMsgEnd
pqPutc
pqPuts
pqPutInt
pqPutnchar
pqFlush
pqHandleSendFailure

I think this would be useful to create a tool/library which needs to
handle frontend/backend protocol messages in detail.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slowness of extended protocol

2016-08-19 Thread Tatsuo Ishii

BTW, there seem to be a room to enhance JDBC driver performance. In my
understanding it always uses unnamed portal even if the SQL is like
"BEGIN" or "COMMIT" (no parameters). They are too often used. Why not
doing like this?

Prepare(stmt=S1,query="BEGIN")
Bind(stmt=S1,portal=P1)
Execute(portal=P1)
:
:
Execute(portal=P1)
:
:
Execute(portal=P1)

Instead of:

Prepare(stmt=S1,query="BEGIN")
Bind(stmt=S1,portal=null)
Execute(portal=null)
:
:
Bind(stmt=s1,portal=null)
Execute(portal=null)
:
:
Bind(stmt=s1,portal=null)
Execute(portal=null)

This way, we could save bunch of Bind messages.

I don't know what other drivers do though.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Should we cacheline align PGXACT?

2016-08-19 Thread Alexander Korotkov

Hackers,

originally this idea was proposed by Andres Freund while experimenting with
lockfree Pin/UnpinBuffer [1].
The patch is attached as well as results of pgbench -S on 72-cores
machine.  As before it shows huge benefit in this case.
For sure, we should validate that it doesn't cause performance regression
in other cases.  At least we should test read-write and smaller machines.
Any other ideas?

1.
https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6...@alap3.anarazel.de

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pgxact-align-1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LWLocks in DSM memory

2016-08-19 Thread Mithun Cy

On Fri, Aug 19, 2016 at 4:44 PM, Amit Kapila 
wrote:
>Can you specify the m/c details as Andres wants tests to be conducted on
some high socket m/c?

As I have specified at the last line of my mail it is a 8 socket intel
machine.

available: 8 nodes (0-7)
node 0 cpus: 0 65 66 67 68 69 70 71 96 97 98 99 100 101 102 103
node 0 size: 65498 MB
node 0 free: 50308 MB
node 1 cpus: 72 73 74 75 76 77 78 79 104 105 106 107 108 109 110 111
node 1 size: 65536 MB
node 1 free: 44594 MB
node 2 cpus: 80 81 82 83 84 85 86 87 112 113 114 115 116 117 118 119
node 2 size: 65536 MB
node 2 free: 61814 MB
node 3 cpus: 88 89 90 91 92 93 94 95 120 121 122 123 124 125 126 127
node 3 size: 65536 MB
node 3 free: 55258 MB
node 4 cpus: 1 2 3 4 5 6 7 8 33 34 35 36 37 38 39 40
node 4 size: 65536 MB
node 4 free: 46705 MB
node 5 cpus: 9 10 11 12 13 14 15 16 41 42 43 44 45 46 47 48
node 5 size: 65536 MB
node 5 free: 40948 MB
node 6 cpus: 17 18 19 20 21 22 23 24 49 50 51 52 53 54 55 56
node 6 size: 65536 MB
node 6 free: 47468 MB
node 7 cpus: 25 26 27 28 29 30 31 32 57 58 59 60 61 62 63 64
node 7 size: 65536 MB
node 7 free: 17285 MB



-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Should we cacheline align PGXACT?

2016-08-19 Thread Amit Kapila

On Fri, Aug 19, 2016 at 11:42 AM, Alexander Korotkov
 wrote:
> Hackers,
>
> originally this idea was proposed by Andres Freund while experimenting with
> lockfree Pin/UnpinBuffer [1].
> The patch is attached as well as results of pgbench -S on 72-cores machine.
> As before it shows huge benefit in this case.
> For sure, we should validate that it doesn't cause performance regression in
> other cases.  At least we should test read-write and smaller machines.
> Any other ideas?
>

may be test on Power m/c as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] bigint txids vs 'xid' type, new txid_recent(bigint) => xid

2016-08-19 Thread Peter Eisentraut

On 8/18/16 9:20 PM, Craig Ringer wrote:
> On 19 August 2016 at 02:35, Jim Nasby  > wrote:
> I think we need to either add real types for handling XID/epoch/TXID
> or finally create uint types. It's *way* too easy to screw things up
> the way they are today.
> 
> Hm. Large scope increase there. Especially introducing unsigned types.
> There's a reason that hasn't been done already - it's not just copying a
> whole pile of code, it's also defining all the signed/unsigned
> interactions and conversions carefully.

https://github.com/petere/pguint ;-)

> I'm not against adding a 'bigxid' or 'epoch_xid' or something,
> internally a uint64. It wouldn't need all the opclasses, casts, function
> overloads, etc that uint8 would.

That sounds much better.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we cacheline align PGXACT?

2016-08-19 Thread Tom Lane

Alexander Korotkov  writes:
> originally this idea was proposed by Andres Freund while experimenting with
> lockfree Pin/UnpinBuffer [1].
> The patch is attached as well as results of pgbench -S on 72-cores
> machine.  As before it shows huge benefit in this case.

That's one mighty ugly patch.  Can't you do it without needing to
introduce the additional layer of struct nesting?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LSN as a recovery target

2016-08-19 Thread Adrien Nayrat

On 06/09/2016 02:33 PM, Michael Paquier wrote:
> On Wed, May 25, 2016 at 1:32 AM, Michael Paquier
>  wrote:
>> On Tue, May 24, 2016 at 9:29 AM, Alvaro Herrera
>>  wrote:
>>> Christoph Berg wrote:
 Re: Michael Paquier 2016-05-24

Re: [HACKERS] WIP: About CMake v2

2016-08-19 Thread Yury Zhuravlev


Christian Convey wrote:

I'm interested in helping with your CMake effort.  I don't have any
experience contributing to PG, but I do have some free time at the
moment.  Please let me know if I can help.
I glad to hear it. I suppose you can just try build postgres and send all 
problems to github tracker.

https://github.com/stalkerg/postgres_cmake/issues

Big thanks! 


--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: About CMake v2

2016-08-19 Thread Yury Zhuravlev


Stefan Kaltenbrunner wrote:

On 08/18/2016 09:30 PM, Christian Convey wrote:

Hi Karl,

I'll need to let Yury answer your original question regarding the best
way to report CMake-related bugs.

Regarding the errors you're getting...  I just looked at CMake's
online documentation regarding your "target_compile_definitions" ...


Well - "too old" is a relative term - cmake 2.8.10 was released in only
october 2012 and cmake 2.8.11 in may 2013 so it is not even 4 years old,
the oldest currently supported (though for not much longer) postgresql
release 9.1 was released in september 2011 and 9.2 was also released
before october 2012.
So while Cmake compat might only make it for v10, I dont think that we
can depend on bleeding edge version like that for our buildtools...


1. I don't know but nobody objections if we install latest CMake under 
Windows or MacOSX but many if under Linux/*BSD. I have builded CMake 
recently under old Solaris without any problems. 

2. I use "graceful degradation" method and currently target is 3.0 version 
BUT time to time I testing and backporting the code. I hope I can make all 
features for 2.8.4 . 

3. Please do not compare cmake and GNU Make. CMake it's like Autotools with 
detect library scripts out the box. Every new version brings not only new 
functions and syntax construction. Most fix in find_library scripts. 
Besides you can't build project without gnu make, make, nmake, ninja and 
etc because CMake generate files for another make systems. On this basis, 
to CMake can not meet the same requirements for version as GNU Make. It is 
quite another thing.


PS Special for Postgres in CMake 3.6 I implemented features for replace 
gen_def.pl script under MSVC.  
https://gitlab.kitware.com/cmake/cmake/merge_requests/29


--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Exporting more function in libpq

2016-08-19 Thread Tom Lane

Tatsuo Ishii  writes:
> I would like to proppse to export these functions in libpq.
> pqPutMsgStart
> pqPutMsgEnd
> pqPutc
> pqPuts
> pqPutInt
> pqPutnchar
> pqFlush
> pqHandleSendFailure

> I think this would be useful to create a tool/library which needs to
> handle frontend/backend protocol messages in detail.

I do not think this is a good idea.  If the purpose of libpq is not
to abstract away the wire-level protocol, then what is its purpose?
And how could such a tool avoid breaking libpq, anyway?  For one
example, successfully sending any command message normally results in
an internal state change in libpq (so that it knows what to do with
the response).  Your proposed API here doesn't cover that.  Nor does
it cover actually dealing with the response, which I think would be
needed in most scenarios where you're trying to deal in custom messages.

If you feel a need to be sending your own messages, I think a locally
hacked fork of libpq is a better answer.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-19 Thread Kevin Grittner

On Thu, Aug 18, 2016 at 5:04 PM, Claudio Freire  wrote:

> But the choice of split point may make a difference for future
> insertions, so I'll look into that.

A database product I worked on a long time ago had an interesting
optimization for indexes that had multiple insertion points, each
of which was increasing.  (Picture, for example 100 case types,
with case numbers inserted in sequence within each case type -- so
the first three felony cases would be CF01, CF02, and
CF03; while the first three small claims cases would be
SC01, SC02, and SC03.)  That's not a terribly rare
usage pattern, in my experience.  An index on such data is most
efficient if you split at the point of the index tuple being
inserted -- making it the last tuple on the left-hand page.  I
don't know whether we might want to consider making that an option
somehow -- it just came to mind because of this discussion.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: About CMake v2

2016-08-19 Thread Yury Zhuravlev


Stefan Kaltenbrunner wrote:

hmm how do you actually want reports on how it works?

I just played with it on spoonbill (OpenBSD 5.3/sparc64) and got this:

CMake Error at CMakeLists.txt:1250 (file):
  file does not recognize sub-command GENERATE


CMake Error at src/port/CMakeLists.txt:156 (target_compile_definitions):
  Unknown CMake command "target_compile_definitions".


-- Configuring incomplete, errors occurred!


there is also a ton of stuff like:


Make Error: Internal CMake error, TryCompile generation of cmake failed
-- Looking for opterr - not found
-- Looking for optreset
CMake Error at CMakeLists.txt:10 (ADD_EXECUTABLE):
  Target "cmTryCompileExec3458204847" links to item " m" which has
leading or
  trailing whitespace.  This is now an error according to policy CMP0004.


CMake Error: Internal CMake error, TryCompile generation of cmake failed
-- Looking for optreset - not found
-- Looking for fseeko
CMake Error at CMakeLists.txt:10 (ADD_EXECUTABLE):
  Target "cmTryCompileExec2628321539" links to item " m" which has
leading or
  trailing whitespace.  This is now an error according to policy CMP0004.


CMake Error: Internal CMake error, TryCompile generation of cmake failed


which I have no idea whether they are an actual problem or just
"configure" noise


Can you send issue into github? 
https://github.com/stalkerg/postgres_cmake/issues


--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-19 Thread Merlin Moncure

On Wed, Aug 17, 2016 at 5:18 PM, Jim Nasby  wrote:
> On 8/17/16 2:51 PM, Simon Riggs wrote:
>> On 17 August 2016 at 12:19, Greg Stark  wrote:
>>> Yes, this is exactly what it should be doing and exactly why it's
>>> useful. Physical replication accurately replicates the data from the
>>> master including "corruption" whereas a logical replication system
>>> will not, causing divergence and possible issues during a failover.
>>
>>
>> Yay! Completely agree.
>>
>> Physical replication, as used by DRBD and all other block-level HA
>> solutions, and also used by other databases, such as Oracle.
>>
>> Corruption on the master would often cause errors that would prevent
>> writes and therefore those changes wouldn't even be made, let alone be
>> replicated.
>
>
> My experience has been that you discover corruption after it's already
> safely on disk, and more than once I've been able to recover by using data
> on a londiste replica.
>
> As I said originally, it's critical to understand the different solutions
> and the pros and cons of each. There is no magic bullet.

Data point: in the half or so cases I've experienced corruption on
replicated systems, in all cases but one the standby was clean.  The
'unclean' case actually 8.2 warm standby; the source of the corruption
was a very significant bug where prepared statements would write back
corrupted data if the table definitions changed under the statement
(fixed in 8.3).  In that particular case the corruption was very
unfortunately quite widespread and passed directly along to the
standby server.  This bug nearly costed us a user as well although not
nearly so famous as uber :-).

In the few modern cases I've seen I've not been able to trace it back
to any bug in postgres (in particular multixact was ruled out) and
I've chalked it up to media or (more likely I think) filesystem
problems in the face of a -9 reset.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: About CMake v2

2016-08-19 Thread Yury Zhuravlev


Andres Freund wrote:

The benefit cmake brings to the table, from my pov, is that it allows to
get rid of somewhat a parallel buildsystem (msvc / windows, which
sources most of its information from the makefiles). If we continue to
have two, especially if they're entirely separate, I see little benefit
in this whole endeavor.


I fully agree with this. Although I still leave the probability that for a 
while (couple of months) CMake will coexist with autoconf solely for 
testing and verification.


--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch proposal

2016-08-19 Thread Venkata B Nagothi

On Thu, Aug 18, 2016 at 7:20 PM, Stephen Frost  wrote:

> * Venkata B Nagothi (nag1...@gmail.com) wrote:
> > On Wed, Aug 17, 2016 at 11:27 PM, Stephen Frost 
> wrote:
> > > * Venkata B Nagothi (nag1...@gmail.com) wrote:
> > > > Agreed. Additional option like "pause" would. As long as there is an
> > > option
> > > > to ensure following happens if the recovery target is not reached -
> > > >
> > > >  a) PG pauses the recovery at the end of the WAL
> > > >  b) Generates a warning in the log file saying that recovery target
> point
> > > > is not reached (there is a patch being worked upon on by Thom on
> this)
> > > >  c) Does not open-up the database exiting from the recovery process
> by
> > > > giving room to resume the replay of WALs
> > >
> > > One thing to consider is just how different this is from simply
> bringing
> > > PG up as a warm standby instead, with the warning added to indicate if
> > > the recovery point wasn't reached.
> >
> > I am referring to a specific scenario (performing point-in time recovery)
> > where-in a DBA attempts to bring up a standalone PG instance by restoring
> > the backup and performing recovery to a particular recover target (XID,
> > time or a named restore point) in the past by replaying all the available
> > WAL archives, which is not quite possible through a warm-standby setup.
> >
> > Warm standby is more of a high-availability solution and i do not think
> so,
> > it addresses PITR kind of situation.
>
> No, a warm standby is one which just plays through the WAL but doesn't
> bring the database up or start its own set of WAL, which is what you're
> asking about.
>

I understand that, in warm-standby, PG does continuously replay WAL without
bringing up the database as the database will be in standby mode, which
sounds ideal.

While recovering the database to a particular recovery target point ( and
there is no requirement to setup standby ), then, there is a risk that
database will get promoted due to missing or corrupt WALs. Which means,
there is no way to avoid promotion and resume WAL recovery. An option like
"pause" with a new parameter would be ideal to prevent database promotion
at a default recovery target, which is "immediate" or "end-of-WAL".

Regards,

Venkata B N
Fujitsu Australia

Re: [HACKERS] Re: PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)

2016-08-19 Thread Aleksander Alekseev

Heikki, Peter, thanks a lot for code review!

> What's going on here? Surely pg_atomic_init_u64() should initialize
> the value?

It's because of how pg_atomic_exchange_u64_impl is implemented:

```
while (true)
{   
old = ptr->value; /* <-- reading of uninitialized value! */
if (pg_atomic_compare_exchange_u64_impl(ptr, , xchg_))
break;
}
```

Currently pg_atomic_init_u64 works like this:

pg_atomic_init_u64
`- pg_atomic_init_u64_impl
   `- pg_atomic_write_u64_impl
  `- pg_atomic_exchange_u64_impl

I suspect there is actually no need to make an atomic exchange during
initialization of an atomic variable. Regular `mov` should be enough
(IIRC there is no need to do `lock mov` since `mov` is already atomic).
Anyway I don't feel brave enough right now to mess with atomic
operations since it involves all sort of portability issues. So I
removed this change for now.

> Why does MemorySanitizer complain about that? Calling stat(2) is 
> supposed to fill in all the fields we look at, right?

> In addition to what Heikki wrote, I think the above is not necessary.

It's been my observation that MemorySanitizer often complains on passing
structures with uninitialized padding bytes to system calls. I'm not
sure whether libc really reads/copies structures in these cases or
MemorySanitizer doesn't like the idea in general.

Although it's true that maybe MemorySanitizer it too pedantic in these
cases, in some respect it's right. Since operating system is a black box
from our perspective (not mentioning that there are _many_ OS'es that
change all the time) in general case passing a structure with
uninitialized padding bytes to a system call can in theory cause a
non-repeatable behavior. For instance if there are caching and hash
calculation involved.

Also, as Chapman noted previously [1], according to PostgreSQL
documentation using structures with uninitialized padding bytes should
be avoided anyway.

I strongly believe that benefits of passing all MemorySanitizer checks
(possibility of discovering many complicated bugs automatically in this
case) many times outweighs drawbacks of tiny memset's overhead in these
concrete cases.

> I think this goes too far. You're zeroing all palloc'd memory, even
> if  it's going to be passed to palloc0(), and zeroed there. It might
> even  silence legitimate warnings, if there's code somewhere that
> does  palloc(), and accesses some of it before initializing. Plus
> it's a performance hit.

Completely agree, my bad. I removed these changes.

> Right after this, we have:
> VALGRIND_MAKE_MEM_DEFINED(, sizeof(msg));  
> Do we need both?

Apparently we don't. I removed VALGRIND_MAKE_MEM_DEFINED here since now
there are no uninitialized padding bytes in 


Corrected patch is attached. If you have any other ideas how it could
be improved please let me know!

[1]
https://www.postgresql.org/message-id/56EFF347.20500%40anastigmatix.net

-- 
Best regards,
Aleksander Alekseev
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 01c7ef7..fb5c026 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -337,6 +337,7 @@ gistXLogSplit(bool page_is_leaf,
 	for (ptr = dist; ptr; ptr = ptr->next)
 		npage++;
 
+	memset(, 0, sizeof(xlrec));
 	xlrec.origrlink = origrlink;
 	xlrec.orignsn = orignsn;
 	xlrec.origleaf = page_is_leaf;
diff --git a/src/backend/access/spgist/spgdoinsert.c b/src/backend/access/spgist/spgdoinsert.c
index f090ca5..1275b6c 100644
--- a/src/backend/access/spgist/spgdoinsert.c
+++ b/src/backend/access/spgist/spgdoinsert.c
@@ -204,6 +204,7 @@ addLeafTuple(Relation index, SpGistState *state, SpGistLeafTuple leafTuple,
 {
 	spgxlogAddLeaf xlrec;
 
+	memset(, 0, sizeof(xlrec));
 	xlrec.newPage = isNew;
 	xlrec.storesNulls = isNulls;
 
@@ -448,6 +449,8 @@ moveLeafs(Relation index, SpGistState *state,
 		i = it->nextOffset;
 	}
 
+	memset(, 0, sizeof(xlrec));
+
 	/* Find a leaf page that will hold them */
 	nbuf = SpGistGetBuffer(index, GBUF_LEAF | (isNulls ? GBUF_NULLS : 0),
 		   size, );
@@ -723,6 +726,7 @@ doPickSplit(Relation index, SpGistState *state,
 	newLeafs = (SpGistLeafTuple *) palloc(sizeof(SpGistLeafTuple) * n);
 	leafPageSelect = (uint8 *) palloc(sizeof(uint8) * n);
 
+	memset(, 0, sizeof(xlrec));
 	STORE_STATE(state, xlrec.stateSrc);
 
 	/*
@@ -1501,6 +1505,7 @@ spgAddNodeAction(Relation index, SpGistState *state,
 	newInnerTuple = addNode(state, innerTuple, nodeLabel, nodeN);
 
 	/* Prepare WAL record */
+	memset(, 0, sizeof(xlrec));
 	STORE_STATE(state, xlrec.stateSrc);
 	xlrec.offnum = current->offnum;
 
@@ -1741,6 +1746,7 @@ spgSplitNodeAction(Relation index, SpGistState *state,
 	postfixTuple->allTheSame = innerTuple->allTheSame;
 
 	/* prep data for WAL record */
+	memset(, 0, sizeof(xlrec));
 	xlrec.newPage = false;
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f13f9c1..360e8d2 100644
---

Re: [HACKERS] LWLocks in DSM memory

2016-08-19 Thread Amit Kapila

On Fri, Aug 19, 2016 at 4:29 PM, Mithun Cy 
wrote:

> On Wed, Aug 17, 2016 at 9:12 PM, Andres Freund  wrote:
>
>> >Yes. I want a long wait list, modified in bulk - which should be the
>> >case with the above.
>>
>
> I ran some pgbench. And, I do not see much difference in performance,
> small variance in perf can be attributed to variance in probability of
> drawing the particular built-in script.
>
>
Can you specify the m/c details as Andres wants tests to be conducted on
some high socket m/c?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

[HACKERS] Push down more full joins in postgres_fdw

2016-08-19 Thread Etsuro Fujita


Hi,

The postgres_fdw join pushdown in 9.6 is great, but it can't handle full  
joins on relations with restrictions.  The reason for that is,  
postgres_fdw can't support deparsing subqueries when creating a remote  
join query.  So, by adding the deparsing logic to it, I removed that  
limitation.  Attached is a patch for that, which is based on the patch I  
posted before [1].


Also, by the deparsing logic, I improved the handling of PHVs so that  
PHVs are evaluated remotely if it's safe, as discussed in [2].  Here is  
an example from the regression test, which pushes down multiple levels  
of PHVs to the remote:


EXPLAIN (VERBOSE, COSTS OFF)
SELECT ss.*, ft2.c1 FROM ft2 LEFT JOIN (SELECT 13, q.a, ft2.c1 FROM  
(SELECT 13 FROM ft1 WHERE c1 = 13) q(a) RIGHT JOIN ft2 ON (q.a = ft2.c1)  
WHERE ft2.\
c1 BETWEEN 10 AND 15) ss(f1, f2, f3) ON (ft2.c1 = ss.f1) WHERE ft2.c1  
BETWEEN 10 AND 15;

  \
   QUERY PLAN \

---\
---\
---
  Foreign Scan
Output: (13), (13), ft2_1.c1, ft2.c1
Relations: (public.ft2) LEFT JOIN ((public.ft2) LEFT JOIN (public.ft1))
Remote SQL: SELECT r1."C 1", ss2.c1, ss2.c2, ss2.c3 FROM ("S 1"."T  
1" r1 LEFT JOIN (SELECT r5."C 1", 13, ss1.c1 FROM ("S 1"."T 1" r5 LEFT  
JOIN (SELE\
CT 13 FROM "S 1"."T 1" WHERE (("C 1" = 13))) ss1(c1) ON (((13 = r5."C  
1" WHERE ((r5."C 1" >= 10)) AND ((r5."C 1" <= 15))) ss2(c1, c2, c3)  
ON (((r1.\

"C 1" = 13 WHERE ((r1."C 1" >= 10)) AND ((r1."C 1" <= 15))
(4 rows)

Also, in the same way as the PHV handling, I improved the handling of  
whole-row reference (and system columns other than ctid), as proposed in  
[3].  We don't need the ""CASE WHEN ... IS NOT NULL THEN ROW(...) END"  
conditions any more, as shown in the below example:


EXPLAIN (VERBOSE, COSTS OFF)
SELECT t1.ctid, t1, t2, t1.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1)  
ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10;

  \
 QUERY PLAN \

---\
---\
---
  Limit
Output: t1.ctid, t1.*, t2.*, t1.c1, t1.c3
->  Foreign Scan
  Output: t1.ctid, t1.*, t2.*, t1.c1, t1.c3
  Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
  Remote SQL: SELECT ss1.c1, ss1.c2, ss1.c3, ss1.c4, ss2.c1  
FROM ((SELECT ctid, ROW("C 1", c2, c3, c4, c5, c6, c7, c8), "C 1", c3  
FROM "S 1"."T \
1") ss1(c1, c2, c3, c4) INNER JOIN (SELECT ROW("C 1", c2, c3, c4, c5,  
c6, c7, c8), "C 1" FROM "S 1"."T 1") ss2(c1, c2) ON (TRUE)) WHERE  
((ss1.c3 = ss2.\

c2)) ORDER BY ss1.c4 ASC NULLS LAST, ss1.c3 ASC NULLS LAST
(6 rows)

I'll add this to the next CF.  Comments are welcome!

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/5710D7E2.7010302%40lab.ntt.co.jp
[2]  
https://www.postgresql.org/message-id/b4549406-909f-7d15-dc34-499835a8f0b3%40lab.ntt.co.jp
[3]  
https://www.postgresql.org/message-id/a1fa1c4c-bf96-8ea5-cff5-85b927298e73%40lab.ntt.co.jp


postgres-fdw-more-full-join-pushdown.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LWLocks in DSM memory

2016-08-19 Thread Mithun Cy

On Wed, Aug 17, 2016 at 9:12 PM, Andres Freund  wrote:

> >Yes. I want a long wait list, modified in bulk - which should be the
> >case with the above.
>

I ran some pgbench. And, I do not see much difference in performance, small
variance in perf can be attributed to variance in probability of drawing
the particular built-in script.

Server configuration:
./postgres -c shared_buffers=8GB -N 200 -c
min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

pgbench configuration: initialized with scale_factor 300
./pgbench -c $threads -j $threads -T 1800 -M prepared -b simple-update@1
-b  select-only@20 postgres


Simple-update : select-only = 5:95


*cilents* *8* *64* *128*
*Current Code* 38279.663784 258196.067342 283150.495921
*After Patch revert* 37316.09022 268285.506338 280077.913954
*% diff * -2.5171944284 3.9076656356 -1.0851409449




Simple-update : selec-only = 20:80


*cilents* *8* *64* *128*
*Current Code* 23169.021469 134791.996882 154057.101004
*After Patch revert* 23892.91539 135091.645551 150402.80543
%diff 3.1244043775 0.2223044958 -2.3720396854

And this was done on our 8 socket  intel machine machine

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WIP: About CMake v2

2016-08-19 Thread Yury Zhuravlev


Stefan Kaltenbrunner wrote:

well we have for example a NetBSD 5.1 boxe (coypu) on the buildfarm that
have a software stack that is basically 2008/2009ish...
So 2.8.0-2.8.3 seems like a realistic target to me still



You can install fresh CMake to NetBSD without big problems from source. 


--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we cacheline align PGXACT?

2016-08-19 Thread Alexander Korotkov

On Fri, Aug 19, 2016 at 4:46 PM, Tom Lane  wrote:

> Alexander Korotkov  writes:
> > originally this idea was proposed by Andres Freund while experimenting
> with
> > lockfree Pin/UnpinBuffer [1].
> > The patch is attached as well as results of pgbench -S on 72-cores
> > machine.  As before it shows huge benefit in this case.
>
> That's one mighty ugly patch.  Can't you do it without needing to
> introduce the additional layer of struct nesting?
>

That's worrying me too.
We could use anonymous struct, but it seems to be prohibited in C89 which
we stick to.
Another idea, which comes to my mind, is to manually calculate size of
padding and insert it directly to PGXACT struct.  But that seems rather
ugly too.  However, it would be ugly definition not ugly usage...
Do you have better ideas?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Should we cacheline align PGXACT?

2016-08-19 Thread Alexander Korotkov

On Fri, Aug 19, 2016 at 3:19 PM, Amit Kapila 
wrote:

> On Fri, Aug 19, 2016 at 11:42 AM, Alexander Korotkov
>  wrote:
> > Hackers,
> >
> > originally this idea was proposed by Andres Freund while experimenting
> with
> > lockfree Pin/UnpinBuffer [1].
> > The patch is attached as well as results of pgbench -S on 72-cores
> machine.
> > As before it shows huge benefit in this case.
> > For sure, we should validate that it doesn't cause performance
> regression in
> > other cases.  At least we should test read-write and smaller machines.
> > Any other ideas?
> >
>
> may be test on Power m/c as well.
>

Good idea.  I don't have any such machine at hand now.  Do you have one?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[HACKERS] Make better use of existing enums in plpgsql

2016-08-19 Thread Peter Eisentraut

plpgsql.h defines a number of enums, but most of the code passes them
around as ints.  The attached patch updates structs and function
prototypes to take enum types instead.  This clarifies the struct
definitions in plpgsql.h in particular.

I didn't deal with the PLPGSQL_RC_* symbols, since they are only used in
pl_exec.c (could be moved there?), and it would bloat this patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 59f9df74fdf1da8ef5f426fabf0815d4e4423f12 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 18 Aug 2016 12:00:00 -0400
Subject: [PATCH] Make better use of existing enums in plpgsql

plpgsql.h defines a number of enums, but most of the code passes them
around as ints.  Update structs and function prototypes to take enum
types instead.  This clarifies the struct definitions in plpgsql.h in
particular.
---
 src/pl/plpgsql/src/pl_comp.c  |   6 +--
 src/pl/plpgsql/src/pl_exec.c  |   2 +-
 src/pl/plpgsql/src/pl_funcs.c |  12 ++---
 src/pl/plpgsql/src/plpgsql.h  | 114 +-
 4 files changed, 67 insertions(+), 67 deletions(-)

diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index b628c28..ef3f264 100644
--- a/src/pl/plpgsql/src/pl_comp.c
+++ b/src/pl/plpgsql/src/pl_comp.c
@@ -93,7 +93,7 @@ static PLpgSQL_function *do_compile(FunctionCallInfo fcinfo,
 		   PLpgSQL_func_hashkey *hashkey,
 		   bool forValidator);
 static void plpgsql_compile_error_callback(void *arg);
-static void add_parameter_name(int itemtype, int itemno, const char *name);
+static void add_parameter_name(PLpgSQL_nsitem_type itemtype, int itemno, const char *name);
 static void add_dummy_return(PLpgSQL_function *function);
 static Node *plpgsql_pre_column_ref(ParseState *pstate, ColumnRef *cref);
 static Node *plpgsql_post_column_ref(ParseState *pstate, ColumnRef *cref, Node *var);
@@ -412,7 +412,7 @@ do_compile(FunctionCallInfo fcinfo,
 char		argmode = argmodes ? argmodes[i] : PROARGMODE_IN;
 PLpgSQL_type *argdtype;
 PLpgSQL_variable *argvariable;
-int			argitemtype;
+PLpgSQL_nsitem_type argitemtype;
 
 /* Create $n name for variable */
 snprintf(buf, sizeof(buf), "$%d", i + 1);
@@ -950,7 +950,7 @@ plpgsql_compile_error_callback(void *arg)
  * Add a name for a function parameter to the function's namespace
  */
 static void
-add_parameter_name(int itemtype, int itemno, const char *name)
+add_parameter_name(PLpgSQL_nsitem_type itemtype, int itemno, const char *name)
 {
 	/*
 	 * Before adding the name, check for duplicates.  We need this even though
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index f9b3b22..2a0617d 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -1559,7 +1559,7 @@ exec_stmt(PLpgSQL_execstate *estate, PLpgSQL_stmt *stmt)
 
 	CHECK_FOR_INTERRUPTS();
 
-	switch ((enum PLpgSQL_stmt_types) stmt->cmd_type)
+	switch (stmt->cmd_type)
 	{
 		case PLPGSQL_STMT_BLOCK:
 			rc = exec_stmt_block(estate, (PLpgSQL_stmt_block *) stmt);
diff --git a/src/pl/plpgsql/src/pl_funcs.c b/src/pl/plpgsql/src/pl_funcs.c
index 27ebebc..e3cd9c0 100644
--- a/src/pl/plpgsql/src/pl_funcs.c
+++ b/src/pl/plpgsql/src/pl_funcs.c
@@ -51,7 +51,7 @@ plpgsql_ns_init(void)
  * --
  */
 void
-plpgsql_ns_push(const char *label, enum PLpgSQL_label_types label_type)
+plpgsql_ns_push(const char *label, PLpgSQL_label_type label_type)
 {
 	if (label == NULL)
 		label = "";
@@ -89,7 +89,7 @@ plpgsql_ns_top(void)
  * --
  */
 void
-plpgsql_ns_additem(int itemtype, int itemno, const char *name)
+plpgsql_ns_additem(PLpgSQL_nsitem_type itemtype, int itemno, const char *name)
 {
 	PLpgSQL_nsitem *nse;
 
@@ -231,7 +231,7 @@ plpgsql_ns_find_nearest_loop(PLpgSQL_nsitem *ns_cur)
 const char *
 plpgsql_stmt_typename(PLpgSQL_stmt *stmt)
 {
-	switch ((enum PLpgSQL_stmt_types) stmt->cmd_type)
+	switch (stmt->cmd_type)
 	{
 		case PLPGSQL_STMT_BLOCK:
 			return _("statement block");
@@ -291,7 +291,7 @@ plpgsql_stmt_typename(PLpgSQL_stmt *stmt)
  * GET DIAGNOSTICS item name as a string, for use in error messages etc.
  */
 const char *
-plpgsql_getdiag_kindname(int kind)
+plpgsql_getdiag_kindname(PLpgSQL_getdiag_kind kind)
 {
 	switch (kind)
 	{
@@ -367,7 +367,7 @@ static void free_expr(PLpgSQL_expr *expr);
 static void
 free_stmt(PLpgSQL_stmt *stmt)
 {
-	switch ((enum PLpgSQL_stmt_types) stmt->cmd_type)
+	switch (stmt->cmd_type)
 	{
 		case PLPGSQL_STMT_BLOCK:
 			free_block((PLpgSQL_stmt_block *) stmt);
@@ -791,7 +791,7 @@ static void
 dump_stmt(PLpgSQL_stmt *stmt)
 {
 	printf("%3d:", stmt->lineno);
-	switch ((enum PLpgSQL_stmt_types) stmt->cmd_type)
+	switch (stmt->cmd_type)
 	{
 		case PLPGSQL_STMT_BLOCK:
 			dump_block((PLpgSQL_stmt_block *) stmt);
diff --git a/src/pl/plpgsql/src/plpgsql.h b/src/pl/plpgsql/src/plpgsql.h
index b416e50..c84a97b 100644
--- a/src/pl/plpgsql/src/plpgsql.h
+++

Re: [HACKERS] [PATCH] add option to pg_dumpall to exclude tables from the dump

2016-08-19 Thread Jim Nasby


On 8/18/16 5:01 PM, Tom Lane wrote:

I agree, but I think mandating a database name (which I suppose could be
> *) with the specifiers would solve that issue.

Hmm, something like "-T dbname1:pattern1 -T dbname2:pattern2" ?


Bingo. Hopefully there'd be some way to consolidate the code between the 
two as well...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] anyelement -> anyrange

2016-08-19 Thread Jim Nasby


On 8/18/16 6:02 PM, Corey Huinker wrote:

I'd be happy to roll your code into the extension, and make it marked
more stable.


Yeah, I've been meaning to look at submitting a pull request; hopefully 
will get to it today.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Curing plpgsql's memory leaks for statement-lifespan values

2016-08-19 Thread Tom Lane

Jim Nasby  writes:
> On 7/25/16 1:50 PM, Tom Lane wrote:
>> There's a glibc-dependent hack in aset.c that reports any
>> plpgsql-driven palloc or pfree against a context named "SPI Proc", as
>> well as changes in pl_comp.c so that transient junk created during initial
>> parsing of a plpgsql function body doesn't end up in the SPI Proc context.
>> (I did that just to cut the amount of noise I had to chase down.  I suppose
>> in large functions it might be worth adopting such a change so that that
>> junk can be released immediately after parsing; but I suspect for most
>> functions it'd just be an extra context without much gain.)

> Some folks do create very large plpgsql functions, so if there's a handy 
> way to estimate the size of the function (pg_proc.prosrc's varlena size 
> perhaps) then it might be worth doing that for quite large functions.

I poked at that a bit and realized that during a normal plpgsql function
call, there isn't anything in the SPI Proc context when do_compile is
entered, which means that we could flush all the transient cruft by just
resetting the context afterwards.  The sanest place to put that seems to
be in plpgsql_call_handler, as per attached.  We could put it in
plpgsql_compile so it's not hit in the main line code path, but that would
require plpgsql_compile to know that it's called in an empty context,
which seems a bit fragile (and indeed probably false in the forValidator
case).

There isn't any equivalently easy way to clean up in the case of a DO
block, but I'm not sure that we care as much in that case.

I'm not sure that this is really a net win.  MemoryContextReset is pretty
cheap (at least in non-cassert builds) but it's not zero cost, and in
most scenarios we're not going to care that much about the extra space.
It appeared to me after collecting some stats about the functions present
in the regression tests that for larger functions, the extra space eaten
is just some small multiple (like 2x-3x) of the function body string
length.  So it's not *that* much data, even for big functions, and it's
only going to be eaten on the first usage of a function within a session.

regards, tom lane

diff --git a/src/pl/plpgsql/src/pl_handler.c b/src/pl/plpgsql/src/pl_handler.c
index 36868fb..2a36913 100644
*** a/src/pl/plpgsql/src/pl_handler.c
--- b/src/pl/plpgsql/src/pl_handler.c
***
*** 23,28 
--- 23,29 
  #include "utils/builtins.h"
  #include "utils/guc.h"
  #include "utils/lsyscache.h"
+ #include "utils/memutils.h"
  #include "utils/syscache.h"

*** plpgsql_call_handler(PG_FUNCTION_ARGS)
*** 230,235 
--- 231,239 
  	/* Find or compile the function */
  	func = plpgsql_compile(fcinfo, false);

+ 	/* Flush any transient junk allocated during compile */
+ 	MemoryContextReset(CurrentMemoryContext);
+ 
  	/* Must save and restore prior value of cur_estate */
  	save_cur_estate = func->cur_estate;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we cacheline align PGXACT?

2016-08-19 Thread Tom Lane

Alexander Korotkov  writes:
> On Fri, Aug 19, 2016 at 4:46 PM, Tom Lane  wrote:
>> That's one mighty ugly patch.  Can't you do it without needing to
>> introduce the additional layer of struct nesting?

> That's worrying me too.
> We could use anonymous struct, but it seems to be prohibited in C89 which
> we stick to.
> Another idea, which comes to my mind, is to manually calculate size of
> padding and insert it directly to PGXACT struct.  But that seems rather
> ugly too.  However, it would be ugly definition not ugly usage...
> Do you have better ideas?

No, that was the best one that had occurred to me, too.  You could
probably introduce a StaticAssert that sizeof(PGXACT) is a power of 2
as a means of checking that the manual padding calculation hadn't
gotten broken.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)

2016-08-19 Thread Andres Freund



On August 19, 2016 2:50:30 AM PDT, Aleksander Alekseev 
 wrote:
>Heikki, Peter, thanks a lot for code review!
>
>> What's going on here? Surely pg_atomic_init_u64() should initialize
>> the value?
>
>It's because of how pg_atomic_exchange_u64_impl is implemented:
>
>```
>while (true)
>{   
>old = ptr->value; /* <-- reading of uninitialized value! */
>if (pg_atomic_compare_exchange_u64_impl(ptr, , xchg_))
>break;
>}
>```
>
>Currently pg_atomic_init_u64 works like this:
>
>pg_atomic_init_u64
>`- pg_atomic_init_u64_impl
>   `- pg_atomic_write_u64_impl
>  `- pg_atomic_exchange_u64_impl
>
>I suspect there is actually no need to make an atomic exchange during
>initialization of an atomic variable. Regular `mov` should be enough
>(IIRC there is no need to do `lock mov` since `mov` is already atomic).
>Anyway I don't feel brave enough right now to mess with atomic
>operations since it involves all sort of portability issues. So I
>removed this change for now.

There's platforms with atomic 8 byte compare exchange, without atomic 8 byte 
regular stores.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding restart problems

2016-08-19 Thread Petr Jelinek


On 19/08/16 09:34, konstantin knizhnik wrote:


We are using logical decoding in multimaster and we are faced with the
problem that inconsistent transactions are sent to replica.
Briefly, multimaster is using logical decoding in this way:
1. Each multimaster node is connected with each other using logical
decoding channel and so each pair of nodes
has its own replication slot.
2. In normal scenario each replication channel is used to replicate only
those transactions which were originated at the source node.
We are using origin mechanism to skip "foreign" transactions.
When offline cluster node is returned back to the multimaster we need
to recover this node to the current cluster state.
Recovery is performed from one of the cluster's node. So we are using
only one replication channel to receive all (self and foreign) transactions.
Only in this case we can guarantee consistent order of applying
transactions at recovered node.
After the end of recovery we need to recreate replication slots with all
other cluster nodes (because we have already replied transactions from
this nodes).
To restart logical decoding we first drop existed slot, then create new
one and then start logical replication from the WAL position 0/0
(invalid LSN).
In this case recovery should be started from the last consistent point.



I don't think this will work correctly, there will be gap between when 
the new slot starts to decode and the drop of the old one as the new 
slot first needs to make snapshot.


Do I understand correctly that you are not using replication origins?


The problem is that for some reasons consistent point is not so
consistent and we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts
only the one (last) update and sent this truncated transaction to
destination causing consistency violation at replica.  I started
investigation of logical decoding code and found several things which I
do not understand.


Never seen this happen. Do you have more details about what exactly is 
happening?




Assume that we have transactions T1={start_lsn=100, end_lsn=400} and
T2={start_lsn=200, end_lsn=300}.
Transaction T2 is sent to the replica and replica confirms that
flush_lsn=300.
If now we want to restart logical decoding, we can not start with
position less than 300, because CreateDecodingContext doesn't allow it:

 * start_lsn
 *The LSN at which to start decoding.  If InvalidXLogRecPtr, restart
 *from the slot's confirmed_flush; otherwise, start from the specified
 *location (but move it forwards to confirmed_flush if it's older than
 *that, see below).
 *
else if (start_lsn < slot->data.confirmed_flush)
{
/*
* It might seem like we should error out in this case, but it's
* pretty common for a client to acknowledge a LSN it doesn't have to
* do anything for, and thus didn't store persistently, because the
* xlog records didn't result in anything relevant for logical
* decoding. Clients have to be able to do that to support synchronous
* replication.
*/

So it means that we have no chance to restore T1?
What is worse, if there are valid T2 transaction records with lsn >=
300, then we can partly decode T1 and send this T1' to the replica.
I missed something here?


The decoding starts from restart_lsn of the slot, start_lsn is used for 
skipping the transactions.



Are there any alternative way to "seek" slot to the proper position
without  actual fetching data from it or recreation of the slot?


You can seek forward just fine, just specify the start position in 
START_REPLICATION command.



Is there any mechanism in xlog which can enforce consistent decoding of
transaction (so that no transaction records are missed)?
May be I missed something but I didn't find any "record_number" or
something else which can identify first record of transaction.


As I mentioned above, what you probably want to do is use replication 
origins. When you use those you get origin info when decoding the 
transaction which you can then send to downstream and it can update it's 
idea of where it is for that origin. This is especially useful for the 
transaction forwarding you are doing (See BDR and/or pglogical code for 
example of that).


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sslmode=require fallback

2016-08-19 Thread Jeff Janes

On Sat, Jul 30, 2016 at 11:18 AM, Bruce Momjian  wrote:

> On Fri, Jul 29, 2016 at 11:27:06AM -0400, Peter Eisentraut wrote:
> > On 7/29/16 11:13 AM, Bruce Momjian wrote:
> > > Yes, I am thinking of a case where Postgres is down but a malevolent
> > > user starts a Postgres server on 5432 to gather passwords.  Verifying
> > > against an SSL certificate would avoid this problem, so there is some
> > > value in using SSL on localhost.  (There is no such security available
> > > for Unix-domain socket connections.)
> >
> > Sure, there is the requirepeer connection option for that.
>
> Oh, nice, I had not seen that.
>


Hi Bruce,

There is typo in the doc patch you just committed

"On way to prevent spoofing of"

s/On/One/

Cheers,

Jeff

Re: [HACKERS] Logical decoding restart problems

2016-08-19 Thread Konstantin Knizhnik




On 19.08.2016 19:06, Petr Jelinek wrote:


I don't think this will work correctly, there will be gap between when 
the new slot starts to decode and the drop of the old one as the new 
slot first needs to make snapshot.


Do I understand correctly that you are not using replication origins?


No, we are using replication origins to prevent recursive replication.
The gap is not a biggest problem because in multimaster we commit only 
those transaction which are acknowledged by all live nodes.
So if some transaction falls in the gap... it is unlucky and will be 
aborted.





The problem is that for some reasons consistent point is not so
consistent and we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts
only the one (last) update and sent this truncated transaction to
destination causing consistency violation at replica.  I started
investigation of logical decoding code and found several things which I
do not understand.


Never seen this happen. Do you have more details about what exactly is 
happening?




This is transaction at primary node:

root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec 
-ti node1 bash
postgres@9a04a0c9f246:/pg$  pg_xlogdump 
data/pg_xlog/00010001  | fgrep "tx: 6899"
rmgr: Heaplen (rec/tot):  7/53, tx:   6899, lsn: 
0/019EFBD8, prev 0/019EFB80, desc: LOCK off 12: xid 6899 LOCK_ONLY 
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 60
rmgr: Heaplen (rec/tot): 14/74, tx:   6899, lsn: 
0/019EFC10, prev 0/019EFBD8, desc: HOT_UPDATE off 12 xmax 6899 ; new off 
224 xmax 6899, blkref #0: rel 1663/12407/16421 blk 60
rmgr: Heaplen (rec/tot):  7/53, tx:   6899, lsn: 
0/019EFC60, prev 0/019EFC10, desc: LOCK off 153: xid 6899 LOCK_ONLY 
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 49
rmgr: Heaplen (rec/tot): 14/82, tx:   6899, lsn: 
0/019EFC98, prev 0/019EFC60, desc: UPDATE off 153 xmax 6899 ; new off 55 
xmax 6899, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel 
1663/12407/16421 blk 49
rmgr: Btree   len (rec/tot):  2/64, tx:   6899, lsn: 
0/019EFCF0, prev 0/019EFC98, desc: INSERT_LEAF off 294, blkref #0: rel 
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot):236/   265, tx:   6899, lsn: 
0/019EFD30, prev 0/019EFCF0, desc: PREPARE
pg_xlogdump: FATAL:  error in WAL record at 0/1AC0E70: invalid record 
length at 0/1AC0EA8: wanted 24, got 0



This is the replicated transaction at other node (it is ont clear from 
the trace, but believe me, it is the same transaction):



root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec 
-ti node2 bash
postgres@e5b16d82ce06:/pg$ pg_xlogdump 
data/pg_xlog/00010001  | fgrep "tx:   6882"
rmgr: Heaplen (rec/tot): 14/74, tx:   6882, lsn: 
0/019F3240, prev 0/019F31F0, desc: HOT_UPDATE off 113 xmax 6882 ; new 
off 219 xmax 0, blkref #0: rel 1663/12407/16421 blk 53
rmgr: Heaplen (rec/tot): 14/82, tx:   6882, lsn: 
0/019F5CB8, prev 0/019F5C60, desc: UPDATE off 163 xmax 6882 ; new off 4 
xmax 0, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel 
1663/12407/16421 blk 51
rmgr: Btree   len (rec/tot):  2/64, tx:   6882, lsn: 
0/019F5D10, prev 0/019F5CB8, desc: INSERT_LEAF off 284, blkref #0: rel 
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot):248/   274, tx:   6882, lsn: 
0/019F61F8, prev 0/019F60E8, desc: PREPARE
pg_xlogdump: FATAL:  error in WAL record at 0/1AD3AD8: invalid record 
length at 0/1AD3B10: wanted 24, got 0



And "shorten" version of the same transaction at the third (recovered) node:

root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec 
-ti node3 bash
postgres@b4066d4211bc:/pg$  pg_xlogdump 
data/pg_xlog/00010001  | fgrep "tx: 6893"
rmgr: Heaplen (rec/tot):  7/53, tx:   6893, lsn: 
0/01A29828, prev 0/01A297E0, desc: LOCK off 172: xid 6893 LOCK_ONLY 
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 50
rmgr: Heaplen (rec/tot): 14/82, tx:   6893, lsn: 
0/01A29860, prev 0/01A29828, desc: UPDATE+INIT off 172 xmax 6893 ; new 
off 1 xmax 6893, blkref #0: rel 1663/12407/16421 blk 64, blkref #1: rel 
1663/12407/16421 blk 50
rmgr: Btree   len (rec/tot):  2/64, tx:   6893, lsn: 
0/01A298B8, prev 0/01A29860, desc: INSERT_LEAF off 314, blkref #0: rel 
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot):236/   265, tx:   6893, lsn: 
0/01A298F8, prev 0/01A298B8, desc: PREPARE
pg_xlogdump: FATAL:  error in WAL record at 0/1ACBBF8: invalid record 
length at 0/1ACBC30: wanted 24, got 0


You can see one update instead of two.

Sorry, I have not saved trace  with output of logical decoder. Bu t it 
really decodes just one update!

What I have done:

DROP_REPLICATION_SLOT "mtm_slot_1";
CREATE_REPLICATION_SLOT "mtm_slot_1" LOGICAL "multimaster";

Re: [HACKERS] anyelement -> anyrange

2016-08-19 Thread Corey Huinker

On Fri, Aug 19, 2016 at 11:40 AM, Jim Nasby 
wrote:

> On 8/18/16 6:02 PM, Corey Huinker wrote:
>
>> I'd be happy to roll your code into the extension, and make it marked
>> more stable.
>>
>
> Yeah, I've been meaning to look at submitting a pull request; hopefully
> will get to it today.
>
>
No rush, I'm on vacation. Though I really do appreciate other eyes on the
code and other people using it.

Re: [HACKERS] Re: PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)

2016-08-19 Thread Andres Freund

On 2016-08-19 17:55:25 -0400, Tom Lane wrote:
> It'd be useful also to figure out why our existing valgrind testing has
> not caught this already.  The example you give looks like it surely
> ought to be replicated well enough in the standard regression tests.

The valgrind suppression file explicitly hides a bunch of padding
issues.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)

2016-08-19 Thread Tom Lane

Andres Freund  writes:
> On 2016-08-19 17:55:25 -0400, Tom Lane wrote:
>> It'd be useful also to figure out why our existing valgrind testing has
>> not caught this already.  The example you give looks like it surely
>> ought to be replicated well enough in the standard regression tests.

> The valgrind suppression file explicitly hides a bunch of padding
> issues.

So maybe we ought to work towards taking those out?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-19 Thread Claudio Freire

On Thu, Aug 18, 2016 at 6:26 PM, Claudio Freire  wrote:
> On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan  wrote:
>> On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire  
>> wrote:
>>> In fact, that's why non-leaf index tuples need a different format,
>>> because while leaf index tuples contain the heap pointer already,
>>> non-leaf ones contain only the downlink, not the pointer into the
>>> heap. To be able to do comparisons and pick the right downlink, the
>>> original heap pointer in the leaf index tuple is copied into the
>>> downlink index tuple when splitting pages into an additional
>>> IndexTupleData header that is prepended only to non-leaf index tuples.
>>
>> I think that this is a bad idea. We need to implement suffix
>> truncation of internal page index tuples at some point, to make them
>> contain less information from the original leaf page index tuple.
>> That's an important optimization, because it increases fan-in. This
>> seems like a move in the opposite direction.
>
> I see that. I could try to measure average depth to measure the impact
> this had on fan-in.
>
> While it should cut it in half for narrow indexes, half of very high
> is still high. Wide indexes, which are are the ones that would suffer
> from poor fan-in, would feel this far less, since the overhead is
> constant.
>
> Even if it does have an impact, I don't see an alternative, without
> also implementing suffix truncation. Perhaps I could try to avoid
> adding the leaf tid header if it isn't necessary, though I would have
> to use up the last available flag bit in t_info for that.

So, I did some tests.

TL;DR version is, as expected, there is a big impact on narrow
indexes, but I don't believe it's something to be worried about.

**Begin LONG version** (skip to end long version if not interested)

Regardless of the fanout (or fanin, if you look at it that way), what
matters is tree depth, so I used pageinspect to check how depth of
indexes changed after patching.

This is all on pristine recently built indexes. I will try to measure
the same on bloated indexes later on, but the tests are painfully
slow.

I used the following query to inspect all user indexes, since
pageinspect failed to work for toast indexes for some reason (it
probably needed qualified relnames or something like that). Anyway
user indexes should be enough.

select level, count(*), pg_size_pretty(max(relpages) * 8192.0) as biggest
from pg_class, bt_metap(relname)
where relkind = 'i' and relnamespace = 2200 and relam = 403
group by level
order by level desc;

I tested it on both patched and unpached versions of both my test
database (roughly 23GB), pgbench scale 1000 (15GB) and pgbench scale
1 (146GB).

I believe pgbench is a worst case example, because it only has very
narrow PK indexes, which are the ones that will suffer the most from
this patch.

The numbers:

mat, patched

level;count;biggest
3;18;"1369 MB"
2;60;"322 MB"
1;26;"808 kB"
0;50;"16 kB"

mat, unpatched

level;count;biggest
3;15;"1367 MB"
2;63;"334 MB"
1;26;"800 kB"
0;49;"16 kB"

pgbench, scale 1000, patched

level;count;biggest
3;1;"2145 MB"
1;2;"456 kB"

pgbench, scale 1000, unpatched

level;count;biggest
3;1;"2142 MB"
1;2;"456 kB"

pgbench, scale 1, patched

level;count;biggest
3;1;"21 GB"
2;1;"2224 kB"
1;1;"240 kB"

pgbench, scale 1, unpatched

level;count;biggest
3;1;"21 GB"
1;2;"2208 kB"

So, clearly some indexes are pushed to the next level, but it isn't a
widespread effect - it looks more like the threshold index size for
needing a root split is slightly pushed downwards.

It totally agrees with the math. The index tuple header is 8 bytes,
and the datums need to be maxaligned so they will also be 8 bytes at
least. That means the B in B-tree gets cut by 50% (2/3) at the worst
but only for inner tuples, and so you get that

if a * log_(B)(N/B) = log_(B/(3/2))(N/B)

then for big enogh N

a < (3/2)*log_(B)(N) - 2

Which means the depth of huge B-trees is increased by 50%, but the
effects only begins to be seen at level 2, which is what we have here,
and level 2 for such a narrow index seems to be about 2GB. So we're
talking about very big indexes already. If level 2 can hold 2GB of
index, level 3 can hold around 2G x 8k / 16 = 1TB of index, and I have
yet to see (even in very big databases) a 1TB index on a PK.

If anyone has such an index, yes, this patch will hurt performance by
25% (4 page reads instead of 3).

Bad, but since it only affects very extreme cases, doesn't seem so bad.

**End LONG version**

So, that said, I started mocking up a version of the patch that used a
"has aux data" flag to signal another set of flags from where variable
size attributes can be read, which would enable implementation of
suffix truncation and prefix compression in the future with minimal
data format changes, and was surprised to find that it seems not only
more compact, but cleaner. So I will probably go that way, and

Re: [HACKERS] Re: PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)

2016-08-19 Thread Tom Lane

Piotr Stefaniak  writes:
> On 2016-08-18 20:02, Heikki Linnakangas wrote:
>>> -block = (AllocBlock) malloc(blksize);
>>> +block = (AllocBlock) calloc(1, blksize);

>> I think this goes too far. You're zeroing all palloc'd memory, even if
>> it's going to be passed to palloc0(), and zeroed there. It might even
>> silence legitimate warnings, if there's code somewhere that does
>> palloc(), and accesses some of it before initializing. Plus it's a
>> performance hit.

> I just did a test where I [ this n that ]

I think you are failing to understand Heikki's point.  There is no way
we are committing the change depicted above, because (1) it will mask more
bugs than it fixes; (2) it's an enormously expensive way to fix anything;
and (3) it will effectively disable valgrind testing for missed
initializations.

The right thing to do is find out what downstream bit of code is missing
initializing a block of memory it's working on, and fix that localized
oversight.

It'd be useful also to figure out why our existing valgrind testing has
not caught this already.  The example you give looks like it surely
ought to be replicated well enough in the standard regression tests.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Bug in abbreviated keys abort handling (found with amcheck)

2016-08-19 Thread Peter Geoghegan

I found another bug as a result of using amcheck on Heroku customer
databases. This time, the bug is in core Postgres. It's one of mine.

There was a thinko in tuplesort's abbreviation abort logic, causing
certain SortTuples to be spuriously marked NULL (and so, subsequently
sorted as a NULL tuple, despite not actually changing anything about
the representation of caller tuples). The attached patch fixes this
bug.

I noticed this following a complaint by amcheck about a tuple in the
wrong order on a leaf page in some random text index. The leaf page
was entirely full of NULL values, aside from this one tuple at some
seemingly random position. All non-NULL index tuples were of the kind
that you'd expect to trigger abbreviation to abort (many distinct
values, but with little entropy at the beginning).

I believe that this particular problem has been observed on a tiny
fraction of all databases tested, so I don't think it's very common in
the wild.

I'd be surprised if amcheck does not bring more bugs like this to my
attention before too long. We should work on improving it, so that we
have greater visibility into problems that occur in the field.
-- 
Peter Geoghegan
From d570b9bbfc00cbd861291f47425aa1da1bc32db5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Fri, 19 Aug 2016 14:34:34 -0700
Subject: [PATCH] Fix thinko in abbreviated key abort code

Due to an error in the abbreviated key abort logic, the most recently
processed SortTuple could be incorrectly marked NULL, resulting in an
incorrect final sort order.

This bug may have resulted in corrupt B-Tree indexes in cases where
abbreviation is aborted.  In general, abbreviation is aborted due to a
lack of entropy in the abbreviated keys that were generated so far.
---
 src/backend/utils/sort/tuplesort.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 510565c..ae384a8 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1443,7 +1443,7 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
 			mtup->datum1 = index_getattr(tuple,
 		 1,
 		 RelationGetDescr(state->indexRel),
-		 );
+		 >isnull1);
 		}
 	}
 
@@ -4271,7 +4271,7 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
 			mtup->datum1 = heap_getattr(tuple,
 	  state->indexInfo->ii_KeyAttrNumbers[0],
 		state->tupDesc,
-		>isnull1);
+		>isnull1);
 		}
 	}
 }
@@ -4588,7 +4588,7 @@ copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
 			mtup->datum1 = index_getattr(tuple,
 		 1,
 		 RelationGetDescr(state->indexRel),
-		 >isnull1);
+		 >isnull1);
 		}
 	}
 }
-- 
2.7.4


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)

2016-08-19 Thread Piotr Stefaniak

On 2016-08-19 23:55, Tom Lane wrote:
> I think you are failing to understand Heikki's point.  There is no way
> we are committing the change depicted above, because (1) it will mask more
> bugs than it fixes; (2) it's an enormously expensive way to fix anything;
> and (3) it will effectively disable valgrind testing for missed
> initializations.

I wasn't disagreeing with Heikki, I was trying to show that while 
Aleksander's patch may be useless and perhaps harmful if committed, it 
is not useless in a larger perspective as it has made people look into 
an issue.

And I did that simply because I have more changes of that kind that may 
end up being deemed as useless for committing, but I want to share them 
with -hackers anyway.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] EXLCUDE constraints and Hash indexes

2016-08-19 Thread Jim Nasby


On 8/17/16 8:12 AM, Andrew Gierth wrote:

I also recently found a case where using btree exclusion constraints was
useful: a unique index on an expression can't be marked deferrable, but
the equivalent exclusion constraint can be.


That seems well worth documenting...
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - minor doc improvements

2016-08-19 Thread Tom Lane

Fabien COELHO  writes:
> Minor pgbench documentation improvements so that the description is more
> precise:
>   - a pgbench script may not contain SQL commands, it only needs not to be
> empty.
>   - point out explicitely variable setting meta commands.
>   - the formula is short enough to fit on a line.

I looked at this, but I don't really find any of these changes to be
improvements.  They just make it harder to read IMO.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: initdb: "'" for QUOTE_PATH (non-windows)

2016-08-19 Thread Ryan Murphy

Here is another version of my initdb shell quoting patch.  I have removed
the unnecessary {} block.  I also ran pgindent on the code prior to
creating the patch.

On Thu, Aug 18, 2016 at 3:50 PM, Ryan Murphy  wrote:

>
>
> On Thu, Aug 18, 2016 at 3:44 PM, Tom Lane  wrote:
>
>> Ryan Murphy  writes:
>> > On Thu, Aug 18, 2016 at 3:22 PM, Robert Haas 
>> wrote:
>>  +{ /* pg_ctl command w path, properly quoted */
>>  +PQExpBuffer pg_ctl_path = createPQExpBuffer();
>>  +printfPQExpBuffer(pg_ctl_path, "%s%spg_ctl",
>>
>> >> I think it's worth reducing the scope of variables when that's as
>> >> simple as putting them inside a block that you have to create anyway,
>> >> but I'm skeptical about the idea that one would create a block just to
>> >> reduce the scope of the variables.  I don't think that's our usual
>> >> practice, and I would expect the compiler to detect where the variable
>> >> is referenced first and last anyway.
>>
>> > I enjoy adding the blocks for explicit variable scoping and for quick
>> > navigation in vim (the % key navigates between matching {}'s).  But I
>> want
>> > to fit in with the style conventions of the project.
>>
>> Another point here is that code like this will look quite a bit different
>> after pgindent gets done with it --- that comment will not stay where
>> you put it, for example.  Some of our worst formatting messes come from
>> code wherein somebody adhered to their own favorite layout style without
>> any thought for how it would play with pgindent.
>>
>> regards, tom lane
>>
>
> Ahh, I didn't know about pgindent, but now I see it in /src/tools.  I can
> run that on my code before submitting.
>
> I found these
> 
> links 
> about the style convention and will make sure my patch fits the conventions
> before submitting it.
>
>


0001-initdb-quote-shell-args-in-final-pg_ctl-command.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: PROPOSAL: make PostgreSQL sanitizers-friendly (and prevent information disclosure)

2016-08-19 Thread Piotr Stefaniak

On 2016-08-18 20:02, Heikki Linnakangas wrote:
> On 03/22/2016 03:27 PM, Aleksander Alekseev wrote:
>> diff --git a/src/backend/utils/mmgr/aset.c
>> b/src/backend/utils/mmgr/aset.c
>> index d26991e..46ab8a2 100644
>> --- a/src/backend/utils/mmgr/aset.c
>> +++ b/src/backend/utils/mmgr/aset.c
>> @@ -850,7 +850,7 @@ AllocSetAlloc(MemoryContext context, Size size)
>>  blksize <<= 1;
>>
>>  /* Try to allocate it */
>> -block = (AllocBlock) malloc(blksize);
>> +block = (AllocBlock) calloc(1, blksize);
>>
>>  /*
>>   * We could be asking for pretty big blocks here, so cope if
>> malloc

>
> I think this goes too far. You're zeroing all palloc'd memory, even if
> it's going to be passed to palloc0(), and zeroed there. It might even
> silence legitimate warnings, if there's code somewhere that does
> palloc(), and accesses some of it before initializing. Plus it's a
> performance hit.

I just did a test where I
1. memset() that block to 0xAC (aset.c:853)
2. compile and run bin/initdb, then bin/postgres
2. run an SQL file, shut down bin/postgres
3. make a copy of the transaction log file
4. change the memset() to 0x0C, repeat steps 2-3
5. compare the two transaction log files with a combination of 
hexdump(1), cut(1), and diff(1).

At the end of the output I can see:

-0f34 0010  f5ff ac02 000a  
+0f34 0010  f5ff 0c02 000a  

So it looks like the MSan complaint might be a true positive.

The SQL file is just this snippet from bit.sql:
CREATE TABLE varbit_table (a BIT VARYING(16), b BIT VARYING(16));
COPY varbit_table FROM stdin;
X0F X10
X1F X11
X2F X12
X3F X13
X8F X04
X000F   X0010
X0123   X
X2468   X2468
XFA50   X05AF
X1234   XFFF5
\.

I realize given information might be a little bit scarce, but I didn't 
know what else might be interesting to you that you wouldn't be able to 
reproduce.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - allow empty queries

2016-08-19 Thread Tom Lane

Fabien COELHO  writes:
> I wanted to test overheads on an empty query, but pgbench does not allow 
> it. I do not see why not.

I'm inclined to think this was probably a good thing before 9.6, as a
guard against accidentally writing an empty query into a script and
thus executing more queries than you meant to.  However, with the new
end-of-command definition I think it's impossible to write an empty
query without an explicit ";", so it seems fine now.  Pushed.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] standalone backend PANICs during recovery

2016-08-19 Thread Tom Lane

[ got around to looking at this finally ]

Alvaro Herrera  writes:
> Bernd Helmle wrote:
>> While investigating a problem on a streaming hot standby instance at a
>> customer site, i got the following when using a standalone backend:
>> 
>> PANIC:  btree_xlog_delete_get_latestRemovedXid: cannot operate with
>> inconsistent data
>> CONTEXT:  xlog redo delete: index 1663/65588/65625; iblk 11, heap
>> 1663/65588/65613;
>> 
>> The responsible PANIC is triggered here:
>> src/backend/access/nbtree/nbtxlog.c:555

> This PANIC was introduced in the commit below.  AFAICS your proposed
> patch and rationale are correct.

I'm not very convinced by this.  What seems to me to be the core issue is
how it can make sense for InHotStandby to read as TRUE in a standalone
backend.  The best-case situation if that's true is that the standalone
backend will do a lot of unnecessary work on the baseless assumption that
there are other backends it has to keep things valid for.  The worst-case
situation is that you trip over some bug and are unable to complete
recovery, even though you would have if the hot-standby support hadn't
been activated.

In short, I don't think control should have been here at all.  My proposal
for a fix is to force EnableHotStandby to remain false in a standalone
backend.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use %u to print user mapping's umid and userid

2016-08-19 Thread Tom Lane

Etsuro Fujita  writes:
> On 2016/05/13 3:53, Tom Lane wrote:
>> Robert Haas  writes:
>>> Regardless of what approach we take, I disagree that this needs to be
>>> fixed in 9.6.

>> Agreed.  This is only a cosmetic issue, and it's only going to be visible
>> to a very small group of people, so we can leave it alone until 9.7.

> Agreed.

This thread is shown as "Needs review" in the 2016-09 commitfest, but so
far as I can find, no new patch has been posted since Robert proposed that
we ought to get rid of the current List format for the extra info in favor
of using ExtensibleNode.  I'm going to mark the CF entry as Returned With
Feedback.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Show dropped users' backends in pg_stat_activity

2016-08-19 Thread Tom Lane

Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> I concur.  Let's put the left join(s) into those views and call it
>> good.

> I'd suggest we also add some notes to the documentation that the correct
> approach to dropping users is to disallow access first, then kill any
> existing backends, and then drop the user.  That, plus the left joins,
> seems like it's good enough.

I've pushed Oskari's original patch; a look around in system_views.sql
shows that it's justified on the basis of consistency with other views
even aside from the functional problem he ran into.

The documentation question seems like material for a separate patch.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding restart problems

2016-08-19 Thread Petr Jelinek


On 19/08/16 18:36, Konstantin Knizhnik wrote:



On 19.08.2016 19:06, Petr Jelinek wrote:


I don't think this will work correctly, there will be gap between when
the new slot starts to decode and the drop of the old one as the new
slot first needs to make snapshot.

Do I understand correctly that you are not using replication origins?


No, we are using replication origins to prevent recursive replication.
The gap is not a biggest problem because in multimaster we commit only
those transaction which are acknowledged by all live nodes.
So if some transaction falls in the gap... it is unlucky and will be
aborted.



Ah okay, forgot you have sync MM.




The problem is that for some reasons consistent point is not so
consistent and we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts
only the one (last) update and sent this truncated transaction to
destination causing consistency violation at replica.  I started
investigation of logical decoding code and found several things which I
do not understand.


Never seen this happen. Do you have more details about what exactly is
happening?



This is transaction at primary node:

root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec
-ti node1 bash
postgres@9a04a0c9f246:/pg$  pg_xlogdump
data/pg_xlog/00010001  | fgrep "tx: 6899"
rmgr: Heaplen (rec/tot):  7/53, tx:   6899, lsn:
0/019EFBD8, prev 0/019EFB80, desc: LOCK off 12: xid 6899 LOCK_ONLY
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 60
rmgr: Heaplen (rec/tot): 14/74, tx:   6899, lsn:
0/019EFC10, prev 0/019EFBD8, desc: HOT_UPDATE off 12 xmax 6899 ; new off
224 xmax 6899, blkref #0: rel 1663/12407/16421 blk 60
rmgr: Heaplen (rec/tot):  7/53, tx:   6899, lsn:
0/019EFC60, prev 0/019EFC10, desc: LOCK off 153: xid 6899 LOCK_ONLY
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 49
rmgr: Heaplen (rec/tot): 14/82, tx:   6899, lsn:
0/019EFC98, prev 0/019EFC60, desc: UPDATE off 153 xmax 6899 ; new off 55
xmax 6899, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel
1663/12407/16421 blk 49
rmgr: Btree   len (rec/tot):  2/64, tx:   6899, lsn:
0/019EFCF0, prev 0/019EFC98, desc: INSERT_LEAF off 294, blkref #0: rel
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot):236/   265, tx:   6899, lsn:
0/019EFD30, prev 0/019EFCF0, desc: PREPARE
pg_xlogdump: FATAL:  error in WAL record at 0/1AC0E70: invalid record
length at 0/1AC0EA8: wanted 24, got 0


This is the replicated transaction at other node (it is ont clear from
the trace, but believe me, it is the same transaction):


root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec
-ti node2 bash
postgres@e5b16d82ce06:/pg$ pg_xlogdump
data/pg_xlog/00010001  | fgrep "tx:   6882"
rmgr: Heaplen (rec/tot): 14/74, tx:   6882, lsn:
0/019F3240, prev 0/019F31F0, desc: HOT_UPDATE off 113 xmax 6882 ; new
off 219 xmax 0, blkref #0: rel 1663/12407/16421 blk 53
rmgr: Heaplen (rec/tot): 14/82, tx:   6882, lsn:
0/019F5CB8, prev 0/019F5C60, desc: UPDATE off 163 xmax 6882 ; new off 4
xmax 0, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel
1663/12407/16421 blk 51
rmgr: Btree   len (rec/tot):  2/64, tx:   6882, lsn:
0/019F5D10, prev 0/019F5CB8, desc: INSERT_LEAF off 284, blkref #0: rel
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot):248/   274, tx:   6882, lsn:
0/019F61F8, prev 0/019F60E8, desc: PREPARE
pg_xlogdump: FATAL:  error in WAL record at 0/1AD3AD8: invalid record
length at 0/1AD3B10: wanted 24, got 0


And "shorten" version of the same transaction at the third (recovered)
node:

root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec
-ti node3 bash
postgres@b4066d4211bc:/pg$  pg_xlogdump
data/pg_xlog/00010001  | fgrep "tx: 6893"
rmgr: Heaplen (rec/tot):  7/53, tx:   6893, lsn:
0/01A29828, prev 0/01A297E0, desc: LOCK off 172: xid 6893 LOCK_ONLY
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 50
rmgr: Heaplen (rec/tot): 14/82, tx:   6893, lsn:
0/01A29860, prev 0/01A29828, desc: UPDATE+INIT off 172 xmax 6893 ; new
off 1 xmax 6893, blkref #0: rel 1663/12407/16421 blk 64, blkref #1: rel
1663/12407/16421 blk 50
rmgr: Btree   len (rec/tot):  2/64, tx:   6893, lsn:
0/01A298B8, prev 0/01A29860, desc: INSERT_LEAF off 314, blkref #0: rel
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot):236/   265, tx:   6893, lsn:
0/01A298F8, prev 0/01A298B8, desc: PREPARE
pg_xlogdump: FATAL:  error in WAL record at 0/1ACBBF8: invalid record
length at 0/1ACBC30: wanted 24, got 0

You can see one update instead of two.

Sorry, I have not saved trace  with output of logical decoder. Bu t it
really decodes just one update!
What I have done:

DROP_REPLICATION_SLOT "mtm_slot_1";
CREATE_REPLICATION_SLOT

[HACKERS] CREATE POLICY bug ?

2016-08-19 Thread Andrea Adami

Hello,
i'm testing the new row security level  functionality in postgresql 9.5.
To do that i run this script:

---cut here --

CREATE TABLE public.policy_tab
(
  id bigint NOT NULL,
  description character varying(160) NOT NULL,
  usr name NOT NULL,
  CONSTRAINT policy_tab_pk PRIMARY KEY (id)
);

ALTER TABLE public.policy_tab OWNER TO postgres;

GRANT ALL ON TABLE public.policy_tab TO public;

CREATE OR REPLACE VIEW public.policy_view AS
 SELECT id,
description,
usr
   FROM public.policy_tab;

ALTER TABLE public.policy_view
  OWNER TO postgres;
GRANT ALL ON TABLE public.policy_view TO public;

ALTER TABLE public.policy_tab ENABLE ROW LEVEL SECURITY;

CREATE POLICY standard ON public.policy_tab
FOR ALL
TO PUBLIC
USING (usr = current_user);

INSERT INTO public.policy_tab (id, description, usr) VALUES (1,'uno','
mana...@scuola247.it');
INSERT INTO public.policy_tab (id, description, usr) VALUES (2,'due','
mana...@scuola247.it');
INSERT INTO public.policy_tab (id, description, usr) VALUES (3,'tre','
mana...@scuola247.it');
INSERT INTO public.policy_tab (id, description, usr) VALUES (4,'quattro','
teac...@scuola247.it');
INSERT INTO public.policy_tab (id, description, usr) VALUES (5,'cinque','
teac...@scuola247.it');

---cut here --

after that i run the query: "select * from public.policy_tab"

and the the oupt was what i excpected:

rows 1,2,3 for user: mana...@scuola247.it
rows 4,5 for user: teac...@scuola247.it
rows 1,2,3,4,5  for user:  postgres (the policy doesn't work for him)

but when i run the query: "select * from public.policy_view"

the ouput is the same (all rows)  for all users

i'm doing some mistakes or this is a bug ?

thank you in advance for the time you would like dedicate to me.

Andrea Adami

Re: [HACKERS] LSN as a recovery target

2016-08-19 Thread Michael Paquier

On Fri, Aug 19, 2016 at 10:47 PM, Adrien Nayrat
 wrote:
> I reviewed this patch rebased to deal with
> f6ced51f9188ad5806219471a0b40a91dde923aa, and minor adjustment (see below)

Thanks!

> It do the job. However if you use an incorrect recovery_target_lsn you
> get this message :
> "FATAL:  invalid input syntax for type pg_lsn: "0RT5/4""
>
> I added a PG_TRY/PG_CATCH section in order to get a more explicit message :
> FATAL:  wrong recovery_target_lsn (must be pg_lsn type)
>
> I am not sure if it is the best solution?

Using a PG_TRY/CATCH block the way you do to show to user a different
error message while the original one is actually correct does not
sound like a good idea to me. It complicates the code and the original
pattern matches what is already done for timestamps, where in case of
error you'd get that:
FATAL:  invalid input syntax for type timestamp with time zone: "aa"

I think that a better solution would be to make clearer in the docs
that pg_lsn is used here. First, in recovery.conf.sample, let's use
pg_lsn and not LSN. Then, in the sgml docs, let's add a reference to
the page of pg_lsn as well:
https://www.postgresql.org/docs/devel/static/datatype-pg-lsn.html

Now we have that:
+   
+This parameter specifies the LSN of the write-ahead log stream up to
+which recovery will proceed. The precise stopping point is also
+influenced by .
+   
Perhaps we could just add an additional sentence: "This parameter
value is parsed using the system datatype

Re: [HACKERS] Should we cacheline align PGXACT?

2016-08-19 Thread Amit Kapila

On Fri, Aug 19, 2016 at 8:24 PM, Alexander Korotkov
 wrote:
> On Fri, Aug 19, 2016 at 3:19 PM, Amit Kapila 
> wrote:
>>
>> On Fri, Aug 19, 2016 at 11:42 AM, Alexander Korotkov
>>  wrote:
>> > Hackers,
>> >
>> > originally this idea was proposed by Andres Freund while experimenting
>> > with
>> > lockfree Pin/UnpinBuffer [1].
>> > The patch is attached as well as results of pgbench -S on 72-cores
>> > machine.
>> > As before it shows huge benefit in this case.
>> > For sure, we should validate that it doesn't cause performance
>> > regression in
>> > other cases.  At least we should test read-write and smaller machines.
>> > Any other ideas?
>> >
>>
>> may be test on Power m/c as well.
>
>
> Good idea.  I don't have any such machine at hand now.  Do you have one?
>

Yes, I can help in testing this patch during CF.  Feel free to ping,
if I missed or forgot to do same.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LSN as a recovery target

2016-08-19 Thread Petr Jelinek


On 20/08/16 02:13, Michael Paquier wrote:

On Fri, Aug 19, 2016 at 10:47 PM, Adrien Nayrat
 wrote:

I reviewed this patch rebased to deal with
f6ced51f9188ad5806219471a0b40a91dde923aa, and minor adjustment (see below)


Thanks!


It do the job. However if you use an incorrect recovery_target_lsn you
get this message :
"FATAL:  invalid input syntax for type pg_lsn: "0RT5/4""

I added a PG_TRY/PG_CATCH section in order to get a more explicit message :
FATAL:  wrong recovery_target_lsn (must be pg_lsn type)

I am not sure if it is the best solution?


Using a PG_TRY/CATCH block the way you do to show to user a different
error message while the original one is actually correct does not
sound like a good idea to me. It complicates the code and the original
pattern matches what is already done for timestamps, where in case of
error you'd get that:
FATAL:  invalid input syntax for type timestamp with time zone: "aa"

I think that a better solution would be to make clearer in the docs
that pg_lsn is used here. First, in recovery.conf.sample, let's use
pg_lsn and not LSN. Then, in the sgml docs, let's add a reference to
the page of pg_lsn as well:
https://www.postgresql.org/docs/devel/static/datatype-pg-lsn.html

Now we have that:
+   
+This parameter specifies the LSN of the write-ahead log stream up to
+which recovery will proceed. The precise stopping point is also
+influenced by .
+   
Perhaps we could just add an additional sentence: "This parameter
value is parsed using the system datatype

[HACKERS] TODO list updated for PG 10

2016-08-19 Thread Bruce Momjian

I have done my yearly TODO list cleanup, and I did more extensive item
removal this time.  There were a number of entries I could not figure out
so if people want to review what is left and remove items that are
undesired or done, please do that.  Thanks.

https://wiki.postgresql.org/wiki/Todo

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] replication slots replicated to standbys?

2016-08-19 Thread Michael Paquier

On Sat, Aug 20, 2016 at 1:39 PM, Bruce Momjian  wrote:
> Someone reported that a replication slot that existed at the time a base
> backup was done on the master was copied to the standby.  Because they
> didn't realize it, their WAL was not being recycled on the standby.
>
> Is that possible?  Is it a known behavior?  I don't see it documented.

>From backup.sgml:

It is often a good idea to also omit from the backup the files
within the cluster's pg_replslot/ directory, so that
replication slots that exist on the master do not become part of the
backup.  Otherwise, the subsequent use of the backup to create a standby
may result in indefinite retention of WAL files on the standby, and
possibly bloat on the master if hot standby feedback is enabled, because
the clients that are using those replication slots will still be connecting
to and updating the slots on the master, not the standby.  Even if the
backup is only intended for use in creating a new master, copying the
replication slots isn't expected to be particularly useful, since the
contents of those slots will likely be badly out of date by the time
the new master comes on line.

Note as well that pg_basebackup omits its content and creates an empty
directory.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-19 Thread Claudio Freire

On Sat, Aug 20, 2016 at 1:53 AM, Claudio Freire  wrote:
> All uniqueness checks will try to read-lock nbuf
> unless the uniqueness check for that insertion is done.


That should read "all uniqueness checks will try to read-lock the buf
unless the uniqueness check for that insertion is done."

(nbuf -> buf)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-19 Thread Amit Kapila

On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire  wrote:
>
> A couple of points make me uneasy about this patch, yet I can think of
> no better alternative, so I seek feedback:
>
>  - introducing a different format for inner index tuples makes for an
> invasive patch and quite difficult-to-reason code (it's easy to forget
> whether a page is leaf or inner and that now causes assertion failures
> or segfaults)
>  - the ascent-descent to check for uniqueness when there are large
> dead tuple runs could have locking subtleties that escape me. Perhaps
> there's better ways to solve this.

I have tried to study this part of your patch and it seems somewhat
non-trivial and risky part of proposal.

+ } else {
+ /*
+ * We found the actual first item on another block, so we have to perform
+ * a two-step search - first half, until our write-locked buffer, then another
+ * starting from our write-locked buffer.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(buf, BT_WRITE);
+
+ buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false,
+ true, stack, BT_WRITE, NULL);
+
+ first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, NULL);
+
+ xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf,
first_offset, itup_scankey,
+ checkUnique, _unique, );
+
+ _bt_relbuf(rel, nbuf);
+ }

The idea for uniqueness check is that we hold the write lock on
buffer/page on which we are trying to operate (we do take read locks
on the consecutive pages during this check).  Here, in your changed
proposal, you have two buffers (one to which the search led you and
one buffer previous in the chain) and before checking uniqueness, on
one of the buffers you seem to have write lock and on other you have
read lock.  This seems to change the way uniqueness check works till
now, can you explain how this works (can we safely assume that all
searches for uniqueness check will lead to the same buffer first).

With this new mechanism, do we have two type of search interfaces
where one would work for keys (as now) and other for key-ctid or it
will be just a single interface which works both ways?  I think either
way there are pros and cons.

Also, in the thread below you are talking about using the last bit in
t_info, I want to bring it to your notice that there is a patch of
mine [1] in which I have used it to avoid changing on-disk
compatibility of hash indexes.  I am not saying that we should not
plan it for some other use, but just to keep in mind that there are
other proposals to use it.  We can decide what is best way to proceed,
if we are aware of all the proposals that want to use it.

[1] - 
https://www.postgresql.org/message-id/caa4ek1lkq_udism-z2dq6cuvjh3db5fnfnnezzbpsrjw0ha...@mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-19 Thread Claudio Freire

On Sat, Aug 20, 2016 at 1:05 AM, Amit Kapila  wrote:
> On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire  
> wrote:
>>
>> A couple of points make me uneasy about this patch, yet I can think of
>> no better alternative, so I seek feedback:
>>
>>  - introducing a different format for inner index tuples makes for an
>> invasive patch and quite difficult-to-reason code (it's easy to forget
>> whether a page is leaf or inner and that now causes assertion failures
>> or segfaults)
>>  - the ascent-descent to check for uniqueness when there are large
>> dead tuple runs could have locking subtleties that escape me. Perhaps
>> there's better ways to solve this.
>
> I have tried to study this part of your patch and it seems somewhat
> non-trivial and risky part of proposal.
>
> + } else {
> + /*
> + * We found the actual first item on another block, so we have to perform
> + * a two-step search - first half, until our write-locked buffer, then 
> another
> + * starting from our write-locked buffer.
> + */
> + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> + LockBuffer(buf, BT_WRITE);
> +
> + buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false,
> + true, stack, BT_WRITE, NULL);
> +
> + first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, 
> NULL);
> +
> + xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf,
> first_offset, itup_scankey,
> + checkUnique, _unique, );
> +
> + _bt_relbuf(rel, nbuf);
> + }
>
> The idea for uniqueness check is that we hold the write lock on
> buffer/page on which we are trying to operate (we do take read locks
> on the consecutive pages during this check).  Here, in your changed
> proposal, you have two buffers (one to which the search led you and
> one buffer previous in the chain) and before checking uniqueness, on
> one of the buffers you seem to have write lock and on other you have
> read lock.  This seems to change the way uniqueness check works till
> now, can you explain how this works (can we safely assume that all
> searches for uniqueness check will lead to the same buffer first).

All uniqueness checks will find the same "nbuf" (read-locked), which
is the first one that matches the key.

They could have different "buf" buffers (the write-locked) one. But
since read locks cannot be held while a write-lock is held, I believe
it doesn't matter. All uniqueness checks will try to read-lock nbuf
unless the uniqueness check for that insertion is done. When they do,
they will block until the other insertion is finished, and when that
happens, either the insert happened, or it returned an xid to wait for
and retry. If it succeeded, the check attempting the read-lock will
see the inserted tuple and return the xid of the earlier transaction
and wait on it. If it failed, no insert happened because there's a
visible tuple there, and the read-lock will succeed, find the visible
tuple, and the insert will fail too.

In essence, it is still the case, I believe, that only one insert can
succeed at a time, all the others will retry. It only happens that
with the patch, the contention will be between a read lock and a write
lock, and not two write locks, and there's a little less contention
(some more work can be done in parallel).

> With this new mechanism, do we have two type of search interfaces
> where one would work for keys (as now) and other for key-ctid or it
> will be just a single interface which works both ways?  I think either
> way there are pros and cons.

The patch doesn't add another interface, but the idea is to add it.
The implementation will probably fall on a common function that can do
both, but some index ams can't implement the key-ctid operation (hash
indexes), so they will have to be separate interfaces to be able to
properly represent the separate capabilities (besides, other
implementations may well use completely different paths for the two
operations).

> Also, in the thread below you are talking about using the last bit in
> t_info, I want to bring it to your notice that there is a patch of
> mine [1] in which I have used it to avoid changing on-disk
> compatibility of hash indexes.  I am not saying that we should not
> plan it for some other use, but just to keep in mind that there are
> other proposals to use it.  We can decide what is best way to proceed,
> if we are aware of all the proposals that want to use it.
>
>
> [1] - 
> https://www.postgresql.org/message-id/caa4ek1lkq_udism-z2dq6cuvjh3db5fnfnnezzbpsrjw0ha...@mail.gmail.com

I wasn't aware of that particular thread, but there's another (the
WARM one) where there's another proposed use of that bit.

IMHO, any use of the bit that doesn't leave room for further
extensions of the bit space is a bad idea, because it closes the door
on (easy) further flags being added. And the fact that there's already
several proposals to use that bit for different purposes only
reinforces that idea.

The idea I'm working on allows further

[HACKERS] replication slots replicated to standbys?

2016-08-19 Thread Bruce Momjian

Someone reported that a replication slot that existed at the time a base
backup was done on the master was copied to the standby.  Because they
didn't realize it, their WAL was not being recycled on the standby.

Is that possible?  Is it a known behavior?  I don't see it documented.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CLUSTER, reform_and_rewrite_tuple(), and parallelism

2016-08-19 Thread Amit Kapila

On Thu, Aug 18, 2016 at 4:42 AM, Peter Geoghegan  wrote:
>
> Does anyone have any ideas on how to:
>
> 1). Directly address the reform_and_rewrite_tuple() bottleneck.
>
> and/or:
>
> 2). Push down some or all of the reform_and_rewrite_tuple() work till
> before tuples are passed to the tuplesort.
>
> "2" would probably make it straightforward to have
> reform_and_rewrite_tuple() work occur in parallel workers instead,
> which buys us a lot.
>

I am not aware what exactly you have in mind to parallelize Cluster
command, but anything that is CPU intensive is probably a good bet to
push down to workers (provided it is safe).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: About CMake v2

2016-08-19 Thread Christian Convey

Hi Yury,


On Fri, Aug 19, 2016 at 9:46 AM, Yury Zhuravlev
 wrote:
> Christian Convey wrote:
>>
>> I'm interested in helping with your CMake effort.  I don't have any
>> experience contributing to PG, but I do have some free time at the
>> moment.  Please let me know if I can help.
>
> I glad to hear it. I suppose you can just try build postgres and send all
> problems to github tracker.
> https://github.com/stalkerg/postgres_cmake/issues

Thanks, I'll be happy to do that.  There's been a lot of discussion on
this thread regarding the minimum required CMake version.  If you'd
like me to test with a particular version (or versions) of CMake,
please let me know.

- Christian


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

70 matches

Mail list logo