Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Nico Williams
On Thu, Jun 21, 2018 at 10:05:41AM +0900, Masahiko Sawada wrote:
> On Thu, Jun 21, 2018 at 6:57 AM, Nico Williams  wrote:
> > On Wed, Jun 20, 2018 at 05:16:46PM -0400, Bruce Momjian wrote:
> >> On Mon, Jun 18, 2018 at 12:29:57PM -0500, Nico Williams wrote:
> >> > Note that unless the pg_catalog is protected against manipulation by
> >> > remote storage, then TDE for user tables might be possible to
> >> > compromise.  Like so: the attacker manipulates the pg_catalog to
> >> > escalate privelege in order to obtain the TDE keys.  This argues for
> >> > full database encryption, not just specific tables or columns.  But
> >> > again, this is for the threat model where the storage is the threat.
> >>
> >> Yes, one big problem with per-column encryption is that administrators
> >> can silently delete data, though they can't add or modify it.
> >
> > They can also re-add ("replay") deleted values; this can only be
> > defeated by also binding TX IDs or alike in the ciphertext.  And if you
> > don't bind the encrypted values to the PKs then they can add any value
> > they've seen to different rows.
> 
> I think we could avoid it by implementations. If we implement
> per-column encryption by putting all encrypted columns out to another
> table like TOAST table and encrypting whole that external table then
> we can do per-column encryption without such concerns. Also, that way
> we can encrypt data when disk I/O even if we use per-column
> encryption. It would get a better performance. A downside of this idea
> is extra overhead to access encrypted column but it would be
> predictable since we have TOAST.

The case we were discussing was one where the threat model is that the
DBAs are the threat.  It is only in that case that the replay,
cut-n-paste, and silent deletion attacks are relevant.  Encrypting a
table, or the whole DB, on the server side, does nothing to protect
against that threat.

Never lose track of the threat model.

Nico
-- 



Re: Speedup of relation deletes during recovery

2018-06-20 Thread Michael Paquier
On Wed, Jun 20, 2018 at 08:43:11PM -0700, Andres Freund wrote:
> On 2018-06-18 11:13:47 -0700, Andres Freund wrote:
>> We could do that - but add_to_unowned_list() is actually a bottleneck in
>> other places during recovery too. We pretty much never (outside of
>> dropping relations / databases) close opened relations during recovery -
>> which is obviously problematic since nearly all of the entries are
>> unowned.  I've come to the conclusion that we should have a global
>> variable that just disables adding anything to the global lists.
> 
> On second thought: I think we should your approach in the back branches,
> and something like I'm suggesting in master once open.

+1.  Let's also make sure that the removal of smgrdounlink and
smgrdounlinkfork happens only on master.
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: backtraces for error messages

2018-06-20 Thread Michael Paquier
On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> I wrote it because I got sick of Assert(false) debugging, and I was chasing
> down some "ERROR:  08P01: insufficient data left in message" errors. Then I
> got kind of caught up in it... you know how it is.

Yes, I know that feeling!  I have been using as well the Assert(false)
and the upgrade-to-PANIC tricks a couple of times, so being able to get
more easily backtraces would be really nice.

> It also goes to show there are plenty of places you might want to get a
> stack where you don't have an internal errcode or panic. I don't think that
> idea will fly.

Yep.  Error message uniqueness can help, but getting the call stack
can trace back to more understanding of a problem.
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: backtraces for error messages

2018-06-20 Thread Craig Ringer
This is what the stacks look like btw


[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] DEBUG:  0:
find_in_dynamic_libpath: trying
"/home/craig/pg/10/lib/postgresql/pglogical.so"
[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] LOCATION:
find_in_dynamic_libpath, dfmgr.c:639
[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] STACK:
FRAME0: find_in_dynamic_libpath +628
FRAME1: expand_dynamic_library_name +205
FRAME2: load_external_function +38
FRAME3: LookupBackgroundWorkerFunction +197
FRAME4: StartBackgroundWorker +549
FRAME5: do_start_bgworker +466
FRAME6: maybe_start_bgworkers +382
FRAME7: reaper +895
FRAME8: funlockfile +80
FRAME9: __select +23
FRAME   10: ServerLoop +394
FRAME   11: PostmasterMain +4499

I wrote it because I got sick of Assert(false) debugging, and I was chasing
down some "ERROR:  08P01: insufficient data left in message" errors. Then I
got kind of caught up in it... you know how it is.

It also goes to show there are plenty of places you might want to get a
stack where you don't have an internal errcode or panic. I don't think that
idea will fly.


Re: Speedup of relation deletes during recovery

2018-06-20 Thread Andres Freund
On 2018-06-18 11:13:47 -0700, Andres Freund wrote:
> On 2018-06-19 03:06:54 +0900, Fujii Masao wrote:
> > On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund  wrote:
> > > Hi,
> > >
> > > On 2018-06-15 10:45:04 -0700, Andres Freund wrote:
> > >> > +
> > >> > +   srels = palloc(sizeof(SMgrRelation) * ndelrels);
> > >> > for (i = 0; i < ndelrels; i++)
> > >> > -   {
> > >> > -   SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
> > >> > +   srels[i] = smgropen(delrels[i], InvalidBackendId);
> > >> >
> > >> > -   smgrdounlink(srel, false);
> > >> > -   smgrclose(srel);
> > >> > -   }
> > >> > +   smgrdounlinkall(srels, ndelrels, false);
> > >> > +
> > >> > +   for (i = 0; i < ndelrels; i++)
> > >> > +   smgrclose(srels[i]);
> > >> > +   pfree(srels);
> > >
> > > Hm. This will probably cause another complexity issue. If we just
> > > smgropen() the relation will be unowned. Which means smgrclose() will
> > > call remove_from_unowned_list(), which is O(open-relations). Which means
> > > this change afaict creates a new O(ndrels^2) behaviour?
> > >
> > > It seems like the easiest fix for that would be to have a local
> > > SMgrRelationData in that loop, that we assign the relations to?  That's
> > > a bit hacky, but afaict should work?
> > 
> > The easier (but tricky) fix for that is to call smgrclose() for
> > each SMgrRelation in the reverse order. That is, we should do
> > 
> >for (i = ndelrels - 1; i >= 0; i--)
> >smgrclose(srels[i]);
> > 
> > instead of
> > 
> >for (i = 0; i < ndelrels; i++)
> >smgrclose(srels[i]);
> 
> We could do that - but add_to_unowned_list() is actually a bottleneck in
> other places during recovery too. We pretty much never (outside of
> dropping relations / databases) close opened relations during recovery -
> which is obviously problematic since nearly all of the entries are
> unowned.  I've come to the conclusion that we should have a global
> variable that just disables adding anything to the global lists.

On second thought: I think we should your approach in the back branches,
and something like I'm suggesting in master once open.

Greetings,

Andres Freund



Re: PATCH: backtraces for error messages

2018-06-20 Thread Craig Ringer
On 21 June 2018 at 01:15, Andres Freund  wrote:

> On 2018-06-20 13:10:57 -0400, Robert Haas wrote:
> > On Wed, Jun 20, 2018 at 12:10 PM, Andres Freund 
> wrote:
> > > If we instead had a backtrace enabled for all PANICs and some FATALs by
> > > default (and perhaps a SIGSEGV handler too), we'd be in a better
> > > place. That'd obviously only work when compiled with support for
> > > libraries, on platforms where we added support for that. But I think
> > > that'd be quite the improvement already.
> >
> > I think doing it on PANIC would be phenomenal.  SIGSEGV would be great
> > if we can make it safe enough, which I'm not sure about, but then I
> > suppose we're crashing anyway.
>
> Yea, I think that's pretty much why It'd be ok.


Yep. At worst we crash again while trying to generate a bt. We're not doing
anything particularly exciting, and it should be sensible enough.

> Instead of making the ERROR behavior conditional on
> > log_error_verbosity as Craig has it now, how about doing it whenever
> > the error code is ERRCODE_INTERNAL_ERROR?  That's pretty much the
> > cases that aren't supposed to happen, so if we see those happening a
> > lot, it's either a bug we need to fix or we should supply a better
> > error code.  Also, a lot of those messages are duplicated in many
> > places and/or occur inside fairly generic functions inside
> > lsyscache.c, so the actual error message text tends not to be enough
> > to know what happened.
>
> I don't think that's ok. It's perfectly possible to hit
> ERRCODE_INTERNAL_ERROR at a high frequency in some situations, and
> there's plenty cases that aren't ERRCODE_INTERNAL_ERROR where we'd want
> this.


Perhaps we should fix those, but it might be a game of whack-a-mole as the
code changes, and inevitably you'll want to generate stacks for some other
errcode while getting frustrated at all the ERRCODE_INTERNAL_ERROR. Not
sure it's worth it.

However, maybe a  GUC like

log_error_stacks_errcodes = 'XX000, 55000'

would work. It'd be somewhat expensive to evaluate, but we'd only be doing
it where we'd already decided to emit an error. And it'd fit in even if we
later added smarter selective logging.

BTW, it's worth noting that these backtraces are very limited. They don't
report arguments or locals. So it's still no substitute for suitable
errcontext callbacks, and sometimes it's still necessary to fall back to
gdb or messing around with perf userspace tracepoints.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Speedup of relation deletes during recovery

2018-06-20 Thread Thomas Munro
On Tue, Jun 19, 2018 at 6:13 AM, Fujii Masao  wrote:
> On Sat, Jun 16, 2018 at 2:54 AM, Teodor Sigaev  wrote:
>>> We just had a customer hit this issue. I kind of wonder whether this
>>> shouldn't be backpatched: Currently the execution on the primary is
>>> O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
>>> standby - with a lot higher constants to boot.  That means it's very
>>> easy to get into situations where the standy starts to lag behind very
>>> significantly.
>>
>> +1, we faced with that too
>
> +1 to back-patch. As Horiguchi-san pointed out, this is basically
> the fix for oversight of commit 279628a0a7, not new feature.

+1

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Keeping temporary tables in shared buffers

2018-06-20 Thread Amit Kapila
On Wed, Jun 20, 2018 at 8:47 PM, Bruce Momjian  wrote:
> On Sat, Jun  2, 2018 at 05:18:17PM -0400, Asim Praveen wrote:
>> Hi Amit
>>
>> On Mon, May 28, 2018 at 4:25 AM, Amit Kapila  wrote:
>> >
>> > This is one way, but I think there are other choices as well.  We can
>> > identify and flush all the dirty (local) buffers for the relation
>> > being accessed parallelly.  Now, once the parallel operation is
>> > started, we won't allow performing any write operation on them.  It
>>
>> We talked about this in person in Ottawa and it was great meeting you!
>>  To summarize, the above proposal to continue using local buffers for
>> temp tables is a step forward, however, it enables only certain kinds
>> of queries to be parallelized for temp tables.  E.g. queries changing
>> a temp table in any way cannot be parallelized due to the restriction
>> of no writes during parallel operation.
>
> Should this be a TODO item?
>

+1.  I think we have not hammered out the design completely, but if
somebody is willing to put effort, it is not an unsolvable problem.
AFAIU, this thread is about parallelizing queries that refer temp
tables, however, it is not clear from the title of this thread.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Masahiko Sawada
On Thu, Jun 21, 2018 at 6:57 AM, Nico Williams  wrote:
> On Wed, Jun 20, 2018 at 05:16:46PM -0400, Bruce Momjian wrote:
>> On Mon, Jun 18, 2018 at 12:29:57PM -0500, Nico Williams wrote:
>> > Note that unless the pg_catalog is protected against manipulation by
>> > remote storage, then TDE for user tables might be possible to
>> > compromise.  Like so: the attacker manipulates the pg_catalog to
>> > escalate privelege in order to obtain the TDE keys.  This argues for
>> > full database encryption, not just specific tables or columns.  But
>> > again, this is for the threat model where the storage is the threat.
>>
>> Yes, one big problem with per-column encryption is that administrators
>> can silently delete data, though they can't add or modify it.
>
> They can also re-add ("replay") deleted values; this can only be
> defeated by also binding TX IDs or alike in the ciphertext.  And if you
> don't bind the encrypted values to the PKs then they can add any value
> they've seen to different rows.

I think we could avoid it by implementations. If we implement
per-column encryption by putting all encrypted columns out to another
table like TOAST table and encrypting whole that external table then
we can do per-column encryption without such concerns. Also, that way
we can encrypt data when disk I/O even if we use per-column
encryption. It would get a better performance. A downside of this idea
is extra overhead to access encrypted column but it would be
predictable since we have TOAST.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Fast default stuff versus pg_upgrade

2018-06-20 Thread Andres Freund
Hi,

On 2018-06-20 20:53:34 -0400, Andrew Dunstan wrote:
> This version adds a lock on the table owning the attribute.

Cool.
>  
>   /*
> +  * in binary upgrade mode, update the catalog with any missing 
> values
> +  * that might be present.
> +  */
> + if (dopt->binary_upgrade)
> + {
> + for (j = 0; j < tbinfo->numatts; j++)
> + {
> + if (tbinfo->atthasmissing[j])
> + {
> + appendPQExpBufferStr(q, "\n-- set 
> missing value.\n");
> + appendPQExpBufferStr(q,
> + 
>  "SELECT pg_catalog.binary_upgrade_set_missing_value(");
> + appendStringLiteralAH(q,qualrelname, 
> fout);

missing space.  Probably couldn't hurt to run the changed files through
pgindent and fix the damage...

> + appendPQExpBufferStr(q, 
> "::pg_catalog.regclass,");
> + appendStringLiteralAH(q, 
> tbinfo->attnames[j], fout);
> + appendPQExpBufferStr(q,",");

same.


> + appendStringLiteralAH(q, 
> tbinfo->attmissingval[j], fout);
> + appendPQExpBufferStr(q,");\n\n");

same.


Looks reasonable to me, but I've not tested it.

Greetings,

Andres Freund



Re: Fast default stuff versus pg_upgrade

2018-06-20 Thread Andrew Dunstan



On 06/20/2018 01:46 PM, Andrew Dunstan wrote:




Attached deals with all those issues except locking.




This version adds a lock on the table owning the attribute.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/utils/adt/pg_upgrade_support.c b/src/backend/utils/adt/pg_upgrade_support.c
index 0c54b02..2480fa0 100644
--- a/src/backend/utils/adt/pg_upgrade_support.c
+++ b/src/backend/utils/adt/pg_upgrade_support.c
@@ -11,13 +11,20 @@
 
 #include "postgres.h"
 
+#include "access/heapam.h"
+#include "access/htup_details.h"
 #include "catalog/binary_upgrade.h"
+#include "catalog/indexing.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
 #include "commands/extension.h"
 #include "miscadmin.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
 
 
 #define CHECK_IS_BINARY_UPGRADE	\
@@ -192,3 +199,63 @@ binary_upgrade_set_record_init_privs(PG_FUNCTION_ARGS)
 
 	PG_RETURN_VOID();
 }
+
+Datum
+binary_upgrade_set_missing_value(PG_FUNCTION_ARGS)
+{
+	Oid		table_id = PG_GETARG_OID(0);
+	text*attname = PG_GETARG_TEXT_P(1);
+	text*value = PG_GETARG_TEXT_P(2);
+	DatumvaluesAtt[Natts_pg_attribute];
+	bool nullsAtt[Natts_pg_attribute];
+	bool replacesAtt[Natts_pg_attribute];
+	Datummissingval;
+	Form_pg_attribute attStruct;
+	Relationattrrel, tablerel;
+	HeapTuple   atttup, newtup;
+	char*cattname = text_to_cstring(attname);
+	char*cvalue = text_to_cstring(value);
+
+	CHECK_IS_BINARY_UPGRADE;
+
+	/* lock the table the attribute belongs to */
+	tablerel = heap_open(table_id, AccessExclusiveLock);
+
+	/* Lock the attribute row and get the data */
+	attrrel = heap_open(AttributeRelationId, RowExclusiveLock);
+	atttup = SearchSysCacheAttName(table_id, cattname);
+	if (!HeapTupleIsValid(atttup))
+		elog(ERROR, "cache lookup failed for attribute %s of relation %u",
+			 cattname, table_id);
+	attStruct = (Form_pg_attribute) GETSTRUCT(atttup);
+
+	/* get an array value from the value string */
+
+	/* find the io info for an arbitrary array type to get array_in Oid */
+	missingval = OidFunctionCall3(
+		F_ARRAY_IN,
+		CStringGetDatum(cvalue),
+		ObjectIdGetDatum(attStruct->atttypid),
+		Int32GetDatum(attStruct->atttypmod));
+
+	/* update the tuple - set atthasmissing and attmissingval */
+	MemSet(valuesAtt, 0, sizeof(valuesAtt));
+	MemSet(nullsAtt, false, sizeof(nullsAtt));
+	MemSet(replacesAtt, false, sizeof(replacesAtt));
+
+	valuesAtt[Anum_pg_attribute_atthasmissing - 1] = BoolGetDatum(true);
+	replacesAtt[Anum_pg_attribute_atthasmissing - 1] = true;
+	valuesAtt[Anum_pg_attribute_attmissingval - 1] = missingval;
+	replacesAtt[Anum_pg_attribute_attmissingval - 1] = true;
+
+	newtup = heap_modify_tuple(atttup, RelationGetDescr(attrrel),
+			   valuesAtt, nullsAtt, replacesAtt);
+	CatalogTupleUpdate(attrrel, >t_self, newtup);
+
+	/* clean up */
+	ReleaseSysCache(atttup);
+	heap_close(attrrel, RowExclusiveLock);
+	heap_close(tablerel, AccessExclusiveLock);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index ea2f022..22be3c2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8095,6 +8095,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 	int			i_typstorage;
 	int			i_attnotnull;
 	int			i_atthasdef;
+	int			i_atthasmissing;
 	int			i_attidentity;
 	int			i_attisdropped;
 	int			i_attlen;
@@ -8103,6 +8104,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 	int			i_attoptions;
 	int			i_attcollation;
 	int			i_attfdwoptions;
+	int			i_attmissingval;
 	PGresult   *res;
 	int			ntups;
 	bool		hasdefaults;
@@ -8132,7 +8134,33 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 
 		resetPQExpBuffer(q);
 
-		if (fout->remoteVersion >= 10)
+		if (fout->remoteVersion >= 11)
+		{
+			/* atthasmissing and attmissingval are new in 11 */
+			appendPQExpBuffer(q, "SELECT a.attnum, a.attname, a.atttypmod, "
+			  "a.attstattarget, a.attstorage, t.typstorage, "
+			  "a.attnotnull, a.atthasdef, a.attisdropped, "
+			  "a.attlen, a.attalign, a.attislocal, "
+			  "pg_catalog.format_type(t.oid,a.atttypmod) AS atttypname, "
+			  "array_to_string(a.attoptions, ', ') AS attoptions, "
+			  "CASE WHEN a.attcollation <> t.typcollation "
+			  "THEN a.attcollation ELSE 0 END AS attcollation, "
+			  "a.atthasmissing, a.attidentity, "
+			  "pg_catalog.array_to_string(ARRAY("
+			  "SELECT pg_catalog.quote_ident(option_name) || "
+			  "' ' || pg_catalog.quote_literal(option_value) "
+			  "FROM pg_catalog.pg_options_to_table(attfdwoptions) "
+			  "ORDER BY option_name"
+			  "), E',\n') AS attfdwoptions ,"
+			  "a.attmissingval "
+			  "FROM 

RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Tsunakawa, Takayuki
From: Bruce Momjian [mailto:br...@momjian.us]
> On Fri, May 25, 2018 at 08:41:46PM +0900, Moon, Insung wrote:
> > BTW, I want to support CBC mode encryption[3]. However, I'm not sure how
> to use the IV in CBC mode for this proposal.
> > I'd like to hear opinions by security engineer.
> 
> Well, CBC makes sense, and since AES uses a 16 byte block size, you
> would start with the initialization vector (IV) and run over the 8k page
> 512 times.  The IV can be any random value that is not repeated, and
> does not need to be secret.

XTS is faster and more secure.  XTS seems to be the standard now:

https://www.truecrypt71a.com/documentation/technical-details/encryption-scheme/
"c.Mode of operation: XTS, LRW (deprecated/legacy), CBC (deprecated/legacy)"

Microsoft Introduces AES-XTS to BitLocker in Windows 10 Version 1511
https://www.petri.com/microsoft-introduces-aes-xts-to-bitlocker-in-windows-10-version-1511


> However, using the same IV for the entire table would mean that people
> can detect if two pages in the same table contain the same data.  You
> might care about that, or you might not.  It would prevent detection of
> two _tables_ containing the same 8k page.  A more secure solution would
> be to use a different IV for each 8k page.
> 
> The cleanest idea would be for the per-table IV to be stored per table,
> but the IV used for each block to be a mixture of the table's IV and the
> page's offset in the table.

TrueCrypt uses the 8-byte sector number for the 16-byte tweak value for XTS 
when encrypting each sector.  Maybe we can just use the page number.


Regards
Takayuki Tsunakawa






RE: PATCH: backtraces for error messages

2018-06-20 Thread Tsunakawa, Takayuki
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> I have no idea how expensive backtrace() and libunwind are, but I doubt
> we want to pay the cost for every message before we know that error
> requires the backtrace to be collected.  Something like PGC_USERSET
>   server_min_backtraces=PANIC
> might be a possible interface.

+1

In addition, it would be nicer if we could have some filtering based on the 
error condition.  Sometimes I wanted to know where the "out of memory" or 
"invalid memory alloc request size ..." was thrown.

How about adding a GUC like the -e option of strace?  strace allows to specify 
which system calls and groups of them to capture.  We can use the SQL state and 
class code/name to list the conditions to capture.  In the out-of-memory case, 
I want to specify 53200, but don't want to get stack traces for other errors in 
class 53.

https://www.postgresql.org/docs/devel/static/errcodes-appendix.html

Regards
Takayuki Tsunakawa







Re: changing xpath() and xpath_exists()

2018-06-20 Thread Andrew Dunstan




On 06/20/2018 07:35 PM, Alvaro Herrera wrote:

Hello

Per discussion at
https://postgr.es/m/0684a598-002c-42a2-ae12-f024a324e...@winand.at
I intend to change xpath() and xpath_exists() in a subtly backwards-
incompatible way, so that they match the behavior of SQL-standard-
specified XMLTABLE.  It seems sane to keep them in sync.  Anybody wants
to object so that we keep the historical functions alone and only change
XMLTABLE?

Previously, this query would return empty:
SELECT xpath('root', '');

  xpath
───
  {}
(1 fila)

After this patch, it will return the root node:

xpath
───
  {}
(1 fila)


Note that if an absolute path is used, the behavior is unchanged:

alvherre=# SELECT xpath('/root', '');
xpath
───
  {}
(1 fila)

Please speak up if you think the former behavior should be kept.




So the existing behaviour is looking for a child of the root node called 
'root'? I agree that seems broken.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: partition -> partitioned

2018-06-20 Thread Amit Langote
On 2018/06/21 0:45, Alvaro Herrera wrote:
> On 2018-Jun-19, Amit Langote wrote:
> 
>> Noticed that the relevant code changed, so I rebased the patch.  Also,
>> made a minor update to a nearby comment.
> 
> Pushed, thanks.  I made a couple of comments one or two words shorter
> while (IMO) not losing clarity.

Thank you.

Regards,
Amit




Re: Avoiding Tablespace path collision for primary and standby

2018-06-20 Thread Craig Ringer
On Thu., 21 Jun. 2018, 04:30 Tom Lane,  wrote:

> Ashwin Agrawal  writes:
> > Okay just bouncing another approach, how about generating UUID for a
> > postgres instance during initdb and pg_basebackup ?
>
> There's no uuid generation code in core postgres, for excellent reasons
> (lack of portability and lack of failure modes are the main objections).
> This is not different in any meaningful way from the proposal to use
> timestamps, except for being more complicated.


A v4 UUID is just 128 random bits and some simple formatting. So I really
don't understand your concerns about UUID generation.

That said, it can already be handled with tablespace maps in pg_basebackup.
And any new scheme would need to happen in pg_basebackup too, because it
must happen before the tablespace are copied and thr replica is first
started.

I don't see a big concern with some pg_basebackup --gen-unique-tablespaces
option or the like.

UUID would be better than timestamp due to the skew issues discussed
upthread. But personally I'd just take a label argument. pg_basebackup
--tablespace-prefix or the like.

For non pg_basebackup uses you have to solve it yourself anyway. Pg doesn't
know if it's just been started as a copy, after all, and it's too late to
move tablespace then even if we'd do such a thing.


changing xpath() and xpath_exists()

2018-06-20 Thread Alvaro Herrera
Hello

Per discussion at
https://postgr.es/m/0684a598-002c-42a2-ae12-f024a324e...@winand.at
I intend to change xpath() and xpath_exists() in a subtly backwards-
incompatible way, so that they match the behavior of SQL-standard-
specified XMLTABLE.  It seems sane to keep them in sync.  Anybody wants
to object so that we keep the historical functions alone and only change
XMLTABLE?

Previously, this query would return empty:
SELECT xpath('root', '');

 xpath 
───
 {}
(1 fila)

After this patch, it will return the root node:

   xpath   
───
 {}
(1 fila)


Note that if an absolute path is used, the behavior is unchanged:

alvherre=# SELECT xpath('/root', '');
   xpath   
───
 {}
(1 fila)

Please speak up if you think the former behavior should be kept.

-- 
Álvaro Herrera



Re: Excessive CPU usage in StandbyReleaseLocks()

2018-06-20 Thread Andres Freund
Hi,

On 2018-06-21 00:25:03 +1200, David Rowley wrote:
> On 19 June 2018 at 17:43, Thomas Munro  wrote:
> > The problem is that StandbyReleaseLocks() does a linear search of all
> > known AccessExclusiveLocks when a transaction ends.  Luckily, since
> > v10 (commit 9b013dc2) that is skipped for transactions that haven't
> > taken any AELs and aren't using 2PC, but that doesn't help all users.
> 
> Good to see this getting fixed.  My original patch [1] to fix this was
> more along the lines of yours

>From that discussion I don't really understand why that wasn't pursued
further. The revision committed, clearly was just continuing to use the
wrong datastructure, and had obvious issues with complexity, just in a
somewhat narrower situation?


> only I partitioned the List in an array indexed by the xid mod size of
> array.  I had done this as I thought it would be faster than a hash
> table and would likely see the locks spread evenly over the table.
> IIRC Andres complained and said I should use a hashtable, which I see
> you've done.

It's hard to believe the hashtable is a meaningful bottleneck here. The
primary also uses a hashtable, except it's partitioned & shared, and
thus protected by locks. Also much larger. So it's hard to believe that
we'd need a custom built datastructure to speedup replay...

Greetings,

Andres Freund



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Nico Williams
On Wed, Jun 20, 2018 at 06:19:40PM -0400, Joe Conway wrote:
> On 06/20/2018 05:12 PM, Bruce Momjian wrote:
> > On Mon, Jun 18, 2018 at 11:06:20AM -0400, Joe Conway wrote:
> > Even if they are encrypted with the same key, they use different
> > initialization vectors that are stored inside the encrypted payload, so
> > you really can't identify much except the length, as Robert stated.

Definitely use different IVs, and don't reuse them (or use cipher modes
where IV reuse is not fatal).

> The more you encrypt with a single key, the more fuel you give to the
> person trying to solve for the key with cryptanalysis.

With modern 128-bit block ciphers in modern cipher modes you'd have to
encrypt enough data to make this not a problem.  On the other hand,
you'll still have other reasons to do key rotation.  Key rotation
ultimately means re-encrypting everything.  Getting all of this right is
very difficult.

So again, what's the threat model?  Because if it's sysadmins/DBAs
you're afraid of, there are better things to do.

Nico
-- 



Re: Excessive CPU usage in StandbyReleaseLocks()

2018-06-20 Thread Thomas Munro
Hi David,

On Thu, Jun 21, 2018 at 12:25 AM, David Rowley
 wrote:
> On 19 June 2018 at 17:43, Thomas Munro  wrote:
>> The problem is that StandbyReleaseLocks() does a linear search of all
>> known AccessExclusiveLocks when a transaction ends.  Luckily, since
>> v10 (commit 9b013dc2) that is skipped for transactions that haven't
>> taken any AELs and aren't using 2PC, but that doesn't help all users.
>
> Good to see this getting fixed.  My original patch [1] to fix this was
> more along the lines of yours, only I partitioned the List in an array
> indexed by the xid mod size of array.  I had done this as I thought it
> would be faster than a hash table and would likely see the locks
> spread evenly over the table.  IIRC Andres complained and said I
> should use a hashtable, which I see you've done.

Oh!  I hadn't seen that.  I studied 9b013dc2 but it didn't have a link
to your discussion so I didn't know that your original proposal was
completely different to the commit and essentially the same as what
I'm proposing.  I think this is a more fundamental fix to the problem
so I don't understand why it went into the weeds.  I checked if there
was anything I could salvage from your patch but I think HTAB is the
better way to go here.  I will update the commit message to credit you
for the original analysis and point back to your thread.

Looking at your thread, I see that you were also confused about this:

> Although I'm not all
> that sure in which case the function will actually be called with an
> invalid xid.

Anyone?  In my patch I keep that behaviour, but I think it'd be great
to get a comment there to explain why the case is needed.

>> Thoughts?
>
> I do know this is causing quite a bit of trouble out in the wilds. I'd
> be keen to not have to bear the brunt of any more of those troubles,
> but it'll ultimately depend on how risky a patch looks.

Cool.  I'm thinking of pushing this early next week after some more
testing, unless there are objections.  If you have any feedback or
would like to try to poke holes in it, I'd much appreciate it.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Joe Conway
On 06/20/2018 05:12 PM, Bruce Momjian wrote:
> On Mon, Jun 18, 2018 at 11:06:20AM -0400, Joe Conway wrote:
>>> At the same time, having to have a bunch of independently-decipherable
>>> short field values is not real secure either, especially if they're known
>>> to all be encrypted with the same key.  But what you know or can guess
>>> about the plaintext in such cases would be target-specific, rather than
>>> an attack that could be built once and used against any PG database.
>>
>> Again is dependent on the specific solution for encryption. In some
>> cases you might do something like generate a single use random key,
>> encrypt the payload with that, encrypt the single use key with the
>> "global" key, append the two results and store.
> 
> Even if they are encrypted with the same key, they use different
> initialization vectors that are stored inside the encrypted payload, so
> you really can't identify much except the length, as Robert stated.

The more you encrypt with a single key, the more fuel you give to the
person trying to solve for the key with cryptanalysis.

By encrypting only essentially random data (the single use keys,
generated with cryptographically strong random number generator) with
the "master key", and then encrypting the actual payloads (which are
presumably more predictable than the strong random single use keys), you
minimize the probability of someone cracking your master key and you
also minimize the damage caused by someone cracking one of the single
use keys.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Joe Conway
On 06/20/2018 05:03 PM, Bruce Momjian wrote:
> On Wed, Jun 13, 2018 at 09:20:58AM -0400, Joe Conway wrote:
>> The idea has not been extensively fleshed out yet, but the thought was
>> that we create column level POLICY, which would transparently apply some
>> kind of transform on input and/or output. The transforms would
>> presumably be expressions, which in turn could use functions (extension
>> or builtin) to do their work. That would allow encryption/decryption,
>> DLP (data loss prevention) schemes (masking, redacting), etc. to be
>> applied based on the policies.
> 
> This is currently possible with stock Postgres as you can see from this
> and the following slides:
> 
>   http://momjian.us/main/writings/crypto_hw_use.pdf#page=77

That is definitely not the same thing. A column level POLICY would apply
an input and output transform expression over the column transparently
to the database user. That transform might produce, for example, a
different output depending on the logged in user (certain user sees
entire field whereas other users see redacted or masked form, or certain
users get decrypted result while others don't).

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Nico Williams
On Wed, Jun 20, 2018 at 06:06:56PM -0400, Joe Conway wrote:
> On 06/20/2018 05:09 PM, Bruce Momjian wrote:
> > On Mon, Jun 18, 2018 at 09:49:20AM -0400, Robert Haas wrote:
> >> know the ordering of the values under whatever ordering semantics
> >> apply to that index.  It's unclear to me how useful such information
> > 
> > I don't think an ordered index is possible, only indexing of encrypted
> > hashes, i.e. see this and the next slide:
> 
> It is possible with homomorphic encryption -- whether we want to support
> that in core is another matter.

It's also possible using DNSSEC NSEC3-style designs.

Nico
-- 



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Joe Conway
On 06/20/2018 05:05 PM, Bruce Momjian wrote:
> On Mon, Jun 18, 2018 at 08:29:32AM -0400, Joe Conway wrote:
 Or
 maybe key management is really tied into the separately discussed effort
 to create SQL VARIABLEs somehow.
>>>
>>> Could you elaborate on how key management is tied into SQL VARIABLEs?
>>
>> Well, the key management probably is not, but the SQL VARIABLE might be
>> where the key is stored for use.
> 
> I disagree.  I would need to understand how an extension actually helps
> here, because it certainly limits flexibility compared to a shell
> command.

That flexibility the shell command gives you is also a huge hole from a
security standpoint.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Joe Conway
On 06/20/2018 05:09 PM, Bruce Momjian wrote:
> On Mon, Jun 18, 2018 at 09:49:20AM -0400, Robert Haas wrote:
>> know the ordering of the values under whatever ordering semantics
>> apply to that index.  It's unclear to me how useful such information
> 
> I don't think an ordered index is possible, only indexing of encrypted
> hashes, i.e. see this and the next slide:

It is possible with homomorphic encryption -- whether we want to support
that in core is another matter.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Nico Williams
On Wed, Jun 20, 2018 at 05:16:46PM -0400, Bruce Momjian wrote:
> On Mon, Jun 18, 2018 at 12:29:57PM -0500, Nico Williams wrote:
> > Note that unless the pg_catalog is protected against manipulation by
> > remote storage, then TDE for user tables might be possible to
> > compromise.  Like so: the attacker manipulates the pg_catalog to
> > escalate privelege in order to obtain the TDE keys.  This argues for
> > full database encryption, not just specific tables or columns.  But
> > again, this is for the threat model where the storage is the threat.
> 
> Yes, one big problem with per-column encryption is that administrators
> can silently delete data, though they can't add or modify it.

They can also re-add ("replay") deleted values; this can only be
defeated by also binding TX IDs or alike in the ciphertext.  And if you
don't bind the encrypted values to the PKs then they can add any value
they've seen to different rows.

One can protect to some degree agains replay and reuse attacks, but
protecting against silent deletion is much harder.  Protecting against
the rows (or the entire DB) being restored at a past point in time is
even harder -- you quickly end up wanting Merkle hash/MAC trees and key
rotation, but this complicates everything and is performance killing.

> > I think any threat model where DBAs are not the threat is just not that
> > interesting to address with crypto within postgres itself...
> 
> Yes, but in my analysis the only solution there is client-side
> encryption:

For which threat model?

For threat models where the DBAs are not the threat there's no need for
client-side encryption: just encrypt the storage at the postgres
instance (with encrypting device drivers or -preferably- filesystems).

For threat models where the DBAs are the threat then yes, client-side
encryption works (or server-side encryption to public keys), but you
must still bind the encrypted values to the primary keys, and you must
provide integrity protection for as much data as possible -- see above.

Client-side crypto is hard to do well and still get decent performance.
So on the whole I think that crypto is a poor fit for the DBAs-are-the-
threat threat model.  It's better to reduce the number of DBAs/sysadmins
and audit all privileged (and, for good measure, unprivileged) access.

Client-side encryption, of course, wouldn't be a feature of PG..., as PG
is mostly a very smart server + very dumb clients.  The client could be
a lot smarter, for sure -- it could be a full-fledged RDBMS, it could
even be a postgres instance accessing the real server via FDW.

For example, libgda, the GNOME Data Assistant, IIRC, is a smart client
that uses SQLite3 to access remote resources via virtual table
extensions that function a lot like PG's FDW.  This works well because
SQLite3 is embeddable and light-weight.  PG wouldn't fit that bill as
well, but one could start a PG instance to proxy a remote one via FDW,
with crypto done in the proxy.

>   http://momjian.us/main/writings/crypto_hw_use.pdf#page=97
> 
> You might want to look at the earlier slides too.

I will, thanks.

Nico
-- 



Postgres 10.4 crashing when using PLV8

2018-06-20 Thread Mukesh Chhatani
Hello Team,

I am trying to use the PLV8 via function and while using the function
created via PLV8 in one of the create materialized view, postgres crashes,
attached is the log file with DEBUG5 turned on.

SQL which is breaking the code and SQL function is attached.

Creating materialized view - mat_view_by_product - is the one which causes
the crash.

I have tested the same in below environments

1. Test 1 - Passed
Mac , Postgres - 10.4, PLV8 - 2.1.0

2. Test 2 - Passed
AWS RDS , Postgres - 9.6.6 , PLV8 - 1.5.0

3. Test 3 - Fail, This test was passing earlier on 10.3 but fails on 10.4
AWS EC2 , Postgres - 10.4, PLV8 - 2.3.4

4. Test 4 - Fail
AWS RDS , Postgres - 10.3 , PLV8 - 2.1.0

Please let me know if any more information is required to assist in this
problem.

Thanks for your help.
2018-06-20 19:06:18 UTC [1955]: [235-1] user=,db=,app=,client= DEBUG:  forked 
new backend, pid=2861 socket=11
2018-06-20 19:06:18 UTC [2861]: [1-1] 
user=[unknown],db=[unknown],app=[unknown],client=localhost LOG:  connection 
received: host=localhost port=37222
2018-06-20 19:06:18 UTC [2861]: [2-1] 
user=[unknown],db=[unknown],app=[unknown],client=localhost DEBUG:  SSL 
connection from "(anonymous)"
2018-06-20 19:06:18 UTC [2861]: [3-1] 
user=postgres,db=procured,app=[unknown],client=localhost DEBUG:  postgres 
child[2861]: starting with (
2018-06-20 19:06:18 UTC [2861]: [4-1] 
user=postgres,db=procured,app=[unknown],client=localhost DEBUG:   postgres
2018-06-20 19:06:18 UTC [2861]: [5-1] 
user=postgres,db=procured,app=[unknown],client=localhost DEBUG:  )
2018-06-20 19:06:18 UTC [2861]: [6-1] 
user=postgres,db=procured,app=[unknown],client=localhost DEBUG:  InitPostgres
2018-06-20 19:06:18 UTC [2861]: [7-1] 
user=postgres,db=procured,app=[unknown],client=localhost DEBUG:  my backend ID 
is 3
2018-06-20 19:06:18 UTC [2861]: [8-1] 
user=postgres,db=procured,app=[unknown],client=localhost DEBUG:  
StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGR, 
xid/subid/cid: 0/1/0
2018-06-20 19:06:18 UTC [2861]: [9-1] 
user=postgres,db=procured,app=[unknown],client=localhost DEBUG:  received 
password packet
2018-06-20 19:06:18 UTC [2861]: [10-1] 
user=postgres,db=procured,app=[unknown],client=localhost LOG:  connection 
authorized: user=postgres database=procured SSL enabled (protocol=TLSv1.2, 
cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)
2018-06-20 19:06:18 UTC [2861]: [11-1] 
user=postgres,db=procured,app=psql,client=localhost DEBUG:  
CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGR, 
xid/subid/cid: 0/1/0
2018-06-20 19:06:18 UTC [2830]: [2-1] user=,db=,app=,client= DEBUG:  snapshot 
of 0+0 running transaction ids (lsn 0/72FC0F20 oldest xid 782 latest complete 
781 next xid 782)
2018-06-20 19:06:27 UTC [2861]: [12-1] 
user=postgres,db=procured,app=psql,client=localhost DEBUG:  StartTransaction(1) 
name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2018-06-20 19:06:27 UTC [2861]: [13-1] 
user=postgres,db=procured,app=psql,client=localhost DEBUG:  building index 
"pg_toast_49152_index" on table "pg_toast_49152"
2018-06-20 19:06:27 UTC [2861]: [14-1] 
user=postgres,db=procured,app=psql,client=localhost DEBUG:  no icu dir
2018-06-20 19:06:27 UTC [2861]: [15-1] 
user=postgres,db=procured,app=psql,client=localhost CONTEXT:  PL/pgSQL function 
histogram_merge(histogram,histogram) line 3 at RETURN
2018-06-20 19:06:27 UTC [1955]: [236-1] user=,db=,app=,client= DEBUG:  reaping 
dead processes
2018-06-20 19:06:27 UTC [1955]: [237-1] user=,db=,app=,client= DEBUG:  server 
process (PID 2861) was terminated by signal 11: Segmentation fault
2018-06-20 19:06:27 UTC [1955]: [238-1] user=,db=,app=,client= DETAIL:  Failed 
process was running: CREATE MATERIALIZED VIEW mat_view_by_product AS
SELECT
facility_alias_id,
group_type_id,
product_id,
po_date_month,histogram_agg(histogram) AS histogram,
sum(total_spend) AS total_spend,
min(min_price) AS min_price,
max(max_price) AS max_price,
min(min_po_date) AS min_po_date,
max(max_po_date) AS max_po_date,
(array_agg(most_recent_price ORDER BY max_po_date))[1] AS 
most_recent_price,
sum(total_eaches) AS total_eaches
FROM
mat_view_by_catalog
GROUP BY
facility_alias_id,
group_type_id,
product_id,
po_date_month;
2018-06-20 19:06:27 UTC [1955]: [239-1] user=,db=,app=,client= LOG:  server 
process (PID 2861) was terminated by signal 11: Segmentation fault
2018-06-20 19:06:27 UTC [1955]: [240-1] user=,db=,app=,client= DETAIL:  Failed 
process was running: CREATE MATERIALIZED VIEW mat_view_by_product AS
SELECT
facility_alias_id,
group_type_id,
product_id,
po_date_month,histogram_agg(histogram) AS histogram,
sum(total_spend) AS total_spend,

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Bruce Momjian
On Mon, Jun 18, 2018 at 12:29:57PM -0500, Nico Williams wrote:
> On Mon, Jun 11, 2018 at 06:22:22PM +0900, Masahiko Sawada wrote:
> > As per discussion at PGCon unconference, I think that firstly we need
> > to discuss what threats we want to defend database data against. If
> 
> We call that a threat model.  There can be many threat models, of
> course.
> 
> > user wants to defend against a threat that is malicious user who
> > logged in OS or database steals an important data on datbase this
> > design TDE would not help. Because such user can steal the data by
> > getting a memory dump or by SQL. That is of course differs depending
> > on system requirements or security compliance but what threats do you
> > want to defend database data against? and why?
> 
> This design guards (somewhat) againts the threat of the storage theft
> (e.g., because the storage is remote).  It's a fine threat model to
> address, but it's also a lot easier to address in the filesystem or
> device drivers -- there's no need to do this in PostgreSQL itself except
> so as to support it on all platforms regardless of OS capabilities.
> 
> Note that unless the pg_catalog is protected against manipulation by
> remote storage, then TDE for user tables might be possible to
> compromise.  Like so: the attacker manipulates the pg_catalog to
> escalate privelege in order to obtain the TDE keys.  This argues for
> full database encryption, not just specific tables or columns.  But
> again, this is for the threat model where the storage is the threat.

Yes, one big problem with per-column encryption is that administrators
can silently delete data, though they can't add or modify it.

> Another similar thread model is dump management, where dumps are sent
> off-site where untrusted users might read them, or even edit them in the
> hopes that they will be used for restores and thus compromise the
> database.  This is most easily addressed by just encrypting the backups
> externally to PG.
> 
> Threat models where client users are the threat are easily handled by
> PG's permissions system.
> 
> I think any threat model where DBAs are not the threat is just not that
> interesting to address with crypto within postgres itself...


Yes, but in my analysis the only solution there is client-side
encryption:

http://momjian.us/main/writings/crypto_hw_use.pdf#page=97

You might want to look at the earlier slides too.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Bruce Momjian
On Mon, Jun 18, 2018 at 11:06:20AM -0400, Joe Conway wrote:
> > At the same time, having to have a bunch of independently-decipherable
> > short field values is not real secure either, especially if they're known
> > to all be encrypted with the same key.  But what you know or can guess
> > about the plaintext in such cases would be target-specific, rather than
> > an attack that could be built once and used against any PG database.
> 
> Again is dependent on the specific solution for encryption. In some
> cases you might do something like generate a single use random key,
> encrypt the payload with that, encrypt the single use key with the
> "global" key, append the two results and store.

Even if they are encrypted with the same key, they use different
initialization vectors that are stored inside the encrypted payload, so
you really can't identify much except the length, as Robert stated.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Bruce Momjian
On Mon, Jun 18, 2018 at 09:49:20AM -0400, Robert Haas wrote:
> figure stuff out.  You can infer something about the length of
> particular values.  Perhaps you can find cases where the same
> encrypted value appears multiple times.  If there's a btree index, you

Most encryption methods use a random initialization vector (IV) for each
encryption, e.g. pgp_sym_encrypt(), but length might allow this, as you
stated.

> know the ordering of the values under whatever ordering semantics
> apply to that index.  It's unclear to me how useful such information

I don't think an ordered index is possible, only indexing of encrypted
hashes, i.e. see this and the next slide:

https://momjian.us/main/writings/crypto_hw_use.pdf#page=86

> would be in practice or to what extent it might allow you to attack
> the underlying cryptography, but it seems like there might be cases
> where the information leakage is significant.  For example, suppose
> you're trying to determine which partially-encrypted record is that of
> Aaron Aardvark... or this guy:
> https://en.wikipedia.org/wiki/Hubert_Blaine_Wolfeschlegelsteinhausenbergerdorff,_Sr.
> 
> Recently, it was suggested to me that a use case for column-level
> encryption might be to prevent casual DBA snooping.  So, you'd want
> the data to appear in pg_dump output encrypted, because the DBA might
> otherwise look at it, but you wouldn't really be concerned about the
> threat of the DBA loading a hostile C module that would steal user
> keys and use them to decrypt all the data, because they don't care
> that much and would be fired if they were caught doing it.

Yes, that is a benefit that is not possible with page-level encryption. 
It also encrypts the WAL and backups automatically;  see:

http://momjian.us/main/writings/crypto_hw_use.pdf#page=97

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Push down Aggregates below joins

2018-06-20 Thread Tomas Vondra
On 06/20/2018 10:12 PM, Heikki Linnakangas wrote:
> Currently, the planner always first decides the scan/join order, and
> adds Group/Agg nodes on top of the joins. Sometimes it would be legal,
> and beneficial, to perform the aggregation below a join. I've been
> hacking on a patch to allow that.
> 

There was a patch [1] from Antonin Houska aiming to achieve something
similar. IIRC it aimed to push the aggregate down in more cases,
leveraging the partial aggregation stuff. I suppose your patch only aims
to do the pushdown when the two-phase aggregation is not needed?

[1] https://www.postgresql.org/message-id/flat/9666.1491295317@localhost

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Push down Aggregates below joins

2018-06-20 Thread Tomas Vondra
On 06/20/2018 10:12 PM, Heikki Linnakangas wrote:
> Currently, the planner always first decides the scan/join order, and
> adds Group/Agg nodes on top of the joins. Sometimes it would be legal,
> and beneficial, to perform the aggregation below a join. I've been
> hacking on a patch to allow that.
> 

There was a patch [1] from Antonin Houska aiming to achieve something
similar. IIRC it aimed to push the aggregate down in more cases,
leveraging the partial aggregation stuff. I suppose your patch only aims
to do the pushdown when the two-phase aggregation is not needed?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Bruce Momjian
On Mon, Jun 18, 2018 at 08:29:32AM -0400, Joe Conway wrote:
> >> Or
> >> maybe key management is really tied into the separately discussed effort
> >> to create SQL VARIABLEs somehow.
> > 
> > Could you elaborate on how key management is tied into SQL VARIABLEs?
> 
> Well, the key management probably is not, but the SQL VARIABLE might be
> where the key is stored for use.

I disagree.  I would need to understand how an extension actually helps
here, because it certainly limits flexibility compared to a shell
command.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Bruce Momjian
On Wed, Jun 13, 2018 at 09:20:58AM -0400, Joe Conway wrote:
> On 06/11/2018 05:22 AM, Masahiko Sawada wrote:
> > As per discussion at PGCon unconference, I think that firstly we need
> > to discuss what threats we want to defend database data against.
> 
> Exactly. While certainly there is demand for encryption for the sake of
> "checking a box", different designs will defend against different
> threats, and we should be clear on which ones we are trying to protect
> against for any particular design.

Yep.  This slide covers the various encryption levels and the threats
they protect against:

http://momjian.us/main/writings/crypto_hw_use.pdf#page=97

I do not have page-level encryption listed since that is not currently
possible with Postgres.

> > Also, if I understand correctly, at unconference session there also
> > were two suggestions about the design other than the suggestion by
> > Alexander: implementing TDE at column level using POLICY, and
> > implementing TDE at table-space level. The former was suggested by Joe
> > but I'm not sure the detail of that suggestion. I'd love to hear the
> > deal of that suggestion.
> 
> The idea has not been extensively fleshed out yet, but the thought was
> that we create column level POLICY, which would transparently apply some
> kind of transform on input and/or output. The transforms would
> presumably be expressions, which in turn could use functions (extension
> or builtin) to do their work. That would allow encryption/decryption,
> DLP (data loss prevention) schemes (masking, redacting), etc. to be
> applied based on the policies.

This is currently possible with stock Postgres as you can see from this
and the following slides:

http://momjian.us/main/writings/crypto_hw_use.pdf#page=77

> This, in and of itself, would not address key management. There is
> probably a separate need for some kind of built in key management --
> perhaps a flexible way to integrate with external systems such as Vault
> for example, or maybe something self contained, or perhaps both. Or
> maybe key management is really tied into the separately discussed effort
> to create SQL VARIABLEs somehow.

I cover key management in this slide, and following:

http://momjian.us/main/writings/crypto_hw_use.pdf#page=53

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-06-20 Thread Bruce Momjian
On Fri, May 25, 2018 at 08:41:46PM +0900, Moon, Insung wrote:
> BTW, I want to support CBC mode encryption[3]. However, I'm not sure how to 
> use the IV in CBC mode for this proposal. 
> I'd like to hear opinions by security engineer.

Well, CBC makes sense, and since AES uses a 16 byte block size, you
would start with the initialization vector (IV) and run over the 8k page
512 times.  The IV can be any random value that is not repeated, and
does not need to be secret.

However, using the same IV for the entire table would mean that people
can detect if two pages in the same table contain the same data.  You
might care about that, or you might not.  It would prevent detection of
two _tables_ containing the same 8k page.  A more secure solution would
be to use a different IV for each 8k page.

The cleanest idea would be for the per-table IV to be stored per table,
but the IV used for each block to be a mixture of the table's IV and the
page's offset in the table.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: libpq compression

2018-06-20 Thread Robbie Harwood
Konstantin Knizhnik  writes:

> On 20.06.2018 00:04, Robbie Harwood wrote:
>> Konstantin Knizhnik  writes:
>>> On 18.06.2018 23:34, Robbie Harwood wrote:
>>>
 I also don't like that you've injected into the *startup* path -
 before authentication takes place.  Fundamentally, authentication
 (if it happens) consists of exchanging some combination of short
 strings (e.g., usernames) and binary blobs (e.g., keys).  None of
 this will compress well, so I don't see it as worth performing this
 negotiation there - it can wait.  It's also another message in
 every startup.  I'd leave it to connection parameters, personally,
 but up to you.
>>>
>>>   From my point of view compression of libpq traffic is similar with
>>> SSL and should be toggled at the same stage.
>>
>> But that's not what you're doing.  This isn't where TLS gets toggled.
>>
>> TLS negotiation happens as the very first packet: after completing
>> the TCP handshake, the client will send a TLS negotiation request.
>> If it doesn't happen there, it doesn't happen at all.
>>
>> (You *could* configure it where TLS is toggled.  This is, I think,
>> not a good idea.  TLS encryption is a probe: the server can reject
>> it, at which point the client tears everything down and connects
>> without TLS.  So if you do the same with compression, that's another
>> point of tearing down an starting over.  The scaling on it isn't good
>> either: if we add another encryption method into the mix, you've
>> doubled the number of teardowns.)
>
> Yes, you are right. There is special message for enabling TLS
> procotol.  But I do not think that the same think is needed for
> compression.  This is why I prefer to specify compression in
> connectoin options. So compression may be enabled straight after
> processing of startup package.  Frankly speaking I still do no see
> reasons to postpone enabling compression till some later moment.

I'm arguing for connection option only (with no additional negotiation
round-trip).  See below.

>>> Definitely authentication parameter are not so large to be
>>> efficiently compressed, by compression (may be in future password
>>> protected) can somehow protect this data.  In any case I do not
>>> think that compression of authentication data may have any influence
>>> on negotiation speed.  So I am not 100% sure that toggling
>>> compression just after receiving startup package is the only right
>>> solution.  But I am not also convinced in that there is some better
>>> place where compressor should be configured.  Do you have some
>>> concrete suggestions for it? In InitPostgres just after
>>> PerformAuthentication ?
>>
>> Hmm, let me try to explain this differently.
>>
>> pq_configure() (as you've called it) shouldn't send a packet.  At its
>> callsite, we *already know* whether we want to use compression -
>> that's what the port->use_compression option says.  So there's no
>> point in having a negotiation there - it's already happened.
>
> My idea was the following: client want to use compression. But server
> may reject this attempt (for any reasons: it doesn't support it, has
> no proper compression library, do not want to spend CPU for
> decompression,...) Right now compression algorithm is hardcoded. But
> in future client and server may negotiate to choose proper compression
> protocol.  This is why I prefer to perform negotiation between client
> and server to enable compression.

Well, for negotiation you could put the name of the algorithm you want
in the startup.  It doesn't have to be a boolean for compression, and
then you don't need an additional round-trip.

 I don't like where you've put the entry points to the compression
 logic: it's a layering violation.  A couple people have expressed
 similar reservations I think, so let me see if I can explain using
 `pqsecure_read()` as an example.  In pseudocode, `pqsecure_read()`
 looks like this:

   if conn->is_tls:
   n = tls_read(conn, ptr, len)
   else:
   n = pqsecure_raw_read(conn, ptr, len)
   return n

 I want to see this extended by your code to something like:

   if conn->is_tls:
   n = tls_read(conn, ptr, len)
   else:
   n = pqsecure_raw_read(conn, ptr, len)

   if conn->is_compressed:
   n = decompress(ptr, n)

   return n

 In conjunction with the above change, this should also
 significantly reduce the size of the patch (I think).
>>>
>>> Yes, it will simplify patch. But make libpq compression completely
>>> useless (see my explanation above).  We need to use streaming
>>> compression, and to be able to use streaming compression I have to
>>> pass function for fetching more data to compression library.
>>
>> I don't think you need that, even with the streaming API.
>>
>> To make this very concrete, let's talk about ZSTD_decompressStream (I'm
>> pulling 

Re: Avoiding Tablespace path collision for primary and standby

2018-06-20 Thread Tom Lane
Ashwin Agrawal  writes:
> Okay just bouncing another approach, how about generating UUID for a
> postgres instance during initdb and pg_basebackup ?

There's no uuid generation code in core postgres, for excellent reasons
(lack of portability and lack of failure modes are the main objections).
This is not different in any meaningful way from the proposal to use
timestamps, except for being more complicated.

regards, tom lane



Re: ERROR: ORDER/GROUP BY expression not found in targetlist

2018-06-20 Thread Tom Lane
I wrote:
> Thanks for the report.  I traced through this, and the problem seems to
> be that split_pathtarget_at_srfs() is neglecting to attach sortgroupref
> labeling to the extra PathTargets it constructs.  I don't remember
> whether that code is my fault or Andres', but I'll take a look at
> fixing it.

Here's a proposed patch for that.

regards, tom lane

diff --git a/src/backend/optimizer/util/tlist.c b/src/backend/optimizer/util/tlist.c
index 32160d5..5500f33 100644
*** a/src/backend/optimizer/util/tlist.c
--- b/src/backend/optimizer/util/tlist.c
***
*** 25,44 
  	((IsA(node, FuncExpr) && ((FuncExpr *) (node))->funcretset) || \
  	 (IsA(node, OpExpr) && ((OpExpr *) (node))->opretset))
  
! /* Workspace for split_pathtarget_walker */
  typedef struct
  {
  	List	   *input_target_exprs; /* exprs available from input */
! 	List	   *level_srfs;		/* list of lists of SRF exprs */
! 	List	   *level_input_vars;	/* vars needed by SRFs of each level */
! 	List	   *level_input_srfs;	/* SRFs needed by SRFs of each level */
  	List	   *current_input_vars; /* vars needed in current subexpr */
  	List	   *current_input_srfs; /* SRFs needed in current subexpr */
  	int			current_depth;	/* max SRF depth in current subexpr */
  } split_pathtarget_context;
  
  static bool split_pathtarget_walker(Node *node,
  		split_pathtarget_context *context);
  
  
  /*
--- 25,62 
  	((IsA(node, FuncExpr) && ((FuncExpr *) (node))->funcretset) || \
  	 (IsA(node, OpExpr) && ((OpExpr *) (node))->opretset))
  
! /*
!  * Data structures for split_pathtarget_at_srfs().  To preserve the identity
!  * of sortgroupref items even if they are textually equal(), what we track is
!  * not just bare expressions but expressions plus their sortgroupref indexes.
!  */
  typedef struct
  {
+ 	Node	   *expr;			/* some subexpression of a PathTarget */
+ 	Index		sortgroupref;	/* its sortgroupref, or 0 if none */
+ } split_pathtarget_item;
+ 
+ typedef struct
+ {
+ 	/* This is a List of bare expressions: */
  	List	   *input_target_exprs; /* exprs available from input */
! 	/* These are Lists of Lists of split_pathtarget_items: */
! 	List	   *level_srfs;		/* SRF exprs to evaluate at each level */
! 	List	   *level_input_vars;	/* input vars needed at each level */
! 	List	   *level_input_srfs;	/* input SRFs needed at each level */
! 	/* These are Lists of split_pathtarget_items: */
  	List	   *current_input_vars; /* vars needed in current subexpr */
  	List	   *current_input_srfs; /* SRFs needed in current subexpr */
+ 	/* Auxiliary data for current split_pathtarget_walker traversal: */
  	int			current_depth;	/* max SRF depth in current subexpr */
+ 	Index		current_sgref;	/* current subexpr's sortgroupref, or 0 */
  } split_pathtarget_context;
  
  static bool split_pathtarget_walker(Node *node,
  		split_pathtarget_context *context);
+ static void add_sp_item_to_pathtarget(PathTarget *target,
+ 		  split_pathtarget_item *item);
+ static void add_sp_items_to_pathtarget(PathTarget *target, List *items);
  
  
  /*
*** apply_pathtarget_labeling_to_tlist(List 
*** 822,827 
--- 840,848 
   * already meant as a reference to a lower subexpression).  So, don't expand
   * any tlist expressions that appear in input_target, if that's not NULL.
   *
+  * It's also important that we preserve any sortgroupref annotation appearing
+  * in the given target, especially on expressions matching input_target items.
+  *
   * The outputs of this function are two parallel lists, one a list of
   * PathTargets and the other an integer list of bool flags indicating
   * whether the corresponding PathTarget contains any evaluatable SRFs.
*** split_pathtarget_at_srfs(PlannerInfo *ro
*** 845,850 
--- 866,872 
  	int			max_depth;
  	bool		need_extra_projection;
  	List	   *prev_level_tlist;
+ 	int			lci;
  	ListCell   *lc,
  			   *lc1,
  			   *lc2,
*** split_pathtarget_at_srfs(PlannerInfo *ro
*** 884,893 
--- 906,920 
  	need_extra_projection = false;
  
  	/* Scan each expression in the PathTarget looking for SRFs */
+ 	lci = 0;
  	foreach(lc, target->exprs)
  	{
  		Node	   *node = (Node *) lfirst(lc);
  
+ 		/* Tell split_pathtarget_walker about this expr's sortgroupref */
+ 		context.current_sgref = get_pathtarget_sortgroupref(target, lci);
+ 		lci++;
+ 
  		/*
  		 * Find all SRFs and Vars (and Var-like nodes) in this expression, and
  		 * enter them into appropriate lists within the context struct.
*** split_pathtarget_at_srfs(PlannerInfo *ro
*** 981,996 
  			 * This target should actually evaluate any SRFs of the current
  			 * level, and it needs to propagate forward any Vars needed by
  			 * later levels, as well as SRFs computed earlier and needed by
! 			 * later levels.  We rely 

Re: Postgres 10.4 crashing when using PLV8

2018-06-20 Thread Mukesh Chhatani
Thanks David for the response, I have opened a issue with PLV8 team.

Let me know if I should report this bug to postgres or not, since I was not
sure thus I sent email earlier.

Regards,
Mukesh

On Wed, Jun 20, 2018 at 3:02 PM, David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Wed, Jun 20, 2018 at 12:46 PM, Mukesh Chhatani <
> chhatani.muk...@gmail.com> wrote:
>
>> I am trying to use the PLV8 via function and while using the function
>> created via PLV8 in one of the create materialized view, postgres crashes,
>> attached is the log file with DEBUG5 turned on.
>>
>
> ​These are not the correct place to post possible bug reports for
> third-party extensions.  plV8 has an active Issues listing on GitHub and
> you should post your observations there.  They can report upstream to us if
> there is indeed something specific in core causing their code to fail.
>
> Additionally, in the future if you find that your email is indeed a core
> PostgreSQL issue please just submit it to one of the available lists.  If
> you are unsure pgsql-general is a safe choice.  For this email enough
> information has been provided that our official pgsql-bugs email list (or
> form) would be acceptable and indeed preferred.
>
> David J.​
>
>


Re: Postgres 10.4 crashing when using PLV8

2018-06-20 Thread David G. Johnston
On Wed, Jun 20, 2018 at 12:46 PM, Mukesh Chhatani  wrote:

> I am trying to use the PLV8 via function and while using the function
> created via PLV8 in one of the create materialized view, postgres crashes,
> attached is the log file with DEBUG5 turned on.
>

​These are not the correct place to post possible bug reports for
third-party extensions.  plV8 has an active Issues listing on GitHub and
you should post your observations there.  They can report upstream to us if
there is indeed something specific in core causing their code to fail.

Additionally, in the future if you find that your email is indeed a core
PostgreSQL issue please just submit it to one of the available lists.  If
you are unsure pgsql-general is a safe choice.  For this email enough
information has been provided that our official pgsql-bugs email list (or
form) would be acceptable and indeed preferred.

David J.​


Re: [PATCH] Include application_name in "connection authorized" log message

2018-06-20 Thread Stephen Frost
Greetings,

* Don Seiler (d...@seiler.us) wrote:
> In trying to troubleshoot the source of a recent connection storm, I was
> frustrated to find that the app name was not included in the connection
> messages. It is there in the log_line_prefix on the disconnection messages
> but I would prefer it be directly visible with the connection itself. With
> some guidance from Stephen Frost I've put together this patch which does
> that.

Yeah, I tend to agree that it'd be extremely useful to have this
included in the 'connection authorized' message.

> It adds a new application_name field to the Port struct, stores the
> application name there when processing the startup packet, and then reads
> from there when logging the "connection authorized" message.

Right, to have this included in the 'connection authorized' message, we
need to have it be captured early on, prior to generic GUC handling, and
stored momentairly to be used by the 'connection authorized' message.
Using Port for that seems reasonable (and we already store other things
like user and database there).

Taking a look at the patch itself...

diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 7698cd1f88..088ef346a8 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -135,6 +135,7 @@ typedef struct Port
 */
char   *database_name;
char   *user_name;
+   char   *application_name;
char   *cmdline_options;
List   *guc_options;
 
We should have some comments here about how this is the "startup packet
application_name" and that it's specifically used for the "connection
authorized" message and that it shouldn't really be used
post-connection-startup (the GUC should be used instead, as applications
can change it post-startup).

diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index a4b53b33cd..8e75c80728 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2094,6 +2094,10 @@ retry1:

pstrdup(nameptr));
port->guc_options = lappend(port->guc_options,

pstrdup(valptr));
+
+   /* Copy application_name to port as well */
+   if (strcmp(nameptr, "application_name") == 0)
+   port->application_name = 
pstrdup(valptr);
}
offset = valoffset + strlen(valptr) + 1;
}

That comment feels a bit lacking- this is a pretty special case which
should be explained.

diff --git a/src/backend/utils/init/postinit.c 
b/src/backend/utils/init/postinit.c
index 09e0df290d..86db7630d5 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -266,8 +266,8 @@ PerformAuthentication(Port *port)
 #ifdef USE_SSL
if (port->ssl_in_use)
ereport(LOG,
-   (errmsg("connection authorized: 
user=%s database=%s SSL enabled (protocol=%s, cipher=%s, bits=%d, 
compression=%s)",
-   
port->user_name, port->database_name,
+   (errmsg("connection authorized: 
user=%s database=%s application=%s SSL enabled (protocol=%s, cipher=%s, 
bits=%d, compression=%s)",
+   
port->user_name, port->database_name, port->application_name

be_tls_get_version(port),

be_tls_get_cipher(port),

be_tls_get_cipher_bits(port),
@@ -275,8 +275,8 @@ PerformAuthentication(Port *port)
else
 #endif
ereport(LOG,
-   (errmsg("connection authorized: 
user=%s database=%s",
-   
port->user_name, port->database_name)));
+   (errmsg("connection authorized: 
user=%s database=%s application=%s",
+   
port->user_name, port->database_name, port->application_name)));
}
}

You definitely need to be handling the case where application_name is
*not* passed in more cleanly.  I don't think we should output anything
different from what we do today in those cases.

Also, these don't need to be / shouldn't be 3 seperate patches/commits,
and there should be a sensible commit message which explains what the
change is entirely.

If you 

memory leak when serializing TRUNCATE in reorderbuffer

2018-06-20 Thread Tomas Vondra

Hi,

While rebasing the logical replication patches on top of PG11, I've 
noticed that ReorderBufferSerializeChange claims this:


case REORDER_BUFFER_CHANGE_TRUNCATE:
...
/* ReorderBufferChange contains everything important */

That is not quite correct, though - the OIDs of truncated relations is 
stored in a separately palloc-ed array. So we only serialize the pointer 
to that array (which is part of ReorderBufferChange) and then read it 
back when restoring the change from disk.


Now, this can't cause crashes, because the 'relids' array won't go away 
(more on this later), and so the pointer remains valid. But it's a 
memory leak - a quite small and not very common one, because people 
don't do TRUNCATE very often, particularly not with many tables.


So I think we should fix and serialize/restore the OID array, just like 
we do for tuples, snapshots etc. See the attached fix.


Another thing we should probably reconsider is where the relids is 
allocated - the pointer remains valid because we happen to allocate it 
in TopMemoryContext. It's not that bad because we don't free the other 
reorderbuffer contexts until the walsender exits anyway, but still.


So I propose to allocate it in rb->context just like the other bits of 
data (snapshots, ...). Replacing the palloc() in DecodeTruncate() with 
something like:


   MemoryContextAlloc(ctx->reorder->context,
  xlrec->nrelids * sizeof(Oid));

should do the trick. The other places in decode.c don't allocate memory 
directly but call ReorderBufferGetTupleBuf() instead - perhaps we should 
introduce a similar wrapper here too.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index e2f59bf580..582bedf1de 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -412,10 +412,14 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			}
 			break;
 			/* no data in addition to the struct itself */
+		case REORDER_BUFFER_CHANGE_TRUNCATE:
+			if (change->data.truncate.relids != NULL)
+pfree(change->data.truncate.relids);
+			change->data.truncate.relids = NULL;
+			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
-		case REORDER_BUFFER_CHANGE_TRUNCATE:
 			break;
 	}
 
@@ -2289,6 +2293,26 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 break;
 			}
 		case REORDER_BUFFER_CHANGE_TRUNCATE:
+			{
+Size	size;
+char   *data;
+
+/* account for the OIDs of truncated relations */
+size = sizeof(Oid) * change->data.truncate.nrelids;
+sz += size;
+
+/* make sure we have enough space */
+ReorderBufferSerializeReserve(rb, sz);
+
+data = ((char *) rb->outbuf) + sizeof(ReorderBufferDiskChange);
+/* might have been reallocated above */
+ondisk = (ReorderBufferDiskChange *) rb->outbuf;
+
+memcpy(data, change->data.truncate.relids, size);
+data += size;
+
+break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
@@ -2569,6 +2593,16 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 			/* the base struct contains all the data, easy peasy */
 		case REORDER_BUFFER_CHANGE_TRUNCATE:
+			{
+Size		size;
+
+size = change->data.truncate.nrelids * sizeof(Oid);
+
+change->data.truncate.relids = MemoryContextAllocZero(rb->context, size);
+
+memcpy(change->data.truncate.relids, data, size);
+break;
+			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:


[PATCH] Include application_name in "connection authorized" log message

2018-06-20 Thread Don Seiler
Good afternoon, all.

In trying to troubleshoot the source of a recent connection storm, I was
frustrated to find that the app name was not included in the connection
messages. It is there in the log_line_prefix on the disconnection messages
but I would prefer it be directly visible with the connection itself. With
some guidance from Stephen Frost I've put together this patch which does
that.

It adds a new application_name field to the Port struct, stores the
application name there when processing the startup packet, and then reads
from there when logging the "connection authorized" message.

Thanks,
Don.

-- 
Don Seiler
www.seiler.us


appname_log.patch
Description: Binary data


Re: Avoiding Tablespace path collision for primary and standby

2018-06-20 Thread Ashwin Agrawal
On Wed, Jun 20, 2018 at 10:50 AM Andres Freund  wrote:

>
>
> On June 20, 2018 10:31:05 AM PDT, Ashwin Agrawal 
> wrote:
> >On Wed, Jun 20, 2018 at 9:39 AM Bruce Momjian  wrote:
> >
> >> On Fri, May 25, 2018 at 02:17:23PM -0700, Ashwin Agrawal wrote:
> >> >
> >> > On Fri, May 25, 2018 at 7:33 AM, Tom Lane 
> >wrote:
> >> >
> >> > Ashwin Agrawal  writes:
> >> > > Proposing to create directory with timestamp at time of
> >creating
> >> > tablespace
> >> > > and create symbolic link to it instead.
> >> >
> >> > I'm skeptical that this solves your problem.  What happens when
> >the
> >> CREATE
> >> > TABLESPACE command is replicated to the standby with sub-second
> >> delay?
> >> >
> >> >
> >> > I thought timestamps have micro-second precision. Are we expecting
> >> tabelspace
> >> > to be created, wal logged, streamed, and replayed on mirror in
> >> micro-second ?
> >>
> >> I didn't see anyone answer your question above.  We don't expect
> >> micro-second replay, but clock skew, which Tom Lane mention, could
> >make
> >> it appear to be a micro-second replay.
> >>
> >
> >Thanks Bruce for answering. Though I still don't see why clock skew is
> >a
> >problem here. As I think clock skew only happens across machines. On
> >same
> >machine why would it be an issue. Problem is only with same machine,
> >different machines anyways paths don't collide so even if clock skew
> >happens is not a problem. (I understand there may be reservations for
> >putting timestamp in directory path, but clock skew argument is not
> >clear.)
>
> Clock skew happens within machines too. Both because of multi socket
> systems and virtualization systems. Also clock adjustments.
>

Thank You that helps.

Okay just bouncing another approach, how about generating UUID for a
postgres instance during initdb and pg_basebackup ? (unlike
`system_identifier` used in pg_controldata store it in separate independent
file which is excluded in pg_basebackup, instead created by pg_basebackup)
Read only once during startup and used in tablespace path ? (Understand
generating uuid maybe little heavy-lifting for just same node tablespace
path collision, but having unique identifier for each postgres instance
primary or standby maybe useful for long term for other purposes as well)


Re: Allow auto_explain to log to NOTICE

2018-06-20 Thread Daniel Gustafsson
> On 27 Apr 2018, at 04:24, Andres Freund  wrote:
> 
> On 2018-04-27 11:52:18 +0930, Tom Dunstan wrote:
>>> I'd argue this should contain the non-error cases. It's just as
>>> reasonable to want to add this as a debug level or such.
>> 
>> So all of warning, info, debug and debug1-5?
> 
> Yea. Likely nobody will ever use debug5, but I don't see a point making
> that judgement call here.

Did you have a chance to hack up a new version of the patch with Andres’ review
comments?  I’m marking this patch as Waiting on Author for now based on the
feedback so far in this thread.

cheers ./daniel


Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

2018-06-20 Thread Alexander Korotkov
Hi!

Thank you for your work on this issue.

On Wed, Feb 28, 2018 at 5:25 PM Ivan Kartyshov
 wrote:
> Thank you for your valuable comments. I've made a few adjustments.
>
> The main goal of my changes is to let long read-only transactions run on
> replica if hot_standby_feedback is turned on.
>
>
> Patch1 - hsfeedback_av_truncate.patch is made to stop
> ResolveRecoveryConflictWithLock occurs on replica, after autovacuum lazy
> truncates heap on master cutting some pages at the end. When
> hot_standby_feedback is on, we know that the autovacuum does not remove
> anything superfluous, which could be needed on standby, so there is no
> need to rise any ResolveRecoveryConflict*.
>
> 1) Add to xl_standby_locks and xl_smgr_truncate isautovacuum flag, which
> tells us that autovacuum generates them.
>
> 2) When autovacuum decides to trim the table (using lazy_truncate_heap),
> it takes AccessExclusiveLock and sends this lock to the replica, but
> replica should ignore  AccessExclusiveLock if hot_standby_feedback=on.
>
> 3) When autovacuum truncate wal message is replayed on a replica, it
> takes ExclusiveLock on a table, so as not to interfere with read-only
> requests.

I see that you use IsAutoVacuumWorkerProcess() function to determine
if this access exclusive lock is caused by vacuum.  And I see at least
two issues with that.

1) That wouldn't work for manually run vacuums.  I understand stat
query cancel caused by autovacuum is probably most annoying thing.
But unnecessary partial bug fix is undesirable.
2) Access exclusive locks are logged not only immediately by the
process taken that, but also all such locks are logged in
LogStandbySnapshot().  LogStandbySnapshot() is called by checkpointer,
bgwriter and others.  So, IsAutoVacuumWorkerProcess() wouldn't help in
such cases.  I understand that heap truncation is typically fast. And
probability that some process will dump all the access exclusive locks
while truncation is in progress is low.  But nevertheless we need to
handle that properly.

Based on this, I think we should give up with using
IsAutoVacuumWorkerProcess() to determine locks caused by vacuum.

Thinking about that more I found that adding vacuum mark as an extra
argument to LockAcquireExtended is also wrong.  It would be still hard
to determine if we should log the lock in LogStandbySnapshot().  We'll
have to add that flag to shared memory table of locks.  And that looks
like a kludge.

Therefore, I'd like to propose another approach: introduce new lock
level.  So, AccessExclusiveLock will be split into two
AccessExclusiveLocalLock and AccessExclusiveLock.  In spite of
AccessExclusiveLock, AccessExclusiveLocalLock will be not logged to
standby, and used for heap truncation.

I expect some resistance to my proposal, because fixing this "small
bug" doesn't deserve new lock level.  But current behavior of
hot_standby_feedback is buggy.  From user prospective,
hot_standby_feedback option is just non-worker, which causes master to
bloat without protection for standby queries from cancel.  We need to
fix that, but I don't see other proper way to do that except adding
new lock level...

> Patch2 - hsfeedback_noninvalide_xmin.patch
> When walsender is initialized, its xmin in PROCARRAY is set to
> GetOldestXmin() in order to prevent autovacuum running on master from
> truncating relation and removing some pages that are required by
> replica. This might happen if master's autovacuum and replica's query
> started simultaneously. And the replica has not yet reported its xmin
> value.

I don't yet understand what is the problem here and why this patch
should solve that.  As I get idea of this patch, GetOldestXmin() will
initialize xmin of walsender with lowest xmin of replication slot.
But xmin of replication slots will be anyway taken into account by
GetSnapshotData() called by vacuum.

> How to test:
> Make async replica, turn on feedback, reduce max_standby_streaming_delay
> and aggressive autovacuum.

You forgot to mention, that one should setup the replication using
replication slot.  Naturally, if replication slot isn't exist, then
master shouldn't keep dead tuples for disconnected standby.  Because
master doesn't know if standby will reconnect or is it gone forever.

> autovacuum = on
> autovacuum_max_workers = 1
> autovacuum_naptime = 1s
> autovacuum_vacuum_threshold = 1
> autovacuum_vacuum_cost_delay = 0
>
> Test:
> Here we will start replica and begi repeatable read transaction on
> table, then we stop replicas postmaster to prevent starting walreceiver
> worker (on master startup) and sending master it`s transaction xmin over
> hot_standby_feedback message.
> MASTERREPLICA
> start
> CREATE TABLE test AS (SELECT id, 1 AS value FROM
> generate_series(1,1000) id);
> stop
>  start

I guess you meant start of standby session here, not postmaster.
Otherwise, I don't understand how table test will reach standby.

>  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
>  SELECT 

Re: PATCH: backtraces for error messages

2018-06-20 Thread Robert Haas
On Wed, Jun 20, 2018 at 1:15 PM, Andres Freund  wrote:
> I don't think that's ok. It's perfectly possible to hit
> ERRCODE_INTERNAL_ERROR at a high frequency in some situations,

Really?  How?

> and
> there's plenty cases that aren't ERRCODE_INTERNAL_ERROR where we'd want
> this. E.g. a lot of generic filesystem errors have
> errcode_for_file_access(), but are too generic messages to debug.  So I
> think we'll just need a separate GUC for things that aren't PANIC and
> haven't explicitly opt-ed in.

Mmph.  I don't like that much.  I mean I can hold my nose, but I hope
we can find a way to do better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Avoiding Tablespace path collision for primary and standby

2018-06-20 Thread Andres Freund



On June 20, 2018 10:31:05 AM PDT, Ashwin Agrawal  wrote:
>On Wed, Jun 20, 2018 at 9:39 AM Bruce Momjian  wrote:
>
>> On Fri, May 25, 2018 at 02:17:23PM -0700, Ashwin Agrawal wrote:
>> >
>> > On Fri, May 25, 2018 at 7:33 AM, Tom Lane 
>wrote:
>> >
>> > Ashwin Agrawal  writes:
>> > > Proposing to create directory with timestamp at time of
>creating
>> > tablespace
>> > > and create symbolic link to it instead.
>> >
>> > I'm skeptical that this solves your problem.  What happens when
>the
>> CREATE
>> > TABLESPACE command is replicated to the standby with sub-second
>> delay?
>> >
>> >
>> > I thought timestamps have micro-second precision. Are we expecting
>> tabelspace
>> > to be created, wal logged, streamed, and replayed on mirror in
>> micro-second ?
>>
>> I didn't see anyone answer your question above.  We don't expect
>> micro-second replay, but clock skew, which Tom Lane mention, could
>make
>> it appear to be a micro-second replay.
>>
>
>Thanks Bruce for answering. Though I still don't see why clock skew is
>a
>problem here. As I think clock skew only happens across machines. On
>same
>machine why would it be an issue. Problem is only with same machine,
>different machines anyways paths don't collide so even if clock skew
>happens is not a problem. (I understand there may be reservations for
>putting timestamp in directory path, but clock skew argument is not
>clear.)

Clock skew happens within machines too. Both because of multi socket systems 
and virtualization systems. Also clock adjustments.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Fast default stuff versus pg_upgrade

2018-06-20 Thread Andrew Dunstan



On 06/20/2018 12:58 PM, Andres Freund wrote:

Hi,

Just a quick scan-through review:

On 2018-06-20 12:51:07 -0400, Andrew Dunstan wrote:

diff --git a/src/backend/utils/adt/pg_upgrade_support.c 
b/src/backend/utils/adt/pg_upgrade_support.c
index 0c54b02..2666ab2 100644
--- a/src/backend/utils/adt/pg_upgrade_support.c
+++ b/src/backend/utils/adt/pg_upgrade_support.c
@@ -11,14 +11,19 @@
  
  #include "postgres.h"
  
+#include "access/heapam.h"

+#include "access/htup_details.h"
  #include "catalog/binary_upgrade.h"
+#include "catalog/indexing.h"
  #include "catalog/namespace.h"
  #include "catalog/pg_type.h"
  #include "commands/extension.h"
  #include "miscadmin.h"
  #include "utils/array.h"
  #include "utils/builtins.h"
-
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"


Seems to delete a newline.



Will fix





  #define CHECK_IS_BINARY_UPGRADE   
\
  do {  
\
@@ -192,3 +197,66 @@ binary_upgrade_set_record_init_privs(PG_FUNCTION_ARGS)
  
  	PG_RETURN_VOID();

  }
+
+Datum
+binary_upgrade_set_missing_value(PG_FUNCTION_ARGS)
+{
+   Oid table_id = PG_GETARG_OID(0);
+   text*attname = PG_GETARG_TEXT_P(1);
+   text*value = PG_GETARG_TEXT_P(2);
+   DatumvaluesAtt[Natts_pg_attribute];
+   bool nullsAtt[Natts_pg_attribute];
+   bool replacesAtt[Natts_pg_attribute];
+   Datummissingval;
+   Form_pg_attribute attStruct;
+   Relationattrrel;
+   HeapTuple   atttup, newtup;
+   Oid inputfunc, inputparam;
+   char*cattname = text_to_cstring(attname);
+   char*cvalue = text_to_cstring(value);
+
+   CHECK_IS_BINARY_UPGRADE;
+
+   /* Lock the attribute row and get the data */
+   attrrel = heap_open(AttributeRelationId, RowExclusiveLock);

Maybe I'm being paranoid, but I feel like the relation should also be
locked here.



I wondered about that. Other opinions?






+   atttup = SearchSysCacheAttName(table_id,cattname);

Missing space before 'cattname'.



Will fix.




+   if (!HeapTupleIsValid(atttup))
+   elog(ERROR, "cache lookup failed for attribute %s of relation 
%u",
+cattname, table_id);
+   attStruct = (Form_pg_attribute) GETSTRUCT(atttup);
+
+   /* get an array value from the value string */
+
+   /* find the io info for an arbitrary array type to get array_in Oid */
+   getTypeInputInfo(INT4ARRAYOID, , );

Maybe I'm confused, but why INT4ARRAYOID? If you're just looking for the
oid of array_in, why not use F_ARRAY_IN?



Brain fade? Will fix.





+   missingval = OidFunctionCall3(
+   inputfunc, /* array_in */
+   CStringGetDatum(cvalue),
+   ObjectIdGetDatum(attStruct->atttypid),
+   Int32GetDatum(attStruct->atttypmod));
+
+   /* update the tuple - set atthasmissing and attmissingval */
+   MemSet(valuesAtt, 0, sizeof(valuesAtt));
+   MemSet(nullsAtt, false, sizeof(nullsAtt));
+   MemSet(replacesAtt, false, sizeof(replacesAtt));
+
+   valuesAtt[Anum_pg_attribute_atthasmissing - 1] = BoolGetDatum(true);
+   replacesAtt[Anum_pg_attribute_atthasmissing - 1] = true;
+   valuesAtt[Anum_pg_attribute_attmissingval - 1] = missingval;
+   replacesAtt[Anum_pg_attribute_attmissingval - 1] = true;
+
+   newtup = heap_modify_tuple(atttup, RelationGetDescr(attrrel),
+  valuesAtt, nullsAtt, 
replacesAtt);
+   CatalogTupleUpdate(attrrel, >t_self, newtup);
+   /* clean up */
+   heap_freetuple(newtup);
+   ReleaseSysCache(atttup);
+   pfree(cattname);
+   pfree(cvalue);
+   pfree(DatumGetPointer(missingval));

Is this worth bothering with (except the ReleaseSysCache)? We'll clean
up via memory context soon, no?





Done from an abundance of caution. I'll remove them.


Thanks for the quick review.

Attached deals with all those issues except locking.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/utils/adt/pg_upgrade_support.c b/src/backend/utils/adt/pg_upgrade_support.c
index 0c54b02..aed85a3 100644
--- a/src/backend/utils/adt/pg_upgrade_support.c
+++ b/src/backend/utils/adt/pg_upgrade_support.c
@@ -11,13 +11,20 @@
 
 #include "postgres.h"
 
+#include "access/heapam.h"
+#include "access/htup_details.h"
 #include "catalog/binary_upgrade.h"
+#include "catalog/indexing.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
 #include "commands/extension.h"
 #include "miscadmin.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include 

Re: PATCH: backtraces for error messages

2018-06-20 Thread Daniel Gustafsson
> On 20 Jun 2018, at 17:10, Craig Ringer  wrote:

> BTW, Álvaro posted a simpler patch at 
> https://www.postgresql.org/message-id/20180410213203.nl645o5vj5ap66sl@alvherre.pgsql.
>  It uses backtrace() from glibc, and requires each site you want to bt to be 
> annotated explicitly. I forgot about backtrace() completely when I wrote 
> mine, and I prefer the control libunwind gives me anyway, but the reduced 
> dependency would be nice. Especially since backtrace() is in FreeBSD too.

Just as a datapoint regarding this; backtrace() is not available in OpenBSD,
but there is a library in ports which implements it, libexecinfo.

cheers ./daniel


Re: Avoiding Tablespace path collision for primary and standby

2018-06-20 Thread Ashwin Agrawal
On Wed, Jun 20, 2018 at 9:39 AM Bruce Momjian  wrote:

> On Fri, May 25, 2018 at 02:17:23PM -0700, Ashwin Agrawal wrote:
> >
> > On Fri, May 25, 2018 at 7:33 AM, Tom Lane  wrote:
> >
> > Ashwin Agrawal  writes:
> > > Proposing to create directory with timestamp at time of creating
> > tablespace
> > > and create symbolic link to it instead.
> >
> > I'm skeptical that this solves your problem.  What happens when the
> CREATE
> > TABLESPACE command is replicated to the standby with sub-second
> delay?
> >
> >
> > I thought timestamps have micro-second precision. Are we expecting
> tabelspace
> > to be created, wal logged, streamed, and replayed on mirror in
> micro-second ?
>
> I didn't see anyone answer your question above.  We don't expect
> micro-second replay, but clock skew, which Tom Lane mention, could make
> it appear to be a micro-second replay.
>

Thanks Bruce for answering. Though I still don't see why clock skew is a
problem here. As I think clock skew only happens across machines. On same
machine why would it be an issue. Problem is only with same machine,
different machines anyways paths don't collide so even if clock skew
happens is not a problem. (I understand there may be reservations for
putting timestamp in directory path, but clock skew argument is not clear.)


Re: line numbers in error messages are off wrt debuggers

2018-06-20 Thread Tom Lane
Andres Freund  writes:
> Seems it can be hacked around: https://stackoverflow.com/a/13497879
> But yikes, that's ugly.

I bet it's not very portable either.  I'm not sure that all compilers
will think that HANDLEARGS should be expanded in that situation.

regards, tom lane



Re: line numbers in error messages are off wrt debuggers

2018-06-20 Thread Tom Lane
Andres Freund  writes:
> On 2018-06-20 12:12:26 -0400, Tom Lane wrote:
>> Personally, my habit is to set the breakpoint at errfinish, which works
>> independently of just where the call is.

> I do that too, occasionally. Doesn't really if there's a lot of error
> messages before the bug you're waiting for happens though.

Yeah.  In principle you could make the breakpoint conditional, looking
at the saved line number in the error stack, which *would* match the
log data.  There'd be a small chance of false matches (same line # in
another file), but that'd filter it pretty well most times.

regards, tom lane



Re: PATCH: backtraces for error messages

2018-06-20 Thread Andres Freund
On 2018-06-20 13:10:57 -0400, Robert Haas wrote:
> On Wed, Jun 20, 2018 at 12:10 PM, Andres Freund  wrote:
> > If we instead had a backtrace enabled for all PANICs and some FATALs by
> > default (and perhaps a SIGSEGV handler too), we'd be in a better
> > place. That'd obviously only work when compiled with support for
> > libraries, on platforms where we added support for that. But I think
> > that'd be quite the improvement already.
> 
> I think doing it on PANIC would be phenomenal.  SIGSEGV would be great
> if we can make it safe enough, which I'm not sure about, but then I
> suppose we're crashing anyway.

Yea, I think that's pretty much why It'd be ok.


> Instead of making the ERROR behavior conditional on
> log_error_verbosity as Craig has it now, how about doing it whenever
> the error code is ERRCODE_INTERNAL_ERROR?  That's pretty much the
> cases that aren't supposed to happen, so if we see those happening a
> lot, it's either a bug we need to fix or we should supply a better
> error code.  Also, a lot of those messages are duplicated in many
> places and/or occur inside fairly generic functions inside
> lsyscache.c, so the actual error message text tends not to be enough
> to know what happened.

I don't think that's ok. It's perfectly possible to hit
ERRCODE_INTERNAL_ERROR at a high frequency in some situations, and
there's plenty cases that aren't ERRCODE_INTERNAL_ERROR where we'd want
this. E.g. a lot of generic filesystem errors have
errcode_for_file_access(), but are too generic messages to debug.  So I
think we'll just need a separate GUC for things that aren't PANIC and
haven't explicitly opt-ed in.

Greetings,

Andres Freund



Re: PATCH: backtraces for error messages

2018-06-20 Thread Robert Haas
On Wed, Jun 20, 2018 at 12:10 PM, Andres Freund  wrote:
> I think the problem is that this most frequently is an issue when
> investigating an issue for customers. Often enough it'll legally not be
> possible to gain direct access to the system, and remotely instructing
> people to install an extension and configure it isn't great.  So if we
> could, by default, include something better for PANICs etc, it'd be a
> great boon - I think that's even clear from conversionations on the pg
> lists (which often will be the more knowledgable people: How often did
> we try hard to get a backtrace from a crash?

+1 to all of that.  This is a real nuisance, and making it less of a
nuisance would be great.

> If we instead had a backtrace enabled for all PANICs and some FATALs by
> default (and perhaps a SIGSEGV handler too), we'd be in a better
> place. That'd obviously only work when compiled with support for
> libraries, on platforms where we added support for that. But I think
> that'd be quite the improvement already.

I think doing it on PANIC would be phenomenal.  SIGSEGV would be great
if we can make it safe enough, which I'm not sure about, but then I
suppose we're crashing anyway.  Instead of making the ERROR behavior
conditional on log_error_verbosity as Craig has it now, how about
doing it whenever the error code is ERRCODE_INTERNAL_ERROR?  That's
pretty much the cases that aren't supposed to happen, so if we see
those happening a lot, it's either a bug we need to fix or we should
supply a better error code.  Also, a lot of those messages are
duplicated in many places and/or occur inside fairly generic functions
inside lsyscache.c, so the actual error message text tends not to be
enough to know what happened.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: XML/XPath issues: text/CDATA in XMLTABLE, XPath evaluated with wrong context

2018-06-20 Thread Alvaro Herrera
Hi

On 2018-Mar-27, Markus Winand wrote:

> I’ve found two issues with XML/XPath integration in PostgreSQL. Two
> patches are attached. I’ve just separated them because the second one
> is an incompatible change.
> 
> * Patch 0001: Accept TEXT and CDATA nodes in XMLTABLE’s column_expression.

I pushed 0001 to REL_10_STABLE and master -- since there is no
incompatible behavior change.  (Well, there is a behavior change, but
it's pretty hard to imagine someone would be relying on the bogus old
output.)

I'll see about 0002 now -- only to master, as discussed.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fast default stuff versus pg_upgrade

2018-06-20 Thread Andres Freund


Hi,

Just a quick scan-through review:

On 2018-06-20 12:51:07 -0400, Andrew Dunstan wrote:
> diff --git a/src/backend/utils/adt/pg_upgrade_support.c 
> b/src/backend/utils/adt/pg_upgrade_support.c
> index 0c54b02..2666ab2 100644
> --- a/src/backend/utils/adt/pg_upgrade_support.c
> +++ b/src/backend/utils/adt/pg_upgrade_support.c
> @@ -11,14 +11,19 @@
>  
>  #include "postgres.h"
>  
> +#include "access/heapam.h"
> +#include "access/htup_details.h"
>  #include "catalog/binary_upgrade.h"
> +#include "catalog/indexing.h"
>  #include "catalog/namespace.h"
>  #include "catalog/pg_type.h"
>  #include "commands/extension.h"
>  #include "miscadmin.h"
>  #include "utils/array.h"
>  #include "utils/builtins.h"
> -
> +#include "utils/lsyscache.h"
> +#include "utils/rel.h"
> +#include "utils/syscache.h"
>

Seems to delete a newline.


>  #define CHECK_IS_BINARY_UPGRADE  
> \
>  do { 
> \
> @@ -192,3 +197,66 @@ binary_upgrade_set_record_init_privs(PG_FUNCTION_ARGS)
>  
>   PG_RETURN_VOID();
>  }
> +
> +Datum
> +binary_upgrade_set_missing_value(PG_FUNCTION_ARGS)
> +{
> + Oid table_id = PG_GETARG_OID(0);
> + text*attname = PG_GETARG_TEXT_P(1);
> + text*value = PG_GETARG_TEXT_P(2);
> + DatumvaluesAtt[Natts_pg_attribute];
> + bool nullsAtt[Natts_pg_attribute];
> + bool replacesAtt[Natts_pg_attribute];
> + Datummissingval;
> + Form_pg_attribute attStruct;
> + Relationattrrel;
> + HeapTuple   atttup, newtup;
> + Oid inputfunc, inputparam;
> + char*cattname = text_to_cstring(attname);
> + char*cvalue = text_to_cstring(value);
> +
> + CHECK_IS_BINARY_UPGRADE;
> +
> + /* Lock the attribute row and get the data */
> + attrrel = heap_open(AttributeRelationId, RowExclusiveLock);

Maybe I'm being paranoid, but I feel like the relation should also be
locked here.


> + atttup = SearchSysCacheAttName(table_id,cattname);

Missing space before 'cattname'.


> + if (!HeapTupleIsValid(atttup))
> + elog(ERROR, "cache lookup failed for attribute %s of relation 
> %u",
> +  cattname, table_id);
> + attStruct = (Form_pg_attribute) GETSTRUCT(atttup);
> +
> + /* get an array value from the value string */
> +
> + /* find the io info for an arbitrary array type to get array_in Oid */
> + getTypeInputInfo(INT4ARRAYOID, , );

Maybe I'm confused, but why INT4ARRAYOID? If you're just looking for the
oid of array_in, why not use F_ARRAY_IN?


> + missingval = OidFunctionCall3(
> + inputfunc, /* array_in */
> + CStringGetDatum(cvalue),
> + ObjectIdGetDatum(attStruct->atttypid),
> + Int32GetDatum(attStruct->atttypmod));
> +
> + /* update the tuple - set atthasmissing and attmissingval */
> + MemSet(valuesAtt, 0, sizeof(valuesAtt));
> + MemSet(nullsAtt, false, sizeof(nullsAtt));
> + MemSet(replacesAtt, false, sizeof(replacesAtt));
> +
> + valuesAtt[Anum_pg_attribute_atthasmissing - 1] = BoolGetDatum(true);
> + replacesAtt[Anum_pg_attribute_atthasmissing - 1] = true;
> + valuesAtt[Anum_pg_attribute_attmissingval - 1] = missingval;
> + replacesAtt[Anum_pg_attribute_attmissingval - 1] = true;
> +
> + newtup = heap_modify_tuple(atttup, RelationGetDescr(attrrel),
> +valuesAtt, nullsAtt, 
> replacesAtt);
> + CatalogTupleUpdate(attrrel, >t_self, newtup);

> + /* clean up */
> + heap_freetuple(newtup); 
> + ReleaseSysCache(atttup);
> + pfree(cattname);
> + pfree(cvalue);
> + pfree(DatumGetPointer(missingval));

Is this worth bothering with (except the ReleaseSysCache)? We'll clean
up via memory context soon, no?

Greetings,

Andres Freund



Re: AtEOXact_ApplyLauncher() and subtransactions

2018-06-20 Thread Amit Khandekar
On 18 June 2018 at 15:02, Amit Khandekar  wrote:
> On 16 June 2018 at 00:03, Amit Khandekar  wrote:
>> The way I am implementing this can be seen in attached
>> apply_launcher_subtrans_WIP.patch. (check launcher.c changes). I
>> haven't started testing it yet though.
>
> Attached patch passes some basic testing I did. Will do some more
> testing, and some self-review and code organising, if required.

Done. Attached is v2 version of the patch. Comments welcome.

Changed GetSubscriptionRelations() to GetSubscriptionRelids(), which
now returns only the oids, not the subrel states. This was convenient
for storing the exact returned list into the committed subscription
rels. And anyways the subrel states were not used anywhere.

> I will also split the patch into two : one containing the main issue
> regarding subtransaction, and the other containing the other issue I
> mentioned earlier that shows up without subtransaction as well.

Did not split the patch. The changes for the other issue that shows up
without subtransaction are all in subscriptioncmds.c , and that file
contains mostly the changes for this issue. So kept it as a single
patch. But if it gets inconvenient for someone while reviewing, I will
be happy to split it.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


apply_launcher_subtrans_v2.patch
Description: Binary data


Re: Fast default stuff versus pg_upgrade

2018-06-20 Thread Andrew Dunstan



On 06/19/2018 10:46 PM, Andres Freund wrote:

On 2018-06-19 22:41:26 -0400, Andrew Dunstan wrote:

This unfortunately crashes and burns if we use DirectFunctionCall3 to call
array_in, because it uses fn_extra. There is the CallerFInfoFunctionCall
stuff, but it only has 1 and 2 arg variants, and array_in takes 3. In
retrospect we should probably have added a 3 arg form - quite a few input
functions take 3 args. Anything else is likely to be rather uglier.

Attaching the failing patch. I'll attack this again in the morning.

Why don't you just use OidFunctionCall3? Or simply an explicit
fmgr_info(), InitFunctionCallInfoData(), FunctionCallInvoke() combo?






Thanks for that. I should not code late at night.

Here's a version that works in my testing with Tom's patch making sure 
there's a missing value to migrate applied. Thanks to Alvaro for some 
useful criticism - any errors are of course my responsibility.



cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/utils/adt/pg_upgrade_support.c b/src/backend/utils/adt/pg_upgrade_support.c
index 0c54b02..2666ab2 100644
--- a/src/backend/utils/adt/pg_upgrade_support.c
+++ b/src/backend/utils/adt/pg_upgrade_support.c
@@ -11,14 +11,19 @@
 
 #include "postgres.h"
 
+#include "access/heapam.h"
+#include "access/htup_details.h"
 #include "catalog/binary_upgrade.h"
+#include "catalog/indexing.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
 #include "commands/extension.h"
 #include "miscadmin.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
-
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
 
 #define CHECK_IS_BINARY_UPGRADE	\
 do {			\
@@ -192,3 +197,66 @@ binary_upgrade_set_record_init_privs(PG_FUNCTION_ARGS)
 
 	PG_RETURN_VOID();
 }
+
+Datum
+binary_upgrade_set_missing_value(PG_FUNCTION_ARGS)
+{
+	Oid		table_id = PG_GETARG_OID(0);
+	text*attname = PG_GETARG_TEXT_P(1);
+	text*value = PG_GETARG_TEXT_P(2);
+	DatumvaluesAtt[Natts_pg_attribute];
+	bool nullsAtt[Natts_pg_attribute];
+	bool replacesAtt[Natts_pg_attribute];
+	Datummissingval;
+	Form_pg_attribute attStruct;
+	Relationattrrel;
+	HeapTuple   atttup, newtup;
+	Oid inputfunc, inputparam;
+	char*cattname = text_to_cstring(attname);
+	char*cvalue = text_to_cstring(value);
+
+	CHECK_IS_BINARY_UPGRADE;
+
+	/* Lock the attribute row and get the data */
+	attrrel = heap_open(AttributeRelationId, RowExclusiveLock);
+	atttup = SearchSysCacheAttName(table_id,cattname);
+	if (!HeapTupleIsValid(atttup))
+		elog(ERROR, "cache lookup failed for attribute %s of relation %u",
+			 cattname, table_id);
+	attStruct = (Form_pg_attribute) GETSTRUCT(atttup);
+
+	/* get an array value from the value string */
+
+	/* find the io info for an arbitrary array type to get array_in Oid */
+	getTypeInputInfo(INT4ARRAYOID, , );
+	missingval = OidFunctionCall3(
+		inputfunc, /* array_in */
+		CStringGetDatum(cvalue),
+		ObjectIdGetDatum(attStruct->atttypid),
+		Int32GetDatum(attStruct->atttypmod));
+
+	/* update the tuple - set atthasmissing and attmissingval */
+	MemSet(valuesAtt, 0, sizeof(valuesAtt));
+	MemSet(nullsAtt, false, sizeof(nullsAtt));
+	MemSet(replacesAtt, false, sizeof(replacesAtt));
+
+	valuesAtt[Anum_pg_attribute_atthasmissing - 1] = BoolGetDatum(true);
+	replacesAtt[Anum_pg_attribute_atthasmissing - 1] = true;
+	valuesAtt[Anum_pg_attribute_attmissingval - 1] = missingval;
+	replacesAtt[Anum_pg_attribute_attmissingval - 1] = true;
+
+	newtup = heap_modify_tuple(atttup, RelationGetDescr(attrrel),
+			   valuesAtt, nullsAtt, replacesAtt);
+	CatalogTupleUpdate(attrrel, >t_self, newtup);
+
+	/* clean up */
+	heap_freetuple(newtup); 
+	ReleaseSysCache(atttup);
+	pfree(cattname);
+	pfree(cvalue);
+	pfree(DatumGetPointer(missingval));
+
+	heap_close(attrrel, RowExclusiveLock);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index ea2f022..22be3c2 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8095,6 +8095,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 	int			i_typstorage;
 	int			i_attnotnull;
 	int			i_atthasdef;
+	int			i_atthasmissing;
 	int			i_attidentity;
 	int			i_attisdropped;
 	int			i_attlen;
@@ -8103,6 +8104,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 	int			i_attoptions;
 	int			i_attcollation;
 	int			i_attfdwoptions;
+	int			i_attmissingval;
 	PGresult   *res;
 	int			ntups;
 	bool		hasdefaults;
@@ -8132,7 +8134,33 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
 
 		resetPQExpBuffer(q);
 
-		if (fout->remoteVersion >= 10)
+		if (fout->remoteVersion >= 11)
+		{
+			/* atthasmissing and attmissingval are new in 11 */
+			appendPQExpBuffer(q, "SELECT a.attnum, a.attname, a.atttypmod, "
+			  "a.attstattarget, 

Re: Avoiding Tablespace path collision for primary and standby

2018-06-20 Thread Andres Freund
Hi,

Hi,

On 2018-05-26 10:08:57 -0400, Tom Lane wrote:
> Not sure about the relative-path idea.  Seems like that would create
> a huge temptation to put tablespaces inside the data directory, which
> would force us to deal with that can of worms.

It doesn't seem impossible to normalize the path, and then check for that.


> Also, to the extent that people use tablespaces for what they're
> actually meant to be used for (ie, putting some stuff into a different
> filesystem), I can't see a relative path being helpful.  Admins don't
> go mounting disks at random places in the filesystem tree.

I'm not convinced by that argument. It can certainly make sense to mount
several filesystems relative to a subdirectory. And then there's the
case we're talking about, where you have primary/standby on a single
system. It's not like we'd *force* relative tablespaces...

Greetings,

Andres Freund



Re: line numbers in error messages are off wrt debuggers

2018-06-20 Thread Andres Freund
On 2018-06-20 12:12:26 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I'm fairly frequently annoyed that when I see an error message in the
> > log, I can't just generally set a breakpoint on the included line
> > number. That's because the line number in the error message is from the
> > *end* of the message:
> 
> IME it varies depending on which compiler you use; some report the line
> where the ereport is.  So I'm doubtful that there's going to be much we
> can do about it.  Probably this behavior is bound up with macro expansion
> vs. just when __LINE__ is expanded.

Seems it can be hacked around: https://stackoverflow.com/a/13497879

But yikes, that's ugly.

There's apparenty some discussion in the standard committees trying to
unify the behaviour across compilers, but ...

Greetings,

Andres Freund



Re: Avoiding Tablespace path collision for primary and standby

2018-06-20 Thread Bruce Momjian
On Fri, May 25, 2018 at 02:17:23PM -0700, Ashwin Agrawal wrote:
> 
> On Fri, May 25, 2018 at 7:33 AM, Tom Lane  wrote:
> 
> Ashwin Agrawal  writes:
> > Proposing to create directory with timestamp at time of creating
> tablespace
> > and create symbolic link to it instead.
> 
> I'm skeptical that this solves your problem.  What happens when the CREATE
> TABLESPACE command is replicated to the standby with sub-second delay?
> 
> 
> I thought timestamps have micro-second precision. Are we expecting tabelspace
> to be created, wal logged, streamed, and replayed on mirror in micro-second ?

I didn't see anyone answer your question above.  We don't expect
micro-second replay, but clock skew, which Tom Lane mention, could make
it appear to be a micro-second replay.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: line numbers in error messages are off wrt debuggers

2018-06-20 Thread Andres Freund
Hi,

On 2018-06-20 12:12:26 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I'm fairly frequently annoyed that when I see an error message in the
> > log, I can't just generally set a breakpoint on the included line
> > number. That's because the line number in the error message is from the
> > *end* of the message:
> 
> IME it varies depending on which compiler you use; some report the line
> where the ereport is.  So I'm doubtful that there's going to be much we
> can do about it.  Probably this behavior is bound up with macro expansion
> vs. just when __LINE__ is expanded.
> 
> (No, I won't hold still for a coding requirement that all ereport calls
> be smashed onto a single line ;-))

Hah, me neither ;)


> Personally, my habit is to set the breakpoint at errfinish, which works
> independently of just where the call is.

I do that too, occasionally. Doesn't really if there's a lot of error
messages before the bug you're waiting for happens though.

Greetings,

Andres Freund



Re: pgsql: Store 2PC GID in commit/abort WAL recs for logical decoding

2018-06-20 Thread Alvaro Herrera
On 2018-Jun-15, Alvaro Herrera wrote:

> By the way, why do we need to strlen() the target buffer when strlcpy
> already reports the length?  (You could argue that there is a difference
> if the string is truncated ... but surely we don't care about that case)
> I propose the attached.

I decided not to push this after all.  Yes, one strlen is saved, but
there is some code clarity lost also, and this is certainly not a
contention point.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: line numbers in error messages are off wrt debuggers

2018-06-20 Thread Tom Lane
Andres Freund  writes:
> I'm fairly frequently annoyed that when I see an error message in the
> log, I can't just generally set a breakpoint on the included line
> number. That's because the line number in the error message is from the
> *end* of the message:

IME it varies depending on which compiler you use; some report the line
where the ereport is.  So I'm doubtful that there's going to be much we
can do about it.  Probably this behavior is bound up with macro expansion
vs. just when __LINE__ is expanded.

(No, I won't hold still for a coding requirement that all ereport calls
be smashed onto a single line ;-))

Personally, my habit is to set the breakpoint at errfinish, which works
independently of just where the call is.

regards, tom lane



Re: PATCH: backtraces for error messages

2018-06-20 Thread Andres Freund
On 2018-06-20 12:04:51 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-06-20 11:20:49 -0400, Alvaro Herrera wrote:
> >> I have no idea how expensive backtrace() and libunwind are, but I doubt
> >> we want to pay the cost for every message before we know that error
> >> requires the backtrace to be collected.  Something like PGC_USERSET
> >> server_min_backtraces=PANIC 
> >> might be a possible interface.
> 
> > Yes, most definitely. We can't do this everywhere. It's quite expensive
> > to collect / build them.  So I think we'd probably need another guc that
> > controls when the information is collected, perhaps defaulting to PANIC.
> 
> The cost is a problem, and the dependencies on various additional
> libraries are too.  I wonder whether anything could be gained by
> putting this stuff in an extension?  Then we could have different
> extensions for different backtrace libraries, and loading the extension
> or not would be one avenue to control whether you were paying the cost.
> (Though some control GUCs might be needed anyway.)

I think the problem is that this most frequently is an issue when
investigating an issue for customers. Often enough it'll legally not be
possible to gain direct access to the system, and remotely instructing
people to install an extension and configure it isn't great.  So if we
could, by default, include something better for PANICs etc, it'd be a
great boon - I think that's even clear from conversionations on the pg
lists (which often will be the more knowledgable people: How often did
we try hard to get a backtrace from a crash?

If we instead had a backtrace enabled for all PANICs and some FATALs by
default (and perhaps a SIGSEGV handler too), we'd be in a better
place. That'd obviously only work when compiled with support for
libraries, on platforms where we added support for that. But I think
that'd be quite the improvement already.

Greetings,

Andres Freund



Re: PATCH: backtraces for error messages

2018-06-20 Thread Tom Lane
Andres Freund  writes:
> On 2018-06-20 11:20:49 -0400, Alvaro Herrera wrote:
>> I have no idea how expensive backtrace() and libunwind are, but I doubt
>> we want to pay the cost for every message before we know that error
>> requires the backtrace to be collected.  Something like PGC_USERSET
>> server_min_backtraces=PANIC 
>> might be a possible interface.

> Yes, most definitely. We can't do this everywhere. It's quite expensive
> to collect / build them.  So I think we'd probably need another guc that
> controls when the information is collected, perhaps defaulting to PANIC.

The cost is a problem, and the dependencies on various additional
libraries are too.  I wonder whether anything could be gained by
putting this stuff in an extension?  Then we could have different
extensions for different backtrace libraries, and loading the extension
or not would be one avenue to control whether you were paying the cost.
(Though some control GUCs might be needed anyway.)

regards, tom lane



line numbers in error messages are off wrt debuggers

2018-06-20 Thread Andres Freund
Hi,

I'm fairly frequently annoyed that when I see an error message in the
log, I can't just generally set a breakpoint on the included line
number. That's because the line number in the error message is from the
*end* of the message:

2018-06-20 09:02:39.226 PDT [21145][3/2] LOG:  0: statement: SELECT 1;
2018-06-20 09:02:39.226 PDT [21145][3/2] LOCATION:  exec_simple_query, 
postgres.c:952

corresponds to

ereport(LOG,
(errmsg("statement: %s", query_string),
 errhidestmt(true),
 errdetail_execute(parsetree_list)));

with 952 being the line with the semicolon.

Are others bothered by this?

If so, does anybody have a handle how we could get a more useful line
number out of the preprocessor?

Greetings,

Andres Freund



Re: ERROR: ORDER/GROUP BY expression not found in targetlist

2018-06-20 Thread Tom Lane
Rajkumar Raghuwanshi  writes:
> Below test case is failing with ERROR:  ORDER/GROUP BY expression not found
> in targetlist. this issue is reproducible from PGv10.

Thanks for the report.  I traced through this, and the problem seems to
be that split_pathtarget_at_srfs() is neglecting to attach sortgroupref
labeling to the extra PathTargets it constructs.  I don't remember
whether that code is my fault or Andres', but I'll take a look at
fixing it.

regards, tom lane



Re: PATCH: backtraces for error messages

2018-06-20 Thread Andres Freund
Hi,

On 2018-06-20 11:20:49 -0400, Alvaro Herrera wrote:
> > I recently needed a way to get backtraces from errors in a convenient,
> > non-interactive and indescriminate way. The attached patch is the result.
> > It teaches Pg to use libunwind to self-backtrace in elog/ereport.
> > 
> > Anyone think this is useful/interesting/worth pursuing?

Generally interesting, yes.


> I think we sorely need some mechanism to optionally get backtraces in
> error messages.  I think having backtraces in all messages is definitely
> not a good thing, but requiring an explicit marker (such as in my patch)
> means the code has to be recompiled, which is not a solution in
> production systems.  I kind lean towards your approach, but it has to be
> something that's easily enabled/disabled at runtime.

> I have no idea how expensive backtrace() and libunwind are, but I doubt
> we want to pay the cost for every message before we know that error
> requires the backtrace to be collected.  Something like PGC_USERSET
>   server_min_backtraces=PANIC 
> might be a possible interface.

Yes, most definitely. We can't do this everywhere. It's quite expensive
to collect / build them.  So I think we'd probably need another guc that
controls when the information is collected, perhaps defaulting to PANIC.

Greetings,

Andres Freund



Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-06-20 Thread Andres Freund
Hi,

On 2018-06-20 15:05:59 +0300, Sergey Burladyan wrote:
> Andres Freund  writes:
> 
> > You should first make sure it's actually this problem - which tables are
> > holding back the xmin horizon?  After that, yes, deleting the
> > global/pg_internal.init file is the way to go, and I can't think of a
> > case where it's problematic, even without stopping the server.
> 
> Thanks for clarification! I also have this problem, BTW, autovacuum does
> not worked at all:
> # select max(last_autovacuum) from pg_stat_user_tables;
>   max
> ---
>  2018-06-06 00:48:47.813841+03
> 
> all it workers stoped with this messages:
> ERROR: found xmin 982973690 from before relfrozenxid 2702858737
> CONTEXT: automatic vacuum of table "avito_delta.pg_catalog.pg_authid"
> ERROR: found xmin 982973690 from before relfrozenxid 2702858761
> CONTEXT: automatic vacuum of table "avito_delta.pg_catalog.pg_auth_members"
> 
> and it does not try to vacuum other tables.

Yea, that's this bug.  I'd remove global/pg_internal.init, reconnect,
and manually vacuum.

Greetings,

Andres Freund



Re: PATCH: backtraces for error messages

2018-06-20 Thread Alexander Kuzmenkov

On 06/20/2018 06:10 PM, Craig Ringer wrote:


I recently needed a way to get backtraces from errors in a convenient, 
non-interactive and indescriminate way. The attached patch is the 
result. It teaches Pg to use libunwind to self-backtrace in elog/ereport.


Anyone think this is useful/interesting/worth pursuing?



This would be a great thing to have. I often need to add stack traces to 
the logs, and for now I just link with libunwind and add some ad-hoc 
function to grab the traces. Having libunwind support in core would make 
this simpler.



--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: partition -> partitioned

2018-06-20 Thread Alvaro Herrera
On 2018-Jun-19, Amit Langote wrote:

> Noticed that the relevant code changed, so I rebased the patch.  Also,
> made a minor update to a nearby comment.

Pushed, thanks.  I made a couple of comments one or two words shorter
while (IMO) not losing clarity.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: backtraces for error messages

2018-06-20 Thread Alvaro Herrera
On 2018-Jun-20, Craig Ringer wrote:

> Hi folks
> 
> I recently needed a way to get backtraces from errors in a convenient,
> non-interactive and indescriminate way. The attached patch is the result.
> It teaches Pg to use libunwind to self-backtrace in elog/ereport.
> 
> Anyone think this is useful/interesting/worth pursuing?

I think we sorely need some mechanism to optionally get backtraces in
error messages.  I think having backtraces in all messages is definitely
not a good thing, but requiring an explicit marker (such as in my patch)
means the code has to be recompiled, which is not a solution in
production systems.  I kind lean towards your approach, but it has to be
something that's easily enabled/disabled at runtime.

I have no idea how expensive backtrace() and libunwind are, but I doubt
we want to pay the cost for every message before we know that error
requires the backtrace to be collected.  Something like PGC_USERSET
  server_min_backtraces=PANIC 
might be a possible interface.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Keeping temporary tables in shared buffers

2018-06-20 Thread Bruce Momjian
On Sat, Jun  2, 2018 at 05:18:17PM -0400, Asim Praveen wrote:
> Hi Amit
> 
> On Mon, May 28, 2018 at 4:25 AM, Amit Kapila  wrote:
> >
> > This is one way, but I think there are other choices as well.  We can
> > identify and flush all the dirty (local) buffers for the relation
> > being accessed parallelly.  Now, once the parallel operation is
> > started, we won't allow performing any write operation on them.  It
> 
> We talked about this in person in Ottawa and it was great meeting you!
>  To summarize, the above proposal to continue using local buffers for
> temp tables is a step forward, however, it enables only certain kinds
> of queries to be parallelized for temp tables.  E.g. queries changing
> a temp table in any way cannot be parallelized due to the restriction
> of no writes during parallel operation.

Should this be a TODO item?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: add default parallel query to v10 release notes? (Re: [PERFORM] performance drop after upgrade (9.6 > 10))

2018-06-20 Thread Bruce Momjian
On Thu, May 24, 2018 at 08:00:25PM -0500, Justin Pryzby wrote:
> Moving to -hackers;
> 
> On Sun, Jan 28, 2018 at 06:53:10PM -0500, Bruce Momjian wrote:
> > On Thu, Oct 26, 2017 at 02:45:15PM -0500, Justin Pryzby wrote:
> > > Is it because max_parallel_workers_per_gather now defaults to 2 ?
> > > 
> > > BTW, I would tentatively expect a change in default to be documented in 
> > > the
> > > release notes but can't see that it's.
> > > 77cd477c4ba885cfa1ba67beaa82e06f2e182b85
> > 
> > Oops, you are correct.  The PG 10 release notes, which I wrote, should
> > have mentioned this.  :-(
> 
> I just saw your January response to my October mail..
> 
> Maybe it's silly to update PG10 notes 9 months after release..
> ..but, any reason not to add to v10 release notes now (I don't know if the web
> docs would be updated until the next point release?)

So I did some research on this, particularly to find out how it was
missed in the PG 10 release notes.  It turns out that
max_parallel_workers_per_gather has always defaulted to 2 in head, and
this was changed to default to 0 in the 9.6 branch:

commit f85b1a84152f7bf019fd7a2c5eede97867dcddbb
Author: Robert Haas 
Date:   Tue Aug 16 08:09:15 2016 -0400

Disable parallel query by default.

Per discussion, set the default value of 
max_parallel_workers_per_gather
to 0 in 9.6 only.  We'll leave it enabled in master so that it gets
more testing and in the hope that it can be enable by default in 
v10.

Therefore, there was no commit to find in the PG 10 commit logs.  :-O 
Not sure how we can avoid this kind of problem in the future.

The attached patch adds a PG 10.0 release note item about this change. 
I put it at the bottom since it is newly added.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
diff --git a/doc/src/sgml/release-10.sgml b/doc/src/sgml/release-10.sgml
new file mode 100644
index ea86b5e..5862cd8
*** a/doc/src/sgml/release-10.sgml
--- b/doc/src/sgml/release-10.sgml
*** Branch: REL_10_STABLE [5159626af] 2017-1
*** 4065,4070 
--- 4065,4081 
 

  
+   
+ 
+
+ Enable parallelism by default by changing the default setting
+ of  to
+ 2.
+
+   
+ 
   
  
  


PATCH: backtraces for error messages

2018-06-20 Thread Craig Ringer
Hi folks

I recently needed a way to get backtraces from errors in a convenient,
non-interactive and indescriminate way. The attached patch is the result.
It teaches Pg to use libunwind to self-backtrace in elog/ereport.

Anyone think this is useful/interesting/worth pursuing?

(Patch is currently against pg10, so this is a PoC only).

As written it emits a backtrace when log_error_verbosity=verbose or,
unconditionally, on PANIC. A bt can be hidden by errhidestack() but is
otherwise shown. That's ridiculously, excessively spammy, so it's not
viable for core as-is. Before playing with it too much I thought I'd ask
for ideas on if anyone thinks it's useful, and if so, how it'd work best.

My goal is to allow capture of extra diagnostic info from key locations in
production without needing to attach gdb. It's imperfect since sometimes
there's no convenient message, and other times you can't afford to set up
logging with enough detail. So this would be most useful when combined with
one of the occasionally discussed patches to allow for selective logging
verbosity on a module- or file- level. But I think it's still handy without
that.

I briefly looked into Windows too. Roughly the same approach could be used
to plug in dbghelp.dll support for Windows self-backtracing, it's just
rather uglier; see
https://jpassing.com/2008/03/12/walking-the-stack-of-the-current-thread/ .

BTW, Álvaro posted a simpler patch at
https://www.postgresql.org/message-id/20180410213203.nl645o5vj5ap66sl@alvherre.pgsql.
It uses backtrace() from glibc, and requires each site you want to bt to be
annotated explicitly. I forgot about backtrace() completely when I wrote
mine, and I prefer the control libunwind gives me anyway, but the reduced
dependency would be nice. Especially since backtrace() is in FreeBSD too.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From d9ff6bd8112b31d087d8442f25ffd430df794771 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Wed, 20 Jun 2018 10:57:31 +0800
Subject: [PATCH v2] Support generating backtraces in logs using libunwind

---
 configure  | 349 -
 configure.in   |  22 +++
 src/Makefile.global.in |   6 +-
 src/backend/Makefile   |   4 +-
 src/backend/utils/error/elog.c | 112 +
 src/include/pg_config.h.in |   6 +
 src/include/pg_config_manual.h |   2 +-
 src/include/utils/elog.h   |  12 ++
 8 files changed, 399 insertions(+), 114 deletions(-)

diff --git a/configure b/configure
index 2d5375d51c..87a7d60f3f 100755
--- a/configure
+++ b/configure
@@ -655,6 +655,8 @@ LIBOBJS
 UUID_LIBS
 LDAP_LIBS_BE
 LDAP_LIBS_FE
+LIBUNWIND_LIBS
+LIBUNWIND_CFLAGS
 PTHREAD_CFLAGS
 PTHREAD_LIBS
 PTHREAD_CC
@@ -716,12 +718,12 @@ with_perl
 with_tcl
 ICU_LIBS
 ICU_CFLAGS
-PKG_CONFIG_LIBDIR
-PKG_CONFIG_PATH
-PKG_CONFIG
 with_icu
 enable_thread_safety
 INCLUDES
+PKG_CONFIG_LIBDIR
+PKG_CONFIG_PATH
+PKG_CONFIG
 autodepend
 TAS
 GCC
@@ -846,6 +848,7 @@ with_uuid
 with_ossp_uuid
 with_libxml
 with_libxslt
+with_libunwind
 with_system_tzdata
 with_zlib
 with_gnu_ld
@@ -868,7 +871,9 @@ PKG_CONFIG_LIBDIR
 ICU_CFLAGS
 ICU_LIBS
 LDFLAGS_EX
-LDFLAGS_SL'
+LDFLAGS_SL
+LIBUNWIND_CFLAGS
+LIBUNWIND_LIBS'
 
 
 # Initialize some variables set by options.
@@ -1543,6 +1548,7 @@ Optional Packages:
   --with-ossp-uuidobsolete spelling of --with-uuid=ossp
   --with-libxml   build with XML support
   --with-libxslt  use XSLT support when building contrib/xml2
+  --with-libunwinduse libunwind for enhanced error context stacks
   --with-system-tzdata=DIR
   use system time zone data in DIR
   --without-zlib  do not use Zlib
@@ -1566,6 +1572,10 @@ Some influential environment variables:
   ICU_LIBSlinker flags for ICU, overriding pkg-config
   LDFLAGS_EX  extra linker flags for linking executables only
   LDFLAGS_SL  extra linker flags for linking shared libraries only
+  LIBUNWIND_CFLAGS
+  override cflags used when building with libunwind
+  LIBUNWIND_LIBS
+  override libraries linked when building with libunwind
 
 Use these variables to override the choices made by `configure' or to help
 it to find libraries and programs with nonstandard names/locations.
@@ -5359,112 +5369,8 @@ fi
 
 
 #
-# Include directories
+# Look for pkg-config
 #
-ac_save_IFS=$IFS
-IFS="${IFS}${PATH_SEPARATOR}"
-# SRCH_INC comes from the template file
-for dir in $with_includes $SRCH_INC; do
-  if test -d "$dir"; then
-INCLUDES="$INCLUDES -I$dir"
-  else
-{ $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: *** Include directory $dir does not exist." >&5
-$as_echo "$as_me: WARNING: *** Include directory $dir does not exist." >&2;}
-  fi
-done
-IFS=$ac_save_IFS
-
-
-
-#
-# Library directories
-#
-ac_save_IFS=$IFS
-IFS="${IFS}${PATH_SEPARATOR}"
-# LIBRARY_DIRS comes from command line, SRCH_LIB from 

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-06-20 Thread Euler Taveira
2018-06-20 4:30 GMT-03:00 Haribabu Kommi :
> Attached is a simple patch with implementation. Comments?
>
Why don't you extend the existing function pg_stat_statements_reset()?


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento



Re: Loaded footgun open_datasync on Windows

2018-06-20 Thread Laurenz Albe
Kuntal Ghosh wrote:
[pg_test_fsync doesn't use pgwin32_open and hence doesn't respect O_DSYNC]
> On Wed, Jun 6, 2018 at 2:39 PM, Amit Kapila  wrote:
> > On Wed, Jun 6, 2018 at 10:18 AM, Michael Paquier  
> > wrote:
> > > On Wed, Jun 06, 2018 at 09:58:34AM +0530, Amit Kapila wrote:
> > > > One another alternative could be that we
> > > > define open as pgwin32_open (for WIN32) wherever we need it.
> > > 
> > > Which is what basically happens on any *nix platform, are you foreseeing
> > > anything bad here?
> > 
> > Nothing apparent, but I think we should try to find out why at the first
> > place this has been made backend specific.
> 
> It seems the "#ifndef FRONTEND" restriction was added around
> pgwin32_open() for building libpq with Visual C++ or Borland C++.  The
> restriction was added in commit 422d4819 to build libpq with VC++[1].
> Later, in commit fd7c3f67e0bc4, the support for Borland C++ was also
> added.
> 
> So, I'm not sure whether removing that restriction will work for all
> front-end modules.

Thanks for the research, and sorry for my long silence.

If I remove the "#ifndef FRONTEND", building with MSVC fails for all
call sites that use the two-argument version of open(2).

If I use the three-argument version throughout, which should be no problem,
PostgreSQL builds just fine with MSVC.

How about checking what the buildfarm thinks about the attached?

Yours,
Laurenz AlbeFrom 5e10792cf720cedd6ebaa67cfc6163b9c5d5f305 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Wed, 20 Jun 2018 15:54:20 +0200
Subject: [PATCH] Use pgwin32_open in frontend code on Windows

This is particularly important for pg_test_fsync which does not handle
O_DSYNC correctly otherwise, leading to false claims that disks are
unsafe.

All call sites that use the two-argument form of open(2) were changed
to the three-argument form to make the Windows build succeed.
---
 src/bin/pg_basebackup/pg_receivewal.c | 2 +-
 src/bin/pg_verify_checksums/pg_verify_checksums.c | 2 +-
 src/common/file_utils.c   | 2 +-
 src/include/port.h| 3 ---
 4 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c
index 071b32d19d..b97fce3c2b 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -287,7 +287,7 @@ FindStreamingStart(uint32 *tli)
 
 			snprintf(fullpath, sizeof(fullpath), "%s/%s", basedir, dirent->d_name);
 
-			fd = open(fullpath, O_RDONLY | PG_BINARY);
+			fd = open(fullpath, O_RDONLY | PG_BINARY, 0);
 			if (fd < 0)
 			{
 fprintf(stderr, _("%s: could not open compressed file \"%s\": %s\n"),
diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index 845d5aba27..efb32ffcb9 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -82,7 +82,7 @@ scan_file(char *fn, int segmentno)
 	int			f;
 	int			blockno;
 
-	f = open(fn, 0);
+	f = open(fn, 0, 0);
 	if (f < 0)
 	{
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 48876061c3..8a4cdb47ff 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -283,7 +283,7 @@ fsync_fname(const char *fname, bool isdir, const char *progname)
 	 * unsupported operations, e.g. opening a directory under Windows), and
 	 * logging others.
 	 */
-	fd = open(fname, flags);
+	fd = open(fname, flags, 0);
 	if (fd < 0)
 	{
 		if (errno == EACCES || (isdir && errno == EISDIR))
diff --git a/src/include/port.h b/src/include/port.h
index 74a9dc4d4d..2d40045ad5 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -249,11 +249,8 @@ extern bool rmtree(const char *path, bool rmtopdir);
 #define		O_DIRECT	0x8000
 extern int	pgwin32_open(const char *, int,...);
 extern FILE *pgwin32_fopen(const char *, const char *);
-
-#ifndef FRONTEND
 #define		open(a,b,c) pgwin32_open(a,b,c)
 #define		fopen(a,b) pgwin32_fopen(a,b)
-#endif
 
 /*
  * Mingw-w64 headers #define popen and pclose to _popen and _pclose.  We want
-- 
2.14.4



Re: documentation is now XML

2018-06-20 Thread Tatsuo Ishii
>> It's unfortunate that we'll have to deal with different formats in the
>> supported branches for several more years, but we at Postgres Professional 
>> are
>> ready to accept any your decision on this matter for now.
> 
> I had not thought of translations, but wouldn't all translations also
> need to switch from SGML to XML for back branches?  That would be a lot
> of work.

Regarding Japanese translations, we only do translations for the
latest stable branch (for now 10.x). I don't know about other language
translations though.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: documentation is now XML

2018-06-20 Thread Bruce Momjian
On Wed, Jun 20, 2018 at 04:46:24PM +0300, Alexander Lakhin wrote:
> 20.06.2018 15:03, Bruce Momjian wrote:
> 
> On Wed, Jun 20, 2018 at 10:48:47AM +0300, Liudmila Mantrova wrote:
> 
> It's unfortunate that we'll have to deal with different formats in the
> supported branches for several more years, but we at Postgres 
> Professional are
> ready to accept any your decision on this matter for now.
> 
> I had not thought of translations, but wouldn't all translations also
> need to switch from SGML to XML for back branches?  That would be a lot
> of work.
> 
> Hello Bruce,
> 
> I think there are two approaches here.
> 
> First, if translation is performed directly in sgml (say, we have
> adminpack_fr.sgml), we'd have to convert the existing translation files to XML
> anyway - otherwise we'll have to start translating version 11 from scratch. So
> some tools/scripts are needed to perform the conversion. I don't think it 
> would
> be difficult to use these tools to convert existing translation for version 10
> (and 9.6, 9.5...) to XML and to continue working with it. And after the
> conversion is done we could maintain all the translations in a consistent way,
> using a single tool.
> 
> Second, if translation is performed indirectly, via the PO (as we do, e.g. see
> [1]), then the conversion is a matter of automatic producing .po from xml. We
> see no difficulties with it either.
> 
> So I see no extra work for translators as a consequence of backporting XML
> format except for reusing some tools, which are needed anyway.

Yes, if you can do the translation automated, you are right.  However,
in talking to Peter E, he seemed to say there was a lot of manual work,
and obviously doing that only once for a translation is preferable.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: documentation is now XML

2018-06-20 Thread Alexander Lakhin
20.06.2018 15:03, Bruce Momjian wrote:
> On Wed, Jun 20, 2018 at 10:48:47AM +0300, Liudmila Mantrova wrote:
>> It's unfortunate that we'll have to deal with different formats in the
>> supported branches for several more years, but we at Postgres Professional 
>> are
>> ready to accept any your decision on this matter for now.
> I had not thought of translations, but wouldn't all translations also
> need to switch from SGML to XML for back branches?  That would be a lot
> of work.

Hello Bruce,

I think there are two approaches here.

First, if translation is performed directly in sgml (say, we have
adminpack_fr.sgml), we'd have to convert the existing translation files
to XML anyway - otherwise we'll have to start translating version 11
from scratch. So some tools/scripts are needed to perform the
conversion. I don't think it would be difficult to use these tools to
convert existing translation for version 10 (and 9.6, 9.5...) to XML and
to continue working with it. And after the conversion is done we could
maintain all the translations in a consistent way, using a single tool.

Second, if translation is performed indirectly, via the PO (as we do,
e.g. see [1]), then the conversion is a matter of automatic producing
.po from xml. We see no difficulties with it either.

So I see no extra work for translators as a consequence of backporting
XML format except for reusing some tools, which are needed anyway.

[1]
https://github.com/alexanderlaw/pg-doc-ru/blob/master/postgresql-doc/adminpack.po


Best regards,
--
Alexander Lakhin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-20 Thread Ashutosh Bapat
On Tue, Jun 19, 2018 at 2:13 PM, Jeevan Chalke
 wrote:
>
>
> In the reported testcase, parallel_workers is set to 0 for all partition
> (child) relations. Which means partial parallel paths are not possible for
> child rels. However, the parent can have partial parallel paths. Thus, when
> we have a full partitionwise aggregate possible (as the group by clause
> matches with the partition key), we end-up in a situation where we do create
> a partially_grouped_rel for the parent but there won't be any
> partially_grouped_live_children.
>
> The current code was calling add_paths_to_append_rel() without making sure
> of any live children present or not (sorry, it was my fault). This causes an
> Assertion failure in add_paths_to_append_rel() as this function assumes that
> it will have non-NIL live_childrels list.
>

Thanks for the detailed explanation.

> I have fixed partitionwise aggregate code which is calling
> add_paths_to_append_rel() by checking the live children list correctly. And
> for robustness, I have also added an Assert() in add_paths_to_append_rel().
>
> I have verified the callers of add_paths_to_append_rel() and except one, all
> are calling it by making sure that they have a non-NIL live children list.

I actually thought about moving if(live_childrel != NIL) test inside
this function, but then every caller is doing something different when
that condition occurs. So doesn't help much.

> The one which is calling add_paths_to_append_rel() directly is
> set_append_rel_pathlist(). And I think, at that place, we will never have an
> empty live children list,

Yes, I think so too. That's because set_append_rel_size() should have
marked such a parent as dummy and subsequent set_rel_pathlist() would
not create any paths for it.

Here are some review comments.

+/* We should end-up here only when we have few live child rels. */

I think the right wording is " ... we have at least one child." I was actually
thinking if we should just return from here when live_children == NIL. But then
we will loose an opportunity to catch a bug earlier than set_cheapest().

+ * However, if there are no live children, then we cannot create any append
+ * path.

I think "no live children" is kind of misleading here. We usually use that term
to mean at least one non-dummy child. But in this case there is no child at
all, so we might want to better describe this situation. May be also explain
when this condition can happen.

+if (patype == PARTITIONWISE_AGGREGATE_FULL && grouped_live_children != NIL)

I think for this to happen, the parent grouped relation and the underlying scan
itself should be dummy. So, would an Assert be better? I think we have
discussed this earlier, but I can not spot the mail.


+-- Test when partition tables has parallel_workers = 0 but not the parent

Better be worded as "Test when parent can produce parallel paths but not any of
its children". parallel_worker = 0 is just a means to test that. Although the
EXPLAIN output below doesn't really reflect that parent can have parallel
plans. Is it possible to create a scenario to show that.

I see that you have posted some newer versions of this patch, but I
think those still need to address these comments.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-20 Thread Jeevan Chalke
On Tue, Jun 19, 2018 at 7:14 PM, Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
> In the reported testcase, parallel_workers is set to 0 for all partition
>> (child) relations. Which means partial parallel paths are not possible for
>> child rels. However, the parent can have partial parallel paths. Thus, when
>> we have a full partitionwise aggregate possible (as the group by clause
>> matches with the partition key), we end-up in a situation where we do
>> create a partially_grouped_rel for the parent but there won't be any
>> partially_grouped_live_children.
>>
>> The current code was calling add_paths_to_append_rel() without making
>> sure of any live children present or not (sorry, it was my fault). This
>> causes an Assertion failure in add_paths_to_append_rel() as this function
>> assumes that it will have non-NIL live_childrels list.
>>
>> I have fixed partitionwise aggregate code which is calling
>> add_paths_to_append_rel() by checking the live children list correctly. And
>> for robustness, I have also added an Assert() in add_paths_to_append_rel().
>>
>> I have verified the callers of add_paths_to_append_rel() and except one,
>> all are calling it by making sure that they have a non-NIL live children
>> list. The one which is calling add_paths_to_append_rel() directly is
>> set_append_rel_pathlist(). And I think, at that place, we will never have
>> an empty live children list, I may be wrong though. And if that's the case
>> then newly added Assert() in add_paths_to_append_rel() will fail anyway to
>> catch any programming error in that code path.
>>
>> Attached patch fixing the crash and also added a simple test-coverage for
>> that.
>>
>> Let me know if I missed any.
>>
>
> Rajkumar offlist reported another issue related to data-loss. If few of
> the partitions has parallel_workers = 0, not all, then PWA plan ended up
> with a plan having children which has parallel_workers != 0. So the
> partitions with parallel_workers = 0; were not scanned.
>
> Fixed this in attached version of the patch.
>

There were few commits in this area due to which patch is not cleanly
applying.

Attached rebased patch.

Thanks


>
>>
>> --
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 30aa2d1c24ed5222784935b68dd15ecbc5dc39be Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Wed, 20 Jun 2018 18:47:12 +0530
Subject: [PATCH] Make sure that we have live children before we append them.

In create_partitionwise_grouping_paths(), we were calling
add_paths_to_append_rel() on empty live children which causing
an Assertion failure inside it. Thus, prevent calling
add_paths_to_append_rel() when there are no live children.

In passing, add an Assert() in add_paths_to_append_rel() to
check that it receives a valid live children list.

Also, it is very well possible that we could not have a
partially_grouped_rel for some of the children for which aggregation
in partial is not feasible. In such cases, we will end up having
fewer children in the partially_grouped_live_children list. Don't
create append rel in that case.

Jeevan Chalke
---
 src/backend/optimizer/path/allpaths.c |  3 +
 src/backend/optimizer/plan/planner.c  | 25 +++-
 src/test/regress/expected/partition_aggregate.out | 73 +++
 src/test/regress/sql/partition_aggregate.sql  | 20 +++
 4 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 3ada379..6f5ec5a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1391,6 +1391,9 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	bool		build_partitioned_rels = false;
 	double		partial_rows = -1;
 
+	/* We should end-up here only when we have few live child rels. */
+	Assert(live_childrels != NIL);
+
 	/* If appropriate, consider parallel append */
 	pa_subpaths_valid = enable_parallel_append && rel->consider_parallel;
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 67a2c7a..ab27ad0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7011,12 +7011,14 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 	int			cnt_parts;
 	List	   *grouped_live_children = NIL;
 	List	   *partially_grouped_live_children = NIL;
+	int			dummy_children_cnt;
 	PathTarget *target = grouped_rel->reltarget;
 
 	Assert(patype != PARTITIONWISE_AGGREGATE_NONE);
 	Assert(patype != PARTITIONWISE_AGGREGATE_PARTIAL ||
 		   partially_grouped_rel != NULL);
 
+	dummy_children_cnt = 0;
 	/* Add paths for partitionwise aggregation/grouping. */
 	for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
 	{
@@ -7075,6 +7077,8 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 		if (IS_DUMMY_REL(child_input_rel))
 		{
 			mark_dummy_rel(child_grouped_rel);
+			/* Count 

Re: Concurrency bug in UPDATE of partition-key

2018-06-20 Thread Amit Kapila
On Tue, Jun 19, 2018 at 9:33 PM, Amit Khandekar  wrote:
>
> Could not add RAISE statement, because the isolation test does not
> seem to print those statements in the output. Instead, I have dumped
> some rows in a new table through the trigger function, and did a
> select from that table. Attached is v3 patch.
>

Comments
-
1.
+ /*
+ * If the tuple was concurrently updated and the caller of this
+ * function requested for the updated tuple, skip the trigger
+ * execution.
+ */
+ if (newSlot != NULL && epqslot != NULL)
+ return false;

There is a possibility of memory leak due to above change.  Basically,
GetTupleForTrigger returns a newly allocated tuple.  We should free
the triggertuple by calling heap_freetuple(trigtuple).

2.
ExecBRDeleteTriggers(..)
{
..
+ /* If requested, pass back the concurrently updated tuple, if any. */
+ if (epqslot != NULL)
+ *epqslot = newSlot;
+
  if (trigtuple == NULL)
  return false;
+
+ /*
+ * If the tuple was concurrently updated and the caller of this
+ * function requested for the updated tuple, skip the trigger
+ * execution.
+ */
+ if (newSlot != NULL && epqslot != NULL)
+ return false;
..
}

Can't we merge the first change into second, something like:

if (newSlot != NULL && epqslot != NULL)
{
*epqslot = newSlot;
return false;
}

3.
ExecBRDeleteTriggers(..)
{
- TupleTableSlot *newSlot;
  int i;

  Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
  if (fdw_trigtuple == NULL)
  {
+ TupleTableSlot *newSlot;
+
..
}

This change is okay on its own, but I think it has nothing to do with
this patch. If you don't mind, can we remove it?

4.
+/* --
+ * ExecBRDeleteTriggers()
+ *
+ * Called to execute BEFORE ROW DELETE triggers.
+ *
+ * Returns false in following scenarios :
+ * 1. Trigger function returned NULL.
+ * 2. The tuple was concurrently deleted OR it was concurrently updated and the
+ * new tuple didn't pass EvalPlanQual() test.
+ * 3. The tuple was concurrently updated and the new tuple passed the
+ * EvalPlanQual() test, BUT epqslot parameter is not NULL. In such case, this
+ * function skips the trigger function execution, because the caller has
+ * indicated that it wants to further process the updated tuple. The updated
+ * tuple slot is passed back through epqslot.
+ *
+ * In all other cases, returns true.
+ * --
+ */
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,

The above comments appear unrelated to this function, example, this
function is not at all aware of concurrent updates.  I think if you
want to add comments to this function, we can add them as a separate
patch or if you want to add them as part of this patch at least make
them succinct.

5.
+ /*
+ * If this is part of an UPDATE of partition-key, the
+ * epq tuple will contain the changes from this
+ * transaction over and above the updates done by the
+ * other transaction. The caller should now use this
+ * tuple as its NEW tuple, rather than the earlier NEW
+ * tuple.
+ */

I think we can simplify this comment as "If requested, pass back the
updated tuple if any.", something similar to what you have at some
other place in the patch.

6.
+# Session A is moving a row into another partition, but is waiting for
+# another session B that is updating the original row. The row that ends up
+# in the new partition should contain the changes made by session B.

You have named sessions as s1 and s2, but description uses A and B, it
will be better to use s1 and s2 respectively.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Adding tests for inheritance trees with temporary tables

2018-06-20 Thread Ashutosh Bapat
On Wed, Jun 20, 2018 at 11:39 AM, Michael Paquier  wrote:

>
> Good point, I have added those.  Except that you meant
> tableoid::regclass.

Thanks. I have checked that make check passes with this patch. I have
marked this entry as ready for committer.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Fix some error handling for read() and errno

2018-06-20 Thread Magnus Hagander
On Mon, Jun 18, 2018 at 6:42 AM, Michael Paquier 
wrote:

> On Thu, Jun 14, 2018 at 09:50:33AM -0400, Robert Haas wrote:
> > On Mon, Jun 11, 2018 at 6:11 PM, Alvaro Herrera
> >  wrote:
> >> I would go as far as suggesting to remove qualifiers that indicate what
> >> the file is for (such as "relation mapping file"); relying on the path
> >> as indicator of what's going wrong should be sufficient, since it's an
> >> error that affects internals anyway, not anything that users can do much
> >> about.  Keep variations to a minimum, to ease translator's work;
> >> sometimes it's hard enough to come up with good translations for things
> >> like "relation mapping file" in the first place, and they don't help the
> >> end user.
> >
> > +1.  I think those things are often hard to phrase even in English.
> > It makes the messages long, awkward, and almost invariably the style
> > differs from one message to the next depending on the author and how
> > easy it is to describe the type of file involved.
>
> Okay, so this makes two votes in favor of keeping the messages simple
> without context, shaped like "could not read file %s", with Robert and
> Alvaro, and two votes for keeping some context with Andres and I.
> Anybody else has an opinion to offer?
>

Count me in with Robert and Alvaro with a +0.5 :)


Please note that I think that some messages should keep some context
> anyway, for example the WAL-related information is useful.  An example
> is this one where the offset is good to know about:
> +   ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
> +   (errmsg("could not read from log segment %s, offset %u: read
> %d bytes, expected %d",
> +   fname, readOff, r, XLOG_BLCKSZ)));
>

Yeah, I think you'd need to make a call for the individual message to see
how much it helps. In this one the context definitely does, in some others
it doesn't.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: pg_verify_checksums review

2018-06-20 Thread Magnus Hagander
On Tue, Jun 19, 2018 at 10:25 AM, Daniel Gustafsson  wrote:

> In looking over pg_verify_checksums I found a few small things that I think
> would improve on it:
>
> * pg_verify_checksums was placed in the Client Utils section in the docs.
> Since it requries physical access to the cluster datafiles it seems to
> belong
> in the Server Utils section.
>

Makes sense.


* The -D option and supported environment variable wasn’t documented.
>

> * Only -D is supported for specifying the data directory, but most all
> other
> utilities also support --pgdata on top of -D.  To present a consistent user
> interface we should probably support --pgdata in pg_verify_checksums as
> well.


> The latter is I assume too invasive as we are past the freeze, but the
> first
> two docs patches would make sense in 11 IMO as they document whats in the
> tree.
>
> The attached patches fixes the above mentioned things (I don’t have a docs
> toolchain working right now so the docs patches are best effort).
>

I believe both those are fine for 11, so I've pushed that. I kept it as a
separate patch to make it easy enough to revert it if people prefer that :)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Excessive CPU usage in StandbyReleaseLocks()

2018-06-20 Thread David Rowley
On 19 June 2018 at 17:43, Thomas Munro  wrote:
> The problem is that StandbyReleaseLocks() does a linear search of all
> known AccessExclusiveLocks when a transaction ends.  Luckily, since
> v10 (commit 9b013dc2) that is skipped for transactions that haven't
> taken any AELs and aren't using 2PC, but that doesn't help all users.

Good to see this getting fixed.  My original patch [1] to fix this was
more along the lines of yours, only I partitioned the List in an array
indexed by the xid mod size of array.  I had done this as I thought it
would be faster than a hash table and would likely see the locks
spread evenly over the table.  IIRC Andres complained and said I
should use a hashtable, which I see you've done.

> +   97.88%96.96%  postgres  postgres[.] StandbyReleaseLocks
>
> The attached patch splits the AEL list into one list per xid and
> sticks them in a hash table.  That makes perf say something like:
>
> +0.60% 0.00%  postgres  postgres[.] StandbyReleaseLocks

Yeah, I also saw similar with my patch.

> This seems like something we'd want to back-patch because the problem
> affects all branches (the older releases more severely because they
> lack the above-mentioned optimisation).
>
> Thoughts?

I do know this is causing quite a bit of trouble out in the wilds. I'd
be keen to not have to bear the brunt of any more of those troubles,
but it'll ultimately depend on how risky a patch looks.

[1] 
https://www.postgresql.org/message-id/CAKJS1f9vJ841HY%3DwonnLVbfkTWGYWdPN72VMxnArcGCjF3SywA%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: documentation is now XML

2018-06-20 Thread Bruce Momjian
On Wed, Jun 20, 2018 at 10:48:47AM +0300, Liudmila Mantrova wrote:
> To complete the picture of possible issues with older branches in XML, we
> posted a question in packager lists some time ago and didn't receive any
> objections. Just to keep record of all related questions in one place, here's
> the link:
> 
> https://www.postgresql.org/message-id/flat/
> 06efc906-16f8-0cde-5bee-e3d5abfc00ba%40postgrespro.ru#
> 06efc906-16f8-0cde-5bee-e3d5abfc0...@postgrespro.ru
> 
> We totally understand the reluctance to volunteer personal time for manual
> testing of legacy branches - sad as it is that we may miss some benefits of 
> the
> previous efforts. E.g. all styles and transforms have been stabilized for
> version 11, so they could be reused for older branches. As for testing the
> content conversion, our scripts can handle it in a fully automated way by
> comparing the original and the final outputs via the .txt format, so all
> possible differences will be caught.
> 
> It's unfortunate that we'll have to deal with different formats in the
> supported branches for several more years, but we at Postgres Professional are
> ready to accept any your decision on this matter for now.

I had not thought of translations, but wouldn't all translations also
need to switch from SGML to XML for back branches?  That would be a lot
of work.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Excessive CPU usage in StandbyReleaseLocks()

2018-06-20 Thread Thomas Munro
Adding Simon and David R.

On Wed, Jun 20, 2018 at 5:44 AM, Tom Lane  wrote:
> Hence, I have a modest proposal: use elog(LOG) followed by Assert(false),
> so that debug builds report such problems loudly but production builds
> do not.  We've done that before, see e.g. load_relcache_init_file around
> relcache.c:5718.

Done.  (This elog(LOG) isn't new BTW, I merely moved it.)

On Tue, Jun 19, 2018 at 6:30 PM, Andres Freund  wrote:
> On 2018-06-19 17:43:42 +1200, Thomas Munro wrote:
>> + RecoveryLockLists = hash_create("RecoveryLockLists",
>> + 4096,
>> + 
>> _ctl,
>
> Isn't that initial size needlessly big?

Changed to 64.

>> - RecoveryLockList = lappend(RecoveryLockList, newlock);
>> + entry->locks = lappend(entry->locks, newlock);
>
> Gotta be careful with lappend et al - it's really easy to leak memory
> without explicit context usage. Have you checked that we don't here?

The list and contents are explicitly freed when xids end and the locks
are released (because xid committed/aborted, or older than known
oldest running transaction, or we release all on promotion).  The
existing code seems to assume that TopMemoryContext is current (or
some context that'll last all the way until our promotion), which
seems OK to me.  TopMemoryContext is inherited from PostgresMain() and
any redo handler that changes it without restoring it would be buggy,
and errors are FATAL here.

I had missed one small thing though: I should call hash_destroy() in
ShutdownRecoveryTransactionEnvironment() to free the hash table itself
on promotion.  Fixed.

Why does StandbyReleaseLocks() handle an invalid xid by removing all
locks (if (!TransactionIdIsValid(xid) || lock->xid == xid)) in master?

V2 attached.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Move-RecoveryLockList-into-a-hash-table-v2.patch
Description: Binary data


Re: [HACKERS] GUC for cleanup indexes threshold.

2018-06-20 Thread Alexander Korotkov
On Wed, Jun 20, 2018 at 8:32 AM Masahiko Sawada  wrote:
> On Wed, Jun 20, 2018 at 1:00 AM, Alexander Korotkov
> > Ok.  I've rephrased comment a bit.  Also, you created "index vacuum"
> > subsection in the "resource usage" section.  I think it's not
> > appropriate for this option to be in "resource usage".  Ideally it
> > should be grouped with other vacuum options, but we don't have single
> > section for that.  "Autovacuum" section is also not appropriate,
> > because this guc works not only for autovacuum.  So, most semantically
> > close options, which affects vacuum in general, are
> > vacuum_freeze_min_age, vacuum_freeze_table_age,
> > vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age,
> > which are located in "client connection defaults" section.  So, I
> > decided to move this GUC into this section.  I also change the section
> > in GUC definition (guc.c) from AUTOVACUUM to CLIENT_CONN_STATEMENT.
>
> Agreed. So should we move it to 19.11 Client Connection Defaults in
> doc as well? And I think it's better if this option places next to
> other vacuum options for finding easier. Attached patch changes them
> so. Please review it.

Right, thank you.  Looks good for me.
I'm going to commit this if no objections.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-20 Thread Amit Khandekar
On 20 June 2018 at 14:24, Amit Kapila  wrote:
> On Mon, Jun 18, 2018 at 3:11 PM, Amit Khandekar  
> wrote:
>> On 16 June 2018 at 10:44, Amit Kapila  wrote:
>>> On Thu, Jun 14, 2018 at 10:05 PM, Tom Lane  wrote:

 It looks to me like traversal of the partial subpaths is the right
 thing here, in which case we should do

 -   foreach(l, subpaths)
 +   foreach(l, pathnode->subpaths)

 or perhaps better

 -   pathnode->subpaths = list_concat(subpaths, partial_subpaths);
 +   pathnode->subpaths = subpaths = list_concat(subpaths, 
 partial_subpaths);

 to make the behavior clear and consistent.

>>>
>>> I agree with your analysis and proposed change.  However, I think in
>>> practice, it might not lead to any bug as in the loop, we are
>>> computing parallel_safety and partial_subpaths should be
>>> parallel_safe.
>>
>> Will have a look at this soon.
>>
>
> Did you get a chance to look at it?

Not yet, but I have planned to do this by tomorrow.

> I have committed the patch which
> fixes the problem reported in this thread, so I am inclined to close
> the corresponding entry in Open Items list, but I am afraid that we
> will lose track of this suggestion if I close it.

Yes I agree.




-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-20 Thread Amit Kapila
On Mon, Jun 18, 2018 at 3:11 PM, Amit Khandekar  wrote:
> On 16 June 2018 at 10:44, Amit Kapila  wrote:
>> On Thu, Jun 14, 2018 at 10:05 PM, Tom Lane  wrote:
>>>
>>> It looks to me like traversal of the partial subpaths is the right
>>> thing here, in which case we should do
>>>
>>> -   foreach(l, subpaths)
>>> +   foreach(l, pathnode->subpaths)
>>>
>>> or perhaps better
>>>
>>> -   pathnode->subpaths = list_concat(subpaths, partial_subpaths);
>>> +   pathnode->subpaths = subpaths = list_concat(subpaths, 
>>> partial_subpaths);
>>>
>>> to make the behavior clear and consistent.
>>>
>>
>> I agree with your analysis and proposed change.  However, I think in
>> practice, it might not lead to any bug as in the loop, we are
>> computing parallel_safety and partial_subpaths should be
>> parallel_safe.
>
> Will have a look at this soon.
>

Did you get a chance to look at it?  I have committed the patch which
fixes the problem reported in this thread, so I am inclined to close
the corresponding entry in Open Items list, but I am afraid that we
will lose track of this suggestion if I close it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: server crashed with TRAP: FailedAssertion("!(!parallel_aware || pathnode->path.parallel_safe)"

2018-06-20 Thread Amit Kapila
On Tue, Jun 19, 2018 at 7:45 PM, Robert Haas  wrote:
> On Mon, Jun 18, 2018 at 11:36 PM, Amit Kapila  wrote:
>> Fixed in the attached patch.  I will wait for a day or two to see if
>> Tom or Robert wants to say something and then commit.
>
> The patch LGTM but the commit message could perhaps use a little
> word-smithing, e.g. "Commit ab72716778 allowed Parallel Append paths
> to be generated for a relation that is not parallel safe.  Prevent
> that from happening."
>

Changed as per your suggestion and pushed!

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: ToDo: show size of partitioned table

2018-06-20 Thread Pavel Stehule
2018-06-20 10:03 GMT+02:00 Amit Langote :

> On 2018/06/20 16:50, Pavel Stehule wrote:
> > 2018-06-20 9:44 GMT+02:00 Amit Langote :
> >> Do you mean \dt continues to show size 0 for partitioned tables, but
> with
> >> the new option (\dtP+) shows the actual size by aggregating across
> >> partitions?  +1 to such a feature, but we need to agree on an acceptable
> >> implementation for that.  How does the aggregation happen:
> >>
> >
> > yes - my proposal is no change for \dt for now. I think so we will have
> to
> > change it, when partitioning will be more common and number of partitions
> > will be high. But it is not today.
> >
> > \dtP shows only partitions tables (like \dtS shows only system tables),
> > with "+" shows sum of all related partitions.
>
> Ah, okay.  That makes sense.
>
> >> 1. In a new dedicated function in the backend (parallel to
> pg_table_size)?
> >>
> >> or
> >>
> >> 2. psql issues a separate query to compute the total size of a partition
> >>tree
> >>
> >
> > In this moment we can simply do sum on client side, so it is related to
> @2.
>
> I see, okay.
>
> >> For option 2, I had posted a patch that simplifies writing such a query
> >> and posted that here:
> >>
> >> https://www.postgresql.org/message-id/7a9c5328-5328-52a3-
> >> 2a3d-bf1434b4dd1d%40lab.ntt.co.jp
> >>
> >> With that patch, the query to get the total size of a partition tree
> >> becomes as simple as:
> >>
> >> select  sum(pg_table_size(p)) as size
> >> frompg_get_inheritance_tables('partitioned_table_name') p
> >>
> >
> > good to know it. Thank you. Do you think so your patch should be included
> > to this feature or will be processed independently?
>
> It seems that it would be useful on its own, as people may want to do
> various things once we provide them pg_get_inheritance_table.
>

ok

I'll prepare patch and I'll do note about dependency on your patch.

Regards

Pavel


> Thanks,
> Amit
>
>


Re: ToDo: show size of partitioned table

2018-06-20 Thread Amit Langote
On 2018/06/20 16:50, Pavel Stehule wrote:
> 2018-06-20 9:44 GMT+02:00 Amit Langote :
>> Do you mean \dt continues to show size 0 for partitioned tables, but with
>> the new option (\dtP+) shows the actual size by aggregating across
>> partitions?  +1 to such a feature, but we need to agree on an acceptable
>> implementation for that.  How does the aggregation happen:
>>
> 
> yes - my proposal is no change for \dt for now. I think so we will have to
> change it, when partitioning will be more common and number of partitions
> will be high. But it is not today.
> 
> \dtP shows only partitions tables (like \dtS shows only system tables),
> with "+" shows sum of all related partitions.

Ah, okay.  That makes sense.

>> 1. In a new dedicated function in the backend (parallel to pg_table_size)?
>>
>> or
>>
>> 2. psql issues a separate query to compute the total size of a partition
>>tree
>>
> 
> In this moment we can simply do sum on client side, so it is related to @2.

I see, okay.

>> For option 2, I had posted a patch that simplifies writing such a query
>> and posted that here:
>>
>> https://www.postgresql.org/message-id/7a9c5328-5328-52a3-
>> 2a3d-bf1434b4dd1d%40lab.ntt.co.jp
>>
>> With that patch, the query to get the total size of a partition tree
>> becomes as simple as:
>>
>> select  sum(pg_table_size(p)) as size
>> frompg_get_inheritance_tables('partitioned_table_name') p
>>
> 
> good to know it. Thank you. Do you think so your patch should be included
> to this feature or will be processed independently?

It seems that it would be useful on its own, as people may want to do
various things once we provide them pg_get_inheritance_table.

Thanks,
Amit




Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-06-20 Thread Sergei Kornilov
Hello

>   key.userid = GetUserId();
We can remove query id only from current user, right? Maybe better provide 
optional argument for userid? Typically user with pg_read_all_stats and user 
for application queries are different users.
At least it should be documented.

regards, Sergei



Re: Possible bug in logical replication.

2018-06-20 Thread Arseny Sher

Michael Paquier  writes:

> On Mon, Jun 18, 2018 at 09:42:36PM +0900, Michael Paquier wrote:
>> On Fri, Jun 15, 2018 at 06:27:56PM +0300, Arseny Sher wrote:
>> It seems to me that we still want to have the slot forwarding finish in
>> this case even if this is interrupted.  Petr, isn't that the intention
>> here?
>
> I have been chewing a bit more on the proposed patch, finishing with the
> attached to close the loop.  Thoughts?

Sorry for being pedantic, but it seems to me worthwhile to mention *why*
we need decoding machinery at all -- like I wrote:

+ * While we could just do LogicalConfirmReceivedLocation updating
+ * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn
+ * (allowing to recycle WAL) and xmin (allowing to vacuum old catalog tuples).

Also,

>   * The slot's restart_lsn is used as start point for reading records,

This is clearly seen from the code, I propose to remove this.

>   * while confirmed_lsn is used as base point for the decoding context.

And as I wrote, this doesn't matter as changes are not produced.

>   * The LSN position to move to is checked by doing a per-record scan and
>   * logical decoding which makes sure that confirmed_lsn is updated to a
>   * LSN which allows the future slot consumer to get consistent logical
> - * changes.
> + * changes.  As decoding is done with fast_forward mode, no changes are
> + * actually generated.

confirmed_lsn is always updated to `moveto` unless we run out of WAL
earlier (and unless we try to move slot backwards, which is obviously
forbidden) -- consistent changes are practically irrelevant
here. Moreover, we can directly set confirmed_lsn and still have
consistent changes further as restart_lsn and xmin of the slot are not
touched. What we actually do here is trying to advance *restart_lsn and
xmin* as far as we can but up to the point which ensures that decoding
can assemble a consistent snapshot allowing to fully decode all COMMITs
since updated `confirmed_flush_lsn`. All this happens in
SnapBuildProcessRunningXacts.

> It seems to me that we still want to have the slot forwarding finish in
> this case even if this is interrupted.  Petr, isn't that the intention
> here?

Probably, though I am not sure what is the point of this. Ok, I keep
this check in the updated (with your comments) patch and CC'ing Petr.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61588d626f..76bafca41c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -335,11 +335,14 @@ CreateInitDecodingContext(char *plugin,
  *		The LSN at which to start decoding.  If InvalidXLogRecPtr, restart
  *		from the slot's confirmed_flush; otherwise, start from the specified
  *		location (but move it forwards to confirmed_flush if it's older than
- *		that, see below).
+ *		that, see below). Doesn't matter in fast_forward mode.
  *
  * output_plugin_options
  *		contains options passed to the output plugin.
  *
+ * fast_forward
+ *		bypasses the generation of logical changes.
+ *
  * read_page, prepare_write, do_write, update_progress
  *		callbacks that have to be filled to perform the use-case dependent,
  *		actual work.
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..0a4985ef8c 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -341,12 +341,10 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 
 /*
  * Helper function for advancing logical replication slot forward.
- * The slot's restart_lsn is used as start point for reading records,
- * while confirmed_lsn is used as base point for the decoding context.
- * The LSN position to move to is checked by doing a per-record scan and
- * logical decoding which makes sure that confirmed_lsn is updated to a
- * LSN which allows the future slot consumer to get consistent logical
- * changes.
+ * While we could just do LogicalConfirmReceivedLocation updating
+ * confirmed_flush_lsn, we'd better digest WAL to advance restart_lsn
+ * (allowing to recycle WAL) and xmin (allowing to vacuum old catalog tuples).
+ * We do it in special fast_forward mode without actual replay.
  */
 static XLogRecPtr
 pg_logical_replication_slot_advance(XLogRecPtr moveto)
@@ -358,7 +356,6 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 
 	PG_TRY();
 	{
-		/* restart at slot's confirmed_flush */
 		ctx = CreateDecodingContext(InvalidXLogRecPtr,
 	NIL,
 	true,
@@ -388,10 +385,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 			 */
 			startlsn = InvalidXLogRecPtr;
 
-			/*
-			 * The {begin_txn,change,commit_txn}_wrapper callbacks above will
-			 * store the description into our tuplestore.
-			 */
+			/* Changes are not actually produced in fast_forward mode. */
 			if (record != NULL)
 

ERROR: ORDER/GROUP BY expression not found in targetlist

2018-06-20 Thread Rajkumar Raghuwanshi
Hi,

Below test case is failing with ERROR:  ORDER/GROUP BY expression not found
in targetlist. this issue is reproducible from PGv10.

postgres=# CREATE TABLE test(c1 int, c2 text, c3 text);
CREATE TABLE
postgres=# INSERT INTO test SELECT i % 10, i % 250, to_char(i % 4,
'FM000') FROM GENERATE_SERIES(1,10,1)i;
INSERT 0 10
postgres=# ANALYZE test;
ANALYZE
postgres=# EXPLAIN (verbose) SELECT c1, generate_series(1,1), count(*) FROM
test GROUP BY 1 HAVING count(*) > 1;
  QUERY
PLAN
---
 ProjectSet  (cost=2291.00..2306.15 rows=3000 width=16)
   Output: c1, generate_series(1, 1), (count(*))
   ->  HashAggregate  (cost=2291.00..2291.12 rows=3 width=12)
 Output: c1, count(*)
 Group Key: test.c1
 Filter: (count(*) > 1)
 ->  Seq Scan on public.test  (cost=0.00..1541.00 rows=10
width=4)
   Output: c1, c2, c3
(8 rows)

postgres=# SET min_parallel_table_scan_size=0;
SET
postgres=# EXPLAIN (verbose) SELECT c1, generate_series(1,1), count(*) FROM
test GROUP BY 1 HAVING count(*) > 1;
ERROR:  ORDER/GROUP BY expression not found in targetlist

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


  1   2   >