Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2017-11-24 Thread Amit Kapila
On Thu, Nov 23, 2017 at 5:18 PM, amul sul  wrote:
> On Sat, Nov 11, 2017 at 1:05 AM, Robert Haas  wrote:
>> On Wed, Sep 27, 2017 at 7:07 AM, amul sul  wrote:
>>> Attaching POC patch that throws an error in the case of a concurrent update
>>> to an already deleted tuple due to UPDATE of partition key[1].
>>>
>>> In a normal update new tuple is linked to the old one via ctid forming
>>> a chain of tuple versions but UPDATE of partition key[1] move tuple
>>> from one partition to an another partition table which breaks that chain.
>>
>> This patch needs a rebase.  It has one whitespace-only hunk that
>> should possibly be excluded.
>>
> Thanks for looking at the patch.
>
>> I think the basic idea of this is sound.  Either you or Amit need to
>> document the behavior in the user-facing documentation, and it needs
>> tests that hit every single one of the new error checks you've added
>> (obviously, the tests will only work in combination with Amit's
>> patch).  The isolation should be sufficient to write such tests.
>>
>> It needs some more extensive comments as well.  The fact that we're
>> assigning a meaning to ip_blkid -> InvalidBlockNumber is a big deal,
>> and should at least be documented in itemptr.h in the comments for the
>> ItemPointerData structure.
>>
>> I suspect ExecOnConflictUpdate needs an update for the
>> HeapTupleUpdated case similar to what you've done elsewhere.
>>
>
> UPDATE of partition key v25[1] has documented this concurrent scenario,
> I need to rework on that document paragraph to include this behaviour, will
> discuss with Amit.
>
> Attached 0001 patch includes error check for 8 functions, out of 8 I am able
> to build isolation test for 4 functions -- ExecUpdate,ExecDelete,
> GetTupleForTrigger & ExecLockRows.
>

Few comments:

1.
@@ -1480,6 +1493,10 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
  ereport(ERROR,
  (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
  errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be updated was already moved to an another
partition due to concurrent update")));

Why do you think we need this check in the OnConflictUpdate path?  I
think we don't it here because we are going to relinquish this version
of the tuple and will start again and might fetch some other row
version.  Also, we don't support Insert .. On Conflict Update with
partitioned tables, see[1], which is also an indication that at the
very least we don't need it now.

2.
@@ -2709,6 +2709,10 @@ EvalPlanQualFetch(EState *estate, Relation
relation, int lockmode,
  ereport(ERROR,
  (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
  errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be updated was already moved to an another
partition due to concurrent update")));

..
..
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext:
  ereport(ERROR,
  (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
  errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be locked was already moved to an another partition
due to concurrent update")));
+

At some places after heap_lock_tuple the error message says "tuple to
be updated .." and other places it says "tuple to be locked ..".  Can
we use the same message consistently?  I think it would be better to
use the second one.

3.
}
+
  /* tuple already deleted; nothing to do */
  return NULL;

Spurious whitespace.

4.  There is no need to use *POC* in the name of the patch.  I think
this is no more a POC patch.

[1] - 
https://www.postgresql.org/message-id/7ff1e8ec-dc39-96b1-7f47-ff5965dceeac%40lab.ntt.co.jp

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Custom compression methods

2017-11-24 Thread Tomas Vondra
Hi,

I ran into another issue - after inserting some data into a table with a
tsvector column (without any compression defined), I can no longer read
the data.

This is what I get in the console:

db=# select max(md5(body_tsvector::text)) from messages;
ERROR:  cache lookup failed for compression options 6432

and the stack trace looks like this:

Breakpoint 1, get_cached_compression_options (cmoptoid=6432) at
tuptoaster.c:2563
2563elog(ERROR, "cache lookup failed for compression 
options %u",
cmoptoid);
(gdb) bt
#0  get_cached_compression_options (cmoptoid=6432) at tuptoaster.c:2563
#1  0x004bf3da in toast_decompress_datum (attr=0x2b44148) at
tuptoaster.c:2390
#2  0x004c0c1e in heap_tuple_untoast_attr (attr=0x2b44148) at
tuptoaster.c:225
#3  0x0083f976 in pg_detoast_datum (datum=) at
fmgr.c:1829
#4  0x008072de in tsvectorout (fcinfo=0x2b41e00) at tsvector.c:315
#5  0x005fae00 in ExecInterpExpr (state=0x2b414b8,
econtext=0x2b25ab0, isnull=) at execExprInterp.c:1131
#6  0x0060bdf4 in ExecEvalExprSwitchContext
(isNull=0x7fe9bd37 "", econtext=0x2b25ab0, state=0x2b414b8) at
../../../src/include/executor/executor.h:299

It seems the VARATT_IS_CUSTOM_COMPRESSED incorrectly identifies the
value as custom-compressed for some reason.

Not sure why, but the tsvector column is populated by a trigger that
simply does

NEW.body_tsvector
:= to_tsvector('english', strip_replies(NEW.body_plain));

If needed, the complete tool is here:

https://bitbucket.org/tvondra/archie


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] ginInsertCleanup called from vacuum could still miss tuples to be deleted

2017-11-24 Thread Peter Geoghegan
On Fri, Nov 24, 2017 at 6:09 PM, Jeff Janes  wrote:
>> We see this for the entry tree (no deletion is possible in the first
>> place), and we also see it for the posting tree (the dance with
>> inserters having a pin on the root, and so on). Not mentioning why
>> pending list recycling is safe in terms of locking protocols seems
>> like an omission that should be addressed. I'm no GIN expert, but I
>> would expect this because the pending list seems like a kind of
>> extension of the posting tree.
>
> The pending list follows the same rule as the right-links on the posting
> trees:
>
> "To prevent a backend from reaching a deleted page via a right-link, when
> following a right-link the lock on the previous page is not released until
> the lock on next page has been acquired."
>
> This lock-chaining rule is also followed when stepping from the meta page to
> the first page in the pending list.
>
> (Yes, neither of these are mentioned in the README).

But, as I said, we lock-chain with shared buffer locks in
ginInsertCleanup(). Not exclusive buffer locks. (We also take the HW
lock on the metapage so that ginInsertCleanup() callers block each
other out).

With B-Tree index structures that don't use Lehman & Yao's technique,
in general, we must do lock coupling. There is, in general, no point
in lock-chaining at all if there is no exclusive buffer locker
(writer) involved at some point. Writers must lock with exclusive
buffer locks from the beginning of the chain (e.g. root) for
traditional coupling to be correct. However, the only exclusive buffer
lock in ginInsertCleanup() is taken on the metapage at quite a late
point in cleanup, just before insertion of pending items is performed.
Crucially, this comes after we've decided what the head and tail of
the pending list are.

The metapage HW lock is sufficient for writers to block writers, but
AFAICT it is not sufficient for writers to block readers.

> Since pending lists don't have downlinks, the dance with the super-exclusive
> locks is not needed.

Isn't the pending list's head block number in the metapage very much
like a downlink? I think I would be convinced by what you say here if
we held an exclusive buffer lock on the metapage from the beginning
within ginInsertCleanup() (we'd then remove the LockPage() HW lock
calls entirely). We don't, though. I'm not proposing that we actually
change the code in that way. My point is that that looks like it would
make the code correct from the angle I'm looking at it, which may make
my thinking clearer to you.

> I think what is true for deleted pages should also be sufficient for
> deleted-and-freed-and-recycled pages.  Since you are only allowed to follow
> a link from a locked buffer and not from private memory, you can't follow a
> stale link as long as the links are updated before the page is deleted and
> then recycled.
>
> Is this convincing?

In general, yes, but the devil is in the details. I very much like the
"pending list is analogous to posting list" framing. i just doubt that
we get all the details right in practice.

> Also missing is an argument for the deadlock-freeness.  I think what is
> needed for that is that you never lock a page earlier in the pending list,
> nor the metapage, while holding a lock on a pending list page.  (Which as
> far as I can tell is true, but I don't know how to exhaustively verify it)

Defensively checking the page type (pending, posting, etc) within
scanPendingInsert() would be a good start. That's already something
that we do for posting trees. If we're following the same rules as
posting trees (which sounds like a good idea), then we should have the
same defenses. Same applies to using elogs within ginInsertCleanup()
-- we should promote those Assert()s to elog()s.

I also suggest that you add a random, artificial sleep within
ginInsertCleanup(), like this:

/*
 * Read and lock head of pending list
 */
blkno = metadata->head;
// Random sleep goes here
buffer = ReadBuffer(index, blkno);
LockBuffer(buffer, GIN_SHARE);
page = BufferGetPage(buffer);

If there is a race here, then this should make it more likely to hit.

I should remind you that deadlocks might be caused by recycling races.
While we should explain why pending list cleaning is deadlock-free, as
you suggest, reports of deadlocks might not be due to a flaw in
locking as such.

>> As I've said, it feels slightly off to me that a user backend that is
>> performing opportunistic cleanup during insertion can call
>> RecordFreeIndexPage() in GIN.
>
>
> I don't know how to address that, because I don't know why it seems off.  No
> one else does this because no one else generates large number of free pages
> during insertions.

My point is that it seems like having little distinction between
deletion and recycling is problematic. If you don't have very strong
locking to block out all possible index scans (locking at some central
choke point), then you risk concurrent recycling races. I am wearing
my nbtree hat in a

Re: [HACKERS] ginInsertCleanup called from vacuum could still miss tuples to be deleted

2017-11-24 Thread Jeff Janes
On Thu, Nov 16, 2017 at 2:43 PM, Peter Geoghegan  wrote:

> On Thu, Nov 16, 2017 at 2:16 PM, Jeff Janes  wrote:
> > The only reference to super-exclusive lock in
> src/backend/access/gin/README,
> > that I can find, is about posting trees, not pending lists.  Can you
> quote
> > or give line numbers of the section you are referring to?
>
> That's the section I was referring to. I apologize for being unclear.
>
> I would like to see a *general* explanation of why it's okay that the
> pending list can have its pages recycled early. How can we be sure
> that somebody looking through the pending list won't land on a
> just-recycled page and somehow get confused?
>
> We see this for the entry tree (no deletion is possible in the first
> place), and we also see it for the posting tree (the dance with
> inserters having a pin on the root, and so on). Not mentioning why
> pending list recycling is safe in terms of locking protocols seems
> like an omission that should be addressed. I'm no GIN expert, but I
> would expect this because the pending list seems like a kind of
> extension of the posting tree.
>

The pending list follows the same rule as the right-links on the posting
trees:

"To prevent a backend from reaching a deleted page via a right-link, when
following a right-link the lock on the previous page is not released until
the lock on next page has been acquired."

This lock-chaining rule is also followed when stepping from the meta page
to the first page in the pending list.

(Yes, neither of these are mentioned in the README).

Since pending lists don't have downlinks, the dance with the
super-exclusive locks is not needed.

I think what is true for deleted pages should also be sufficient for
deleted-and-freed-and-recycled pages.  Since you are only allowed to follow
a link from a locked buffer and not from private memory, you can't follow a
stale link as long as the links are updated before the page is deleted and
then recycled.

Is this convincing?

Also missing is an argument for the deadlock-freeness.  I think what is
needed for that is that you never lock a page earlier in the pending list,
nor the metapage, while holding a lock on a pending list page.  (Which as
far as I can tell is true, but I don't know how to exhaustively verify it)

As I've said, it feels slightly off to me that a user backend that is
> performing opportunistic cleanup during insertion can call
> RecordFreeIndexPage() in GIN.
>

I don't know how to address that, because I don't know why it seems off.
No one else does this because no one else generates large number of free
pages during insertions.  If we ever implement something like fractal
indexes for BTrees, then we would need to do this from the BTree insertion
(or clean-up-from-insertion) code as well.  There are no warnings in the .h
or .c for RecordFreeIndexPage() about where one is allowed to call it
from.   I don't see any hazards of this beyond those associated with
deleting the page in the first place.  Any deleted page unpinned can be
recycled at any time by a vacuum that comes along while you are swapped
out, no?

Cheers,

Jeff


Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-11-24 Thread Amit Kapila
On Fri, Nov 24, 2017 at 4:51 PM, hubert depesz lubaczewski
 wrote:
> Hi,
>
> up to parallel executions, when we had node in explain analyze showing
> "loops=x" with x more than 1, it meant that the "actual time" had to be
> multiplied by loops to get real time spent in a node.
>
> For example, check step 13 in https://explain.depesz.com/s/gNBd
>
> It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.
>
> But with parallel execution it seems to be no longer the case.
>
> For example:
> https://explain.depesz.com/s/LTMp
> or
> https://explain.depesz.com/s/QHRi
>
> It looks that the actual time is really actual time, and loops is
> "worker nodes + 1".
>
> Is that really the case?
>

I think so.

> Should I, for explain.depesz.com, when dealing
> with partial* and parallel* nodes, use "loops=1" for calculation of
> exclusive/inclusive time? always? some other nodes?
>

I am not sure what exactly inclusive or exclusive means, but for
parallel nodes, total stats are accumulated so you are seeing loops as
'worker nodes + 1'.  Now, as presumably workers run parallelly, so I
think the actual time will be what will be shown in the node not
actual time * nloops.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] ginInsertCleanup called from vacuum could still miss tuples to be deleted

2017-11-24 Thread Jeff Janes
On Thu, Nov 16, 2017 at 12:29 PM, Robert Haas  wrote:

> On Thu, Nov 16, 2017 at 7:08 AM, Masahiko Sawada 
> wrote:
> > Agreed, that's better. Attached updated patch.
> > Also I've added this to the next CF so as not to forget.
>
> Committed and back-patched.  While I'm fairly sure this is a correct
> fix, post-commit review from someone who knows GIN better than I do
> would be a great idea.
>
> I am also clear on what the consequences of this bug are.  It seems
> like it could harm insert performance by making us wait when we
> shouldn't, but can it cause corruption?  That I'm not sure about.  If
> there's already a cleanup of the pending list in progress, how do
> things go wrong?
>
>

It can cause corruption.  Attached is a test case to show it.  It is based
on my usual crash-recovery test, but to get corruption I had to turn off
the crash-injection (which means the test can be run on an unpatched
server) and increase gin_pending_list_limit.

On 8 CPU, it took anywhere from 15 minutes to 8 hours to see corruption,
which presents as something like this in the script output:

child abnormal exit update did not update 1 row: key 8117 updated 2 at
count.pl line 193.\n  at count.pl line 201.

Your commit has apparently fixed it, but I will run some more extensive
tests.

The source of the corruption is this:

A tuple is inserted into the pending list.  It is quickly updated again, or
deleted, and then passes the dead-to-all horizon, so is eligible to vacuum.

A user back end takes the lock to start cleaning the pending list.

A vacuum back end attempts the lock, but due to the bug, it does so
conditionally and then decides it is done when that fails, and starts the
index vacuum proper.

The user back end reaches the dead tuple in the pending list, and inserts
into the main part of the index, in a part that the vacuum has already
passed over.

The vacuum goes back to the table and marks the tid slot as re-usable, even
though there is still an index tuple pointing to it.

Cheers,

Jeff


count.pl
Description: Binary data


do.sh
Description: Bourne shell script


Re: [HACKERS] Custom compression methods

2017-11-24 Thread Tomas Vondra
Hi,

On 11/24/2017 10:38 AM, Ildus Kurbangaliev wrote:
> ...
> That means compressed datums now in the column with storage
> specified as external. I'm not sure that's a bug or a feature.
>

Interesting. Never realized it behaves like this. Not sure if it's
intentional or not (i.e. bug vs. feature).

> Lets insert them usual way:
> 
>   delete from t2;
>   insert into t2 select repeat(md5(i::text),300) from
> generate_series(1,10) s(i);
>   \d+
> 
>  List of relations
>Schema | Name | Type  | Owner |  Size   | Description 
>   +--+---+---+-+-
>public | t1   | table | ildus | 18 MB   | 
>public | t2   | table | ildus | 1011 MB | 
> 
> Maybe there should be more common solution like comparison of
> attribute properties?
> 

Maybe, not sure what the right solution is. I just know that if we allow
inserting data into arbitrary tables without recompression, we may end
up with data that can't be decompressed.

I agree that the behavior with extended storage is somewhat similar, but
the important distinction is that while that is surprising the data is
still accessible.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-24 Thread Craig Ringer
On 24 November 2017 at 22:11, Oliver Ford  wrote:

> Adds RANGE BETWEEN with a start and end value, as well as an
> exclusions clause, to the window functions. This partially resolves
> TODO list item "Implement full support for window framing clauses".
>

Yay!

I'll try to take a look at this.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] proposal - Default namespaces for XPath expressions (PostgreSQL 11)

2017-11-24 Thread Pavel Stehule
2017-11-24 18:13 GMT+01:00 Pavel Stehule :

>
>
> 2017-11-24 17:53 GMT+01:00 Pavel Stehule :
>
>> Hi
>>
>> 2017-11-22 22:49 GMT+01:00 Thomas Munro :
>>
>>> On Thu, Nov 9, 2017 at 10:11 PM, Pavel Stehule 
>>> wrote:
>>> > Attached new version.
>>>
>>> Hi Pavel,
>>>
>>> FYI my patch testing robot says[1]:
>>>
>>>  xml  ... FAILED
>>>
>>> regression.diffs says:
>>>
>>> + SELECT x.* FROM t1, xmltable(XMLNAMESPACES(DEFAULT 'http://x.y'),
>>> '/rows/row' PASSING t1.doc COLUMNS data int PATH
>>> 'child::a[1][attribute::hoge="haha"]') as x;
>>> + data
>>> + --
>>> + (0 rows)
>>> +
>>>
>>> Maybe you forgot to git-add the expected file?
>>>
>>> [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/305979133
>>
>>
>> unfortunately xml.out has 3 versions and is possible so one version
>> should be taken elsewhere than my comp.
>>
>> please can me send your result xml.out file?
>>
>
> looks like this case is without xml support so I can fix on my comp.
>
>
fixed regress test



>
>> Regards
>>
>> Pavel
>>
>>
>>>
>>> --
>>> Thomas Munro
>>> http://www.enterprisedb.com
>>>
>>
>>
>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4dd9d029e6..b871e82a73 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -10489,7 +10489,8 @@ SELECT xml_is_well_formed_document('http://postgresql.org/stuf
  second the namespace URI. It is not required that aliases provided in
  this array be the same as those being used in the XML document itself (in
  other words, both in the XML document and in the xpath
- function context, aliases are local).
+ function context, aliases are local). Default namespace has
+ empty name (empty string) and should be only one.
 
 
 
@@ -10505,11 +10506,20 @@ SELECT xpath('/my:a/text()', 'http://example.com";>test',
 ]]>
 
 
+
+ Inside predicate literals and, or,
+ div and mod are used as keywords
+ (XPath operators) every time and default namespace are not applied there.
+ If you would to use these literals like tag names, then the default namespace
+ should not be used, and these literals should be explicitly
+ labeled.
+
+
 
  To deal with default (anonymous) namespaces, do something like this:
 

Re: [HACKERS] proposal - Default namespaces for XPath expressions (PostgreSQL 11)

2017-11-24 Thread Pavel Stehule
2017-11-24 17:53 GMT+01:00 Pavel Stehule :

> Hi
>
> 2017-11-22 22:49 GMT+01:00 Thomas Munro :
>
>> On Thu, Nov 9, 2017 at 10:11 PM, Pavel Stehule 
>> wrote:
>> > Attached new version.
>>
>> Hi Pavel,
>>
>> FYI my patch testing robot says[1]:
>>
>>  xml  ... FAILED
>>
>> regression.diffs says:
>>
>> + SELECT x.* FROM t1, xmltable(XMLNAMESPACES(DEFAULT 'http://x.y'),
>> '/rows/row' PASSING t1.doc COLUMNS data int PATH
>> 'child::a[1][attribute::hoge="haha"]') as x;
>> + data
>> + --
>> + (0 rows)
>> +
>>
>> Maybe you forgot to git-add the expected file?
>>
>> [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/305979133
>
>
> unfortunately xml.out has 3 versions and is possible so one version should
> be taken elsewhere than my comp.
>
> please can me send your result xml.out file?
>

looks like this case is without xml support so I can fix on my comp.



> Regards
>
> Pavel
>
>
>>
>> --
>> Thomas Munro
>> http://www.enterprisedb.com
>>
>
>


Re: [HACKERS] proposal - Default namespaces for XPath expressions (PostgreSQL 11)

2017-11-24 Thread Pavel Stehule
Hi

2017-11-22 22:49 GMT+01:00 Thomas Munro :

> On Thu, Nov 9, 2017 at 10:11 PM, Pavel Stehule 
> wrote:
> > Attached new version.
>
> Hi Pavel,
>
> FYI my patch testing robot says[1]:
>
>  xml  ... FAILED
>
> regression.diffs says:
>
> + SELECT x.* FROM t1, xmltable(XMLNAMESPACES(DEFAULT 'http://x.y'),
> '/rows/row' PASSING t1.doc COLUMNS data int PATH
> 'child::a[1][attribute::hoge="haha"]') as x;
> + data
> + --
> + (0 rows)
> +
>
> Maybe you forgot to git-add the expected file?
>
> [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/305979133


unfortunately xml.out has 3 versions and is possible so one version should
be taken elsewhere than my comp.

please can me send your result xml.out file?

Regards

Pavel


>
> --
> Thomas Munro
> http://www.enterprisedb.com
>


Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-24 Thread Oliver Ford
On Fri, Nov 24, 2017 at 3:08 PM, Erikjan Rijkers  wrote:
> (debian 8)
>
> make check fails:
>
>  foreign_data ... ok
>  window   ... FAILED
>  xmlmap   ... ok
>
> The diff is:
>
> $ ( cd  /var/data1/pg_stuff/pg_sandbox/pgsql.frame_range/src/test/regress &&
> cat regression.diffs )
> ***
> /var/data1/pg_stuff/pg_sandbox/pgsql.frame_range/src/test/regress/expected/window.out
> 2017-11-24 15:36:15.387573714 +0100
> ---
> /var/data1/pg_stuff/pg_sandbox/pgsql.frame_range/src/test/regress/results/window.out
> 2017-11-24 15:38:35.290553157 +0100
> ***
> *** 1034,1043 
>   (10 rows)
>
>   SELECT pg_get_viewdef('v_window');
> ! pg_get_viewdef
> ! --
> !   SELECT i.i,+
> !  sum(i.i) OVER (ORDER BY i.i) AS sum_rows+
>   FROM generate_series(1, 10) i(i);
>   (1 row)
>
> --- 1034,1043 
>   (10 rows)
>
>   SELECT pg_get_viewdef('v_window');
> ! pg_get_viewdef
> !
> ---
> !   SELECT i.i,
> +
> !  sum(i.i) OVER (ORDER BY i.i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
> AS sum_rows+
>   FROM generate_series(1, 10) i(i);
>   (1 row)
>
>
> This small hickup didn't prevent building an instance but obviously I
> haven't done any real tests yet.
>

I think something was committed recently that changed the spacing of
pg_get_viewdef on Windows. I had the same spacing as you until I
pulled this morning, so I updated my expected output but now it looks
like whatever's changed on Windows hasn't changed on Linux.

I'll try and find what caused this change.



Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2017-11-24 Thread Alvaro Herrera
A typo in all the messages the patch adds:
"to an another" -> "to another"

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] More stats about skipped vacuums

2017-11-24 Thread Robert Haas
On Tue, Nov 21, 2017 at 2:09 AM, Kyotaro HORIGUCHI
 wrote:
> Yes, my concern here is how many column we can allow in a stats
> view. I think I'm a bit too warried about that.

I think that's a good thing to worry about.   In the past, Tom has
expressed reluctance to make stats tables that have a row per table
any wider at all, IIRC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: default range partition and constraint exclusion

2017-11-24 Thread Robert Haas
On Wed, Nov 22, 2017 at 4:21 AM, Amit Langote
 wrote:
>>> If all predicate_refuted_by() receives is the expression tree (AND/OR)
>>> with individual nodes being strict clauses involving partition keys (and
>>> nothing about the nullness of the keys), the downstream code is just
>>> playing by the rules as explained in the header comment of
>>> predicate_refuted_by_recurse() in concluding that query's restriction
>>> clause a = 2 refutes it.
>>
>> Oh, wait a minute.  Actually, I think predicate_refuted_by() is doing
>> the right thing here.  Isn't the problem that mc2p2 shouldn't be
>> accepting a (2, null) tuple at all?
>
> It doesn't.  But, for a query, it does contain (2, ) tuples,
> where  would always be non-null.  So, it should be scanned in the
> plan for the query that has only a = 2 as restriction and no restriction
> on b.  That seems to work.

OK, so I am still confused about whether the constraint is wrong or
the constraint exclusion logic is wrong.  One of them, at least, has
to be wrong, and we have to fix whichever one is wrong.  Fixing broken
constraint exclusion logic by hacking up the constraint, or conversely
fixing a broken constraint by hacking up the constraint exclusion
logic, wouldn't be right.

I think my last email was confused: I thought that the (2, null) tuple
was ending up in mc2p2, but it's really ending up in mc2p_default,
whose constraint currently looks like this:

NOT (
  ((a < 1) OR ((a = 1) AND (b < 1)))
OR
  ((a > 1) OR ((a = 1) AND (b >= 1)))
)

Now where exactly is constraint exclusion going wrong here?  a = 2
refutes a < 1 and a = 1, which means that (a < 1) OR ((a = 1) AND (b <
1)) must be false and that (a = 1) AND (b >= 1) must also be false.
But (a > 1) could be either true or null, which means (a > 1) OR ((a =
1) AND (b >= 1)) can be true or null, which means the whole thing can
be false or null, which means that it is not refuted by a = 2.  It
should be possible to dig down in there step by step and figure out
where the wheels are coming off -- have you tried to do that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2017-11-24 Thread Robert Haas
On Thu, Nov 23, 2017 at 6:48 AM, amul sul  wrote:
> And remaining are EvalPlanQualFetch, ExecOnConflictUpdate,
> RelationFindReplTupleByIndex & RelationFindReplTupleSeq.  Note that check in
> RelationFindReplTupleByIndex & RelationFindReplTupleSeq will have LOG not an
> ERROR.

The first one is going to come up when you have, for example, two
concurrent updates targeting the same row, and the second one when you
have an ON CONFLICT UPDATE clause.  I guess the latter two are
probably related to logical replication, and maybe not easy to test
via an automated regression test.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-24 Thread Erikjan Rijkers

On 2017-11-24 15:11, Oliver Ford wrote:

Adds RANGE BETWEEN with a start and end value, as well as an
exclusions clause, to the window functions. This partially resolves
TODO list item "Implement full support for window framing clauses".


[0001-window-frame-v1.patch]

(debian 8)

make check fails:

 foreign_data ... ok
 window   ... FAILED
 xmlmap   ... ok

The diff is:

$ ( cd  
/var/data1/pg_stuff/pg_sandbox/pgsql.frame_range/src/test/regress && cat 
regression.diffs )
*** 
/var/data1/pg_stuff/pg_sandbox/pgsql.frame_range/src/test/regress/expected/window.out 
  2017-11-24 15:36:15.387573714 +0100
--- 
/var/data1/pg_stuff/pg_sandbox/pgsql.frame_range/src/test/regress/results/window.out 
   2017-11-24 15:38:35.290553157 +0100

***
*** 1034,1043 
  (10 rows)

  SELECT pg_get_viewdef('v_window');
! pg_get_viewdef
! --
!   SELECT i.i,+
!  sum(i.i) OVER (ORDER BY i.i) AS sum_rows+
  FROM generate_series(1, 10) i(i);
  (1 row)

--- 1034,1043 
  (10 rows)

  SELECT pg_get_viewdef('v_window');
! pg_get_viewdef
! 
---
!   SELECT i.i,  
   +
!  sum(i.i) OVER (ORDER BY i.i ROWS BETWEEN 1 PRECEDING AND 1 
FOLLOWING) AS sum_rows+

  FROM generate_series(1, 10) i(i);
  (1 row)


This small hickup didn't prevent building an instance but obviously I 
haven't done any real tests yet.



thanks,


Erik Rijkers



Re: [HACKERS] Commits don't block for synchronous replication

2017-11-24 Thread Simon Riggs
On 23 November 2017 at 11:11, Michael Paquier  wrote:

> This is older than the bug report of this thread. All those
> indications point out that the patch has *not* been committed. So it
> seems to me that you perhaps committed it to your local repository,
> but forgot to push it to the remote. I am switching back the patch
> status to what looks correct to me "Ready for committer". Thanks.

Yes, that looks like it's my mistake. Thanks for rechecking.

Will commit and backpatch when I get home.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Add RANGE with values and exclusions clauses to the Window Functions

2017-11-24 Thread Oliver Ford
Adds RANGE BETWEEN with a start and end value, as well as an
exclusions clause, to the window functions. This partially resolves
TODO list item "Implement full support for window framing clauses".

== Specification ==

The window functions already allow a "ROWS BETWEEN start_value
PRECEDING/FOLLOWING AND end_value PRECEDING/FOLLOWING" to restrict the
number of rows within a partition that are piped into an aggregate
function based on their position before or after the current row. This
patch adds an equivalent for RANGE which restricts the rows based on
whether the _values_ of the ORDER BY column for all other rows in the
partition are within the start_value and end_value bounds. This brings
PostgreSQL to parity with Oracle, and implements a SQL:2011 standard
feature.

SQL:2011 also defines a window frame exclusion clause, which excludes
certain rows from the result. This clause doesn't seem to be
implemented in any mainstream RDBMS (MariaDb mentions that fact in its
documentation here:
https://mariadb.com/kb/en/library/window-functions-overview/ and has
it on its TODO list). This patch implements three EXCLUDE clauses
described in the standard:

EXCLUDE CURRENT ROW - excludes the current row from the result
EXCLUDE TIES - excludes identical rows from the result
EXCLUDE NO OTHERS - does nothing, is the default behavior; exists
purely to describe the intention not to exclude any other rows

The RANGE BETWEEN clause requires a single ORDER BY column which must
be either an integer or a date/time type. If the column is a date/time
type then start_value and end_value must both be an interval type. If
the column is an integer, then the values must both be integers.

== Testing ==

Tested on Windows with MinGW. All existing regression tests pass. New
tests and updated documentation is included. Tests show both the new
RANGE with values working and the exclusion clause working in both
RANGE and ROWS mode.

== Future Work ==

The standard also defines, in addition to RANGE and ROWS, a GROUPS
option with a corresponding EXCLUDE GROUP option. This also doesn't
seem to be implemented anywhere else, and I plan to implement it next.

This patch also adds some new error messages which have not been
internationalized.


0001-window-frame-v1.patch
Description: Binary data


Re: [HACKERS] WIP: Separate log file for extension

2017-11-24 Thread Antonin Houska
Antonin Houska  wrote:

> Magnus Hagander  wrote:
> 
> > > On Fri, Aug 25, 2017 at 12:12 AM, Antonin Houska  wrote:
> 
> > 
> > I like this idea in general.
> > 
> >  Then it's supposed to change some of its attributes
> > 
> > >  adjust_log_stream_attr(&stream->filename, "my_extension.log");
> > 
> > This, however, seems to be wrong.
> > 
> > The logfile name does not belong in the extension, it belongs in the
> > configuration file. I think the extension should set it's "stream id" or
> > whatever you want to call it, and then it should be possible to control in
> > postgresql.conf where that log is sent.
> 
> Doesn't the last paragraph of
> 
> https://www.postgresql.org/message-id/11412.1503912190%40localhost
> 
> address your concerns?

Besides a new version of the patch, an example extension is attached that uses
the feature.

> > Also, what if this extension is loaded on demand in a session and not via
> > shared_preload_libraries? It looks like the syslogger only gets the list of
> > configured streams when it starts?
> 
> Yes, the syslogger gets the list of streams only when it starts, so the
> extension that wants to use this feature needs to provide the file information
> via shared_preload_libraries. I consider it sufficient because various
> existing logging-related GUCs also can't be changed on-the-fly.
> 
> > In short, I think the solution should be more generic, and not "just for 
> > extensions".

statement_log.diff demonstrates how the feature can be used by various
subsystems of PG core. Please consider it an example rather than part of the
"separate log patch". Even if there were no other design questions, there's
too much copy & paste in guc.c. I have no good idea right now how to improve
this part.

> o.k. Any idea about dividing the streams into categories? Should they for
> example correspond somehow to categories of GUC variables?
> 
> > I completely missed this thread when I did my quick-wip at
> > https://www.postgresql.org/message-id/flat/cabuevexztl0gorywm9s4tr_ft3fmjbraxqdxj+bqzjpvmru...@mail.gmail.com#cabuevexztl0gorywm9s4tr_ft3fmjbraxqdxj+bqzjpvmru...@mail.gmail.com
> > -- some of the changes made were close enough that I got the two confused :)
> > Based on the feedback of that one, have you done any performance checks?
> 
> I don't expect mere routing of messages into multiple files to bring any
> overhead. I'll run some tests, just out of curiosity.

After having read the thread on your patch I think that the reason you were
asked to evaluate performance was that your patch can possibly make syslogger
a bottleneck. In contrast, my patch does not prevent user from disabling the
syslogger if it (the syslogger) seems to cause performance issues.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
new file mode 100644
index a3d4917..e3a0f48
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*** typedef struct
*** 465,470 
--- 465,471 
  
  static pid_t backend_forkexec(Port *port);
  static pid_t internal_forkexec(int argc, char *argv[], Port *port);
+ static Size get_backend_params_size(void);
  
  /* Type for a socket that can be inherited to a client process */
  #ifdef WIN32
*** typedef int InheritableSocket;
*** 483,489 
--- 484,496 
   */
  typedef struct
  {
+ 	/*
+ 	 * read_backend_variables() relies on size to be the first field, followed
+ 	 * by port.
+ 	 */
+ 	Size		size;
  	Port		port;
+ 
  	InheritableSocket portsocket;
  	char		DataDir[MAXPGPATH];
  	pgsocket	ListenSocket[MAXLISTEN];
*** typedef struct
*** 529,534 
--- 536,543 
  	char		my_exec_path[MAXPGPATH];
  	char		pkglib_path[MAXPGPATH];
  	char		ExtraOptions[MAXPGPATH];
+ 	int			nlogstreams;
+ 	char		log_streams[FLEXIBLE_ARRAY_MEMBER];
  } BackendParameters;
  
  static void read_backend_variables(char *id, Port *port);
*** internal_forkexec(int argc, char *argv[]
*** 4476,4486 
  	static unsigned long tmpBackendFileNum = 0;
  	pid_t		pid;
  	char		tmpfilename[MAXPGPATH];
! 	BackendParameters param;
  	FILE	   *fp;
  
! 	if (!save_backend_variables(¶m, port))
  		return -1;/* log made by save_backend_variables */
  
  	/* Calculate name for temp file */
  	snprintf(tmpfilename, MAXPGPATH, "%s/%s.backend_var.%d.%lu",
--- 4485,4502 
  	static unsigned long tmpBackendFileNum = 0;
  	pid_t		pid;
  	char		tmpfilename[MAXPGPATH];
! 	Size		param_size;
! 	BackendParameters *param;
  	FILE	   *fp;
  
! 	param_size = get_backend_params_size();
! 	param = (BackendParameters *) palloc(param_size);
! 	if (!save_backend_variables(param, port))
! 	{
! 		pfree(param);
  		return -1;/* log made by save_backend_variables */
+ 	}
+ 	Assert(param->size == param_size);
  
  	/* Calculate name for temp file */
  	snprint

Re: scan-build plpython stuff

2017-11-24 Thread John Naylor
Peter,

I built plpython with scan-build using Python 2.7.12 and Clang 3.8. On
master, I got 13 warnings, and with your patches only one warning
(report attached).

Make installcheck passes.
Let me know if I can test anything else.

-John Naylor
Title: plpy_spi.c


























Bug Summary

File:plpy_spi.c
Location:line 444, column 20
Description:Access to field 'tupdesc' results in a dereference of a null pointer (loaded from variable 'result')


Annotated Source Code

1/*
2 * interface to SPI functions
3 *
4 * src/pl/plpython/plpy_spi.c
5 */
6 
7#include "postgres.h"
8 
9#include 
10 
11#include "access/htup_details.h"
12#include "access/xact.h"
13#include "catalog/pg_type.h"
14#include "executor/spi.h"
15#include "mb/pg_wchar.h"
16#include "parser/parse_type.h"
17#include "utils/memutils.h"
18#include "utils/syscache.h"
19 
20#include "plpython.h"
21 
22#include "plpy_spi.h"
23 
24#include "plpy_elog.h"
25#include "plpy_main.h"
26#include "plpy_planobject.h"
27#include "plpy_plpymodule.h"
28#include "plpy_procedure.h"
29#include "plpy_resultobject.h"
30 
31 
32static PyObject *PLy_spi_execute_query(char *query, long limit);
33static PyObject *PLy_spi_execute_fetch_result(SPITupleTable *tuptable,
34			 uint64 rows, int status);
35static void PLy_spi_exception_set(PyObject *excclass, ErrorData *edata);
36 
37 
38/* prepare(query="select * from foo")
39 * prepare(query="select * from foo where bar = $1", params=["text"])
40 * prepare(query="select * from foo where bar = $1", params=["text"], limit=5)
41 */
42PyObject *
43PLy_spi_prepare(PyObject *self, PyObject *args)
44{
45	PLyPlanObject *plan;
46	PyObject   *list = NULL((void*)0);
47	PyObject   *volatile optr = NULL((void*)0);
48	char	   *query;
49	PLyExecutionContext *exec_ctx = PLy_current_execution_context();
50	volatile MemoryContext oldcontext;
51	volatile ResourceOwner oldowner;
52	volatile int nargs;
53 
54	if (!PyArg_ParseTuple(args, "s|O:prepare", &query, &list))
55		return NULL((void*)0);
56 
57	if (list && (!PySequence_Check(list)))
58	{
59		PLy_exception_set(PyExc_TypeError,
60		  "second argument of plpy.prepare must be a sequence");
61		return NULL((void*)0);
62	}
63 
64	if ((plan = (PLyPlanObject *) PLy_plan_new()) == NULL((void*)0))
65		return NULL((void*)0);
66 
67	plan->mcxt = AllocSetContextCreate(TopMemoryContext,
68	   "PL/Python plan context",
69	   ALLOCSET_DEFAULT_SIZES0, (8 * 1024), (8 * 1024 * 1024));
70	oldcontext = MemoryContextSwitchTo(plan->mcxt);
71 
72	nargs = list ? PySequence_LengthPySequence_Size(list) : 0;
73 
74	plan->nargs = nargs;
75	plan->types = nargs ? palloc0(sizeof(Oid) * nargs) : NULL((void*)0);
76	plan->values = nargs ? palloc0(sizeof(Datum) * nargs) : NULL((void*)0);
77	plan->args = nargs ? palloc0(sizeof(PLyObToDatum) * nargs) : NULL((void*)0);
78 
79	MemoryContextSwitchTo(oldcontext);
80 
81	oldcontext = CurrentMemoryContext;
82	oldowner = CurrentResourceOwner;
83 
84	PLy_spi_subtransaction_begin(oldcontext, oldowner);
85 
86	PG_TRY()do { sigjmp_buf *save_exception_stack = PG_exception_stack; ErrorContextCallback *save_context_stack = error_context_stack; sigjmp_buf local_sigjmp_buf; if (__sigsetjmp (local_sigjmp_buf, 0) == 0) { PG_exception_stack = &local_sigjmp_buf;
87	{
88		int			i;
89 
90		for (i = 0; i < nargs; i++)
91		{
92			char	   *sptr;
93			Oid			typeId;
94			int32		typmod;
95 
96			optr = PySequence_GetItem(list, i);
97			if (PyString_Check(optr)((PyObject*)(optr))->ob_type))->tp_flags & ((1L<<27))) != 0))
98sptr = PyString_AsString(optr);
99			else if (PyUnicode_Check(optr)((PyObject*)(optr))->ob_type))->tp_flags & ((1L<<28))) != 0))
100sptr = PLyUnicode_AsString(optr);
101			else
102			{
103ereport(ERROR,do { if (errstart(20, "plpy_spi.c", 104, __func__, ("plpython" "-" "11"))) errfinish (errmsg("plpy.prepare: type name at ordinal position %d is not a string", i)); if (__builtin_constant_p(20) && (20) >= 20) abort(); } while(0)
104		(errmsg("plpy.prepare: type name at ordinal position %d is not a string", i)))do { if (errstart(20, "plpy_spi.c", 104, __func__, ("plpython" "-" "11"))) errfinish (errmsg("plpy.prepare: type name at ordinal position %d is not a string", i)); if (__builtin_constant_p(20) && (20) >= 20) abort(); } while(0);
105sptr = NULL((void*)0);	/* keep compiler quiet */
106			}
107 
108			/
109			 * Resolve argument type names and then look them up by
110			 * oid in the system cache, and remember the required
111			 *information for input conversion.
112			 /
113 
114			parseTypeString(sptr, &typeId, &typmod, false((bool) 0));
115 
116			Py_DECREF(optr)do { if ( --((PyObject*)(optr))->ob_refcnt != 0) ; else ( (*(((PyObject*)((PyObject *)(optr)))->ob_type)->tp_dealloc)((PyObject *)((PyObject *)(optr; } while (0);
117 
118			/*
119			 * set optr to NULL, so we won't try to unref it again in case of
120	

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-11-24 Thread Antonin Houska
Masahiko Sawada  wrote:

> On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
>  wrote:
> > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada  
> > wrote:
> >>
> >> Because I don't want to break the current user semantics. that is,
> >> currently it's guaranteed that the subsequent reads can see the
> >> committed result of previous writes even if the previous transactions
> >> were distributed transactions. And it's ensured by writer side. If we
> >> can make the reader side ensure it, the backend process don't need to
> >> wait for the resolver process.
> >>
> >> The waiting backend process are released by resolver process after the
> >> resolver process tried to resolve foreign transactions. Even if
> >> resolver process failed to either connect to foreign server or to
> >> resolve foreign transaction the backend process will be released and
> >> the foreign transactions are leaved as dangling transaction in that
> >> case, which are processed later. Also if resolver process takes a long
> >> time to resolve foreign transactions for whatever reason the user can
> >> cancel it by Ctl-c anytime.
> >>
> >
> > So, there's no guarantee that the next command issued from the
> > connection *will* see the committed data, since the foreign
> > transaction might not have committed because of a network glitch
> > (say). If we go this route of making backends wait for resolver to
> > resolve the foreign transaction, we will have add complexity to make
> > sure that the waiting backends are woken up in problematic events like
> > crash of the resolver process OR if the resolver process hangs in a
> > connection to a foreign server etc. I am not sure that the complexity
> > is worth the half-guarantee.
> >
> 
> Hmm, maybe I was wrong. I now think that the waiting backends can be
> woken up only in following cases;
> - The resolver process succeeded to resolve all foreign transactions.
> - The user did the cancel (e.g. ctl-c).
> - The resolver process failed to resolve foreign transaction for a
> reason of there is no such prepared transaction on foreign server.
> 
> In other cases the resolver process should not release the waiters.

I'm not sure I see consensus here. What Ashutosh says seems to be: "Special
effort is needed to ensure that backend does not keep waiting if the resolver
can't finish it's work in forseable future. But this effort is not worth
because by waking the backend up you might prevent the next transaction from
seeing the changes the previous one tried to make."

On the other hand, your last comments indicate that you try to be even more
stringent in letting the backend wait. However even this stringent approach
does not guarantee that the next transaction will see the data changes made by
the previous one.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: [HACKERS] Add support for tuple routing to foreign partitions

2017-11-24 Thread Etsuro Fujita

(2017/10/27 20:00), Etsuro Fujita wrote:

Please find attached an updated version of the patch.


Amit rebased this patch and sent me the rebased version off list. 
Thanks for that, Amit!


One thing I noticed I overlooked is about this change I added to 
make_modifytable to make a valid-looking plan node to pass to 
PlanForeignModify to plan remote insert to each foreign partition:


+   /*
+* The column list of the child might have a different 
column

+* order and/or a different set of dropped columns than that
+* of its parent, so adjust the subplan's tlist as well.
+*/
+   tlist = preprocess_targetlist(root,
+ child_parse->targetList);

This would be needed because the FDW might reference the tlist.  Since 
preprocess_targetlist references root->parse, it's needed to replace 
that with the child query before calling that function, but I forgot to 
do that.  So I fixed that.  Attached is an updated version of the patch.


Best regards,
Etsuro Fujita
*** a/contrib/file_fdw/input/file_fdw.source
--- b/contrib/file_fdw/input/file_fdw.source
***
*** 176,182  COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', 
delimiter ',');
--- 176,188 
  SELECT tableoid::regclass, * FROM pt;
  SELECT tableoid::regclass, * FROM p1;
  SELECT tableoid::regclass, * FROM p2;
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (1, 'xyzzy');
+ \t off
  INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (2, 'xyzzy');
+ \t off
  INSERT INTO pt VALUES (2, 'xyzzy');
  SELECT tableoid::regclass, * FROM pt;
  SELECT tableoid::regclass, * FROM p1;
*** a/contrib/file_fdw/output/file_fdw.source
--- b/contrib/file_fdw/output/file_fdw.source
***
*** 315,321  SELECT tableoid::regclass, * FROM p2;
  (0 rows)
  
  COPY pt FROM '@abs_srcdir@/data/list2.bad' with (format 'csv', delimiter 
','); -- ERROR
! ERROR:  cannot route inserted tuples to a foreign table
  CONTEXT:  COPY pt, line 2: "1,qux"
  COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', delimiter ',');
  SELECT tableoid::regclass, * FROM pt;
--- 315,321 
  (0 rows)
  
  COPY pt FROM '@abs_srcdir@/data/list2.bad' with (format 'csv', delimiter 
','); -- ERROR
! ERROR:  cannot route copied tuples to a foreign table
  CONTEXT:  COPY pt, line 2: "1,qux"
  COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', delimiter ',');
  SELECT tableoid::regclass, * FROM pt;
***
*** 341,348  SELECT tableoid::regclass, * FROM p2;
   p2   | 2 | qux
  (2 rows)
  
  INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
! ERROR:  cannot route inserted tuples to a foreign table
  INSERT INTO pt VALUES (2, 'xyzzy');
  SELECT tableoid::regclass, * FROM pt;
   tableoid | a |   b   
--- 341,366 
   p2   | 2 | qux
  (2 rows)
  
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (1, 'xyzzy');
+  Insert on public.pt
+Foreign Insert on public.p1
+Insert on public.p2
+->  Result
+  Output: 1, 'xyzzy'::text
+ 
+ \t off
  INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
! ERROR:  cannot route inserted tuples to foreign table "p1"
! \t on
! EXPLAIN (VERBOSE, COSTS FALSE) INSERT INTO pt VALUES (2, 'xyzzy');
!  Insert on public.pt
!Foreign Insert on public.p1
!Insert on public.p2
!->  Result
!  Output: 2, 'xyzzy'::text
! 
! \t off
  INSERT INTO pt VALUES (2, 'xyzzy');
  SELECT tableoid::regclass, * FROM pt;
   tableoid | a |   b   
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 7029,7034  NOTICE:  drop cascades to foreign table bar2
--- 7029,7236 
  drop table loct1;
  drop table loct2;
  -- ===
+ -- test tuple routing for foreign-table partitions
+ -- ===
+ create table pt (a int, b int) partition by list (a);
+ create table locp partition of pt for values in (1);
+ create table locfoo (a int check (a in (2)), b int);
+ create foreign table remp partition of pt for values in (2) server loopback 
options (table_name 'locfoo');
+ explain (verbose, costs off)
+ insert into pt values (1, 1), (2, 1);
+QUERY PLAN
+ -
+  Insert on public.pt
+Insert on public.locp
+Foreign Insert on public.remp
+  Remote SQL: INSERT INTO public.locfoo(a, b) VALUES ($1, $2)
+->  Values Scan on "*VALUES*"
+  Output: "*VALUES*".column1, "*VALUES*".column2
+ (6 rows)
+ 
+ insert into pt values (1, 1), (2, 1);
+ select tableoid::regclass, * FROM pt;
+  tableoid | a | b 
+ --+---+---
+  locp | 1 | 1
+  remp | 2 | 1
+ (2 rows)
+ 
+ select tableoi

Re: [HACKERS] Parallel Append implementation

2017-11-24 Thread Rajkumar Raghuwanshi
On Thu, Nov 23, 2017 at 2:22 PM, amul sul  wrote:
> Look like it is the same crash what v20 claim to be fixed, indeed I
> missed to add fix[1] in v20 patch, sorry about that. Attached updated
> patch includes aforementioned fix.

Hi,

I have applied latest v21 patch, it got crashed when enabled
partition-wise-join,
same query is working fine with and without partition-wise-join
enabled on PG-head.
please take a look.

SET enable_partition_wise_join TO true;

CREATE TABLE pt1 (a int, b int, c text, d int) PARTITION BY LIST(c);
CREATE TABLE pt1_p1 PARTITION OF pt1 FOR VALUES IN ('', '0001',
'0002', '0003');
CREATE TABLE pt1_p2 PARTITION OF pt1 FOR VALUES IN ('0004', '0005',
'0006', '0007');
CREATE TABLE pt1_p3 PARTITION OF pt1 FOR VALUES IN ('0008', '0009',
'0010', '0011');
INSERT INTO pt1 SELECT i % 20, i % 30, to_char(i % 12, 'FM'), i %
30 FROM generate_series(0, 9) i;
ANALYZE pt1;

CREATE TABLE pt2 (a int, b int, c text, d int) PARTITION BY LIST(c);
CREATE TABLE pt2_p1 PARTITION OF pt2 FOR VALUES IN ('', '0001',
'0002', '0003');
CREATE TABLE pt2_p2 PARTITION OF pt2 FOR VALUES IN ('0004', '0005',
'0006', '0007');
CREATE TABLE pt2_p3 PARTITION OF pt2 FOR VALUES IN ('0008', '0009',
'0010', '0011');
INSERT INTO pt2 SELECT i % 20, i % 30, to_char(i % 12, 'FM'), i %
30 FROM generate_series(0, 9) i;
ANALYZE pt2;

EXPLAIN ANALYZE
SELECT t1.c, sum(t2.a), COUNT(*) FROM pt1 t1 FULL JOIN pt2 t2 ON t1.c
= t2.c GROUP BY t1.c ORDER BY 1, 2, 3;
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

stack-trace is given below.

Core was generated by `postgres: parallel worker for PID 73935
 '.
Program terminated with signal 11, Segmentation fault.
#0  0x006dc4b3 in ExecProcNode (node=0x7f7f7f7f7f7f7f7e) at
../../../src/include/executor/executor.h:238
238if (node->chgParam != NULL) /* something changed? */
Missing separate debuginfos, use: debuginfo-install
keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
libcom_err-1.41.12-23.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x006dc4b3 in ExecProcNode (node=0x7f7f7f7f7f7f7f7e) at
../../../src/include/executor/executor.h:238
#1  0x006dc72e in ExecAppend (pstate=0x26cd6e0) at nodeAppend.c:207
#2  0x006d1e7c in ExecProcNodeInstr (node=0x26cd6e0) at
execProcnode.c:446
#3  0x006dcee5 in ExecProcNode (node=0x26cd6e0) at
../../../src/include/executor/executor.h:241
#4  0x006dd38c in fetch_input_tuple (aggstate=0x26cd7f8) at
nodeAgg.c:699
#5  0x006e02eb in agg_fill_hash_table (aggstate=0x26cd7f8) at
nodeAgg.c:2536
#6  0x006dfb2b in ExecAgg (pstate=0x26cd7f8) at nodeAgg.c:2148
#7  0x006d1e7c in ExecProcNodeInstr (node=0x26cd7f8) at
execProcnode.c:446
#8  0x006d1e4d in ExecProcNodeFirst (node=0x26cd7f8) at
execProcnode.c:430
#9  0x006c9439 in ExecProcNode (node=0x26cd7f8) at
../../../src/include/executor/executor.h:241
#10 0x006cbd73 in ExecutePlan (estate=0x26ccda0,
planstate=0x26cd7f8, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001', numberTuples=0,
direction=ForwardScanDirection, dest=0x26b2ce0, execute_once=1
'\001') at execMain.c:1718
#11 0x006c9a12 in standard_ExecutorRun (queryDesc=0x26d7fa0,
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at
execMain.c:361
#12 0x006c982e in ExecutorRun (queryDesc=0x26d7fa0,
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at
execMain.c:304
#13 0x006d096c in ParallelQueryMain (seg=0x26322a8,
toc=0x7fda24d46000) at execParallel.c:1271
#14 0x0053272d in ParallelWorkerMain (main_arg=1203628635) at
parallel.c:1149
#15 0x007e8c99 in StartBackgroundWorker () at bgworker.c:841
#16 0x007fc029 in do_start_bgworker (rw=0x2656d00) at postmaster.c:5741
#17 0x007fc36b in maybe_start_bgworkers () at postmaster.c:5945
#18 0x007fb3fa in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5134
#19 
#20 0x003dd26e1603 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:82
#21 0x007f6bee in ServerLoop () at postmaster.c:1721
#22 0x007f63dd in PostmasterMain (argc=3, argv=0x2630180) at
postmaster.c:1365
#23 0x0072cb40 in main (argc=3, argv=0x2630180) at main.c:228

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation



explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-11-24 Thread hubert depesz lubaczewski
Hi,

up to parallel executions, when we had node in explain analyze showing
"loops=x" with x more than 1, it meant that the "actual time" had to be
multiplied by loops to get real time spent in a node.

For example, check step 13 in https://explain.depesz.com/s/gNBd

It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.

But with parallel execution it seems to be no longer the case.

For example:
https://explain.depesz.com/s/LTMp
or
https://explain.depesz.com/s/QHRi

It looks that the actual time is really actual time, and loops is
"worker nodes + 1".

Is that really the case? Should I, for explain.depesz.com, when dealing
with partial* and parallel* nodes, use "loops=1" for calculation of
exclusive/inclusive time? always? some other nodes?

or am I missing something in here?

Best regards,

depesz



Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-11-24 Thread Alexander Korotkov
Dear Amit,

Thank you for the attention to this patch.

On Thu, Nov 23, 2017 at 4:39 PM, Amit Kapila 
wrote:

> On Thu, Jun 22, 2017 at 9:06 PM, Alexander Korotkov
>  wrote:
> > Work on this patch took longer than I expected.  It is still in not so
> good
> > shape, but I decided to publish it anyway in order to not stop progress
> in
> > this area.
> > I also tried to split this patch into several.  But actually I manage to
> > separate few small pieces, while most of changes are remaining in the
> single
> > big diff.
> > Long story short, patchset is attached.
> >
> > 0001-64bit-guc-relopt-1.patch
> > This patch implements 64 bit GUCs and relation options which are used in
> > further patches.
> >
> > 0002-heap-page-special-1.patch
> > Putting xid and multixact bases into PageHeaderData would take extra 16
> > bytes on index pages too.  That would be waste of space for indexes.
> This
> > is why I decided to put bases into special area of heap pages.
> > This patch adds special area for heap pages contaning prune xid and magic
> > number.  Magic number is different for regular heap page and sequence
> page.
> >
>
> uint16 pd_pagesize_version;
> - TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
>   ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
>   } PageHeaderData;
>
> Why have you moved pd_prune_xid from page header?
>

pg_prune_xid makes sense only for heap pages.  Once we introduce special
area for heap pages, we can move pg_prune_xid there and save some bytes in
index pages.  However, this is an optimization not directly related to
64-bit xids.  Idea is that if we anyway change page format, why don't apply
this optimization as well?  But if we have any doubts in this, it can be
removed with no problem.


> > 0003-64bit-xid-1.patch
> > It's the major patch.  It redefines TransactionID ad 64-bit integer and
> > defines 32-bit ShortTransactionID which is used for t_xmin and t_xmax.
> > Transaction id comparison becomes straight instead of circular. Base
> values
> > for xids and multixact ids are stored in heap page special.  SLRUs also
> > became 64-bit and non-circular.   To be able to calculate xmin/xmax
> without
> > accessing heap page, base values are copied into HeapTuple.
> Correspondingly
> > HeapTupleHeader(Get|Set)(Xmin|Xmax) becomes just
> > HeapTuple(Get|Set)(Xmin|Xmax) whose require HeapTuple not just
> > HeapTupleHeader.  heap_page_prepare_for_xid() is used to ensure that
> given
> > xid fits particular page base.  If it doesn't fit then base of page is
> > shifted, that could require single-page freeze.  Format for wal is
> changed
> > in order to prevent unaligned access to TransactionId.  *_age GUCs and
> > relation options are changed to 64-bit.  Forced "autovacuum to prevent
> > wraparound" is removed, but there is still freeze to truncate SLRUs.
> >
>
> It seems there is no README or some detailed explanation of how all
> this works like how the value of pd_xid_base is maintained.  I don't
> think there are enough comments in the patch to explain the things.  I
> think it will be easier to understand and review the patch if you
> provide some more details either in email or in the patch.
>

OK.  I'm going to write README and include it into the patch.


> > 0004-base-values-for-testing-1.patch
> > This patch is used for testing that calculations using 64-bit bases and
> > short 32-bit xid values are correct.  It provides initdb options for
> initial
> > xid, multixact id and multixact offset values.  Regression tests
> initialize
> > cluster with large (more than 2^32) values.
> >
> > There are a lot of open items, but I would like to notice some of them:
> >  * WAL becomes significantly larger due to storage 8 byte xids instead
> of 4
> > byte xids.  Probably, its needed to use base approach in WAL too.
> >
>
> Yeah and I think it can impact performance as well.  By any chance
> have you run pgbench read-write to see the performance impact of this
> patch?
>

Sure, I'll make some benchmarks on both 32-bit and 64-bit machines.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: documentation is now XML

2017-11-24 Thread Michael Paquier
On Fri, Nov 24, 2017 at 5:39 AM, Tom Lane  wrote:
> Peter Eisentraut  writes:
>> The documentation sources are now DocBook XML, not SGML.  (The files are
>> still named *.sgml.  That's something to think about separately.)
>
> I think we should have a discussion about whether it'd be smart
> to convert the back branches' documentation to XML as well.
>
> The main reason that I want to consider this is that back-patching
> documentation fixes is going to be a huge problem if we don't.

Things like c29c578 and 1ff01b3 only found their way on HEAD. There is
a bit of work needed here for back-branches. At the end I would vote
for having consistent documentation on all active branches.

> Using the same doc-building toolchain across all branches seems like a win
> as well.  You could argue that switching build requirements in a minor
> release is unfriendly to packagers; but those who build their own docs
> have already had to adapt to the xsltproc/fop toolchain for v10, so
> standardizing on that for 9.3-9.6 as well doesn't seem like it complicates
> their lives.  (Possibly we should canvass opinions on pgsql-packagers to
> be sure of that.)

My own packaging is going to need some tweaks as well, but there is
nothing difficult. A discussion is definitely deserved on -packagers,
all don't have the same toolchain set.
-- 
Michael



Re: [HACKERS] Custom compression methods

2017-11-24 Thread Ildus Kurbangaliev
On Thu, 23 Nov 2017 21:54:32 +0100
Tomas Vondra  wrote:
 
> 
> Hmm, this seems to have fixed it, but only in one direction. Consider
> this:
> 
> create table t_pglz (v text);
> create table t_lz4 (v text compressed lz4);
> 
> insert into t_pglz select repeat(md5(i::text),300)
> from generate_series(1,10) s(i);
> 
> insert into t_lz4 select repeat(md5(i::text),300)
> from generate_series(1,10) s(i);
> 
> \d+
> 
>  Schema |  Name  | Type  | Owner | Size  | Description
> ++---+---+---+-
>  public | t_lz4  | table | user  | 12 MB |
>  public | t_pglz | table | user  | 18 MB |
> (2 rows)
> 
> truncate t_pglz;
> insert into t_pglz select * from t_lz4;
> 
> \d+
> 
>  Schema |  Name  | Type  | Owner | Size  | Description
> ++---+---+---+-
>  public | t_lz4  | table | user  | 12 MB |
>  public | t_pglz | table | user  | 18 MB |
> (2 rows)
> 
> which is fine. But in the other direction, this happens
> 
> truncate t_lz4;
> insert into t_lz4 select * from t_pglz;
> 
>  \d+
>List of relations
>  Schema |  Name  | Type  | Owner | Size  | Description
> ++---+---+---+-
>  public | t_lz4  | table | user  | 18 MB |
>  public | t_pglz | table | user  | 18 MB |
> (2 rows)
> 
> which means the data is still pglz-compressed. That's rather strange,
> I guess, and it should compress the data using the compression method
> set for the target table instead.

That's actually an interesting issue. It happens because if tuple fits
to page then postgres just moves it as is. I've just added
recompression if it has custom compressed datums to keep dependencies
right. But look:

  create table t1(a text);
  create table t2(a text);
  alter table t2 alter column a set storage external;
  insert into t1 select repeat(md5(i::text),300) from
generate_series(1,10) s(i);
  \d+

  List of relations 
   Schema | Name | Type  | Owner |Size| Description 
  +--+---+---++-
   public | t1   | table | ildus | 18 MB  | 
   public | t2   | table | ildus | 8192 bytes | 
  (2 rows)

  insert into t2 select * from t1;

  \d+

List of relations
   Schema | Name | Type  | Owner | Size  | Description 
  +--+---+---+---+-
   public | t1   | table | ildus | 18 MB | 
   public | t2   | table | ildus | 18 MB | 
  (2 rows)

That means compressed datums now in the column with storage specified as
external. I'm not sure that's a bug or a feature. Lets insert them
usual way:

  delete from t2;
  insert into t2 select repeat(md5(i::text),300) from
generate_series(1,10) s(i);
  \d+

 List of relations
   Schema | Name | Type  | Owner |  Size   | Description 
  +--+---+---+-+-
   public | t1   | table | ildus | 18 MB   | 
   public | t2   | table | ildus | 1011 MB | 

Maybe there should be more common solution like comparison of attribute
properties?

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company