date:20160804

Re: [HACKERS] Detecting skipped data from logical slots (data silently skipped)

2016-08-04 Thread Andres Freund

On 2016-08-05 00:12:41 -0400, Robert Haas wrote:
> > The cause is an optimisation intended to allow the downstream to avoid
> > having to do local writes and flushes when the upstream's activity isn't of
> > interest to it and doesn't result in replicated rows. When the upstream does
> > a bunch of writes to another database or otherwise produces WAL not of
> > interest to the downstream we send the downstream keepalive messages that
> > include the upstream's current xlog position and the client replies to
> > acknowledge it's seen the new LSN. But, so that we can avoid disk flushes on
> > the downstream, we permit it to skip advancing its replication origin in
> > response to those keepalives. We continue to advance the confirmed_flush_lsn
> > and restart_lsn in the replication slot on the upstream so we can free WAL
> > that's not needed and move the catalog_xmin up. The replication origin on
> > the downstream falls behind the confirmed_flush_lsn on the upstream.
> 
> This seems entirely too clever.  The upstream could safely remember
> that if the downstream asks for WAL position X it's safe to begin
> streaming from WAL position Y because nothing in the middle is
> interesting, but it can hardly decide to unilaterally ignore the
> request position.
> 
> > The simplest fix would be to require downstreams to flush their replication
> > origin when they get a hot standby feedback message, before they send a
> > reply with confirmation. That could be somewhat painful for performance, but
> > can be alleviated somewhat by waiting for the downstream postgres to get
> > around to doing a flush anyway and only forcing it if we're getting close to
> > the walsender timeout. That's pretty much what BDR and pglogical do when
> > applying transactions to avoid having to do a disk flush for each and every
> > applied xact. Then we change START_REPLICATION ... LOGICAL so it ERRORs if
> > you ask for a too-old LSN rather than silently ignoring it.
> 
> That's basically just proposing to revert this broken optimization,
> IIUC, and instead just try not to flush too often on the standby.

The effect of the optimization is *massive* if you are replicating a
less active database, or a less active subset of a database, in a
cluster with lots of other activity. I don't think that can just be
disregard, to protect against something with plenty of other failure
scenarios.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible duplicate release of buffer lock.

2016-08-04 Thread Kyotaro HORIGUCHI

Hello,

At Thu, 4 Aug 2016 16:09:17 -0700, Peter Geoghegan  wrote in 

> On Wed, Aug 3, 2016 at 1:31 AM, Kyotaro HORIGUCHI
>  wrote:
> > The first line is emitted for simultaneous deletion of a index
> > page, which is impossible by design in a consistent state so the
> > complained situation should be the result of an index corruption
> > before upgading, specifically, inconsistent sibling pointers
> > around a deleted page.
> 
> > My point here is that if concurrent deletion can't be perfomed by
> > the current implement, this while loop could be removed and
> > immediately error out or log a message,
> 
> Perhaps I have misunderstood, but I must ask: Are you aware that 9.4
> has commit efada2b8, which fixes a bug with page deletion, that never
> made it into earlier branches?

_bt_unlink_halfdead_page in nbtpage.c of 9.4 that I visited this
time is after the commit. And it doesn't change the condition for
the "no left sibling" message. If I misunderstood you.

> It is worth noting that anything that can make VACUUM in particular
> raise an error is considered problematic. For example, we do not allow
> VACUUM to finish incomplete B-Tree page splits, even though we could,
> because there might be no disk space for even one extra page at the
> worst possible time (we VACUUM in part to free disk space, of course).
> We really want to let VACUUM finish, if at all possible, even if it
> sees something that indicates data corruption.

Yes, It seems to me that the "no left sibling" is ignorable since
skipping for the case doesn't more harm.

For this inquiry case, VACUUM stopped by a seemingly
double-releasing of a lock just after "no left sibling". I
suspect that it is caused by skipping deleted page just left of
the target page and it can happen for certain type of
corruption. (writing out pages in certain order then crash and
incomplete crash recovery after that would cause this..)

If concurrent deletion (marking half-dead, specifically) can't
happen, P_ISDELETED() won't be seen in the loop but we may
consider it as a skippable condition to continue VACUUM. But
still I think we shouldn't release the lock on the target page
there. But it doesn't harm except that VACUUM stops there so I
don't mind it will be as it is. So I'd like to have opinions of
others about this.

> I remember that during the review of the similar B-Tree page split
> patch (not efada2b8, but rather commit 40dae7ec, which was also in
> 9.4), I found bugs due to functions not being very clear about whether
> a buffer lock and/or pin were held upon returning from a function in
> all cases, including very rare cases or "can't happen" cases. So, this
> seems kind of familiar.

I think the point is that is a danger leads to wrong result or
crash of server, or not. On that criteria, this doesn't harm,
maybe.

> Code comments in the relevant function say this:
> 
>  * Returns 'false' if the page could not be unlinked (shouldn't happen).
>  * If the (new) right sibling of the page is empty, *rightsib_empty is set
>  * to true.
>  */
> static bool
> _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
> {
> 
> What we see here is that this "shouldn't happen" condition does, in
> fact, happen very rarely, including in cases where the target is not a
> leaf page.
> 
> I think there might well be a real bug here, because we are not
> careful enough in ensuring that the _bt_unlink_halfdead_page()
> "contract" holds when something that "can't happen" does, in fact,
> happen:
> 
> /*
>  * Release the target, if it was not the leaf block.  The leaf is always
>  * kept locked.
>  */
> if (target != leafblkno)
> _bt_relbuf(rel, buf);
> 
> return true;
> }
> 
> (Assuming you can call this code at the end of
> _bt_unlink_halfdead_page() its "contract".)
> 
> It's not good at all if the code is stuck in a way that prevents
> VACUUM from finishing, as I said. That seems to be the case here,
> judging from Horiguchi-san's log output, which shows an ERROR from
> "lock main 13879 is not held". The problem is not so much that a
> "can't happen" thing happens (data corruption will sometimes make "the
> impossible" possible, after all -- we know this, and write code
> defensively because of this). The bug is that there is any ERROR for
> VACUUM that isn't absolutely unavoidable (the damage has been done
> anyway -- the index is already corrupt).

Agreed. My thought is almost the as this, it seems to be better
to save such 'not absolutely ununvoidable' case unless it's too
complicated.

> I'll need to think about this some more, when I have more time.
> Perhaps tomorrow.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible duplicate release of buffer lock.

2016-08-04 Thread Kyotaro HORIGUCHI

At Wed, 03 Aug 2016 12:00:33 -0400, Tom Lane  wrote in 
<26373.1470240...@sss.pgh.pa.us>
> Kyotaro HORIGUCHI  writes:
> > My point here is that if concurrent deletion can't be perfomed by
> > the current implement, this while loop could be removed and
> > immediately error out or log a message,
> 
> >> if (P_ISDELETED(opaque) || opaque->btpo_next != target)
> >> {
> >> elog(ERROR, "no left sibling of page %d (concurrent deletion?) in 
> >> \"%s\"",..
> 
> That would certainly break things: there are valid cases for the
> loop to need to iterate, specifically where the left sibling got
> split before we could acquire lock on it.

Sorry for the garbage. It indeed is just a breakage.

> > or, the while loop at least should stop before overshooting the
> > target.
> 
> >> while (P_ISDELETED(opaque) || opaque->btpo_next != target)
> >> {
> >> /* step right one page */
> >> leftsib = opaque->btpo_next;
> >> _bt_relbuf(rel, lbuf);
> >> if (leftsib == target || leftsib == P_NONE)
> >> {
> >> elog(ERROR, "no left sibling of page %d (concurrent deletion?) in 
> >> \"%s\"",..
> 
> Huh?  Surely that added test condition could never be true because
> of the second part of the while() test.

It can be masked by P_ISDELETED(opaque), this story is for the
case that the "deleted" page has the right link to the
target. More precisely, the case where a deleted page is seen
during traversing sibling-link even though starting from a live
page (or target, specifically).

As Peter's reply, this is in the case of curruption, which could
be saved just in order to continue performing VACUUM. Or a
violation on a "contract" about page locks. If it's useless, the
P_ISDELETED is also useless unless any corruption, or concurrent
deletion.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Pavan Deolasee

On Fri, Aug 5, 2016 at 8:23 AM, Claudio Freire 
wrote:

> On Thu, Aug 4, 2016 at 11:15 PM, Bruce Momjian  wrote:
>
> >
> > OK, that's a lot of text, and I am confused.  Please tell me the
> > advantage of having an index point to a string of HOT chains, rather
> > than a single one?  Is the advantage that the index points into the
> > middle of the HOT chain rather than at the beginning?  I can see some
> > value to that, but having more ctid HOT headers that can't be removed
> > except by VACUUM seems bad, plus the need for index rechecks as you
> > cross to the next HOT chain seems bad.
> >
> > The value of WARM is to avoid index bloat --- right now we traverse the
> > HOT chain on a single page just fine with no complaints about speed so I
> > don't see a value to optimizing that traversal, and I think the key
> > rechecks and ctid bloat will make it all much slower.
> >
> > It also seems much more complicated.
>
> The point is avoiding duplicate rows in the output of index scans.
>
> I don't think you can avoid it simply by applying index condition
> rechecks as the original proposal implies, in this case:
>
> CREATE TABLE t (id integer not null primary key, someid integer, dat
> integer);
> CREATE INDEX i1 ON t (someid);
>
> INSERT INTO t (id, someid, dat) VALUES (1, 2, 100);
> UPDATE t SET dat = dat + 1 where id = 1;
> UPDATE t SET dat = dat + 1, id = 2 where id = 1;
> UPDATE t SET dat = dat + 1, id = 3, someid = 3 where id = 2;
> UPDATE t SET dat = dat + 1, id = 1, someid = 2 where id = 3;
> SELECT * FROM t WHERE someid = 2;
>
> That, I believe, will cause the problematic chains where the condition
> recheck passes both times the last visible tuple is visited during the
> scan. It will return more than one tuple when in reality there is only
> one.
>

Hmm. That seems like a real problem. The only way to address that is to
ensure that for a given WARM chain and given index key, there must exists
only a single index entry. There can and will be multiple entries if the
index key changes, but if the index key changes to the original value (or
any other value already in the WARM chain) again, we must not add another
index entry. Now that does not seem like a very common scenario and
skipping a duplicate index entry does have its own benefit, so may be its
not a terrible idea to try that. You're right, it may be expensive to
search for an existing matching index key for the same root line pointer.
But if we could somehow short-circuit the more common case, it might still
be worth trying.

The other idea you mentioned is to index intermediate tuples but somehow
restrict WARM chain following knowing that the subsequent tuples are
reachable via different index tuple. I see two major problems with that
approach:

1. When HOT or WARM chains are pruned and dead tuples removed, we may lose
the knowledge about key changes between intermediate updates. Without that
its seems difficult to know when to stop while following chain starting
from the old index tuple.

2. The existence of index pointers to intermediate tuples will lead to
index bloat. This is the same problem that we found in Simon's original
proposal (discussed internally). None of the intermediate line pointers can
be vacuumed until the entire chain becomes DEAD. Event if the a duplicate
index key is inserted for every index, knowing that and reclaiming to the
index pointers to the original root line pointer, will be difficult.

>
> Not to mention the cost of scanning the chain of N tuples N times,
> making it quadratic - bound by the number of tuples that can fit a
> page, but still high.
>
> Solving that, I believe, will remove all the simplicity of pointing to
> root tuples.
>
>
You're right. But having all indexes point to the root line pointer makes
things simpler to manage and less buggy. So if we can somehow solve the
duplicate tuples problem, even at additional cost at update time, it might
still be worth. I would suggest that we should think more and we can find
some solution to the problem.

>
> I don't really see how you'd do that on yours. You seem to assume
> finding a particular key-item pointer pair in an index would be
> efficient, but IMHO that's not the case.
>

That's true. But we could somehow short-circuit the more common pattern,
that might still be worth. For corner cases, we can fall back to non-HOT
update and keep things simple. It will still be a huge improvement over
what we have currently.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] multivariate statistics (v19)

2016-08-04 Thread Michael Paquier

On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
 wrote:
> Attached is v19 of the "multivariate stats" patch series - essentially v18
> rebased on top of current master. Aside from a few bug fixes, the main
> improvement is addition of SGML docs demonstrating the statistics in a way
> similar to the current "Row Estimation Examples" (and the docs are actually
> in the same section). I've tried to keep the right amount of technical
> detail (and pointing to the right README for additional details), but this
> may need improvements. I have not written docs explaining how statistics may
> be combined yet (more about this later).

What we have here is quite something:
$ git diff master --stat | tail -n1
 77 files changed, 12809 insertions(+), 65 deletions(-)
I will try to get familiar on the topic and added myself as a reviewer
of this patch. Hopefully I'll get feedback soon.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_ctl promote wait

2016-08-04 Thread Michael Paquier

On Wed, Aug 3, 2016 at 9:35 PM, Peter Eisentraut
 wrote:
> Updated patch set for pg_ctl promote -w, incorporating AFAICT all of the
> previous reviews, in particular Michael Paquier's changes to the
> StartupXLOG() ordering, and rebasing on top of
> src/common/controldata_utils.c.

Glad to see a new set of patches here:

1) From 0001
-   ok($result, "@$cmd exit code 0");
-   is($stderr, '', "@$cmd no stderr");
+   ok($result, "$test_name: exit code 0");
+   is($stderr, '', "$test_name: no stderr");
Okay with that, there is anyway a log mentioning the command anyway.
Perhaps you'd want to backpatch that?

2) From 0002:
+command_fails_like([ 'pg_ctl', 'promote', '-D', "$tempdir/nonexistent" ],
Any code using src/port/getopt_long.c (Windows!) is going to fail on
that. The argument that is not preceded by an option name needs to be
put at the end of the command. So for example this command needs to be
changed to that:
[ 'pg_ctl', '-D', "$tempdir/nonexistent", 'promote' ]

+$node_standby->poll_query_until('postgres', 'SELECT NOT pg_is_in_recovery()');
+
+is($node_standby->safe_psql('postgres', 'SELECT pg_is_in_recovery()'),
+   'f', 'promoted standby is not in recovery');
We could just check that poll_query_until returns 1 here. This would
save running one query.

3) From 0003
In do_stop, this patches makes the wait happen for a maximum of
wait_seconds * 2, once when getting the control file information, and
once when waiting for the server to shut down. This is not a good
idea, and the idea of putting a wait argument in get_controlfile does
not seem a good interface to me. I'd rather see get_controlfile be
extended with a flag saying no_error_on_failure and keep the wait
logic within pg_ctl. The flag would do the following when enabled:
- for frontends, do not issue a WARNING message and return NULL instead.
- for backends, do not issue an ERROR and return NULL instead.

4) Patch 0004, no real comments :) I am glad you picked up this idea.

5) Patch 0005:
Looks good to me. I just noticed that similar to 0002 regarding the
ordering of those arguments:
+command_ok([ 'pg_ctl', 'promote', '-w', '-D', $node_standby->data_dir ],
+  'pg_ctl promote -w of standby runs');

Thanks,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detecting skipped data from logical slots (data silently skipped)

2016-08-04 Thread Robert Haas

On Wed, Aug 3, 2016 at 9:39 AM, Craig Ringer  wrote:
> I think we have a bit of a problem with the behaviour specified for logical
> slots, one that makes it hard to prevent a outdated snapshot or backup of a
> logical-slot-using downstream from knowing it's missing a chunk of data
> that's been consumed from a slot. That's not great since slots are supposed
> to ensure a continuous, gapless data stream.
>
> If the downstream requests that logical decoding restarts at an LSN older
> than the slot's confirmed_flush_lsn, we silently ignore the client's request
> and start replay at the confirmed_flush_lsn. That's by design and fine
> normally, since we know the gap LSNs contained no transactions of interest
> to the downstream.

Wow, that sucks.

> The cause is an optimisation intended to allow the downstream to avoid
> having to do local writes and flushes when the upstream's activity isn't of
> interest to it and doesn't result in replicated rows. When the upstream does
> a bunch of writes to another database or otherwise produces WAL not of
> interest to the downstream we send the downstream keepalive messages that
> include the upstream's current xlog position and the client replies to
> acknowledge it's seen the new LSN. But, so that we can avoid disk flushes on
> the downstream, we permit it to skip advancing its replication origin in
> response to those keepalives. We continue to advance the confirmed_flush_lsn
> and restart_lsn in the replication slot on the upstream so we can free WAL
> that's not needed and move the catalog_xmin up. The replication origin on
> the downstream falls behind the confirmed_flush_lsn on the upstream.

This seems entirely too clever.  The upstream could safely remember
that if the downstream asks for WAL position X it's safe to begin
streaming from WAL position Y because nothing in the middle is
interesting, but it can hardly decide to unilaterally ignore the
request position.

> The simplest fix would be to require downstreams to flush their replication
> origin when they get a hot standby feedback message, before they send a
> reply with confirmation. That could be somewhat painful for performance, but
> can be alleviated somewhat by waiting for the downstream postgres to get
> around to doing a flush anyway and only forcing it if we're getting close to
> the walsender timeout. That's pretty much what BDR and pglogical do when
> applying transactions to avoid having to do a disk flush for each and every
> applied xact. Then we change START_REPLICATION ... LOGICAL so it ERRORs if
> you ask for a too-old LSN rather than silently ignoring it.

That's basically just proposing to revert this broken optimization,
IIUC, and instead just try not to flush too often on the standby.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Fri, Aug 5, 2016 at 12:44 AM, Bruce Momjian  wrote:
>> UPDATE t SET dat = dat + 1, id = 3, someid = 3 where id = 2;
>
> This is ends the WARM chain, and creates new index entries because all
> indexes are changed.
>
>> UPDATE t SET dat = dat + 1, id = 1, someid = 2 where id = 3;
>
> This does the same thing.
>
>> SELECT * FROM t WHERE someid = 2;
>
> This uses the 'someid' index.  The index contains three entries:
>
> 1. {someid=2} pointing to first WARM chain
> 2. {someid=3} pointing to single tuple (no HOT chain)
> 3. {someid=2} pointing to single tuple (no HOT chain)
>
> The scan of #1 returns no visible rows.  #2 doesn't match the value in
> the WHERE clause, so we don't even check the heap.  The scan of #3
> returns one row.
>
> Remember, we don't scan past the end of the HOT chain, which is what we
> do now.

Ok, I botched the example.

I wanted the other updates to all be WARM updates, so imagine there's
another index that is unchanged (say, a someid2 not updated ever).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [COMMITTERS] pgsql: Prevent "snapshot too old" from trying to return pruned TOAST tu

2016-08-04 Thread Robert Haas

On Thu, Aug 4, 2016 at 10:58 PM, Robert Haas  wrote:
> On Thu, Aug 4, 2016 at 7:23 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> Prevent "snapshot too old" from trying to return pruned TOAST tuples.
>>
>> Looks like this patch broke the build on castoroides.  Recommend
>> changing InitToastSnapshot into a macro.  (There's a reason why
>> InitDirtySnapshot is a macro.)
>
> What is the reason?  We refuse to separate frontend and backend
> headers in any sort of principled way?

That was poorly phrased.  I'll try again: I can't see any reason for
using a macro here except that it allows frontend code to compile this
without breaking.  But that doesn't seem like a very good way of
solving that problem.  There's surely no way for a casual reader of
the code to realize that macros can be used here and inline functions
cannot, especially because this works apparently works fine on most
machines, including mine.

Here's a patch.  Is this what you want?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index 55b6b41..31b0132 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -2316,5 +2316,5 @@ init_toast_snapshot(Snapshot toast_snapshot)
 	if (snapshot == NULL)
 		elog(ERROR, "no known snapshots");
 
-	InitToastSnapshot(toast_snapshot, snapshot->lsn, snapshot->whenTaken);
+	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
 }
diff --git a/src/include/utils/tqual.h b/src/include/utils/tqual.h
index 8041e7b..fc7328c 100644
--- a/src/include/utils/tqual.h
+++ b/src/include/utils/tqual.h
@@ -104,12 +104,9 @@ extern bool ResolveCminCmaxDuringDecoding(struct HTAB *tuplecid_data,
  * Similarly, some initialization is required for SnapshotToast.  We need
  * to set lsn and whenTaken correctly to support snapshot_too_old.
  */
-static inline void
-InitToastSnapshot(Snapshot snapshot, XLogRecPtr lsn, int64 whenTaken)
-{
-	snapshot->satisfies = HeapTupleSatisfiesToast;
-	snapshot->lsn = lsn;
-	snapshot->whenTaken = whenTaken;
-}
+#define InitToastSnapshot(snapshotdata, l, w)  \
+	((snapshotdata).satisfies = HeapTupleSatisfiesDirty, \
+	 (snapshotdata).lsn = (l),	\
+	 (snapshotdata).whenTaken = (w))
 
 #endif   /* TQUAL_H */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PostgreSQL 10 Roadmaps

2016-08-04 Thread Etsuro Fujita


On 2016/08/02 23:54, Simon Riggs wrote:

https://wiki.postgresql.org/wiki/PostgreSQL10_Roadmap


Thank you for creating the wiki page!

I added a link to the NTT roadmap page on the links page.  We hope that 
we can collaborate on the projects with people working at other 
companies or with individual contributors in the same development areas.


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 11:53:05PM -0300, Claudio Freire wrote:
> The point is avoiding duplicate rows in the output of index scans.
> 
> I don't think you can avoid it simply by applying index condition
> rechecks as the original proposal implies, in this case:
> 
> CREATE TABLE t (id integer not null primary key, someid integer, dat integer);
> CREATE INDEX i1 ON t (someid);
> 
> INSERT INTO t (id, someid, dat) VALUES (1, 2, 100);

OK, let me run through this and you can tell me where I am wrong.

At this point there are two indexes, one on 'id' and one on 'someid'.

> UPDATE t SET dat = dat + 1 where id = 1;

This is a HOT update because no indexes were changed.

> UPDATE t SET dat = dat + 1, id = 2 where id = 1;

This changes the HOT chain to a WARM chain because one index is changed.
That means that lookups on both indexes recheck the single visible
tuple, if there is one.

> UPDATE t SET dat = dat + 1, id = 3, someid = 3 where id = 2;

This is ends the WARM chain, and creates new index entries because all
indexes are changed.

> UPDATE t SET dat = dat + 1, id = 1, someid = 2 where id = 3;

This does the same thing.

> SELECT * FROM t WHERE someid = 2;

This uses the 'someid' index.  The index contains three entries:

1. {someid=2} pointing to first WARM chain
2. {someid=3} pointing to single tuple (no HOT chain)
3. {someid=2} pointing to single tuple (no HOT chain)

The scan of #1 returns no visible rows.  #2 doesn't match the value in
the WHERE clause, so we don't even check the heap.  The scan of #3
returns one row.

Remember, we don't scan past the end of the HOT chain, which is what we
do now.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

2016-08-04 Thread Michael Paquier

On Thu, Aug 4, 2016 at 4:38 PM, Michael Paquier
 wrote:
> Andreas, with the patch attached is the assertion still triggered?
>
> Thoughts from others are welcome as well.

Self review:
  * of streaming base backups currently in progress. forcePageWrites is set
  * to true when either of these is non-zero. lastBackupStart is the latest
  * checkpoint redo location used as a starting point for an online backup.
+ * exclusiveBackupStarted is true while pg_start_backup() is being called
+ * during an exclusive backup.
  */
 boolexclusiveBackup;
 intnonExclusiveBackups;
 XLogRecPtrlastBackupStart;
+boolexclusiveBackupStarted;
It would be better to just use one variable that uses an enum to track
the following states of an exclusive backup: none, starting and
in-progress.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 11:15 PM, Bruce Momjian  wrote:
> On Thu, Aug  4, 2016 at 09:53:31PM -0300, Claudio Freire wrote:
>> It's the other way around. A WARM chain has to span whole HOT chains.
>> The WARM chain always begins in the head of a HOT chain (or non-HOT
>> tuple of course), and cannot end before the HOT chain ends.
>>
>> That is, all within a single page:
>>
>>   [  v1 .. v2 .. v3  .. v4 .. v5 .. v6  v7 ]
>>   [ hot chain 1   ][ hot chain 2 ] [NH]
>>   [ WARM CHAIN  ] [NW]
>>
>> (NH = non-HOT)
>> (NW = non-WARM)
>>
>> The flag could be useful for the upgrade path, but if you ignore
>> having to live with old-format indexes, it's not necessary.
>>
>> A WARM chain starts at any index item pointer. It ends when there's a
>> key change or a page change (ie: the WARM chain cannot span multiple
>> pages). That cannot happen within a HOT chain, so that point will also
>> be the end of a HOT chain.
>>
>> You can think of a WARM chain as a chain of HOT chains.
>
> OK, that's a lot of text, and I am confused.  Please tell me the
> advantage of having an index point to a string of HOT chains, rather
> than a single one?  Is the advantage that the index points into the
> middle of the HOT chain rather than at the beginning?  I can see some
> value to that, but having more ctid HOT headers that can't be removed
> except by VACUUM seems bad, plus the need for index rechecks as you
> cross to the next HOT chain seems bad.
>
> The value of WARM is to avoid index bloat --- right now we traverse the
> HOT chain on a single page just fine with no complaints about speed so I
> don't see a value to optimizing that traversal, and I think the key
> rechecks and ctid bloat will make it all much slower.
>
> It also seems much more complicated.

The point is avoiding duplicate rows in the output of index scans.

I don't think you can avoid it simply by applying index condition
rechecks as the original proposal implies, in this case:

CREATE TABLE t (id integer not null primary key, someid integer, dat integer);
CREATE INDEX i1 ON t (someid);

INSERT INTO t (id, someid, dat) VALUES (1, 2, 100);
UPDATE t SET dat = dat + 1 where id = 1;
UPDATE t SET dat = dat + 1, id = 2 where id = 1;
UPDATE t SET dat = dat + 1, id = 3, someid = 3 where id = 2;
UPDATE t SET dat = dat + 1, id = 1, someid = 2 where id = 3;
SELECT * FROM t WHERE someid = 2;

That, I believe, will cause the problematic chains where the condition
recheck passes both times the last visible tuple is visited during the
scan. It will return more than one tuple when in reality there is only
one.

Not to mention the cost of scanning the chain of N tuples N times,
making it quadratic - bound by the number of tuples that can fit a
page, but still high.

Solving that, I believe, will remove all the simplicity of pointing to
root tuples.

This is done in my proposals by adding boundaries to WARM chains such
that potential matches are visited only once (first variant) or
guaranteeing that visible tuples with a particular key will be
reachable from only one index entry (second variant). Pointing to the
middle of the chain also allows skipping the recheck for the first HOT
chain, which could improve scan performance a lot, especially in the
second variant that has no condition rechecks on invisible tuples.

I don't really see how you'd do that on yours. You seem to assume
finding a particular key-item pointer pair in an index would be
efficient, but IMHO that's not the case.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 09:53:31PM -0300, Claudio Freire wrote:
> It's the other way around. A WARM chain has to span whole HOT chains.
> The WARM chain always begins in the head of a HOT chain (or non-HOT
> tuple of course), and cannot end before the HOT chain ends.
> 
> That is, all within a single page:
> 
>   [  v1 .. v2 .. v3  .. v4 .. v5 .. v6  v7 ]
>   [ hot chain 1   ][ hot chain 2 ] [NH]
>   [ WARM CHAIN  ] [NW]
> 
> (NH = non-HOT)
> (NW = non-WARM)
> 
> The flag could be useful for the upgrade path, but if you ignore
> having to live with old-format indexes, it's not necessary.
> 
> A WARM chain starts at any index item pointer. It ends when there's a
> key change or a page change (ie: the WARM chain cannot span multiple
> pages). That cannot happen within a HOT chain, so that point will also
> be the end of a HOT chain.
> 
> You can think of a WARM chain as a chain of HOT chains.

OK, that's a lot of text, and I am confused.  Please tell me the
advantage of having an index point to a string of HOT chains, rather
than a single one?  Is the advantage that the index points into the
middle of the HOT chain rather than at the beginning?  I can see some
value to that, but having more ctid HOT headers that can't be removed
except by VACUUM seems bad, plus the need for index rechecks as you
cross to the next HOT chain seems bad.

The value of WARM is to avoid index bloat --- right now we traverse the
HOT chain on a single page just fine with no complaints about speed so I
don't see a value to optimizing that traversal, and I think the key
rechecks and ctid bloat will make it all much slower.

It also seems much more complicated.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] max_parallel_degree > 0 for 9.6 beta

2016-08-04 Thread Michael Paquier

On Thu, Aug 4, 2016 at 10:52 PM, David G. Johnston
 wrote:
> My initial reaction was +1 but now I'm leaning toward enabled by default.
>
> Those who would upgrade to 9.6 within a year of its release are most likely,
> process and personality wise, to be those for whom the risks and rewards of
> new features is part of their everyday routine.
>
> If indeed we release 10.0 with it enabled by default we should be confident
> in its stability at that point in the 9.6.x branch.

Being cautious pays more in the long term, so seeing the number of
bugs that showed up I'd rather vote for having it disabled by default
in 9.6 stable, and enabled on master to aim at enabling it in 10.0.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 7:59 AM, Pavan Deolasee  wrote:
> So marking the index would require us to mark both old and new index tuples
> as requiring re-check. That requires an additional index scan to locate the
> old row and then an additional write to force it to re-check, which is
> algorithmically O(NlogN) on table size.


So, basically, I'm saying this isn't really O(NlogN), it's O(N),
manifested in low-cardinality indexes.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

>On Thu, Aug  4, 2016 at 06:21:48PM -0400, Bruce Momjian wrote:
> Another approach would be to keep updates on the same page in a single
> WARM chain, even if all indexes have changed columns.  We will already
> have code to walk the HOT chain looking for matches, and at most one
> tuple in the update chain is visible.  The downside of this is
> unnecessary checking if the tuple matches the index.

Forget this idea I had.  If all the indexed keys change, we don't save
anything in fewer index tuples, and we extend the length of the HOT
chain because we always have to point to the root of the HOT chain in
the index.

The only advantage would be allowing reuse of the old index tuple if the
indexed value changes back to and old indexed value in the same chain,
which seems rare.

Let me say I am excited at the progress that has been made on this item
in the past 36 hours and I think we are triangulating on a solution.

I know we are restricted by the existing page format and pg_upgrade in
adding this features, but frankly that limitation seems minor --- even
if designing a new storage system, figuring out how to do WARM chains
efficiently would be hard.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 7:21 PM, Bruce Momjian  wrote:
> On Thu, Aug  4, 2016 at 03:23:10PM -0300, Claudio Freire wrote:
>> INSERT INTO TABLE t (id,val,dat) VALUES (32,'a','blabla');
>> UPDATE t SET dat='blabla2' where id = 32;
>> UPDATE t SET dat='blabla3', id=33 where id = 32;
>> UPDATE t SET val='b' where id = 33;
>> UPDATE t SET dat='blabla4' where id = 33;
>> UPDATE t SET val='a' where id = 33;
>>
>> This produces a case similar to the one I described. Assuming all
>> updates happen on the same page, the first will be HOT, so no key
>> check is needed. The second cannot be HOT, because it updates the id.
>> But it can be WARM, if it's done on the same page, so it can avoid
>> updating the index on val. The third will need a new index item
>> pointer, because it has a new key. The t_ctid chain will walk all the
>> versions of the row, but the one with 'blabla2' will not have the
>> HOT-updated bit set. So regular HOT chain walking will stop there, I
>> propose to not stop there, so the index on val can follow the chain
>> into blabla3 as if it were a HOT update. Correctness is achieved only
>> if there is no index item pointer on that index pointing to that index
>> version, and since there is no efficient way of checking, the
>> invariants I propose aim to guarantee it so it can be assumed. Since
>> WARM updates don't create new index item pointers, what needs to be
>> checked is whether updating from version 2 to version 3 would be a
>> WARM update on this index or not, and that equals to checking the
>> index keys for equality.
>
> OK, it took me a while to understand this, so let me see if I can
> restate it.  You have an update chain in a single page which is made up
> of one ore more HOT chains, some of which might be WARM.  A flag
> indicates a WARM chain (see below).  A WARM chain cannot go outside of a
> HOT chain.

It's the other way around. A WARM chain has to span whole HOT chains.
The WARM chain always begins in the head of a HOT chain (or non-HOT
tuple of course), and cannot end before the HOT chain ends.

That is, all within a single page:

  [  v1 .. v2 .. v3  .. v4 .. v5 .. v6  v7 ]
  [ hot chain 1   ][ hot chain 2 ] [NH]
  [ WARM CHAIN  ] [NW]

(NH = non-HOT)
(NW = non-WARM)

The flag could be useful for the upgrade path, but if you ignore
having to live with old-format indexes, it's not necessary.

A WARM chain starts at any index item pointer. It ends when there's a
key change or a page change (ie: the WARM chain cannot span multiple
pages). That cannot happen within a HOT chain, so that point will also
be the end of a HOT chain.

You can think of a WARM chain as a chain of HOT chains.

> The HOT chains have identical keys for all indexes, and the WARM chains
> have identical keys for some indexes.

Lets highlight that a WARM chain concept only exists in the context of
a specific index. WARM chains (the way I describe them) apply to only
one index at a time.

So, different indexes will have different chaining of HOT chains into
WARM chains. While index A may chain HOT chain 1 and HOT chain 2
together, index B may not. In fact, the way WARM works garantees that
at least one index won't chain them.

So WARM chains have identical keys for a specific index. Different
indexes have different WARM chains.

>  If an update chain is all on the
> same page, it can have multiple HOT chains.  I think this is possible if
> an update changes _all_ indexed columns in the middle of an update
> chain, turning off HOT and WARM.

Yes

> I think we agreed that WARM index tuples will point to the head of the
> HOT update chain on the same page.

The head of the WARM chain, which may not be the HOT update chain.

In the above example, the WARM index tuple would point to the head of
the hot chain 1, not hot chain 2.

> We will store the
> HEAP_RECHECK_REQUIRED flag on the update chain head tuple, or on the
> LP_REDIRECT ctid if we removed the tuple during HOT cleanup.

I wouldn't put it on the update chain head, I'd put it on the
just-inserted heap tuple, and I'd call it HEAP_WARM_TUPLE. I wouldn't
use a corresponding HEAP_WARM_UPDATED, there's no need to update the
old tuple's flags.

The purpose of the flag is to be able to discern new-style index
pointers from old-style ones, since they require different ways of
following the update chain.

In the example above, old indexes will have item pointers to the head
of hot chain 1 and 2. A scan that finds the item pointer for hot chain
1 has to scan the HOT chain only, and not step into hot chain 2,
because there's an item pointer for hot chain 2 already. New indexes
need to follow through the end of HOT chain 1 into HOT chain 2,
because there won't be an item pointer pointing to the head of HOT
chain 2. The flag being set in the tuple would signal that this is a
new-style update, so it needs to follow the WARM chain, whereas if the
flag is 0, it needs not. It allows for a painless upgrade, but that's
it.

The flag alone doesn't remove the need

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Craig Ringer

On 5 August 2016 at 04:31, Tom Lane  wrote:

>  In short: if we'd done it that way all along, it would've been nice.

But encouraging people to change horses now isn't really going to
> make things better.  What is far more likely to happen, if we were
> to do that, is that people will make clients that gratuitously fail
> on old server versions because they didn't bother to support or test
> for the case that the server doesn't send server_version_num.
>

Yup, if someone's writing a completely new client or bypassing their
database driver's version detection.

But the same argument would've applied when the server_version_num GUC was
added in the first place. The world didn't end then, and it won't now.

Yes, it'll add a small cost to the startup packet, but that's it. It
should've been done when server_version_num was added in the first place
and it should be done now. Small variation in one-off packet size is pretty
unimportant anyway TBH, since it's latency and roundtrip costs and that
hurt us.

You don't really want to encourage multiple ways of identifying which
> software version you're dealing with.  That way madness lies.
>
>
Yet we have a server_version_num GUC. Because it's useful and people
benefited from it.

It's a pain in the butt having 9.4beta, etc and having to deal with those.

I like to think we think in the long term here. The same arguments you make
against adding server_version_num as GUC_REPORT apply to pretty much every
improvement that exposes any kind of UI, including new SQL. Users won't be
able to use it for ages, it'll break apps that connect to older servers if
they use the new feature, etc etc.

Lets just add it. It should've been there in the first place, it was an
oversight not to add it when initially adding server_version_num, and it
should be fixed.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Fwd: [BUGS] BUG #14247: COMMENT is restored on wrong database

2016-08-04 Thread Craig Ringer

On 5 August 2016 at 05:03, David G. Johnston 
wrote:

> 
> I'm all for an elegant solution here though at some point having a working
> solution now beats waiting for someone to willingly dive more deeply into
> pg_dump.  I too seem to recall previous proposals for COMMON ON CURRENT
> DATABASE yet here we are...
>
>
SECURITY LABEL ... ON CURRENT DATABASEis needed too, and has caused
real-world pain.

I tend to agree that adding and using ... ON CURRENT DATABASE is worthwhile
now. It's guaranteed to be at least as correct as what pg_dump emits now.

We do have a horrible mess with how pg_dump handles database level
properties, but I'd rather not try to deal with that at the same time, get
into a long discussion and land up fixing nothing.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 10:58:33PM +0530, Pavan Deolasee wrote:
> 2. I also think that most often the tuple that will be updated will have its
> t_ctid pointing to itself. This sounds more invasive, but can we somehow

Agreed!  I think it has to.  Great idea.

> utilise the t_ctid field to store the OffsetNumber of the root line pointer?
> The root line pointer only exists when the tuple under consideration has
> HEAP_ONLY_TUPLE flag set and we know for such tuples, BlockNumber in t_ctid
> must be same as current block (unless HEAP_UPDATED is also set and the 
> updating
> transaction eventually aborted).

Oh, aborts, so the tuple we are looking at is the head of the chain and
points to a tuple that was aborted.  So we would still need a page-scan
to handle cases where the t_ctid was over-written by an aborted tuple,
but as you said it would be rare.

So, if the tuple has HEAP_ONLY_TUPLE set, we know the block number
ip_blkid equals our block number, so any ip_blkid value that doesn't
equal our block number is a special flag that indicates that ip_posid
points to the head of our HOT chain.  I guess we could just set it to
(our block number + 1), with rollover possible and desired.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible duplicate release of buffer lock.

2016-08-04 Thread Peter Geoghegan

On Wed, Aug 3, 2016 at 1:31 AM, Kyotaro HORIGUCHI
 wrote:
> The first line is emitted for simultaneous deletion of a index
> page, which is impossible by design in a consistent state so the
> complained situation should be the result of an index corruption
> before upgading, specifically, inconsistent sibling pointers
> around a deleted page.

> My point here is that if concurrent deletion can't be perfomed by
> the current implement, this while loop could be removed and
> immediately error out or log a message,

Perhaps I have misunderstood, but I must ask: Are you aware that 9.4
has commit efada2b8, which fixes a bug with page deletion, that never
made it into earlier branches?

It is worth noting that anything that can make VACUUM in particular
raise an error is considered problematic. For example, we do not allow
VACUUM to finish incomplete B-Tree page splits, even though we could,
because there might be no disk space for even one extra page at the
worst possible time (we VACUUM in part to free disk space, of course).
We really want to let VACUUM finish, if at all possible, even if it
sees something that indicates data corruption.

I remember that during the review of the similar B-Tree page split
patch (not efada2b8, but rather commit 40dae7ec, which was also in
9.4), I found bugs due to functions not being very clear about whether
a buffer lock and/or pin were held upon returning from a function in
all cases, including very rare cases or "can't happen" cases. So, this
seems kind of familiar.

Code comments in the relevant function say this:

 * Returns 'false' if the page could not be unlinked (shouldn't happen).
 * If the (new) right sibling of the page is empty, *rightsib_empty is set
 * to true.
 */
static bool
_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
{

What we see here is that this "shouldn't happen" condition does, in
fact, happen very rarely, including in cases where the target is not a
leaf page.

I think there might well be a real bug here, because we are not
careful enough in ensuring that the _bt_unlink_halfdead_page()
"contract" holds when something that "can't happen" does, in fact,
happen:

/*
 * Release the target, if it was not the leaf block.  The leaf is always
 * kept locked.
 */
if (target != leafblkno)
_bt_relbuf(rel, buf);

return true;
}

(Assuming you can call this code at the end of
_bt_unlink_halfdead_page() its "contract".)

It's not good at all if the code is stuck in a way that prevents
VACUUM from finishing, as I said. That seems to be the case here,
judging from Horiguchi-san's log output, which shows an ERROR from
"lock main 13879 is not held". The problem is not so much that a
"can't happen" thing happens (data corruption will sometimes make "the
impossible" possible, after all -- we know this, and write code
defensively because of this). The bug is that there is any ERROR for
VACUUM that isn't absolutely unavoidable (the damage has been done
anyway -- the index is already corrupt).

I'll need to think about this some more, when I have more time.
Perhaps tomorrow.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-04 Thread Tom Lane

I wrote:
> Looking around, there are a couple of places outside commands/analyze.c
> that are making the same mistake, so this patch isn't complete, but it
> illustrates what needs to be done.

Here's a more complete patch.

regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5fcedd7..9ac7122 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*** compute_distinct_stats(VacAttrStatsP sta
*** 2049,2056 
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
   nmultiple == track_cnt)
--- 2049,2059 
  
  		if (nmultiple == 0)
  		{
! 			/*
! 			 * If we found no repeated non-null values, assume it's a unique
! 			 * column; but be sure to discount for any nulls we found.
! 			 */
! 			stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
   nmultiple == track_cnt)
*** compute_scalar_stats(VacAttrStatsP stats
*** 2426,2433 
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (toowide_cnt == 0 && nmultiple == ndistinct)
  		{
--- 2429,2439 
  
  		if (nmultiple == 0)
  		{
! 			/*
! 			 * If we found no repeated non-null values, assume it's a unique
! 			 * column; but be sure to discount for any nulls we found.
! 			 */
! 			stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  		}
  		else if (toowide_cnt == 0 && nmultiple == ndistinct)
  		{
*** compute_scalar_stats(VacAttrStatsP stats
*** 2753,2759 
  		else
  			stats->stawidth = stats->attrtype->typlen;
  		/* Assume all too-wide values are distinct, so it's a unique column */
! 		stats->stadistinct = -1.0;
  	}
  	else if (null_cnt > 0)
  	{
--- 2759,2765 
  		else
  			stats->stawidth = stats->attrtype->typlen;
  		/* Assume all too-wide values are distinct, so it's a unique column */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  	}
  	else if (null_cnt > 0)
  	{
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index 0f851ea..817453c 100644
*** a/src/backend/tsearch/ts_typanalyze.c
--- b/src/backend/tsearch/ts_typanalyze.c
*** compute_tsvector_stats(VacAttrStats *sta
*** 295,301 
  		stats->stawidth = total_width / (double) nonnull_cnt;
  
  		/* Assume it's a unique column (see notes above) */
! 		stats->stadistinct = -1.0;
  
  		/*
  		 * Construct an array of the interesting hashtable items, that is,
--- 295,301 
  		stats->stawidth = total_width / (double) nonnull_cnt;
  
  		/* Assume it's a unique column (see notes above) */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  
  		/*
  		 * Construct an array of the interesting hashtable items, that is,
diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
index fcb71d3..56504fc 100644
*** a/src/backend/utils/adt/rangetypes_typanalyze.c
--- b/src/backend/utils/adt/rangetypes_typanalyze.c
*** compute_range_stats(VacAttrStats *stats,
*** 203,209 
  		/* Do the simple null-frac and width stats */
  		stats->stanullfrac = (double) null_cnt / (double) samplerows;
  		stats->stawidth = total_width / (double) non_null_cnt;
! 		stats->stadistinct = -1.0;
  
  		/* Must copy the target values into anl_context */
  		old_cxt = MemoryContextSwitchTo(stats->anl_context);
--- 203,211 
  		/* Do the simple null-frac and width stats */
  		stats->stanullfrac = (double) null_cnt / (double) samplerows;
  		stats->stawidth = total_width / (double) non_null_cnt;
! 
! 		/* Estimate that non-null values are unique */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  
  		/* Must copy the target values into anl_context */
  		old_cxt = MemoryContextSwitchTo(stats->anl_context);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cc2a9a1..56943f2 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*** double
*** 4738,4743 
--- 4738,4744 
  get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
  {
  	double		stadistinct;
+ 	double		stanullfrac = 0.0;
  	double		ntuples;
  
  	*isdefault = false;
*** get_variable_numdistinct(VariableStatDat
*** 4745,4751 
  	/*
  	 * Determine the stadistinct value to use.  There are cases where we can
  	 * get an estimate even without a pg_statistic entry, or can get a better
! 	 * value than is in pg_statistic.
  	 */
  	if (HeapTupleIsValid(vardata->statsTuple))
  	{
--- 4746,4753 
  	/*
  	 * Determine the stadistinct value to use.  There are cases where we can
  	 * get an estimate even without a pg_statistic entry,

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 03:33:16PM -0300, Claudio Freire wrote:
> On Thu, Aug 4, 2016 at 3:23 PM, Claudio Freire  wrote:
> >> That is a new HOT chain given our current code, but
> >> would the new tuple addition realize it needs to create a new index
> >> tuple?  The new tuple code needs to check the index's root HOT tid for a
> >> non-match.
> >
> > I'm not proposing to change how HOT works, but adding another very
> > similar mechanism on top of it called WARM.
> >
> > So HOT will keep working like that, HOT pruning will happen as usual,
> > but we'll have the concept (immaterialized in the physical
> > representation of the data, just implicit) of WARM chains. WARM chains
> > would span one or several HOT chains, so they're "bigger".
> 
> 
> Answering your question, there's no need to find the root page on index 
> updates.
> 
> When creating the new tuple in nodeModifyTable.c, the code currently
> skips updating indexes if it's using HOT.
> 
> We would add a check for WARM too. It will use WARM for index X iff both:
> 
>  * ItemPointerGetBlockNumber(oldtuplectid) ==
> ItemPointerGetBlockNumber(newtuplectid)
>  * satisfies HOT for only this index X

OK, I see why you like walking the entire update chain instead of just
walking the HOT chain --- you think it avoids us having to find the HOT
chain root.  However, how do you check the index to find out of the
update chain root is already in the index?  I don't think you can just
look at the current tuple to know that since a previous tuple in the
update chain might have put it there even if the current old tuple
wouldn't have.

My point is there can be multiple update chains on the same page for
different rows with identical indexed values, so I don't see how
checking just for the page number helps here.  Are we going to check the
entire page?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-04 Thread Tom Lane

I looked into the problem described at
https://www.postgresql.org/message-id/flat/VisenaEmail.26.df42f82acae38a58.156463942b8%40tc7-visena
and I believe I've reproduced it: the requirement is that the inner join
column for the antijoin must contain a lot of NULL values, and what isn't
NULL must be unique or nearly so.  If ANALYZE doesn't come across any
duplicated values, it will set the column's stadistinct value to "-1",
which causes the planner to believe that each row of the inner table
produces a unique value, resulting in a bogus answer from
get_variable_numdistinct() and thence a bogus join size estimate.

Here's an example in the regression database, making use of the existing
unique column tenk1.unique1:

regression=# create table manynulls as select case when random() < 0.1 then 
unique1 else null end as unique1 from tenk1;
SELECT 1
regression=# analyze manynulls; 
ANALYZE
regression=# explain analyze select * from tenk1 t where not exists(select 1 
from manynulls m where m.unique1 = t.unique1);
QUERY PLAN  
   
---
 Hash Anti Join  (cost=261.00..756.50 rows=1 width=244) (actual 
time=4.632..14.729 rows=8973 loops=1)
   Hash Cond: (t.unique1 = m.unique1)
   ->  Seq Scan on tenk1 t  (cost=0.00..458.00 rows=1 width=244) (actual 
time=0.015..2.683 rows=1 loops=1)
   ->  Hash  (cost=136.00..136.00 rows=1 width=4) (actual time=4.553..4.553 
rows=1027 loops=1)
 Buckets: 16384  Batches: 1  Memory Usage: 165kB
 ->  Seq Scan on manynulls m  (cost=0.00..136.00 rows=1 width=4) 
(actual time=0.019..2.668 rows=1 loops=1)
 Planning time: 0.808 ms
 Execution time: 15.670 ms
(8 rows)

So the antijoin size estimate is way off, but it's hardly the planner's
fault because the stats are insane:

regression=# select attname,null_frac,n_distinct from pg_stats where tablename 
= 'manynulls';
 attname | null_frac | n_distinct 
-+---+
 unique1 |0.8973 | -1
(1 row)

With the patch attached below, ANALYZE produces

regression=# analyze manynulls;
ANALYZE
regression=# select attname,null_frac,n_distinct from pg_stats where tablename 
= 'manynulls';
 attname | null_frac | n_distinct 
-+---+
 unique1 |0.8973 |-0.1027
(1 row)

and now the join size estimate is dead on:

regression=# explain analyze select * from tenk1 t where not exists(select 1 
from manynulls m where m.unique1 = t.unique1);
QUERY PLAN  
   
---
 Hash Anti Join  (cost=261.00..847.69 rows=8973 width=244) (actual 
time=4.501..13.888 rows=8973 loops=1)
   Hash Cond: (t.unique1 = m.unique1)
   ->  Seq Scan on tenk1 t  (cost=0.00..458.00 rows=1 width=244) (actual 
time=0.031..4.959 rows=1 loops=1)
   ->  Hash  (cost=136.00..136.00 rows=1 width=4) (actual time=4.404..4.404 
rows=1027 loops=1)
 Buckets: 16384  Batches: 1  Memory Usage: 165kB
 ->  Seq Scan on manynulls m  (cost=0.00..136.00 rows=1 width=4) 
(actual time=0.034..2.576 rows=1 loops=1)
 Planning time: 1.388 ms
 Execution time: 14.542 ms
(8 rows)

What I did in the patch is to scale the formerly fixed "-1.0" stadistinct
estimate to discount the fraction of nulls we found.  An alternative
possibility might be to decree that a fractional stadistinct considers
only non-nulls, but then all the other paths through ANALYZE would be
wrong.  The spec for it in pg_statistic.h doesn't suggest any such
interpretation, either.

Looking around, there are a couple of places outside commands/analyze.c
that are making the same mistake, so this patch isn't complete, but it
illustrates what needs to be done.

This is a bit reminiscent of the nulls-accounting problem we fixed in
commit be4b4dc75, though that tended to result in underestimates not
overestimates of the number of distinct values.  We didn't back-patch
that fix, so probably we shouldn't back-patch this either.  On the other
hand, it is far more open-and-shut that this is wrong.  Thoughts?

regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5fcedd7..9ac7122 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*** compute_distinct_stats(VacAttrStatsP sta
*** 2049,2056 
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
   nmultiple == track_cnt)
--- 2049,2059 
  
  		if (nmultiple == 0)
  		{
!

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Andrew Dunstan




On 08/01/2016 04:25 PM, Tom Lane wrote:

Andrew Dunstan  writes:

Somewhat related is how we name the git branches. It would help me from
a buildfarm POV if we kept lexically them sortable, which could be done
at least for the next 90 major releases :-) by adding an underscore
after the REL piece, thus: REL_10_STABLE. I realise that's a way off,
but it's worth bringing up while we're discussing the topic.

Hmm, sounds a bit C-locale-centric, but I have no objection to inserting
an underscore there if it seems helpful.



Good. FTR I tested this with every locale on my system and the results 
were identical .





What I thought would be worth discussing is whether to continue using the
"_STABLE" suffix.  It seems rather like a noise word for our purposes.
OTOH, dropping it might be a headache for scripts that deal with branch
names --- any thoughts?





Just from the buildfarm POV it poses no difficulties.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 03:23:10PM -0300, Claudio Freire wrote:
> INSERT INTO TABLE t (id,val,dat) VALUES (32,'a','blabla');
> UPDATE t SET dat='blabla2' where id = 32;
> UPDATE t SET dat='blabla3', id=33 where id = 32;
> UPDATE t SET val='b' where id = 33;
> UPDATE t SET dat='blabla4' where id = 33;
> UPDATE t SET val='a' where id = 33;
> 
> This produces a case similar to the one I described. Assuming all
> updates happen on the same page, the first will be HOT, so no key
> check is needed. The second cannot be HOT, because it updates the id.
> But it can be WARM, if it's done on the same page, so it can avoid
> updating the index on val. The third will need a new index item
> pointer, because it has a new key. The t_ctid chain will walk all the
> versions of the row, but the one with 'blabla2' will not have the
> HOT-updated bit set. So regular HOT chain walking will stop there, I
> propose to not stop there, so the index on val can follow the chain
> into blabla3 as if it were a HOT update. Correctness is achieved only
> if there is no index item pointer on that index pointing to that index
> version, and since there is no efficient way of checking, the
> invariants I propose aim to guarantee it so it can be assumed. Since
> WARM updates don't create new index item pointers, what needs to be
> checked is whether updating from version 2 to version 3 would be a
> WARM update on this index or not, and that equals to checking the
> index keys for equality.

OK, it took me a while to understand this, so let me see if I can
restate it.  You have an update chain in a single page which is made up
of one ore more HOT chains, some of which might be WARM.  A flag
indicates a WARM chain (see below).  A WARM chain cannot go outside of a
HOT chain.

The HOT chains have identical keys for all indexes, and the WARM chains
have identical keys for some indexes.  If an update chain is all on the
same page, it can have multiple HOT chains.  I think this is possible if
an update changes _all_ indexed columns in the middle of an update
chain, turning off HOT and WARM.

I think we agreed that WARM index tuples will point to the head of the
HOT update chain on the same page.  We will store the
HEAP_RECHECK_REQUIRED flag on the update chain head tuple, or on the
LP_REDIRECT ctid if we removed the tuple during HOT cleanup.

So, I think the case you are discussing is walking WARM beyond the end
of the HOT chain so you don't have to create a second index entry for
the second HOT chain in the same update chain for matching values.  

For example, you have a HOT chain, and you set the indexed col1=2, then
later, in the same HOT chain, change it to col1=4, then later, in the
same HOT chain, back to col1=2.  I assume the last update would see
there was already an index entry pointing to the head of the same HOT
chain, and would skip adding to the index.

However, you might be saying that the last update of col1=2 makes a
non-HOT entry in the update chain, and you want to find that without
adding an index entry.

That seems overly complex and not worth it.

Another approach would be to keep updates on the same page in a single
WARM chain, even if all indexes have changed columns.  We will already
have code to walk the HOT chain looking for matches, and at most one
tuple in the update chain is visible.  The downside of this is
unnecessary checking if the tuple matches the index.

> >  I don't see a value in (expensive)
> > checking the key of an invisible tuple in hopes of stoping the HOT chain
> > traversal.
> 
> Not HOT chain, the point is to guarantee correctness by properly
> identifying the end of the WARM (not HOT) chain, which is left
> implicit.

Why should WARM chains span beyond HOT chains?

> > Also, what happens when a tuple chain goes off-page, then returns to the
> > original page?
> 
> The WARM chain ends on the page hop, and when it returns it's a new
> WARM chain. So traversal would not follow t_ctid pointers that hop
> pages, which makes it efficient, and also guarantees correctness
> (since if the is a page hop, we know the index will contain an item
> pointer to the version in that other page, so we'll visit it when we
> get there).
> 
> > That is a new HOT chain given our current code, but
> > would the new tuple addition realize it needs to create a new index
> > tuple?  The new tuple code needs to check the index's root HOT tid for a
> > non-match.
> 
> I'm not proposing to change how HOT works, but adding another very
> similar mechanism on top of it called WARM.
> 
> So HOT will keep working like that, HOT pruning will happen as usual,
> but we'll have the concept (immaterialized in the physical
> representation of the data, just implicit) of WARM chains. WARM chains
> would span one or several HOT chains, so they're "bigger".

Yes, I think I got it, but what is the benefit?  Fewer index entries?

As I see it, the only way with WARM you could have an update chain on
the same page that has multiple HOT chains is

Re: [HACKERS] Oddity with NOT IN

2016-08-04 Thread Marko Tiikkaja


On 2016-08-04 11:23 PM, Jim Nasby wrote:

I've got a customer that discovered something odd...

SELECT f1 FROM v1 WHERE f2 not in (SELECT bad FROM v2 WHERE f3 = 1);

does not error, even though bad doesn't exist, but


I'm guessing there's a v1.bad?

This is a common mistake, and also why I recommend always table 
qualifying column references when there's more than one table in scope.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Oddity with NOT IN

2016-08-04 Thread Jim Nasby


I've got a customer that discovered something odd...

SELECT f1 FROM v1 WHERE f2 not in (SELECT bad FROM v2 WHERE f3 = 1);

does not error, even though bad doesn't exist, but

SELECT bad FROM v2 WHERE f3 = 1;
gives

ERROR:  column "bad" does not exist

Is that expected?

This is on 9.4.8, and both v1 and v2 are views. The only "odd" thing 
that I see is that v1 is a UNION ALL and v2 is a UNION. I don't think 
there's any tables in common between the two views.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pgbench performance tuning?

2016-08-04 Thread Jeff Janes

On Thu, Aug 4, 2016 at 10:48 AM, Greg Stark  wrote:
> I'm trying to run pgbench on a moderately beefy machine (4-core 3.4GHz
> with 32G of ram and md mirrored spinning rust drives) at scale 300
> with 32 clients with duration of 15min. I'm getting TPS numbers
> between 60-150 which seems surprisingly low to me and also to several
> people on IRC.

I am assuming "md mirrored" means you have two drives, so there is no
striping.  What is their RPM?

Your IO system is inadequate and I'm only slightly surprised at the low TPS.

> Now pg_test_fsync does seem to indicate that's not an unreasonable
> commit rate if there was very little commit grouping going on:
>
> Compare file sync methods using one 8kB write:
> open_datasync   100.781 ops/sec9922 usecs/op
> fdatasync71.088 ops/sec   14067 usecs/op
>
> Compare file sync methods using two 8kB writes:
> open_datasync50.286 ops/sec   19886 usecs/op
> fdatasync80.349 ops/sec   12446 usecs/op
>
> And iostat does seem to indicate the drives are ~ 80% utilized with
> high write await times So maybe this is just what the system is
> capable of with synchronous_commit?

As an experiment, turn off synchrounous_commit and see what happens.

Mostly likely you are getting adequate commit grouping behavior, but
that doesn't apply to the table data.  Each transaction dirties some
random page in the 3.9GB pgbench_accounts table, and there is no
effective grouping of those writes. That data isn't written
synchronously, but in the steady state it doesn't really matter
because at some point the write rate has to equilibrate with the
dirtying rate.  If you can't make the disks faster then the TPS has to
drop to meet them. The most likely mechanism for this to happen is
that the disks are so harried trying to keep up with the dirty data
eviction, that they can't service the sync calls from the commits in a
timely matter.  But if you took the sync calls out, the bottleneck
would likely just move somewhere else with only modest overall
improvement.

The way to tune it would be to make shared_buffers large enough that
all of pgbench_accounts fits in it, and increase checkpoint_segments
and checkpoint_timeout as much as you can afford, and increase
checkpoint_completion_target.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: [BUGS] BUG #14247: COMMENT is restored on wrong database

2016-08-04 Thread David G. Johnston

On Thu, Aug 4, 2016 at 4:50 PM, Tom Lane  wrote:

> Robert Haas  writes:
> > On Tue, Aug 2, 2016 at 5:42 PM, David G. Johnston
> >  wrote:
> > The fact that pg_dump is emitting COMMENT ON DATABASE at all is
> > fundamentally wrong given the existing division-of-labor decisions,
> > namely that pg_dump is responsible for objects within a database
> > not for database-level properties.
>
> > I think a while back somebody had the idea of making COMMENT ON
> > CURRENT_DATABASE or COMMENT ON CURRENT DATABASE work, which seems like
> > an elegant solution to me.  Of course, I just work here.
>
> I'm fairly annoyed at David for having selectively quoted from private
> email in a public forum, but that was one of the points I touched on
> in material that he cut.

> The point I tried to make to him is that
> possibly COMMENT ON CURRENT DATABASE is a portion of a holistic solution,
> but it's only a portion.

I should have asked first and I'll take the heat for choosing to re-post
publicly but I kept this aspect of your recommendation precisely because
that was indeed your position.

TL>> It's entirely possible that some feature like COMMENT ON CURRENT
DATABASE
TL>> would be a necessary component of a holistic solution, [ various
other ON CURRENT commands elidded ]

I'm all for an elegant solution here though at some point having a working
solution now beats waiting for someone to willingly dive more deeply into
pg_dump.  I too seem to recall previous proposals for COMMON ON CURRENT
DATABASE yet here we are...

David J.

Re: [HACKERS] Fwd: [BUGS] BUG #14247: COMMENT is restored on wrong database

2016-08-04 Thread Tom Lane

Robert Haas  writes:
> On Tue, Aug 2, 2016 at 5:42 PM, David G. Johnston
>  wrote:
> The fact that pg_dump is emitting COMMENT ON DATABASE at all is
> fundamentally wrong given the existing division-of-labor decisions,
> namely that pg_dump is responsible for objects within a database
> not for database-level properties.

> I think a while back somebody had the idea of making COMMENT ON
> CURRENT_DATABASE or COMMENT ON CURRENT DATABASE work, which seems like
> an elegant solution to me.  Of course, I just work here.

I'm fairly annoyed at David for having selectively quoted from private
email in a public forum, but that was one of the points I touched on
in material that he cut.  The point I tried to make to him is that
possibly COMMENT ON CURRENT DATABASE is a portion of a holistic solution,
but it's only a portion.  We need to rethink exactly what pg_dump is
supposed to do with database-level properties.  And if it does need
COMMENT ON CURRENT DATABASE, it likely also needs GRANT ... ON CURRENT
DATABASE, ALTER CURRENT DATABASE OWNER TO, ALTER CURRENT DATABASE SET,
and maybe some other things.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] RBTree iteration interface improvement

2016-08-04 Thread Robert Haas

On Thu, Jul 28, 2016 at 1:57 PM, Tom Lane  wrote:
> Aleksander Alekseev  writes:
>>> Can you explain use case where you need it?
>
>> ... Or maybe you have different objects, e.g. IndexScanDesc's, that should
>> iterate over some tree's independently somewhere in indexam.c
>> procedures. Exact order may depend on user's query so you don't even
>> control it.
>
> It seems clear to me that the existing arrangement is hazardous for any
> RBTree that hasn't got exactly one consumer.  I think Aleksander's plan to
> decouple the iteration state is probably a good one (NB: I've not read the
> patch, so this is not an endorsement of details).  I'd go so far as to say
> that we should remove the old API as being dangerous, rather than preserve
> it on backwards-compatibility grounds.  We make bigger changes than that
> in internal APIs all the time.
>
> Having said that, though: if the iteration state is not part of the
> object, it's not very clear whether we can behave sanely if someone
> changes the tree while an iteration is open.  It will need careful
> thought as to what sort of guarantees we can make about that.  If it's
> too weak, then a separated-state version would have enough hazards of
> its own that it's not necessarily any safer.

+1 to all of that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Tom Lane

Peter Eisentraut  writes:
> On 8/3/16 1:20 PM, Tom Lane wrote:
>> I was just wondering whether it would be worth starting to use "git tag -a".

> One should always use annotated tags for public releases.  That way, the
> tag is its own Git object that cannot be later moved around or easily
> faked (modulo GPG signatures -- a whole different discussion).

AFAIK, you can't really move around a plain tag either.  Certainly that's
why we go out of our way not to tag releases until we're certain we don't
need a re-wrap.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: [BUGS] BUG #14247: COMMENT is restored on wrong database

2016-08-04 Thread Robert Haas

On Tue, Aug 2, 2016 at 5:42 PM, David G. Johnston
 wrote:
> Moving to -hackers since this is getting into details
>
> Bug Report # 14247
>
> On Tue, Aug 2, 2016 at 4:58 PM, Tom Lane  wrote:
>>
>> "David G. Johnston"  writes:
>> > Do you have an opinion on this following?
>>
>> I think the real problem in this area is that the division of labor
>> we have between pg_dump and pg_dumpall isn't very good for real-world
>> use cases.
>
>
> I won't disagree here.
>
>>
>>   I'm not at all sure what a better answer would look like.
>> But I'd rather see us take two steps back and consider the issue
>> holistically instead of trying to band-aid individual misbehaviors.
>>
>> The fact that pg_dump is emitting COMMENT ON DATABASE at all is
>> fundamentally wrong given the existing division-of-labor decisions,
>> namely that pg_dump is responsible for objects within a database
>> not for database-level properties.

I think a while back somebody had the idea of making COMMENT ON
CURRENT_DATABASE or COMMENT ON CURRENT DATABASE work, which seems like
an elegant solution to me.  Of course, I just work here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Tom Lane

Robert Haas  writes:
> On Thu, Aug 4, 2016 at 9:57 AM, Tom Lane  wrote:
>> [ shrug... ]  What do you claim is not machine-readable about
>> server_version?

> Surely you can't have missed the connection between the issue at hand
> and what Craig is talking about.  If libpq were using the
> machine-readable version rather than PARSING A STRING, switching to a
> two-part numbering scheme wouldn't force a compatibility break.

Well, yeah, this specific case would not have broken, because we
intentionally chose to maintain backwards compatibility of the
PG_VERSION_NUM data format.  But I think there's nothing much except
wishful thinking backing up the idea that starting to send
server_version_num now will prevent a bug in future.  If we ever do
change our version numbering scheme again, it would quite possibly
be in a way that breaks PG_VERSION_NUM as well.  A fairly obvious
potential reason to need to change something there is overflow of
the two-digit fields: clients relying on PG_VERSION_NUM would be
*more* at risk, not less so, than clients looking at a string.

In short: if we'd done it that way all along, it would've been nice.
But encouraging people to change horses now isn't really going to
make things better.  What is far more likely to happen, if we were
to do that, is that people will make clients that gratuitously fail
on old server versions because they didn't bother to support or test
for the case that the server doesn't send server_version_num.

You don't really want to encourage multiple ways of identifying which
software version you're dealing with.  That way madness lies.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in WaitForBackgroundWorkerShutdown() [REL9_5_STABLE]

2016-08-04 Thread Tom Lane

Yury Zhuravlev  writes:
> Dmitry Ivanov wrote:
>>> Recently I've encountered a strange crash while calling elog(ERROR, "...")
>>> after the WaitForBackgroundWorkerShutdown() function. It turns out that
>>> _returns_ inside the PG_TRY()..PG_CATCH() block are *highly* undesirable,
>>> since they leave PG_exception_stack pointing to a local struct in a dead
>>> frame, which is an obvious UB. I've attached a patch which fixes this
>>> behavior in the aforementioned function, but there might be more errors
>>> like that elsewhere.

> Good catch. But in 9.6 it has been fixed by 
> db0f6cad4884bd4c835156d3a720d9a79dbd63a9 commit. 

Only accidentally, I'd say.  I chose to push this into HEAD as well,
so that WaitForBackgroundWorkerShutdown would look more like
WaitForBackgroundWorkerStartup, and would not have a risk of future
breakage if someone adds code after the for-loop.

I was pretty worried about the "more errors like that" point, because
we recently fixed another similar mistake in plpython, cf 1d2fe56e4.
However, a search using the attached quick-hack script to look for
"return" etc between PG_TRY and PG_CATCH did not find any other
instances.

It'd be nice to have some automatic way of catching future mistakes
like this, but I'm not sure about a reliable way to do that.
One idea is to add an Assert to PG_CATCH:

 #define PG_TRY()  \
 do { \
 sigjmp_buf *save_exception_stack = PG_exception_stack; \
 ErrorContextCallback *save_context_stack = error_context_stack; \
 sigjmp_buf local_sigjmp_buf; \
 if (sigsetjmp(local_sigjmp_buf, 0) == 0) \
 { \
 PG_exception_stack = &local_sigjmp_buf

 #define PG_CATCH()\
+Assert(PG_exception_stack == &local_sigjmp_buf); \
 } \
 else \
 { \
 PG_exception_stack = save_exception_stack; \
 error_context_stack = save_context_stack

but there are a couple of disadvantages to this.  One is that you don't
find out about a problem unless the bad code path is actually exercised.
I'm afraid that the most tempting places to do something like this would
be low-probability error cases, and thus that the Assert would do little
to catch them.  (Case in point: the postmaster-death path fixed in this
patch would surely never have been caught by testing.)  The other and
really worse problem is that when the Assert does fire, it's nowhere near
the scene of the crime, so while it may tell you there's a problem it
does not give much help in locating the bug.

Anyone have a better idea?

regards, tom lane

#!/bin/sh

# search DIR for PG_TRY blocks containing WORD
# return, break, continue, goto are good things to look for

dir="$1"
word="$2"

for f in `find "$dir" -type f -name '*.c'`
do
  sed -n -e '/PG_TRY()/,/PG_CATCH()/ { /'"$word"'/ =; /'"$word"'/ p }' $f | \
  sed -e '/^[0-9]\+$/ { s|^[0-9]\+$|'"$f"': &: |; N }'
done

exit 0

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keeping CURRENT_DATE and similar constructs in original format

2016-08-04 Thread Peter Eisentraut

On 5/12/16 6:14 PM, Tom Lane wrote:
> So what I've wanted to do for some time is invent a new expression node
> type that represents any one of these functions and can be reverse-listed
> in the same format that the input had.  The attached proposed patch does
> that.

I was experimenting with this as well when I found your patch, and I
think this is the right solution.  Your patch works fine for me.

Because of the refactoring in 2f153ddfdd318b211590dd5585f65f357d85c2de,
you will need to update your patch a bit.

> (I'm not particularly in love with the node type name
> ValueFunction; anybody got a better idea?)

I think this is fine.  The only other idea I have would be
SQLValueFunction, to emphasize that this is about SQL-mandated special
cases.

> By the by, a scan through gram.y reveals other stuff we aren't trying
> to reverse-list in original form:
> 
>   a_expr AT TIME ZONE a_expr
>   LIKE, ILIKE, SIMILAR TO
>   OVERLAPS
>   BETWEEN
>   COLLATION FOR '(' a_expr ')'
>   EXTRACT '(' extract_list ')'
>   OVERLAY '(' overlay_list ')'
>   POSITION '(' position_list ')'
>   SUBSTRING '(' substr_list ')'
>   TREAT '(' a_expr AS Typename ')'
>   TRIM '(' BOTH trim_list ')'
>   TRIM '(' LEADING trim_list ')'
>   TRIM '(' TRAILING trim_list ')'
>   TRIM '(' trim_list ')'
> 
> Each of these gets converted to some PG-specific function or operator
> name, and then will get reverse-listed using that name and ordinary
> function or operator syntax, rather than using the SQL-approved special
> syntax.

I think those could be addressed by having ruleutils.c *always* convert
matching function calls back to the special syntax.  Alternatively, tag
the function call node in the grammar with "this is special syntax" and
then look for that in ruleutils.c.  This is sort of what I was playing
with, except that the several levels of casting for the datetime
functions make that a mess.  If it's only one function call, it should
be easier.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Robert Haas

On Thu, Aug 4, 2016 at 9:57 AM, Tom Lane  wrote:
> Vladimir Sitnikov  writes:
>>> Sorry, but I don't buy that.  I think sending both server_version and
>>> server_version_num would be silly, and we're certainly not going to stop
>>> sending server_version.
>
>> What is wrong with sending machine-readable value?
>
> [ shrug... ]  What do you claim is not machine-readable about
> server_version?

Surely you can't have missed the connection between the issue at hand
and what Craig is talking about.  If libpq were using the
machine-readable version rather than PARSING A STRING, switching to a
two-part numbering scheme wouldn't force a compatibility break.  Every
driver that his independently implemented the PostgreSQL wire protocol
is going to have to be updated for this, if they're doing something
similar to libpq, and you're still asking why sending
server_version_num is potentially beneficial?

I think it's entirely reasonable to ask whether it's worth burdening
connection startup with a few extra bytes of essentially duplicative
information is a good idea on performance grounds, and I don't know
the answer to that question.  But pretending like it wouldn't help
anything when it would fix *the exact problem we are currently talking
about* is just sticking your head in the sand.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Peter Eisentraut

On 8/3/16 1:20 PM, Tom Lane wrote:
> I was just wondering whether it would be worth starting to use "git tag -a".

One should always use annotated tags for public releases.  That way, the
tag is its own Git object that cannot be later moved around or easily
faked (modulo GPG signatures -- a whole different discussion).

Unannotated tags are more intended for private labels.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pgbench performance tuning?

2016-08-04 Thread Greg Stark

On Thu, Aug 4, 2016 at 6:53 PM, Andres Freund  wrote:
> What's the config? Version? What activity does pidstat -d -l indicate?
> How much WAL was generated?

I know the specifics matter but I was also trying to avoid dumping too
much into the email.

The shared buffers is set to 16384 (128MB). Otherwise it's a default
config (For 9.4 and before I set checkpoint_segments to 32 as well).

Never seen pidstat before but pidstat -d looks like it prints very
similar output to the iostat output I was gathering already. There's
nothing else running on the machine.

I would have to rerun the benchmarks to measure the WAL output. Since
there's no replication and the database shut down cleanly it looks
like it recycled all the log files proactively and the last checkpoint
is in the earliest numbered xlog file.

I'll add checking the xlog position before and after pg_bench to the
script. I also wanted to use run the database under a binary calling
getrusage to report RUSAGE_CHILDREN for the database. It looks like
there's no stock program to do this but it shouldn't be hard to hack
one up.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 3:23 PM, Claudio Freire  wrote:
>> That is a new HOT chain given our current code, but
>> would the new tuple addition realize it needs to create a new index
>> tuple?  The new tuple code needs to check the index's root HOT tid for a
>> non-match.
>
> I'm not proposing to change how HOT works, but adding another very
> similar mechanism on top of it called WARM.
>
> So HOT will keep working like that, HOT pruning will happen as usual,
> but we'll have the concept (immaterialized in the physical
> representation of the data, just implicit) of WARM chains. WARM chains
> would span one or several HOT chains, so they're "bigger".


Answering your question, there's no need to find the root page on index updates.

When creating the new tuple in nodeModifyTable.c, the code currently
skips updating indexes if it's using HOT.

We would add a check for WARM too. It will use WARM for index X iff both:

 * ItemPointerGetBlockNumber(oldtuplectid) ==
ItemPointerGetBlockNumber(newtuplectid)
 * satisfies HOT for only this index X


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] handling unconvertible error messages

2016-08-04 Thread Tom Lane

Victor Wagner  writes:
> On Thu, 04 Aug 2016 09:42:10 -0400
> Tom Lane  wrote:
>> Yeah.  I'm inclined to think that we should reset the message locale
>> to C as soon as we've forked away from the postmaster, and leave it
>> that way until we've absorbed settings from the startup packet.

> Really, if this response is sent after backend has been forked, problem
> probably can be easily fixed better way - StartupMessage contain
> information about desired client encoding, so this information just
> should be processed earlier than any other information from this
> message, which can cause errors (such as database name).

I think that's wishful thinking.  There will *always* be errors that
come out before we can examine the contents of the startup message.
Moreover, until we've done authentication, we should be very wary of
applying client-specified settings at all: they might be malicious.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 3:07 PM, Bruce Momjian  wrote:
> On Thu, Aug  4, 2016 at 02:35:32PM -0300, Claudio Freire wrote:
>> > Vacuum will need to be taught not to break the invariant, but I
>> > believe this is viable and efficient.
>>
>>
>> And I didn't mention the invariant.
>>
>> The invariant would be that all WARM chains:
>>
>>  * contain entire HOT chains (ie: no item pointers pointing to the
>> middle of a HOT chain)
>
> You mean no _index_ item pointers, right?

Yes

>>  * start at the item pointer, and end when the t_ctid of the heap
>> tuple either points to another page, or a tuple with a different index
>> key
>
> Uh, there is only one visible tuple in a HOT chain, so I don't see the
> different index key as being a significant check.

For a HOT chain, sure. That's why I mentioned it as an optimization.

But t_ctid, I checked, unless I read the code wrong, is set for
non-HOT updates too.

So following t_ctid regardless of the HOT-update flags should yield
update chains that can walk several buffers, and that's in fact the
mechanism I'm proposing, only just stop following the chain when it
jumps buffers.

>  That is, I think we
> should check the visibility first, as we do now, and then, for the
> only-visible tuple, check the key.

No, because, as I explained in the example, doing that will result in
duplicates being returned.

The whole update chain can span multiple WARM chains, and each WARM
chain can span multiple HOT chains.

Consider, assuming there's an index on id and another on val:

INSERT INTO TABLE t (id,val,dat) VALUES (32,'a','blabla');
UPDATE t SET dat='blabla2' where id = 32;
UPDATE t SET dat='blabla3', id=33 where id = 32;
UPDATE t SET val='b' where id = 33;
UPDATE t SET dat='blabla4' where id = 33;
UPDATE t SET val='a' where id = 33;

This produces a case similar to the one I described. Assuming all
updates happen on the same page, the first will be HOT, so no key
check is needed. The second cannot be HOT, because it updates the id.
But it can be WARM, if it's done on the same page, so it can avoid
updating the index on val. The third will need a new index item
pointer, because it has a new key. The t_ctid chain will walk all the
versions of the row, but the one with 'blabla2' will not have the
HOT-updated bit set. So regular HOT chain walking will stop there, I
propose to not stop there, so the index on val can follow the chain
into blabla3 as if it were a HOT update. Correctness is achieved only
if there is no index item pointer on that index pointing to that index
version, and since there is no efficient way of checking, the
invariants I propose aim to guarantee it so it can be assumed. Since
WARM updates don't create new index item pointers, what needs to be
checked is whether updating from version 2 to version 3 would be a
WARM update on this index or not, and that equals to checking the
index keys for equality.

>  I don't see a value in (expensive)
> checking the key of an invisible tuple in hopes of stoping the HOT chain
> traversal.

Not HOT chain, the point is to guarantee correctness by properly
identifying the end of the WARM (not HOT) chain, which is left
implicit.

> Also, what happens when a tuple chain goes off-page, then returns to the
> original page?

The WARM chain ends on the page hop, and when it returns it's a new
WARM chain. So traversal would not follow t_ctid pointers that hop
pages, which makes it efficient, and also guarantees correctness
(since if the is a page hop, we know the index will contain an item
pointer to the version in that other page, so we'll visit it when we
get there).

> That is a new HOT chain given our current code, but
> would the new tuple addition realize it needs to create a new index
> tuple?  The new tuple code needs to check the index's root HOT tid for a
> non-match.

I'm not proposing to change how HOT works, but adding another very
similar mechanism on top of it called WARM.

So HOT will keep working like that, HOT pruning will happen as usual,
but we'll have the concept (immaterialized in the physical
representation of the data, just implicit) of WARM chains. WARM chains
would span one or several HOT chains, so they're "bigger".

>> This is maintained by:
>>
>>  * No item pointer will be created for row versions whose immediately
>> previous version is already on the index with the same key
>
> You really only have the ctid HOT head stored in the index, not the
> immediate ctid.

I know. I just mentioned it just in case there was a transient time
during cleanup when it isn't true, but thinking it a little bit more,
if it weren't, HOT would also cause duplicates during index scans, so
it's probably not the case (or protected with a lock).

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] improved DefElem list processing

2016-08-04 Thread Tom Lane

I wrote:
> Peter Eisentraut  writes:
>> The other adds a location field to the DefElem node.

> +1 for sure, lots of places where that would be a good thing
> (the duplicate check itself, for starters).

Forgot to mention: seems like you should have added a location
argument to makeDefElem.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pgbench performance tuning?

2016-08-04 Thread Andres Freund

On 2016-08-04 19:15:43 +0100, Greg Stark wrote:
> On Thu, Aug 4, 2016 at 6:53 PM, Andres Freund  wrote:
> > What's the config? Version? What activity does pidstat -d -l indicate?
> > How much WAL was generated?
> 
> I know the specifics matter but I was also trying to avoid dumping too
> much into the email.
> 
> The shared buffers is set to 16384 (128MB). Otherwise it's a default
> config (For 9.4 and before I set checkpoint_segments to 32 as well).

Well, with 128MB I don't find that a very surprising result. You're
going to push data out to disk constantly. Given the averaged random
access pattern of pgbench that's not really something that interesting.


> Never seen pidstat before but pidstat -d looks like it prints very
> similar output to the iostat output I was gathering already. There's
> nothing else running on the machine.

The question is which backends are doing the IO.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lossy Index Tuple Enhancement (LITE)

2016-08-04 Thread Simon Riggs

On 4 August 2016 at 18:27, Bruce Momjian  wrote:

>> > Also, why not use this bitmap for all indexes, not just update chains?
>>
>> I don't understand where you get this update chains thing from.
>>
>> The bitmap can apply to multiple tuples on one page, which is described.
>
> I am asking if we could use this idea for all rows on a page, not just
> HOT updated, e.g. we have five rows that have col1=4 in an index --- why
> not just use on index pointer for all of them, with a bitmap to point to
> the possible matches?

Yes. I described that...

"The same situation can occur if we have many INSERTs with same values
on the same block. This could lead to a reduction in size of the btree
for indexes with many duplicate values."

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Double invocation of InitPostmasterChild in bgworker with -DEXEC_BACKEND

2016-08-04 Thread Robert Haas

On Tue, Aug 2, 2016 at 6:40 PM, Tom Lane  wrote:
> Thomas Munro  writes:
>> I discovered that if you build with -DEXEC_BACKEND on a Unix system
>> and then try to start a background worker, it dies in
>> InitializeLatchSupport:
>
>> TRAP: FailedAssertion("!(selfpipe_readfd == -1)", File: "latch.c", Line: 161)
>
>> That's because InitPostmasterChild is called twice.  I can
>> successfully use regular parallel query workers and bgworkers created
>> by extensions if I apply the attached patch.
>
> Confirmed, fix pushed.  I wonder if we should have a buildfarm member
> running the Unix EXEC_BACKEND code ...

That would get my vote.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] improved DefElem list processing

2016-08-04 Thread Tom Lane

Peter Eisentraut  writes:
> Here are two WIP patches to improve the DefElem list processing that is
> used by many utility commands.

> One factors out the duplicate checks, which are currently taking up a
> lot of space with duplicate code.  I haven't applied this everywhere
> yet, but the patch shows how much boring code can be saved.

+1 on the general idea, but I wonder if it's a problem that you replaced
an O(N) check with an O(N^2) check.  I don't think there are any commands
that would be likely to have so many options that this would become a
serious issue, but ...

> The other adds a location field to the DefElem node.

+1 for sure, lots of places where that would be a good thing
(the duplicate check itself, for starters).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Change the default of update_process_title to off

2016-08-04 Thread Robert Haas

On Thu, Aug 4, 2016 at 3:52 AM, Andres Freund  wrote:
> On 2016-08-04 16:48:11 +0900, Michael Paquier wrote:
>> Here is a different proposal: documenting instead that disabling that
>> parameter on Windows can improve performance, at the cost of losing
>> information verbosity for processes.
>
> The benefit on windows seems pretty marginal, given the way it has to be
> viewed. People installing processexplorer et. al. to view a handle that
> have to know about, will be able to change the config.

I agree.  I think it would be fine to disable this on Windows, but I
wouldn't want to do the same thing on other platforms.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 02:35:32PM -0300, Claudio Freire wrote:
> > Vacuum will need to be taught not to break the invariant, but I
> > believe this is viable and efficient.
> 
> 
> And I didn't mention the invariant.
> 
> The invariant would be that all WARM chains:
> 
>  * contain entire HOT chains (ie: no item pointers pointing to the
> middle of a HOT chain)

You mean no _index_ item pointers, right?

>  * start at the item pointer, and end when the t_ctid of the heap
> tuple either points to another page, or a tuple with a different index
> key

Uh, there is only one visible tuple in a HOT chain, so I don't see the
different index key as being a significant check.  That is, I think we
should check the visibility first, as we do now, and then, for the
only-visible tuple, check the key.  I don't see a value in (expensive)
checking the key of an invisible tuple in hopes of stoping the HOT chain
traversal.

Also, what happens when a tuple chain goes off-page, then returns to the
original page?  That is a new HOT chain given our current code, but
would the new tuple addition realize it needs to create a new index
tuple?  The new tuple code needs to check the index's root HOT tid for a
non-match.

> This is maintained by:
> 
>  * No item pointer will be created for row versions whose immediately
> previous version is already on the index with the same key

You really only have the ctid HOT head stored in the index, not the
immediate ctid.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 01:05:53PM -0400, Bruce Momjian wrote:
> > Approach 2 seems more reasonable and simple. 
> > 
> > There are only 2 bits for lp_flags and all combinations are already used. 
> > But
> > for LP_REDIRECT root line pointer, we could use the lp_len field to store 
> > this
> > special flag, which is not used for LP_REDIRECT line pointers. So we are 
> > able
> > to mark the root line pointer.
> 
> Uh, as I understand it, we only use LP_REDIRECT when we have _removed_
> the tuple that the ctid was pointing to, but it seems you would need to
> set HEAP_RECHECK_REQUIRED earlier than that.
> 
> Also, what is currently in the lp_len field for LP_REDIRECT?  Zeros, or
> random data?  I am asking for pg_upgrade purposes.

It seems LP_REDIRECT's lp_len is always zero:  :-)

/*
 * ItemIdSetRedirect
 *  Set the item identifier to be REDIRECT, with the specified link.
 *  Beware of multiple evaluations of itemId!
 */
#define ItemIdSetRedirect(itemId, link) \
( \
(itemId)->lp_flags = LP_REDIRECT, \
(itemId)->lp_off = (link), \
(itemId)->lp_len = 0 \
  ^^
)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pgbench performance tuning?

2016-08-04 Thread Andres Freund

On 2016-08-04 18:48:31 +0100, Greg Stark wrote:
> I'm trying to run pgbench on a moderately beefy machine (4-core 3.4GHz
> with 32G of ram and md mirrored spinning rust drives) at scale 300
> with 32 clients with duration of 15min. I'm getting TPS numbers
> between 60-150 which seems surprisingly low to me and also to several
> people on IRC.

What's the config? Version? What activity does pidstat -d -l indicate?
How much WAL was generated?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Pgbench performance tuning?

2016-08-04 Thread Greg Stark

I'm trying to run pgbench on a moderately beefy machine (4-core 3.4GHz
with 32G of ram and md mirrored spinning rust drives) at scale 300
with 32 clients with duration of 15min. I'm getting TPS numbers
between 60-150 which seems surprisingly low to me and also to several
people on IRC.

Now pg_test_fsync does seem to indicate that's not an unreasonable
commit rate if there was very little commit grouping going on:

Compare file sync methods using one 8kB write:
open_datasync   100.781 ops/sec9922 usecs/op
fdatasync71.088 ops/sec   14067 usecs/op

Compare file sync methods using two 8kB writes:
open_datasync50.286 ops/sec   19886 usecs/op
fdatasync80.349 ops/sec   12446 usecs/op

And iostat does seem to indicate the drives are ~ 80% utilized with
high write await times So maybe this is just what the system is
capable of with synchronous_commit?

Is anyone really familiar with pg_bench on similar hardware? Are these
numbers reasonable? Any suggestion for how to run it to get the most
realistic measure of Postgres on this machine?

starting vacuum...end.
transaction type: 
scaling factor: 300
query mode: simple
number of clients: 32
number of threads: 4
duration: 900 s
number of transactions actually processed: 109424
latency average: 263.196 ms
tps = 121.464536 (including connections establishing)
tps = 121.464824 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
 0.002  \set aid random(1, 10 * :scale)
 0.001  \set bid random(1, 1 * :scale)
 0.001  \set tid random(1, 10 * :scale)
 0.001  \set delta random(-5000, 5000)
 0.229  BEGIN;
12.589  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid;
 0.280  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
 1.442  UPDATE pgbench_tellers SET tbalance = tbalance +
:delta WHERE tid = :tid;
12.435  UPDATE pgbench_branches SET bbalance = bbalance +
:delta WHERE bid = :bid;
 0.222  INSERT INTO pgbench_history (tid, bid, aid, delta,
mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
   237.623  END;

And the iostat for the period the pg_bench is running:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   1,310,000,43   20,480,00   77,78

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb   0,00 2,47   14,31  181,13   175,96  2897,77
31,4580,85  413,63   67,55  440,97   4,16  81,35
sda   0,01 2,47   23,77  181,13   292,04  2897,77
31,1466,72  325,59   51,08  361,61   3,92  80,40
md0   0,00 0,000,050,00 0,20 0,00
8,00 0,000,000,000,00   0,00   0,00
md1   0,00 0,000,000,00 0,00 0,00
0,00 0,000,000,000,00   0,00   0,00
md2   0,00 0,00   38,04  182,79   467,79  2895,73
30,46 0,000,000,000,00   0,00   0,00
md3   0,00 0,000,000,01 0,00 0,04
8,00 0,000,000,000,00   0,00   0,00


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 11:04:49PM +0530, Pavan Deolasee wrote:
> If the root tuple still exists, we store the WARM flag (or
> HEAP_RECHECK_REQUIRED as used in the original post) in the tuple header 
> itself.
> When the root tuple becomes dead and HOT prune decides to replace it with a
> LP_REDIRECT line pointer, the information is moved to lp_len (which is
> currently set to 0 for LP_REDIRECT items). Does that answer your question?

Ah, brilliant!

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] psql: Missing option to print current buffer to current output channel (i.e., \qprint)

2016-08-04 Thread David G. Johnston

"\echo" has "\qecho" - I'm basically looking for a "\qprint"

"\write" doesn't apply

The following doesn't work either...

#bash (stock Ubuntu 14.04)
psql --set=query_in_var='SELECT version();' > /dev/null <<'SQL'
\o /tmp/psql-test.txt
\set ECHO queries
--still ends up on stdout, hence the redirect to /dev/null to avoid it
showing on screen along with the cat output below
:query_in_var
\o
SQL
cat /tmp/psql-test.txt

Can anyone offer a reason not to add "\qprint"?

David J.

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Pavan Deolasee

On Thu, Aug 4, 2016 at 10:49 PM, Bruce Momjian  wrote:

> On Thu, Aug  4, 2016 at 06:16:02PM +0100, Simon Riggs wrote:
> > On 4 August 2016 at 18:05, Bruce Momjian  wrote:
> >
> > >> Approach 2 seems more reasonable and simple.
> > >>
> > >> There are only 2 bits for lp_flags and all combinations are already
> used. But
> > >> for LP_REDIRECT root line pointer, we could use the lp_len field to
> store this
> > >> special flag, which is not used for LP_REDIRECT line pointers. So we
> are able
> > >> to mark the root line pointer.
> > >
> > > Uh, as I understand it, we only use LP_REDIRECT when we have _removed_
> > > the tuple that the ctid was pointing to, but it seems you would need to
> > > set HEAP_RECHECK_REQUIRED earlier than that.
> >
> > Hmm. Mostly there will be one, so this is just for the first update
> > after any VACUUM.
> >
> > Adding a new linepointer just to hold this seems kludgy and could mean
> > we run out of linepointers.
>
> Ah, so in cases where there isn't an existing LP_REDIRECT for the chain,
> you create one and use the lp_len to identify it as a WARM chain?  Hmm.
>
>
If the root tuple still exists, we store the WARM flag (or
HEAP_RECHECK_REQUIRED as used in the original post) in the tuple header
itself. When the root tuple becomes dead and HOT prune decides to replace
it with a LP_REDIRECT line pointer, the information is moved to lp_len
(which is currently set to 0 for LP_REDIRECT items). Does that answer your
question?


> You can't update the indexes pointing to the existing ctid, so what you
> would really have to do is to write over the existing ctid with
> LP_REDIRECT plus WARM marker, and move the old ctid to a new ctid slot?
>
>
Not really. I hope the above answers this, but please let me know if you
mean something else.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 2:14 PM, Claudio Freire  wrote:
> To fix this, what I was pondering, and I believe it would be viable,
> is to teach heap_hot_search_buffer about the WARM invariant. It will
> make correctness heavily depend on the invariant being true, which
> means either flagging item pointers as WARM chains, or requiring a
> reindex during upgrade to make sure all indexes verify the invariant -
> as old indices won't.
>
> When an item pointer is WARM (either implicit or explicit),
> heap_hot_search_buffer will only stop following the update chain if it
> reaches not the end of a HOT chain, but rather a key change. For this,
> it will need to recheck the key of all but the first TID in the chain,
> so singleton chains (ie: not-updated or frozen tuples) won't incurr
> this extra cost. When walking the chain, it will need to form an index
> tuple for both the old and new versions, and compare them, and follow
> the chain *only* if the keys are equal.
>
> As an optimization, if the tuple is marked HOT-updated, the key check
> may be skipped.
>
> Vacuum will need to be taught not to break the invariant, but I
> believe this is viable and efficient.


And I didn't mention the invariant.

The invariant would be that all WARM chains:

 * contain entire HOT chains (ie: no item pointers pointing to the
middle of a HOT chain)
 * start at the item pointer, and end when the t_ctid of the heap
tuple either points to another page, or a tuple with a different index
key
 * there is exactly one WARM chain containing any heap tuple,
including the singleton chain represented by a tuple whose WARM chain
contains only itself

This is maintained by:

 * No item pointer will be created for row versions whose immediately
previous version is already on the index with the same key

Cleanup can only proceed by applying cleanup on HOT chains, by moving
the head of a WARM chain forward, or by removing it entirely. It may
be that VACUUM already works like that, but I'm unsure.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Pavan Deolasee

On Thu, Aug 4, 2016 at 10:41 PM, Simon Riggs  wrote:

> On 4 August 2016 at 17:31, Andres Freund  wrote:
> > Hi,
> >
> > On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> >> Indexes whose values do not change do not require new index pointers.
> Only
> >> the index whose key is being changed will need a new index entry. The
> new
> >> index entry will be set to the CTID of the root line pointer.
> >
> > That seems to require tracing all hot-chains in a page, to actually
> > figure out what the root line pointer of a warm-updated HOT tuple is,
> > provided it's HOT_UPDATED itself.  Or have you found a smart way to
> > figure that out?
>
> Hmm, sorry, I did think of that point and I thought I had added it to the
> doc.
>
> So, yes, I agree - re-locating the root is the biggest downside to
> this idea. Perhaps Pavan has other thoughts?
>
>
I clearly missed this problem, so what I say now is not fully thought
through. I see couple of options:

1. Most often the tuple that's being updated will be fetched by recent
index scan. So we can potentially cache last (few) mappings of CTID to
their root line pointers. May be we can store that in RelationData on a per
relation basis.
2. I also think that most often the tuple that will be updated will have
its t_ctid pointing to itself. This sounds more invasive, but can we
somehow utilise the t_ctid field to store the OffsetNumber of the root line
pointer? The root line pointer only exists when the tuple under
consideration has HEAP_ONLY_TUPLE flag set and we know for such tuples,
BlockNumber in t_ctid must be same as current block (unless HEAP_UPDATED is
also set and the updating transaction eventually aborted).

We may have to fall back to a full page scan to find the root line pointer,
but we can limit that for corner cases.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 06:18:47PM +0100, Simon Riggs wrote:
> > Have you considered what it'd take to allow index pointers to allow to
> > point to "intermediate" root tuples? I.e. have some indexes point into
> > the middle of an update-chain, whereas others still point to the
> > beginning? There's certainly some complications involved with that, but
> > it'd also have the advantage in reducing the amount of rechecking that
> > needs to be done.
> 
> "Intermediate root tuples" was my first attempt at this and it didn't
> work. I'll dig up the details, some problem in VACUUM, IIRC.

I think the problem is that HOT chain pruning becomes almost impossible
except via VACUUM.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lossy Index Tuple Enhancement (LITE)

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 1:58 PM, Simon Riggs  wrote:
> On 3 August 2016 at 20:37, Claudio Freire  wrote:
>> On Wed, Aug 3, 2016 at 4:20 AM, Simon Riggs  wrote:
>>> == IndexScan ==
>>>
>>> Note that the executor code for IndexScan appears identical between
>>> the two optimizations. The difference between duplicate and range LITE
>>> tuples is needed only at INSERT time (or UPDATE indexed column to a
>>> new value).
>>>
>>> When we do an IndexScan if we see a LITE tuple we do a scan of the
>>> linepointer ranges covered by this index tuple (which might eventually
>>> go to a full block scan). For example with bit1 set we would scan 16
>>> linepointers (on an 8192 byte block) and with 2 bits set we would scan
>>> 32 linepointers etc.., not necessarily consecutive ranges. So the
>>> IndexScan can return potentially many heap rows per index entry, which
>>> need to be re-checked and may also need to be sorted prior to
>>> returning them. If no rows are returned we can kill the index pointer,
>>> just as we do now for btrees, so they will be removed eventually from
>>> the index without the need for VACUUM. An BitmapIndexScan that sees a
>>> lossy pointer can also use the lossy TID concept, so we can have
>>> partially lossy bitmaps.
>>
>> Wouldn't this risk scanning rows more than once unless it's part of a
>> bitmap scan?
>
> It's always the job of the index insert logic to ensure a valid index exists.
>
> I'm not sure what you see that changes that here. Say more?

I just explained it further in the WARM thread, which is actually the
idea I was arriving to.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lossy Index Tuple Enhancement (LITE)

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 06:03:34PM +0100, Simon Riggs wrote:
> On 4 August 2016 at 02:13, Bruce Momjian  wrote:
> 
> > How do plan to clear the bitmask so it, over time, doesn't end up being
> > all-set?
> 
> I don't have a plan, though thoughts welcome.
> 
> Similar situation that our current indexes don't recover from bloat, a
> situation made worse by non-HOT updates.
> 
> So, we have a choice of which disadvantage to accept.
> 
> 
> > Also, why not use this bitmap for all indexes, not just update chains?
> 
> I don't understand where you get this update chains thing from.
> 
> The bitmap can apply to multiple tuples on one page, which is described.

I am asking if we could use this idea for all rows on a page, not just
HOT updated, e.g. we have five rows that have col1=4 in an index --- why
not just use on index pointer for all of them, with a bitmap to point to
the possible matches?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Andres Freund

On 2016-08-04 18:18:47 +0100, Simon Riggs wrote:
> On 4 August 2016 at 18:17, Andres Freund  wrote:
> > On 2016-08-04 18:11:17 +0100, Simon Riggs wrote:
> > I'm less afraid of the fiddlyness of finding the root tuple, than the
> > cost. It's not cheap to walk through, potentially, all hot chains to
> > find the root ctid.
> >
> > Have you considered what it'd take to allow index pointers to allow to
> > point to "intermediate" root tuples? I.e. have some indexes point into
> > the middle of an update-chain, whereas others still point to the
> > beginning? There's certainly some complications involved with that, but
> > it'd also have the advantage in reducing the amount of rechecking that
> > needs to be done.
> 
> "Intermediate root tuples" was my first attempt at this and it didn't
> work. I'll dig up the details, some problem in VACUUM, IIRC.

I think it's doable, but the question is how to do cleanup. It's kind of
easy to end up with a page full of line pointers which are all reachable
from indexes, but otherwise useless.  But I'm not sure that's actually
that inevitable:

a) Given that we fully scan indexes anyway, we could apply redirections
   to the indexes themselves, during vacuum. Then redirections could be
   removed afterwards.
b) Start fixing up indexes, by refinding entries in the index from the
   heap tuple. That requires expressions to actually be immutable, but
   personally I think it's stupid to cater for expressions that are
   not. Then we can update indexes to point to the "final tuple" after
   pruning.  It'd also pave the way to avoid the full index scan at the
   end of vacuum (in some cases).

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 06:16:02PM +0100, Simon Riggs wrote:
> On 4 August 2016 at 18:05, Bruce Momjian  wrote:
> 
> >> Approach 2 seems more reasonable and simple.
> >>
> >> There are only 2 bits for lp_flags and all combinations are already used. 
> >> But
> >> for LP_REDIRECT root line pointer, we could use the lp_len field to store 
> >> this
> >> special flag, which is not used for LP_REDIRECT line pointers. So we are 
> >> able
> >> to mark the root line pointer.
> >
> > Uh, as I understand it, we only use LP_REDIRECT when we have _removed_
> > the tuple that the ctid was pointing to, but it seems you would need to
> > set HEAP_RECHECK_REQUIRED earlier than that.
> 
> Hmm. Mostly there will be one, so this is just for the first update
> after any VACUUM.
> 
> Adding a new linepointer just to hold this seems kludgy and could mean
> we run out of linepointers.

Ah, so in cases where there isn't an existing LP_REDIRECT for the chain,
you create one and use the lp_len to identify it as a WARM chain?  Hmm.

You can't update the indexes pointing to the existing ctid, so what you
would really have to do is to write over the existing ctid with
LP_REDIRECT plus WARM marker, and move the old ctid to a new ctid slot?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Simon Riggs

On 4 August 2016 at 18:17, Andres Freund  wrote:
> On 2016-08-04 18:11:17 +0100, Simon Riggs wrote:
>> On 4 August 2016 at 17:31, Andres Freund  wrote:
>> > Hi,
>> >
>> > On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
>> >> Indexes whose values do not change do not require new index pointers. Only
>> >> the index whose key is being changed will need a new index entry. The new
>> >> index entry will be set to the CTID of the root line pointer.
>> >
>> > That seems to require tracing all hot-chains in a page, to actually
>> > figure out what the root line pointer of a warm-updated HOT tuple is,
>> > provided it's HOT_UPDATED itself.  Or have you found a smart way to
>> > figure that out?
>>
>> Hmm, sorry, I did think of that point and I thought I had added it to the 
>> doc.
>>
>> So, yes, I agree - re-locating the root is the biggest downside to
>> this idea. Perhaps Pavan has other thoughts?
>>
>> I think its doable, but it will be fiddly.
>
> I'm less afraid of the fiddlyness of finding the root tuple, than the
> cost. It's not cheap to walk through, potentially, all hot chains to
> find the root ctid.
>
> Have you considered what it'd take to allow index pointers to allow to
> point to "intermediate" root tuples? I.e. have some indexes point into
> the middle of an update-chain, whereas others still point to the
> beginning? There's certainly some complications involved with that, but
> it'd also have the advantage in reducing the amount of rechecking that
> needs to be done.

"Intermediate root tuples" was my first attempt at this and it didn't
work. I'll dig up the details, some problem in VACUUM, IIRC.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Andres Freund

On 2016-08-04 18:11:17 +0100, Simon Riggs wrote:
> On 4 August 2016 at 17:31, Andres Freund  wrote:
> > Hi,
> >
> > On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> >> Indexes whose values do not change do not require new index pointers. Only
> >> the index whose key is being changed will need a new index entry. The new
> >> index entry will be set to the CTID of the root line pointer.
> >
> > That seems to require tracing all hot-chains in a page, to actually
> > figure out what the root line pointer of a warm-updated HOT tuple is,
> > provided it's HOT_UPDATED itself.  Or have you found a smart way to
> > figure that out?
> 
> Hmm, sorry, I did think of that point and I thought I had added it to the doc.
> 
> So, yes, I agree - re-locating the root is the biggest downside to
> this idea. Perhaps Pavan has other thoughts?
> 
> I think its doable, but it will be fiddly.

I'm less afraid of the fiddlyness of finding the root tuple, than the
cost. It's not cheap to walk through, potentially, all hot chains to
find the root ctid.

Have you considered what it'd take to allow index pointers to allow to
point to "intermediate" root tuples? I.e. have some indexes point into
the middle of an update-chain, whereas others still point to the
beginning? There's certainly some complications involved with that, but
it'd also have the advantage in reducing the amount of rechecking that
needs to be done.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 09:58:29AM -0700, Andres Freund wrote:
> On 2016-08-04 12:55:25 -0400, Bruce Momjian wrote:
> > On Thu, Aug  4, 2016 at 09:31:18AM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> > > > Indexes whose values do not change do not require new index pointers. 
> > > > Only
> > > > the index whose key is being changed will need a new index entry. The 
> > > > new
> > > > index entry will be set to the CTID of the root line pointer.
> > > 
> > > That seems to require tracing all hot-chains in a page, to actually
> > > figure out what the root line pointer of a warm-updated HOT tuple is,
> > > provided it's HOT_UPDATED itself.  Or have you found a smart way to
> > > figure that out?
> > 
> > The index points to the head of the HOT chain, and you just walk the
> > chain on the page --- on need to look at other chains on the page.
> 
> When doing an update, you'll need to re-find the root tuple, to insert
> the root ctid for the new index tuples.

Also, I was thinking we could just point into the middle of the HOT
chain to limit the number of rows we have to check, but the problem
there is that HOT pruning could then not remove that middle ctid, which
is bad.

I wonder if walking all HOT chains on a single page to find the root is
cheap enough.  The ctid tells you if it is part of a HOT chain, right?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Simon Riggs

On 4 August 2016 at 18:05, Bruce Momjian  wrote:

>> Approach 2 seems more reasonable and simple.
>>
>> There are only 2 bits for lp_flags and all combinations are already used. But
>> for LP_REDIRECT root line pointer, we could use the lp_len field to store 
>> this
>> special flag, which is not used for LP_REDIRECT line pointers. So we are able
>> to mark the root line pointer.
>
> Uh, as I understand it, we only use LP_REDIRECT when we have _removed_
> the tuple that the ctid was pointing to, but it seems you would need to
> set HEAP_RECHECK_REQUIRED earlier than that.

Hmm. Mostly there will be one, so this is just for the first update
after any VACUUM.

Adding a new linepointer just to hold this seems kludgy and could mean
we run out of linepointers.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 7:59 AM, Pavan Deolasee  wrote:
> We don’t want to force a recheck for every index fetch because that will
> slow everything down. So we would like a simple and efficient way of knowing
> about the existence of a WARM tuple in a chain and do a recheck in only
> those cases, ideally O(1). Having a HOT chain contain a WARM tuple is
> discussed below as being a “WARM chain”, implying it needs re-check.
>
> We could solve this by either 1) marking both old and new index tuples as
> "recheck-required" or 2) marking the HOT chain on the heap as
> "recheck-required" such that any tuple returned from the chain requires a
> recheck.
>
> The advantage of doing 1) is that the recheck is required only for the
> compromised index.
> Note that in presence of multiple index keys pointing to the same root line
> pointer, the chain needs re-check for both the old as well as the new index
> key. The old index key must not fetch the new tuple(s) and the new index key
> must not fetch the old tuple(s). So irrespective of whether an old or a new
> tuple is being returned, the caller must know about the WARM tuples in the
> chains. So marking the index would require us to mark both old and new index
> tuples as requiring re-check. That requires an additional index scan to
> locate the old row and then an additional write to force it to re-check,
> which is algorithmically O(NlogN) on table size.
>
> Marking the heap, (2) is O(1) per updated index, so seems better. But since
> we don't know with respect to which index or indexes the chain is broken,
> all index scans must do a recheck for a tuple returned from a WARM chain.
>
> How do we mark a chain as WARM? How do we clear the flag? There are a few
> possible approaches:
>
> 1. Mark the first WARM tuple which causes HOT chain to become WARM with a
> special HEAP_RECHECK_REQUIRED flag (this flag must then be carried over to
> any further additions to the chain. Then whenever a tuple if returned from a
> chain, we must look forward into the chain for WARM tuple and let the caller
> know about potential breakage. This seems simple but would require us to
> follow HOT chain every time to check if it’s HOT or WARM.
>
> 2. Mark the root line pointer (or the root tuple) with a special
> HEAP_RECHECK_REQUIRED flag to tell us about the presence of a WARM tuple in
> the chain. Since all indexes point to the root line pointer, it should be
> enough to just mark the root line pointer (or tuple) with this flag.
>
> 3. We could also adopt a hybrid approach and mark the new index tuple and
> new heap tuple with HEAP_RECHECK_REQUIRED flag and presence of either of the
> flag in either tuple forces recheck.

I've actually been thinking of another possibility in the context of
LITE, and what I was pondering looks eerily similar to WARM, so let me
summarize (I was going to write a lengthier POC but I thought I better
comment now since you're introducing the same concept):

Marks are troublesome becuase, as you mention, you would need one
"WARM" bit per index to be efficient. We also don't want a recheck on
every index access.

HOT is very close to WARM already, even the code. The only problem
with HOT is that it stops scanning the update chain when a tuple is
not HOT-updated. If heap_hot_search_buffer knew to keep on following
the update chain when there is a WARM update on the index in question,
this could work.

But it won't work if it's implemented naively, because if there was
another item pointer in another page of the index pointing to a heap
tuple that belongs to an already-visited chain, which is possible in
the below case, then the scan would produce duplicate rows:

N: Row versions:
->: update
* : causes a new index tuple on index I

1 -> 2 -> 3 -> 4 -> 5
* * *
a ba  <- key

If the scan qual is "key in ('a','b')", then the index scan will visit
both versions of the row, 1, 3 and 4. Suppose only 5 is visible, and
suppose the whole chain lies in the same page.

There is one update chain, 1..5, but three WARM chains: 1..2, 3, 4..5

When visiting item pointers, 1 will be found, the chain will be
followed and eventually version 5 will pass the visibility check and
the tuple will pass the scan quals, it will even match the key in the
index item, so no key check will fix this. 5 will be yielded when
visiting version 1.

Next, version 3 will be visited. There's no easy way to know that we
already visited 3, so we'll follow the chain all the way to 5, and,
again, it will be yielded. Here, the key won't match, so maybe this
case could be filtered out with a key check.

Finally, version 4 will be visited, the chain will be followed to 5,
and as in the first case, it will be yielded.

So the scan yielded the same heap tuple thrice, which isn't an ideal situation.

It would be totally acceptable for a bitmap index scan, so the
expensive logic I will be outlining below may be skipped when building
a bitmap (lossy or not).

To fix this, what I w

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 09:58:29AM -0700, Andres Freund wrote:
> On 2016-08-04 12:55:25 -0400, Bruce Momjian wrote:
> > On Thu, Aug  4, 2016 at 09:31:18AM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> > > > Indexes whose values do not change do not require new index pointers. 
> > > > Only
> > > > the index whose key is being changed will need a new index entry. The 
> > > > new
> > > > index entry will be set to the CTID of the root line pointer.
> > > 
> > > That seems to require tracing all hot-chains in a page, to actually
> > > figure out what the root line pointer of a warm-updated HOT tuple is,
> > > provided it's HOT_UPDATED itself.  Or have you found a smart way to
> > > figure that out?
> > 
> > The index points to the head of the HOT chain, and you just walk the
> > chain on the page --- on need to look at other chains on the page.
> 
> When doing an update, you'll need to re-find the root tuple, to insert
> the root ctid for the new index tuples.

Ah, I had not thought of that --- good point.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Simon Riggs

On 4 August 2016 at 17:31, Andres Freund  wrote:
> Hi,
>
> On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
>> Indexes whose values do not change do not require new index pointers. Only
>> the index whose key is being changed will need a new index entry. The new
>> index entry will be set to the CTID of the root line pointer.
>
> That seems to require tracing all hot-chains in a page, to actually
> figure out what the root line pointer of a warm-updated HOT tuple is,
> provided it's HOT_UPDATED itself.  Or have you found a smart way to
> figure that out?

Hmm, sorry, I did think of that point and I thought I had added it to the doc.

So, yes, I agree - re-locating the root is the biggest downside to
this idea. Perhaps Pavan has other thoughts?

I think its doable, but it will be fiddly.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 04:29:09PM +0530, Pavan Deolasee wrote:
> Write Amplification Reduction Method (WARM)
> 
> 
> A few years back, we developed HOT to address the problem associated with MVCC
> and frequent updates and it has served us very well. But in the light of 
> Uber's
> recent technical blog highlighting some of the problems that are still
> remaining, especially for workloads where HOT does not help to the full 
> extent,
> Simon, myself and others at 2ndQuadrant discussed several ideas and what I
> present below is an outcome of that. This is not to take away credit from
> anybody else. Others may have thought about similar ideas, but I haven’t seen
> any concrete proposal so far.

HOT was a huge win for Postgres and I am glad you are reviewing
improvements.

> This method succeeds in reducing the write amplification, but causes other
> issues which also need to be solved. WARM breaks the invariant that all tuples
> in a HOT chain have the same index values and so an IndexScan would need to
> re-check the index scan conditions against the visible tuple returned from
> heap_hot_search(). We must still check visibility, so we only need to re-check
> scan conditions on that one tuple version.
>   
> We don’t want to force a recheck for every index fetch because that will slow
> everything down. So we would like a simple and efficient way of knowing about
> the existence of a WARM tuple in a chain and do a recheck in only those cases,
> ideally O(1). Having a HOT chain contain a WARM tuple is discussed below as
> being a “WARM chain”, implying it needs re-check.

In summary, we are already doing visibility checks on the HOT chain, so
a recheck if the heap tuple matches the index value is only done at most
on the one visible tuple in the chain, not ever tuple in the chain.

> 2. Mark the root line pointer (or the root tuple) with a special
> HEAP_RECHECK_REQUIRED flag to tell us about the presence of a WARM tuple in 
> the
> chain. Since all indexes point to the root line pointer, it should be enough 
> to
> just mark the root line pointer (or tuple) with this flag. 

Yes, I think #2 is the easiest.  Also, if we modify the index page, we
would have to WAL the change and possibly WAL log the full page write
of the index page.  :-(

> Approach 2 seems more reasonable and simple. 
> 
> There are only 2 bits for lp_flags and all combinations are already used. But
> for LP_REDIRECT root line pointer, we could use the lp_len field to store this
> special flag, which is not used for LP_REDIRECT line pointers. So we are able
> to mark the root line pointer.

Uh, as I understand it, we only use LP_REDIRECT when we have _removed_
the tuple that the ctid was pointing to, but it seems you would need to
set HEAP_RECHECK_REQUIRED earlier than that.

Also, what is currently in the lp_len field for LP_REDIRECT?  Zeros, or
random data?  I am asking for pg_upgrade purposes.

> One idea is to somehow do this as part of the vacuum where we collect root 
> line
> pointers of  WARM chains during the first phase of heap scan, check the 
> indexes
> for all such tuples (may be combined with index vacuum) and then clear the 
> heap
> flags during the second pass, unless new tuples are added to the WARM chain. 
> We
> can detect that by checking that all tuples in the WARM chain still have XID
> less than the OldestXmin that VACUUM is using.

Yes, it seems natural to clear the ctid HEAP_RECHECK_REQUIRED flag where
you are adjusting the HOT chain anyway.

> It’s important to ensure that the flag is set when it is absolutely necessary,
> while having false positives is not a problem. We might do a little wasteful
> work if the flag is incorrectly set. Since flag will be set only during
> heap_update() and the operation is already WAL logged, this can be piggybacked
> with the heap_update WAL record. Similarly, when a root tuple is pruned to a
> redirect line pointer, the operation is already WAL logged and we can 
> piggyback
> setting of line pointer flag with that WAL record.
> 
> Flag clearing need not be WAL logged, unless we can piggyback that to some
> existing WAL logging.

Agreed, good point.

Very nice!

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lossy Index Tuple Enhancement (LITE)

2016-08-04 Thread Simon Riggs

On 4 August 2016 at 02:13, Bruce Momjian  wrote:

> How do plan to clear the bitmask so it, over time, doesn't end up being
> all-set?

I don't have a plan, though thoughts welcome.

Similar situation that our current indexes don't recover from bloat, a
situation made worse by non-HOT updates.

So, we have a choice of which disadvantage to accept.

> Also, why not use this bitmap for all indexes, not just update chains?

I don't understand where you get this update chains thing from.

The bitmap can apply to multiple tuples on one page, which is described.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Andres Freund

On 2016-08-04 12:55:25 -0400, Bruce Momjian wrote:
> On Thu, Aug  4, 2016 at 09:31:18AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> > > Indexes whose values do not change do not require new index pointers. Only
> > > the index whose key is being changed will need a new index entry. The new
> > > index entry will be set to the CTID of the root line pointer.
> > 
> > That seems to require tracing all hot-chains in a page, to actually
> > figure out what the root line pointer of a warm-updated HOT tuple is,
> > provided it's HOT_UPDATED itself.  Or have you found a smart way to
> > figure that out?
> 
> The index points to the head of the HOT chain, and you just walk the
> chain on the page --- on need to look at other chains on the page.

When doing an update, you'll need to re-find the root tuple, to insert
the root ctid for the new index tuples.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lossy Index Tuple Enhancement (LITE)

2016-08-04 Thread Simon Riggs

On 3 August 2016 at 20:37, Claudio Freire  wrote:
> On Wed, Aug 3, 2016 at 4:20 AM, Simon Riggs  wrote:
>> == IndexScan ==
>>
>> Note that the executor code for IndexScan appears identical between
>> the two optimizations. The difference between duplicate and range LITE
>> tuples is needed only at INSERT time (or UPDATE indexed column to a
>> new value).
>>
>> When we do an IndexScan if we see a LITE tuple we do a scan of the
>> linepointer ranges covered by this index tuple (which might eventually
>> go to a full block scan). For example with bit1 set we would scan 16
>> linepointers (on an 8192 byte block) and with 2 bits set we would scan
>> 32 linepointers etc.., not necessarily consecutive ranges. So the
>> IndexScan can return potentially many heap rows per index entry, which
>> need to be re-checked and may also need to be sorted prior to
>> returning them. If no rows are returned we can kill the index pointer,
>> just as we do now for btrees, so they will be removed eventually from
>> the index without the need for VACUUM. An BitmapIndexScan that sees a
>> lossy pointer can also use the lossy TID concept, so we can have
>> partially lossy bitmaps.
>
> Wouldn't this risk scanning rows more than once unless it's part of a
> bitmap scan?

It's always the job of the index insert logic to ensure a valid index exists.

I'm not sure what you see that changes that here. Say more?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 09:31:18AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> > Indexes whose values do not change do not require new index pointers. Only
> > the index whose key is being changed will need a new index entry. The new
> > index entry will be set to the CTID of the root line pointer.
> 
> That seems to require tracing all hot-chains in a page, to actually
> figure out what the root line pointer of a warm-updated HOT tuple is,
> provided it's HOT_UPDATED itself.  Or have you found a smart way to
> figure that out?

The index points to the head of the HOT chain, and you just walk the
chain on the page --- on need to look at other chains on the page.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-04 Thread Alfred Perlstein




On 8/4/16 2:00 AM, Torsten Zuehlsdorff wrote:



On 03.08.2016 21:05, Robert Haas wrote:

On Wed, Aug 3, 2016 at 2:23 PM, Tom Lane  wrote:

Robert Haas  writes:

I don't think they are saying that logical replication is more
reliable than physical replication, nor do I believe that to be true.
I think they are saying that if logical corruption happens, you can
fix it by typing SQL statements to UPDATE, INSERT, or DELETE the
affected rows, whereas if physical corruption happens, there's no
equally clear path to recovery.


Well, that's not an entirely unreasonable point, but I dispute the
implication that it makes recovery from corruption an easy thing to do.
How are you going to know what SQL statements to issue?  If the master
database is changing 24x7, how are you going to keep up with that?


I think in many cases people fix their data using business logic.  For
example, suppose your database goes down and you have to run
pg_resetxlog to get it back up.  You dump-and-restore, as one does,
and find that you can't rebuild one of your unique indexes because
there are now two records with that same PK.  Well, what you do is you
look at them and judge which one has the correct data, often the one
that looks more complete or the one with the newer timestamp. Or,
maybe you need to merge them somehow.  In my experience helping users
through problems of this type, once you explain the problem to the
user and tell them they have to square it on their end, the support
call ends.  The user may not always be entirely thrilled about having
to, say, validate a problematic record against external sources of
truth, but they usually know how to do it.  Database bugs aren't the
only way that databases become inaccurate.  If the database that they
use to keep track of land ownership in the jurisdiction where I live
says that two different people own the same piece of property,
somewhere there is a paper deed in a filing cabinet.  Fishing that out
to understand what happened may not be fun, but a DBA can explain that
problem to other people in the organization and those people can get
it fixed.  It's a problem, but it's fixable.

On the other hand, if a heap tuple contains invalid infomask bits that
cause an error every time you read the page (this actually happened to
an EnterpriseDB customer!), the DBA can't tell other people how to fix
it and can't fix it personally either.  Instead, the DBA calls me.


After reading this statement the ZFS filesystem pops into my mind. It 
has protection build in against various problems (data degradation, 
current spikes, phantom writes, etc).


For me this raises two questions:

1) would the usage of ZFS prevent such errors?

My feeling would say yes, but i have no idea about how a invalid 
infomask bit could occur.


2) would it be possible to add such prevention to PostgreSQL

I know this could add a massive overhead, but it its optional this 
could be a fine thing?
Postgresql is very "zfs-like" in its internals.  The problem was a bug 
in postgresql that caused it to just write data to the wrong place.


Some vendors use ZFS under databases to provide very cool services such 
as backup snapshots, test snapshots and other such uses.  I think Joyent 
is one such vendor but I'm not 100% sure.


-Alfred


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Andres Freund

Hi,

On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> Indexes whose values do not change do not require new index pointers. Only
> the index whose key is being changed will need a new index entry. The new
> index entry will be set to the CTID of the root line pointer.

That seems to require tracing all hot-chains in a page, to actually
figure out what the root line pointer of a warm-updated HOT tuple is,
provided it's HOT_UPDATED itself.  Or have you found a smart way to
figure that out?

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Design for In-Core Logical Replication

2016-08-04 Thread Simon Riggs

On 29 July 2016 at 16:53, Robert Haas  wrote:

> I think that to really understand exactly what you and Petr have in
> mind, we'd need a description of where publication and subscription
> data is stored within the server, and exactly what gets stored.
> Perhaps that will come in a later email.  I'm not bashing the design,
> exactly, I just can't quite see how all of the pieces fit together
> yet.

Sure no problem.

It's clear there are about 12 ways of putting this together and each
way has various terms needed.

We need input from many people to get this right, since this is much
more about UI than system architecture.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] improved DefElem list processing

2016-08-04 Thread Peter Eisentraut

Here are two WIP patches to improve the DefElem list processing that is
used by many utility commands.

One factors out the duplicate checks, which are currently taking up a
lot of space with duplicate code.  I haven't applied this everywhere
yet, but the patch shows how much boring code can be saved.

The other adds a location field to the DefElem node.  This allows
showing an error pointer if there is a problem in one of the options,
which can be useful for complex commands such as COPY or CREATE USER.

At the moment, these patches are independent and don't work together,
but the second one would become much easier if the first one is accepted.

If these ideas are acceptable, I'll produce a complete patch series.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 89af91f91944ba70c538d249f042eebf14fdd2cd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 4 Aug 2016 11:29:27 -0400
Subject: [PATCH] Add location field to DefElem

---
 contrib/file_fdw/file_fdw.c |  5 ++-
 src/backend/commands/copy.c | 93 +
 src/backend/commands/sequence.c | 36 +---
 src/backend/commands/user.c | 41 +++---
 src/backend/nodes/copyfuncs.c   |  1 +
 src/backend/nodes/equalfuncs.c  |  2 +
 src/backend/nodes/makefuncs.c   |  1 +
 src/backend/nodes/outfuncs.c|  1 +
 src/backend/nodes/readfuncs.c   |  1 +
 src/backend/parser/gram.y   | 44 +++
 src/backend/tcop/utility.c  | 36 +---
 src/include/commands/copy.h |  7 ++--
 src/include/commands/sequence.h |  5 ++-
 src/include/commands/user.h |  3 +-
 src/include/nodes/parsenodes.h  |  1 +
 15 files changed, 190 insertions(+), 87 deletions(-)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index c049131..85d9913 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -293,7 +293,7 @@ file_fdw_validator(PG_FUNCTION_ARGS)
 	/*
 	 * Now apply the core COPY code's validation logic for more checks.
 	 */
-	ProcessCopyOptions(NULL, true, other_options);
+	ProcessCopyOptions(NULL, NULL, true, other_options);
 
 	/*
 	 * Filename option is required for file_fdw foreign tables.
@@ -632,7 +632,8 @@ fileBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Create CopyState from FDW options.  We always acquire all columns, so
 	 * as to match the expected ScanTupleSlot signature.
 	 */
-	cstate = BeginCopyFrom(node->ss.ss_currentRelation,
+	cstate = BeginCopyFrom(NULL,
+		   node->ss.ss_currentRelation,
 		   filename,
 		   false,
 		   NIL,
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f45b330..1f85cda 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -279,12 +279,12 @@ static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";
 
 
 /* non-export function prototypes */
-static CopyState BeginCopy(bool is_from, Relation rel, Node *raw_query,
-		  const char *queryString, const Oid queryRelId, List *attnamelist,
+static CopyState BeginCopy(ParseState *pstate, bool is_from, Relation rel, Node *raw_query,
+		   const Oid queryRelId, List *attnamelist,
 		  List *options);
 static void EndCopy(CopyState cstate);
 static void ClosePipeToProgram(CopyState cstate);
-static CopyState BeginCopyTo(Relation rel, Node *query, const char *queryString,
+static CopyState BeginCopyTo(ParseState *pstate, Relation rel, Node *query,
 			const Oid queryRelId, const char *filename, bool is_program,
 			List *attnamelist, List *options);
 static void EndCopyTo(CopyState cstate);
@@ -787,7 +787,7 @@ CopyLoadRawBuf(CopyState cstate)
  * the table or the specifically requested columns.
  */
 Oid
-DoCopy(const CopyStmt *stmt, const char *queryString, uint64 *processed)
+DoCopy(ParseState *pstate, const CopyStmt *stmt, uint64 *processed)
 {
 	CopyState	cstate;
 	bool		is_from = stmt->is_from;
@@ -936,7 +936,7 @@ DoCopy(const CopyStmt *stmt, const char *queryString, uint64 *processed)
 			PreventCommandIfReadOnly("COPY FROM");
 		PreventCommandIfParallelMode("COPY FROM");
 
-		cstate = BeginCopyFrom(rel, stmt->filename, stmt->is_program,
+		cstate = BeginCopyFrom(pstate, rel, stmt->filename, stmt->is_program,
 			   stmt->attlist, stmt->options);
 		cstate->range_table = range_table;
 		*processed = CopyFrom(cstate);	/* copy from file to database */
@@ -944,7 +944,7 @@ DoCopy(const CopyStmt *stmt, const char *queryString, uint64 *processed)
 	}
 	else
 	{
-		cstate = BeginCopyTo(rel, query, queryString, relid,
+		cstate = BeginCopyTo(pstate, rel, query, relid,
 			 stmt->filename, stmt->is_program,
 			 stmt->attlist, stmt->options);
 		*processed = DoCopyTo(cstate);	/* copy from database to file */
@@ -980,7 +980,8 @@ DoCopy(const CopyStmt *stmt, const char *queryString, uint64 *processed)
  * self-consistency of the options list.
  */
 void
-ProcessCopyOptions(CopyS

Re: [HACKERS] regression test for extended query protocol

2016-08-04 Thread Alvaro Herrera

Michael Paquier wrote:
> On Thu, Aug 4, 2016 at 12:14 AM, Dave Cramer  wrote:
> > We currently run tests every time a PR is created on github, but I don't
> > think there are any animals running the JDBC test suite
> >
> > We can add tests, what exactly do we want to test. Then setting up an animal
> > to run the tests would be fairly straight forward.
> 
> It may be interesting to implement that as a module in the buildfarm
> client I think that any animal could include at will. Just a thought.

Implementing things as buildfarm modules gets boring, though -- consider
for instance that src/test/recovery is only being run by hamster, and
only because you patched the buildfarm client; only Andrew can influence
buildfarm animals into running that test by integrating the module in
the client script, and even that is limited by having to request animal
caretakers to upgrade that script.

If somebody had some spare time to devote to this, I would suggest to
implement something in core that can be used to specify a list of
commands to run, and a list of files-to-be-saved-in-bf-log emitted by
each command.  We could have one such base file in the core repo that
specifies some tests to run (such as "make -C src/test/recovery check"),
and an additional file can be given by buildfarm animals to run custom
tests, without having to write BF modules for each thing.  With that,
pgsql committers could simply add a new test to be run by all buildfarm
animals by adding it to the list in core.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] handling unconvertible error messages

2016-08-04 Thread Victor Wagner

On Thu, 04 Aug 2016 09:42:10 -0400
Tom Lane  wrote:

> Victor Wagner  writes:
> > If error occurs during processing of StartMessage protocol message,
> > i.e. client request connection to unexisting database,
> > ErrorResponse would contain message in the server default locale,
> > despite of client encoding being specified in the StartMessage.  
> 
> Yeah.  I'm inclined to think that we should reset the message locale
> to C as soon as we've forked away from the postmaster, and leave it
> that way until we've absorbed settings from the startup packet.
> Sending messages of this sort in English isn't great, but it's better
> than sending completely-unreadable ones.  Or is that just my
> English-centricity showing?

From my russian point of view, english messages are definitely better
than transliteration of Russian  with latin letters (although it is
not completely unreadable), not to mention wrong encoding or lots of
question marks.

Really, if this response is sent after backend has been forked, problem
probably can be easily fixed better way - StartupMessage contain
information about desired client encoding, so this information just
should be processed earlier than any other information from this
message, which can cause errors (such as database name).

If this errors are sent from postmaster itself, things are worse,
because I don't think that locale subsystem is desined to be
reintitalized lots of times in the same process. 
But postmaster itself can use non-localized messaging. Its messages in
the logs are typically analyzed by more or less qualified DBA and
system admistrators, not by end user.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Vladimir Sitnikov

Tom Lane :

> [ shrug... ]  What do you claim is not machine-readable about
> server_version?
>

0) server_version needs a dance to parse.
For instance, recent "Stamp version 10devel" patch did touch
"server_version" parsing in fe-exec.c:
https://github.com/vlsi/postgres/pull/2/files#diff-2df5cad06efe4485ad362b0eb765cec0L986
Of course it might happen there was just a bug in fe-exec.c, however that
is a smell. Lots of clients might need to update server_version parsing
logic for no reason except "support 10devel kind of versions".
There are cases when the dance is locale-specific:
https://github.com/pgjdbc/pgjdbc/pull/190

1) By saying "just parse server_version" you are basically forcing every
postgresql client (except libpq) to implement, test, and support its own
parse logic. Do you care of having robust clients that will not break in
parts when tested against 10.whatever?

1) Several examples were provided by Craig:
https://www.postgresql.org/message-id/CAMsr%2BYFt1NcjseExt_Ov%2Bfrk0Jzb0-DKqYKt8ALzVEXHBM0jKg%40mail.gmail.com
Craig> This means that at least when connecting to newer servers clients no
longer have to do any stupid dances around parsing "9.5beta1",
"9.4.0mycustompatchedPg"

2) Official documentation suggests "see also server_version_num for a
machine-readable version."
https://www.postgresql.org/docs/9.5/static/functions-info.html

Even though "one can simply try to parse server_version value", I claim
that server_version_num is much more machine-readable than server_version
one.

Vladimir

Re: [HACKERS] Hash Indexes

2016-08-04 Thread Mithun Cy

I did some basic testing of same. In that I found one issue with cursor.

+BEGIN;

+SET enable_seqscan = OFF;

+SET enable_bitmapscan = OFF;

+CREATE FUNCTION declares_cursor(int)

+ RETURNS void

+ AS 'DECLARE c CURSOR FOR SELECT * from con_hash_index_table WHERE keycol
= $1;'

+LANGUAGE SQL;

+

+SELECT declares_cursor(1);

+MOVE FORWARD ALL FROM c;

+MOVE BACKWARD 1 FROM c;

+ CLOSE c;

+ WARNING: buffer refcount leak: [5835] (rel=base/16384/30537,
blockNum=327, flags=0x9380, refcount=1 1)

ROLLBACK;

closing the cursor comes with the warning which say we forgot to unpin the
buffer.

I have also added tests [1] for coverage improvements.

[1] Some tests to cover hash_index.



On Thu, Jul 14, 2016 at 4:33 PM, Amit Kapila 
wrote:

> On Wed, Jun 22, 2016 at 8:48 PM, Robert Haas 
> wrote:
> > On Wed, Jun 22, 2016 at 5:14 AM, Amit Kapila 
> wrote:
> >> We can do it in the way as you are suggesting, but there is another
> thing
> >> which we need to consider here.  As of now, the patch tries to finish
> the
> >> split if it finds split-in-progress flag in either old or new bucket.
> We
> >> need to lock both old and new buckets to finish the split, so it is
> quite
> >> possible that two different backends try to lock them in opposite order
> >> leading to a deadlock.  I think the correct way to handle is to always
> try
> >> to lock the old bucket first and then new bucket.  To achieve that, if
> the
> >> insertion on new bucket finds that split-in-progress flag is set on a
> >> bucket, it needs to release the lock and then acquire the lock first on
> old
> >> bucket, ensure pincount is 1 and then lock new bucket again and ensure
> that
> >> pincount is 1. I have already maintained the order of locks in scan (old
> >> bucket first and then new bucket; refer changes in _hash_first()).
> >> Alternatively, we can try to  finish the splits only when someone tries
> to
> >> insert in old bucket.
> >
> > Yes, I think locking buckets in increasing order is a good solution.
> > I also think it's fine to only try to finish the split when the insert
> > targets the old bucket.  Finishing the split enables us to remove
> > tuples from the old bucket, which lets us reuse space instead of
> > accelerating more.  So there is at least some potential benefit to the
> > backend inserting into the old bucket.  On the other hand, a process
> > inserting into the new bucket derives no direct benefit from finishing
> > the split.
> >
>
> Okay, following this suggestion, I have updated the patch so that only
> insertion into old-bucket can try to finish the splits.  Apart from
> that, I have fixed the issue reported by Mithun upthread.  I have
> updated the README to explain the locking used in patch.   Also, I
> have changed the locking around vacuum, so that it can work with
> concurrent scans when ever possible.  In previous patch version,
> vacuum used to take cleanup lock on a bucket to remove the dead
> tuples, moved-due-to-split tuples and squeeze operation, also it holds
> the lock on bucket till end of cleanup.  Now, also it takes cleanup
> lock on a bucket to out-wait scans, but it releases the lock as it
> proceeds to clean the overflow pages.  The idea is first we need to
> lock the next bucket page and  then release the lock on current bucket
> page.  This ensures that any concurrent scan started after we start
> cleaning the bucket will always be behind the cleanup.  Allowing scans
> to cross vacuum will allow it to remove tuples required for sanctity
> of scan.  Also for squeeze-phase we are just checking if the pincount
> of buffer is one (we already have Exclusive lock on buffer of bucket
> by that time), then only proceed, else will try to squeeze next time
> the cleanup is required for that bucket.
>
> Thoughts/Suggestions?
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-04 Thread Pavel Stehule

2016-08-04 15:37 GMT+02:00 Amit Kapila :

> On Wed, Aug 3, 2016 at 5:09 PM, Pavel Stehule 
> wrote:
> >
> >
> > 2016-08-03 12:16 GMT+02:00 Rahila Syed :
> >>
> >> Should changing the value from OFF to ON automatically either commit or
> >> rollback transaction in progress?
> >>
> >>
> >> FWIW, running  set autocommit through ecpg commits the ongoing
> transaction
> >> when autocommit is set to ON from OFF. Should such behaviour be
> implemented
> >> for \set AUTOCOMMIT ON as well?
> >
> >
> > I dislike automatic commit or rollback here.
> >
>
> What problem you see with it, if we do so and may be mention the same
> in docs as well.  Anyway, I think we should make the behaviour of both
> ecpg and psql same.
>

Implicit COMMIT can be dangerous - ROLLBACK can be unfriendly surprising.


> > What about raising warning if
> > some transaction is open?
> >
>
> Not sure what benefit we will get by raising warning.  I think it is
> better to choose one behaviour (automatic commit or leave the
> transaction open as is currently being done in psql) and make it
> consistent across all clients.
>

I am not sure about value of ecpg for this case. It is used by 0.0001%
users. Probably nobody in Czech Republic knows this client.

Warnings enforce the user do some decision - I don't think so we can do
this decision well.

Regards

Pavel



>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>

Re: [HACKERS] max_parallel_degree > 0 for 9.6 beta

2016-08-04 Thread David G. Johnston

On Thu, Aug 4, 2016 at 9:25 AM, Robert Haas  wrote:

> On Thu, Aug 4, 2016 at 12:56 AM, Tom Lane  wrote:
> > Noah Misch  writes:
> >> On Wed, Apr 20, 2016 at 02:28:15PM -0400, Tom Lane wrote:
> >>> +1, but let's put an entry on the 9.6 open-items page to remind us to
> >>> make that decision at the right time.
> >
> >> It's that time.  Do we restore the max_parallel_workers_per_gather=0
> default,
> >> or is enabling this by default the right thing after all?
> >
> > At this point I'd have to vote against enabling by default in 9.6.  The
> > fact that in the past week we've found bugs as bad as e1a93dd6a does not
> > give me a warm fuzzy feeling about the parallel-query code being ready
> > for prime time.
> >
> > Of course the question is how do we ever get to that point if we chicken
> > out with enabling it by default now.  Maybe we could keep it turned on
> > in HEAD.
>
> However, we have a long track
> record of being cautious about things like this, so I would be OK with
> the idea of disabling this in the 9.6 branch and leaving it turned on
> in master, with the hope of shipping it enabled-by-default in v10.


My initial reaction was +1 but now I'm leaning toward enabled by default.

Those who would upgrade to 9.6 within a year of its release are most
likely, process and personality wise, to be those for whom the risks and
rewards of new features is part of their everyday routine.

If indeed we release 10.0 with it enabled by default we should be confident
in its stability at that point in the 9.6.x branch.

David J.

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Tom Lane

Vladimir Sitnikov  writes:
>> Sorry, but I don't buy that.  I think sending both server_version and
>> server_version_num would be silly, and we're certainly not going to stop
>> sending server_version.

> What is wrong with sending machine-readable value?

[ shrug... ]  What do you claim is not machine-readable about
server_version?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] "Some tests to cover hash_index"

2016-08-04 Thread Mithun Cy

I am attaching the patch to improve some coverage of hash index code [1].
I have added some basic tests, which mainly covers overflow pages. It took
2 sec extra time in my machine in parallel schedule.




Hit Total Coverage
old tests Line Coverage 780 1478 52.7

Function Coverage 63 85 74.1
improvement after tests Line Coverage 1181 1478 79.9 %

Function Coverage 78 85 91.8 %


[1]Concurrent Hash Index. 

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


commit-hash_coverage_test
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Vladimir Sitnikov

>
> Sorry, but I don't buy that.  I think sending both server_version and
> server_version_num would be silly, and we're certainly not going to stop
> sending server_version.
>

What is wrong with sending machine-readable value?

Vladimir

Re: [HACKERS] [RFC] Change the default of update_process_title to off

2016-08-04 Thread Tom Lane

Andres Freund  writes:
> On 2016-08-04 16:48:11 +0900, Michael Paquier wrote:
>> Here is a different proposal: documenting instead that disabling that
>> parameter on Windows can improve performance, at the cost of losing
>> information verbosity for processes.

> The benefit on windows seems pretty marginal, given the way it has to be
> viewed. People installing processexplorer et. al. to view a handle that
> have to know about, will be able to change the config.

Yeah, I think I agree.  It would be bad to disable it by default on
Unix, because ps(1) is a very standard tool there, but the same argument
doesn't hold for Windows.

Another route to a solution would be to find a cheaper way to update
the process title on Windows ... has anyone looked for alternatives?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Tom Lane

Craig Ringer  writes:
> On 4 August 2016 at 12:45, Tom Lane  wrote:
>> Craig Ringer  writes:
>>> Well, this seems like a good time to make server_version_num GUC_REPORT
>>> as well...

>> To what end?  Existing versions of libpq wouldn't know about it, and new
>> versions of libpq couldn't rely on it to get reported by older servers,
>> so it'd still be the path of least resistance to examine server_version.

> Because it's really silly that we don't,

Sorry, but I don't buy that.  I think sending both server_version and
server_version_num would be silly, and we're certainly not going to stop
sending server_version.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] handling unconvertible error messages

2016-08-04 Thread Tom Lane

Victor Wagner  writes:
> If error occurs during processing of StartMessage protocol message,
> i.e. client request connection to unexisting database,
> ErrorResponse would contain message in the server default locale,
> despite of client encoding being specified in the StartMessage.

Yeah.  I'm inclined to think that we should reset the message locale
to C as soon as we've forked away from the postmaster, and leave it
that way until we've absorbed settings from the startup packet.
Sending messages of this sort in English isn't great, but it's better
than sending completely-unreadable ones.  Or is that just my
English-centricity showing?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-04 Thread Amit Kapila

On Wed, Aug 3, 2016 at 5:09 PM, Pavel Stehule  wrote:
>
>
> 2016-08-03 12:16 GMT+02:00 Rahila Syed :
>>
>> Should changing the value from OFF to ON automatically either commit or
>> rollback transaction in progress?
>>
>>
>> FWIW, running  set autocommit through ecpg commits the ongoing transaction
>> when autocommit is set to ON from OFF. Should such behaviour be implemented
>> for \set AUTOCOMMIT ON as well?
>
>
> I dislike automatic commit or rollback here.
>

What problem you see with it, if we do so and may be mention the same
in docs as well.  Anyway, I think we should make the behaviour of both
ecpg and psql same.

> What about raising warning if
> some transaction is open?
>

Not sure what benefit we will get by raising warning.  I think it is
better to choose one behaviour (automatic commit or leave the
transaction open as is currently being done in psql) and make it
consistent across all clients.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] max_parallel_degree > 0 for 9.6 beta

2016-08-04 Thread Robert Haas

On Thu, Aug 4, 2016 at 12:56 AM, Tom Lane  wrote:
> Noah Misch  writes:
>> On Wed, Apr 20, 2016 at 02:28:15PM -0400, Tom Lane wrote:
>>> +1, but let's put an entry on the 9.6 open-items page to remind us to
>>> make that decision at the right time.
>
>> It's that time.  Do we restore the max_parallel_workers_per_gather=0 default,
>> or is enabling this by default the right thing after all?
>
> At this point I'd have to vote against enabling by default in 9.6.  The
> fact that in the past week we've found bugs as bad as e1a93dd6a does not
> give me a warm fuzzy feeling about the parallel-query code being ready
> for prime time.
>
> Of course the question is how do we ever get to that point if we chicken
> out with enabling it by default now.  Maybe we could keep it turned on
> in HEAD.

What do other people think about this topic?

Personally, I've found the process of fixing the parallel query bugs
rather exhausting, and if we leave it turned on by default in 9.6,
we'll probably get a lot more of those bug reports a lot sooner than
we will otherwise.  But I'd kind of rather get it out of the way than
put it off: those bugs need to be fixed at some point, and we won't
find them if nobody runs the code.  However, we have a long track
record of being cautious about things like this, so I would be OK with
the idea of disabling this in the 9.6 branch and leaving it turned on
in master, with the hope of shipping it enabled-by-default in v10.  I
think it would not be good to leave disabled it by default in master
-- the code won't get much testing then, we won't find the bugs, and
we'll build more stuff on top of what we've already got without
finding the cracks in the foundation.  I think users will be more
inclined to forgive us for parallel query bugs in 2016 than in 2026;
better to debug it before the honeymoon wears off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-04 Thread Geoff Winkless

On 30 July 2016 at 13:42, Tomas Vondra  wrote:
> I'd argue that if you mess with catalogs directly, you're on your own.

Interesting. What would you suggest people use instead?

Geoff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-04 Thread Aleksander Alekseev

Thanks everyone for your remarks and comments!

Here is an improved version of a patch.

Main changes:
* Patch passes `make installcheck`
* Code is fully commented, also no more TODO's

I wish I sent this version of a patch last time. Now I realize it was
really hard to read and understand. Hope I managed to correct this
flaw. If you believe that some parts of the code are still poorly
commented or could be improved in any other way please let me know.

And as usual, any other comments, remarks or questions are highly
appreciated!

-- 
Best regards,
Aleksander Alekseev
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 8f5332a..ac272e1 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1693,7 +1693,7 @@
   
   
p = permanent table, u = unlogged table,
-   t = temporary table
+   t = temporary table, f = fast temporary table
   
  
 
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index 1fa6de0..56de4dc 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/common
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = heaptuple.o indextuple.o printtup.o reloptions.o scankey.o \
+OBJS = fasttab.o heaptuple.o indextuple.o printtup.o reloptions.o scankey.o \
 	tupconvert.o tupdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/common/fasttab.c b/src/backend/access/common/fasttab.c
new file mode 100644
index 000..610790d
--- /dev/null
+++ b/src/backend/access/common/fasttab.c
@@ -0,0 +1,1884 @@
+/*-
+ *
+ * fasttab.c
+ *	  virtual catalog and fast temporary tables
+ *
+ * This file contents imlementation of special type of temporary tables ---
+ * fast temporary tables (FTT). From user perspective they work exactly as
+ * regular temporary tables. However there are no records about FTTs in
+ * pg_catalog. These records are stored in backend's memory instead and mixed
+ * with regular records during scans of catalog tables. We refer to
+ * corresponding tuples of catalog tables as "in-memory" or "virtual" tuples
+ * and to all these tuples together --- as "in-memory" or "virtual" catalog.
+ *
+ * Note that since temporary tables are visiable only in one session there is
+ * no need to use shared memory or locks for FTTs. Transactions support is
+ * very simple too. There is no need to track xmin/xmax, etc.
+ *
+ * FTTs are designed to to solve pg_catalog bloating problem. The are
+ * applications that create and delete a lot of temporary tables. It causes
+ * bloating of pg_catalog and running auto vacuum on it. It's quite an
+ * expensive operation that affects entire database performance.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/fasttab.c
+ *
+ *-
+ */
+
+#include "c.h"
+#include "postgres.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "access/amapi.h"
+#include "access/fasttab.h"
+#include "access/relscan.h"
+#include "access/valid.h"
+#include "access/sysattr.h"
+#include "access/htup_details.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_depend.h"
+#include "catalog/pg_inherits.h"
+#include "catalog/pg_statistic.h"
+#include "storage/bufmgr.h"
+#include "utils/rel.h"
+#include "utils/inval.h"
+#include "utils/memutils.h"
+
+/*
+		  TYPEDEFS, MACRO DECLARATIONS AND CONST STATIC VARIABLES
+ */
+
+/* #define FASTTAB_DEBUG 1 */
+
+#ifdef FASTTAB_DEBUG
+static int32 fasttab_scan_tuples_counter = -1;
+#endif
+
+/* List of in-memory tuples. */
+typedef struct
+{
+	dlist_node	node;
+	HeapTuple	tup;
+}	DListHeapTupleData;
+
+typedef DListHeapTupleData *DListHeapTuple;
+
+/* Like strcmp but for integer types --- int, uint32, Oid, etc. */
+#define FasttabCompareInts(x, y) ( (x) == (y) ? 0 : ( (x) > (y) ? 1 : -1 ))
+
+/* Forward declaration is required for relation_is_inmem_tuple_function */
+struct FasttabSnapshotData;
+typedef struct FasttabSnapshotData *FasttabSnapshot;
+
+/* Predicate that determines whether given tuple should be stored in-memory */
+typedef bool (*relation_is_inmem_tuple_function)
+			(Relation relation, HeapTuple tup, FasttabSnapshot fasttab_snapshot,
+		 int tableIdx);
+
+/* Capacity of FasttabRelationMethods->attrNumbers, see below */
+#define FasttabRelationMaxOidAttributes 2
+
+/* FasttabRelationMethodsTable entry */
+typedef const struct
+{
+	/* relation oid */
+	Oid			relationId;
+	/* predicate that determines whether tuple s

Re: [HACKERS] Bug in WaitForBackgroundWorkerShutdown() [REL9_5_STABLE]

2016-08-04 Thread Yury Zhuravlev


Dmitry Ivanov wrote:

Recently I've encountered a strange crash while calling elog(ERROR, "...")
after the WaitForBackgroundWorkerShutdown() function. It turns out that
_returns_ inside the PG_TRY()..PG_CATCH() block are *highly* undesirable,
since they leave PG_exception_stack pointing to a local struct in a dead
frame, which is an obvious UB. I've attached a patch which fixes this
behavior in the aforementioned function, but there might be more errors
like that elsewhere.


Forgot some curly braces, my bad. v2 attached.



Good catch. But in 9.6 it has been fixed by 
db0f6cad4884bd4c835156d3a720d9a79dbd63a9 commit. 



--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in WaitForBackgroundWorkerShutdown() [REL9_5_STABLE]

2016-08-04 Thread Dmitry Ivanov

> Recently I've encountered a strange crash while calling elog(ERROR, "...")
> after the WaitForBackgroundWorkerShutdown() function. It turns out that
> _returns_ inside the PG_TRY()..PG_CATCH() block are *highly* undesirable,
> since they leave PG_exception_stack pointing to a local struct in a dead
> frame, which is an obvious UB. I've attached a patch which fixes this
> behavior in the aforementioned function, but there might be more errors
> like that elsewhere.

Forgot some curly braces, my bad. v2 attached.

-- 
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Companydiff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 76619e9..fd55262 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -1025,13 +1025,16 @@ WaitForBackgroundWorkerShutdown(BackgroundWorkerHandle *handle)
 
 			status = GetBackgroundWorkerPid(handle, &pid);
 			if (status == BGWH_STOPPED)
-return status;
+break;
 
 			rc = WaitLatch(&MyProc->procLatch,
 		   WL_LATCH_SET | WL_POSTMASTER_DEATH, 0);
 
 			if (rc & WL_POSTMASTER_DEATH)
-return BGWH_POSTMASTER_DIED;
+			{
+status = BGWH_POSTMASTER_DIED;
+break;
+			}
 
 			ResetLatch(&MyProc->procLatch);
 		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Bug in WaitForBackgroundWorkerShutdown() [REL9_5_STABLE]

2016-08-04 Thread Dmitry Ivanov

Recently I've encountered a strange crash while calling elog(ERROR, "...") 
after the WaitForBackgroundWorkerShutdown() function. It turns out that 
_returns_ inside the PG_TRY()..PG_CATCH() block are *highly* undesirable, 
since they leave PG_exception_stack pointing to a local struct in a dead 
frame, which is an obvious UB. I've attached a patch which fixes this behavior 
in the aforementioned function, but there might be more errors like that 
elsewhere.

-- 
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Companydiff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 76619e9..5ecb55a 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -1025,13 +1025,14 @@ WaitForBackgroundWorkerShutdown(BackgroundWorkerHandle *handle)
 
 			status = GetBackgroundWorkerPid(handle, &pid);
 			if (status == BGWH_STOPPED)
-return status;
+break;
 
 			rc = WaitLatch(&MyProc->procLatch,
 		   WL_LATCH_SET | WL_POSTMASTER_DEATH, 0);
 
 			if (rc & WL_POSTMASTER_DEATH)
-return BGWH_POSTMASTER_DIED;
+status = BGWH_POSTMASTER_DIED;
+break;
 
 			ResetLatch(&MyProc->procLatch);
 		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Unused argument in PinBuffer function

2016-08-04 Thread Ildar Musin


Hi all,

I was looking through the buffer manager's code and have noticed that 
function PinBuffer has an unused 'strategy' argument. It's seems that 
after refactoring made by Alexander Korotkov and Andres Freund 
(48354581a49c30f5757c203415aa8412d85b0f70) it became useless. The 
attached patch removes it. Probably someone more advanced could edit the 
function description to reflect changes?


Regards,

--
Ildar Musin
i.mu...@postgrespro.ru

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..1eaa1cb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -432,7 +432,7 @@ static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
   ForkNumber forkNum, BlockNumber blockNum,
   ReadBufferMode mode, BufferAccessStrategy strategy,
   bool *hit);
-static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
+static bool PinBuffer(BufferDesc *buf);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
@@ -1020,7 +1020,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 */
 		buf = GetBufferDescriptor(buf_id);
 
-		valid = PinBuffer(buf, strategy);
+		valid = PinBuffer(buf);
 
 		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
@@ -1232,7 +1232,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			buf = GetBufferDescriptor(buf_id);
 
-			valid = PinBuffer(buf, strategy);
+			valid = PinBuffer(buf);
 
 			/* Can release the mapping lock as soon as we've pinned it */
 			LWLockRelease(newPartitionLock);
@@ -1563,7 +1563,7 @@ ReleaseAndReadBuffer(Buffer buffer,
  * some callers to avoid an extra spinlock cycle.
  */
 static bool
-PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
+PinBuffer(BufferDesc *buf)
 {
 	Buffer		b = BufferDescriptorGetBuffer(buf);
 	bool		result;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 113 matches

Mail list logo