Re: [HACKERS] A weird bit in pg_upgrade/exec.c

2017-11-09 Thread Alvaro Herrera
a.akent...@postgrespro.ru wrote:

> This function has two calls:
> check_bin_dir(_cluster);
> check_bin_dir(_cluster);
> 
> I'd like to substitute these last two lines with this:
> get_bin_version(cluster);

Odd indeed.  One would think that if a cluster variable is passed as
parameter, the global vars should not be used.  +1 for fixing it, and
your proposal sounds as good as any.

> Doing it would simplify the patch I'm writing, but I'm worried I might break
> something that's been there for a long time and has been working fine.

I think odd coding this was introduced recently because of the
pg_resetxlog -> pg_resetwal renaming.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-11-03 Thread Alvaro Herrera
Alvaro Herrera wrote:

> I think your argument is sensible for some uses (where people run manual
> VACUUM after loading data) but not others (where people just use manual
> VACUUM in place of autovacuuming -- because they don't trust autovac, or
> the schedule isn't convenient, or whatever other reason).  I've seen
> both things being done in production.

BTW I also noticed that creating an index does summarize the first range
(rather than leaving it empty, as this rationale would suggest).  I
think we changed this at the last minute before commit in 9.5 and never
revisited; I'm inclined to change it for pg11 (but not now, of course).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-03 Thread Alvaro Herrera
Peter Geoghegan wrote:
> Andres Freund  wrote:

> > Staring at the vacuumlazy hunk I think I might have found a related bug:
> > heap_update_tuple() just copies the old xmax to the new tuple's xmax if
> > a multixact and still running.  It does so without verifying liveliness
> > of members.  Isn't that buggy? Consider what happens if we have three
> > blocks: 1 has free space, two is being vacuumed and is locked, three is
> > full and has a tuple that's key share locked by a live tuple and is
> > updated by a dead xmax from before the xmin horizon. In that case afaict
> > the multi will be copied from the third page to the first one.  Which is
> > quite bad, because vacuum already processed it, and we'll set
> > relfrozenxid accordingly.  I hope I'm missing something here?
> 
> Can you be more specific about what you mean here? I think that I
> understand where you're going with this, but I'm not sure.

He means that the tuple that heap_update moves to page 1 (which will no
longer be processed by vacuum) will contain a multixact that's older
than relminmxid -- because it is copied unchanged by heap_update instead
of properly checking against age limit.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ucs_wcwidth vintage

2017-11-03 Thread Alvaro Herrera
Thomas Munro wrote:
> Hi hackers,
> 
> src/backend/utils/mb/wchar.c contains a ~16 year old wcwidth
> implementation that originally arrived in commit df4cba68, but the
> upstream code[1] apparently continued evolving and there have been
> more Unicode revisions since.  It probably doesn't matter much: the
> observation made by Zr40 in the #postgresql IRC channel that lead me
> to guess that this code might be responsible is that emojis screw up
> psql's formatting, since current terminal emulators recognise them as
> double-width but PostgreSQL doesn't.  Still, it's interesting that we
> have artefacts deriving from various different frozen versions of the
> Unicode standard in the source tree, and that might affect some proper
> languages.
> 
> 樂

Ah, thanks for the test case:

alvherre=# select '樂', 'hello';
 ?column? │ ?column? 
──┼──
 樂│ hello
(1 fila)



-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ucs_wcwidth vintage

2017-11-03 Thread Alvaro Herrera
Thomas Munro wrote:
> Hi hackers,
> 
> src/backend/utils/mb/wchar.c contains a ~16 year old wcwidth
> implementation that originally arrived in commit df4cba68, but the
> upstream code[1] apparently continued evolving and there have been
> more Unicode revisions since.

I think we should update it to current upstream source, then, just like
we (are supposed to) do for any other piece of code we adopt.


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-11-03 Thread Alvaro Herrera
> Robert Haas  writes:

> > We could do that, but the motivation for the current system was to
> > avoid leaking memory in a long-lived context.

Yeah, my approach here is to use a CATCH block that deletes the memory
context just created, thus avoiding a long-lived leak.

Tom Lane wrote:

> Another key point is to avoid leaving a corrupted relcache entry behind
> if you fail partway through.

Sure ... in the code as I have it we only assign the local variable to
the relcache entry if everything is succesful.  So no relcache
corruption should result.

> It might work to build the new key in a context that's initially a
> child of CurrentMemoryContext, then reparent it to be a child of
> CacheMemoryContext when done. 

That's another way (than the PG_TRY block), but I think it's more
complicated with no gain.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-11-03 Thread Alvaro Herrera
Tom Lane wrote:
> Alvaro Herrera <alvhe...@alvh.no-ip.org> writes:
> > Tom Lane wrote:
> >> Do we still need the complication in brinsummarize to discriminate
> >> against the last partial range?  Now that the lock consideration
> >> is gone, I think that might be a wart.
> 
> > In the case of VACUUM, it's not desirable to create a summarization for
> > the last partial range, because if the table is still being filled, that
> > would slow down the insertion process.
> 
> Hm.  Okay, but you should change the comment then, because "we do not want
> to spend one RelationGetNumberOfBlocks call" is a pretty weak reason.

Changed.

> Also, I think I would accept that argument for autovacuum, but maybe
> not so much for a manual vacuum.  Maybe you should drive it off
> IsAutovacuumWorker rather than which operation is being done.

I think your argument is sensible for some uses (where people run manual
VACUUM after loading data) but not others (where people just use manual
VACUUM in place of autovacuuming -- because they don't trust autovac, or
the schedule isn't convenient, or whatever other reason).  I've seen
both things being done in production.  If we do as you suggest, there is
no way to do the manual vacuum without summarizing the partial also; the
approach of doing the partial only in brin_summarize_new_values lets the
user choose what to do.

Once upon a time I thought about adding a reloption to let the user
choose what to do, but I never got around to writing a patch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-11-03 Thread Alvaro Herrera
Tom Lane wrote:
> Alvaro Herrera <alvhe...@alvh.no-ip.org> writes:
> 
> > Yeah, I think this approach results in better code.  The attached patch
> > implements that, and it passes the test for me (incl. calling
> > brin_summarize_new_values concurrently with vacuum, when running the
> > insert; the former does include the final page range whereas the latter
> > does not.)
> 
> Hm, so IIUC the point is that once the placeholder tuple is in, we can
> rely on concurrent inserters to update it for insertions into pages that
> are added after we determine our scan stop point.  But if the scan stop
> point is chosen before inserting the placeholder, then we have a race
> condition.

Exactly.  We don't need to scan those pages once the placeholder tuple
is in.

> The given code seems a brick or so short of a load, though: if the table
> has been extended sufficiently, it could compute scanNumBlks as larger
> than bs_pagesPerRange, no?  You need to clamp the computation result.

Oops, right.

> Also, shouldn't the passed-in heapBlk always be a multiple of
> pagesPerRange already?

Yeah, I guess I can turn that into an assert.

> Do we still need the complication in brinsummarize to discriminate
> against the last partial range?  Now that the lock consideration
> is gone, I think that might be a wart.

You mean this code?

/*
 * Unless requested to summarize even a partial range, go away 
now if
 * we think the next range is partial.
 *
 * Maybe the table already grew to cover this range completely, 
but we
 * don't want to spend a whole RelationGetNumberOfBlocks to 
find out,
 * so just leave it for next time.
 */
if (!include_partial &&
(startBlk + pagesPerRange > heapNumBlocks))
break;

In the case of VACUUM, it's not desirable to create a summarization for
the last partial range, because if the table is still being filled, that
would slow down the insertion process.  So we pass include_partial=false
there.  In brin_summarize_new_values, the theory is that the user called
that function because they're done loading (at least temporarily) so
it's better to process the partial range.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 6d31088c39d455637179853a3c72a7d9e20fe053 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Thu, 2 Nov 2017 18:35:10 +0100
Subject: [PATCH v3] Fix summarization concurrent with relation extension

---
 src/backend/access/brin/brin.c | 91 --
 1 file changed, 62 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e6909d7aea..72ac8929bd 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -67,7 +67,7 @@ static BrinBuildState *initialize_brin_buildstate(Relation 
idxRel,
   BrinRevmap *revmap, 
BlockNumber pagesPerRange);
 static void terminate_brin_buildstate(BrinBuildState *state);
 static void brinsummarize(Relation index, Relation heapRel, BlockNumber 
pageRange,
- double *numSummarized, double *numExisting);
+ bool include_partial, double *numSummarized, double 
*numExisting);
 static void form_and_insert_tuple(BrinBuildState *state);
 static void union_tuples(BrinDesc *bdesc, BrinMemTuple *a,
 BrinTuple *b);
@@ -791,7 +791,7 @@ brinvacuumcleanup(IndexVacuumInfo *info, 
IndexBulkDeleteResult *stats)
 
brin_vacuum_scan(info->index, info->strategy);
 
-   brinsummarize(info->index, heapRel, BRIN_ALL_BLOCKRANGES,
+   brinsummarize(info->index, heapRel, BRIN_ALL_BLOCKRANGES, false,
  >num_index_tuples, 
>num_index_tuples);
 
heap_close(heapRel, AccessShareLock);
@@ -909,7 +909,7 @@ brin_summarize_range(PG_FUNCTION_ARGS)

RelationGetRelationName(indexRel;
 
/* OK, do it */
-   brinsummarize(indexRel, heapRel, heapBlk, , NULL);
+   brinsummarize(indexRel, heapRel, heapBlk, true, , NULL);
 
relation_close(indexRel, ShareUpdateExclusiveLock);
relation_close(heapRel, ShareUpdateExclusiveLock);
@@ -1129,7 +1129,8 @@ terminate_brin_buildstate(BrinBuildState *state)
 }
 
 /*
- * Summarize the given page range of the given index.
+ * On the given BRIN index, summarize the heap page range that corresponds
+ * to the heap block number given.
  *
  * This routine can run in parallel with insertions into the heap.  To avoid
  * missing those values from the summary tuple, we first inse

Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-11-03 Thread Alvaro Herrera
Alvaro Herrera wrote:

> Maybe a solution is to call RelationGetNumberOfBlocks() after the
> placeholder tuple has been inserted, for the case where we would be
> scanning past end of relation; passing InvalidBlockNumber as stop point
> would indicate to do things that way.  I'll try with that approach now.

Yeah, I think this approach results in better code.  The attached patch
implements that, and it passes the test for me (incl. calling
brin_summarize_new_values concurrently with vacuum, when running the
insert; the former does include the final page range whereas the latter
does not.)

Tomas Vondra wrote:

> FWIW this patch fixes the issue for me - I can no longer reproduce the
> bitmapscan vs. seqscan result discrepancies (even with the extra UPDATE
> phase).

Thanks for testing!  This confirms that the issue was correctly
identified.  Would you try the current patch, which is better than the
other one?  It's significantly different that I think it invalidates
prior testing.

Thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From b3568a5475526fbd7768bd4d74c6ad681ede920f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Thu, 2 Nov 2017 18:35:10 +0100
Subject: [PATCH v2] Fix summarization concurrent with relation extension

---
 src/backend/access/brin/brin.c | 78 ++
 1 file changed, 49 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e6909d7aea..ad512c2c5f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -67,7 +67,7 @@ static BrinBuildState *initialize_brin_buildstate(Relation 
idxRel,
   BrinRevmap *revmap, 
BlockNumber pagesPerRange);
 static void terminate_brin_buildstate(BrinBuildState *state);
 static void brinsummarize(Relation index, Relation heapRel, BlockNumber 
pageRange,
- double *numSummarized, double *numExisting);
+ bool include_partial, double *numSummarized, double 
*numExisting);
 static void form_and_insert_tuple(BrinBuildState *state);
 static void union_tuples(BrinDesc *bdesc, BrinMemTuple *a,
 BrinTuple *b);
@@ -791,7 +791,7 @@ brinvacuumcleanup(IndexVacuumInfo *info, 
IndexBulkDeleteResult *stats)
 
brin_vacuum_scan(info->index, info->strategy);
 
-   brinsummarize(info->index, heapRel, BRIN_ALL_BLOCKRANGES,
+   brinsummarize(info->index, heapRel, BRIN_ALL_BLOCKRANGES, false,
  >num_index_tuples, 
>num_index_tuples);
 
heap_close(heapRel, AccessShareLock);
@@ -909,7 +909,7 @@ brin_summarize_range(PG_FUNCTION_ARGS)

RelationGetRelationName(indexRel;
 
/* OK, do it */
-   brinsummarize(indexRel, heapRel, heapBlk, , NULL);
+   brinsummarize(indexRel, heapRel, heapBlk, true, , NULL);
 
relation_close(indexRel, ShareUpdateExclusiveLock);
relation_close(heapRel, ShareUpdateExclusiveLock);
@@ -1129,7 +1129,8 @@ terminate_brin_buildstate(BrinBuildState *state)
 }
 
 /*
- * Summarize the given page range of the given index.
+ * On the given BRIN index, summarize the heap page range that corresponds
+ * to the heap block number given.
  *
  * This routine can run in parallel with insertions into the heap.  To avoid
  * missing those values from the summary tuple, we first insert a placeholder
@@ -1139,6 +1140,12 @@ terminate_brin_buildstate(BrinBuildState *state)
  * update of the index value happens in a loop, so that if somebody updates
  * the placeholder tuple after we read it, we detect the case and try again.
  * This ensures that the concurrently inserted tuples are not lost.
+ *
+ * A further corner case is this routine being asked to summarize the partial
+ * range at the end of the table.  heapNumBlocks is the (possibly outdated)
+ * table size; if we notice that the requested range lies beyond that size,
+ * we re-compute the table size after inserting the placeholder tuple, to
+ * avoid missing pages that were appended recently.
  */
 static void
 summarize_range(IndexInfo *indexInfo, BrinBuildState *state, Relation heapRel,
@@ -1160,6 +1167,20 @@ summarize_range(IndexInfo *indexInfo, BrinBuildState 
*state, Relation heapRel,
   heapBlk, phtup, phsz);
 
/*
+* Whenever we're scanning what we believe to be the final range on the
+* table (i.e. one that might be partial) we need to recompute our idea 
of
+* what the latest page is after inserting the placeholder tuple.  This
+* ensures that we don't miss pages just because they were extended by
+* concurrent backends that didn't have the chance to see our 
placehold

Re: [HACKERS] dropping partitioned tables without CASCADE

2017-11-03 Thread Alvaro Herrera
Ashutosh Bapat wrote:
> On Fri, Nov 3, 2017 at 1:42 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
> wrote:

> > I think adding "is partitioned" at end of line isn't good; looks like a
> > phrase but isn't translatable.  Maybe add keyword PARTITIONED instead?
> 
> In that case may be we should separate bounds and "PARTITIONED" with a
> ",". "part_default DEFAULT, PARTITIONED" would read better than
> "part_default DEFAULT PARTITIONED"?

Hmm, I vote +0.5 for the comma.

> > Having the DEFAULT partition show up in the middle of the list is weird.
> 
> Agreed. But that's true even without this patch.

Yes.

> > Is it possible to put it at either start or end of the list?
> 
> Right now, we could do that if we order the list by bound expression;
> lexically DEFAULT would come before FOR VALUES ... . But that's not
> future-safe; we may have a bound expression starting with A, B or C.
> Beyond that it really gets tricky to order the partitions by bounds.

I was just thinking in changing the query to be "order by
is_the_default_partition, partition_name" instead of just "order by
partition_name".  Sorting by bounds rather than name (a feature whose
worth should definitely be discussed separately IMV) sounds a lot more
complicated.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-11-03 Thread Alvaro Herrera
Tom Lane wrote:
> Alvaro Herrera <alvhe...@alvh.no-ip.org> writes:

> > Rather than remove the capability, I'd be inclined to make
> > brin_summarize_new_values summarize the final partial range, and have
> > VACUUM not do it.  Would that be too inconsistent?
> 
> That doesn't really get you out of the problem that this is an abuse of
> the relation extension lock, and is likely to cause issues when people
> try to optimize that locking mechanism.

Right.

> Why is it that the regular technique doesn't work, ie create a placeholder
> tuple and let it get added to by any insertions that happen?

The problem is that we determine relation size (and scan stop point)
before inserting the placeholder tuple, so any relation extension that
occurs after we read the size is not covered by the scan.  The reason we
do this is to avoid calling RelationGetNumberOfBlocks once for each
range, since it's known to be very expensive.

Maybe a solution is to call RelationGetNumberOfBlocks() after the
placeholder tuple has been inserted, for the case where we would be
scanning past end of relation; passing InvalidBlockNumber as stop point
would indicate to do things that way.  I'll try with that approach now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dropping partitioned tables without CASCADE

2017-11-03 Thread Alvaro Herrera
Amit Langote wrote:
> On 2017/09/06 19:14, Amit Langote wrote:
> > On 2017/09/06 18:46, Rushabh Lathia wrote:
> >> Okay, I have marked this as ready for committer.
> > 
> > Thanks Ashutosh and Rushabh for rebasing and improving the patch.  Looks
> > good to me too.
> 
> Patch needed to be rebased after the default partitions patch went in, so
> done.  Per build status on http://commitfest.cputube.org :)

I think adding "is partitioned" at end of line isn't good; looks like a
phrase but isn't translatable.  Maybe add keyword PARTITIONED instead?

Having the DEFAULT partition show up in the middle of the list is weird.
Is it possible to put it at either start or end of the list?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-11-02 Thread Alvaro Herrera
Tom Lane wrote:
> Alvaro Herrera <alvhe...@alvh.no-ip.org> writes:
> > Tom Lane wrote:
> >> If VACUUM and brin_summarize_new_values both ignore the partial
> >> range, then what else would request this?  Can't we just decree
> >> that we don't summarize the partial range, period?
> 
> > brin_summarize_range() can do it.
> 
> So what would happen if we just don't summarize partial ranges?

Index scan would always have to read all the heap pages for that partial
range.  Maybe not a big issue, but when you finish loading a table, it'd
be good to have a mechanism to summarize that partial range ...

Rather than remove the capability, I'd be inclined to make
brin_summarize_new_values summarize the final partial range, and have
VACUUM not do it.  Would that be too inconsistent?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-11-02 Thread Alvaro Herrera
Tom Lane wrote:
> Alvaro Herrera <alvhe...@alvh.no-ip.org> writes:

> > 2. when summarization is requested on the partial range at the end of a
> > table, we acquire extension lock on the rel, then compute relation size
> > and run summarization with the lock held.  This guarantees that we don't
> > miss any pages.  This is bad for concurrency though, so it's only done
> > in that specific scenario.
> 
> Hm, I wonder how this will play with the active proposals around
> reimplementing relation extension locks.  All that work seems to be
> assuming that the extension lock is only held for a short time and
> nothing much beyond physical extension is done while holding it.
> I'm afraid that you may be introducing a risk of e.g. deadlocks
> if you do this.

Ouch ... yeah, that could be a problem.

Another idea I had was to just insert the placeholder tuple while
holding the extension lock, then release the lock while the
summarization is done.  It would be a bit of a break of the current
separation of concerns, but I'm not convinced that the current setup is
perfect, so maybe that's okay.

> If VACUUM and brin_summarize_new_values both ignore the partial
> range, then what else would request this?  Can't we just decree
> that we don't summarize the partial range, period?

brin_summarize_range() can do it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-11-02 Thread Alvaro Herrera
Tomas Vondra wrote:

> Unfortunately, I think we still have a problem ... I've been wondering
> if we end up producing correct indexes, so I've done a simple test.

Here's a proposed patch that should fix this problem (and does, in my
testing).  Would you please give it a try?

This patch changes two things:

1. in VACUUM or brin_summarize_new_values, we only process fully loaded
ranges, and ignore the partial range at end of table.

2. when summarization is requested on the partial range at the end of a
table, we acquire extension lock on the rel, then compute relation size
and run summarization with the lock held.  This guarantees that we don't
miss any pages.  This is bad for concurrency though, so it's only done
in that specific scenario.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From a760bfd44afeff8d1629599411ac7ce87acc7d26 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Thu, 2 Nov 2017 18:35:10 +0100
Subject: [PATCH] Fix summarization concurrent with relation extension

---
 src/backend/access/brin/brin.c | 91 --
 1 file changed, 70 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index b3aa6d1ced..7696c0700c 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -29,6 +29,7 @@
 #include "postmaster/autovacuum.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
+#include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -68,6 +69,10 @@ static BrinBuildState *initialize_brin_buildstate(Relation 
idxRel,
 static void terminate_brin_buildstate(BrinBuildState *state);
 static void brinsummarize(Relation index, Relation heapRel, BlockNumber 
pageRange,
  double *numSummarized, double *numExisting);
+static void determine_summarization_range(BlockNumber pageRange,
+ BlockNumber 
heapNumBlocks,
+ BlockNumber 
pagesPerRange,
+ BlockNumber 
*startBlk, BlockNumber *endBlk);
 static void form_and_insert_tuple(BrinBuildState *state);
 static void union_tuples(BrinDesc *bdesc, BrinMemTuple *a,
 BrinTuple *b);
@@ -1253,30 +1258,16 @@ brinsummarize(Relation index, Relation heapRel, 
BlockNumber pageRange,
BlockNumber startBlk;
BlockNumber endBlk;
 
-   /* determine range of pages to process; nothing to do for an empty 
table */
-   heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
-   if (heapNumBlocks == 0)
-   return;
-
revmap = brinRevmapInitialize(index, , NULL);
 
-   if (pageRange == BRIN_ALL_BLOCKRANGES)
+   /* determine range of pages to process */
+   heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
+   determine_summarization_range(pageRange, heapNumBlocks, pagesPerRange,
+ , 
);
+   if (startBlk > heapNumBlocks)
{
-   startBlk = 0;
-   endBlk = heapNumBlocks;
-   }
-   else
-   {
-   startBlk = (pageRange / pagesPerRange) * pagesPerRange;
-   /* Nothing to do if start point is beyond end of table */
-   if (startBlk > heapNumBlocks)
-   {
-   brinRevmapTerminate(revmap);
-   return;
-   }
-   endBlk = startBlk + pagesPerRange;
-   if (endBlk > heapNumBlocks)
-   endBlk = heapNumBlocks;
+   brinRevmapTerminate(revmap);
+   return;
}
 
/*
@@ -1287,9 +1278,31 @@ brinsummarize(Relation index, Relation heapRel, 
BlockNumber pageRange,
{
BrinTuple  *tup;
OffsetNumber off;
+   boolext_lock_held = false;
 
CHECK_FOR_INTERRUPTS();
 
+   /*
+* Whenever we're scanning a range that would go past what we 
know to
+* be end-of-relation, we need to ensure we scan to the true 
end of the
+* relation, or we risk missing tuples in recently added pages. 
 To do
+* this, we hold the relation extension lock from here till 
we're done
+* summarizing, and compute the relation size afresh now.  The 
logic in
+* determine_summarization_range ensures that this is not done 
in the
+* common cases of vacuum or brin_summarize_new_values(), but 
instead
+* only when we're specifically asked to summarize the last 
range i

Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-11-02 Thread Alvaro Herrera
Tom Lane wrote:
> Alvaro Herrera <alvhe...@alvh.no-ip.org> writes:
> >> Hmm, I'm pretty sure we stress-tested brin in pretty much the same way.
> >> But I see this misbehavior too.  Looking ...
> 
> > Turns out that this is related to concurrent growth of the table while
> > the summarization process is scanning -- so new pages have appeared at
> > the end of the table after the end point has been determined.  It would
> > be a pain to determine number of blocks for each range, so I'm looking
> > for a simple way to fix it without imposing so much overhead.
> 
> Where are we on this --- do you want me to push the brin_doupdate
> fix I proposed, or were you intending to merge that into a
> larger patch?

Please push your fixes, I'll post my proposed patch for the other bug
afterwards; they are unrelated problems after all.

If you prefer me to push your fixes, I can do that -- let me know.

> If I'm to do it, is there a reason not to back-patch to all branches
> with BRIN?

No, all these fixes should go back to 9.5.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-02 Thread Alvaro Herrera
Stephen Frost wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > Andres Freund  writes:
> > > Do we care about people upgrading to unreleased versions? We could do
> > > nothing, document it in the release notes, or ???
> > 
> > Do nothing.
> 
> Agreed.  Not much we can do there.

Pushed the reverts.

I noticed while doing so that REL_10_STABLE contains the bogus commits.  
Does that change our opinion regarding what to do for people upgrading
to a version containing the broken commits?  I don't think so, because

  1) we hope that not many people will trust their data to 10.0
 immediately after release
  2) the bug is very low probability
  3) it doesn't look like we can do a lot about it anyway.

I'll experiment with Andres' proposed fix now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-02 Thread Alvaro Herrera
Andres Freund wrote:

> I spent some time discussing this with Robert today (with both of us
> alternating between feeling the other and ourselves as stupid), and the
> conclusion I think is that the problem is on the pruning, rather than
> the freezing side.

Thanks both for spending some more time on this.

> I think the problem is on the pruning, rather than the freezing side. We
> can't freeze a tuple if it has an alive predecessor - rather than
> weakining this, we should be fixing the pruning to not have the alive
> predecessor.

I gave a look at HTSV back then, but I didn't find what the right tweak
was, but then I only tried changing the return value to DEAD and
DELETE_IN_PROGRESS; the thought of selecting DEAD or RECENTLY_DEAD based
on OldestXmin didn't occur to me ... I was thinking that the fact that
there were live lockers meant that the tuple could not be removed,
obviously failing to notice that the subsequent versions of the tuple
would be good enough.

> If the update xmin is actually below the cutoff we can remove the tuple
> even if there's live lockers - the lockers will also be present in the
> newer version of the tuple.  I verified that for me that fixes the
> problem. Obviously that'd require some comment work and more careful
> diagnosis.

Sounds good.

I'll revert those commits then, keeping the test, and then you can
commit your change.  OK?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Mapping MERGE onto CTEs (Re: MERGE SQL Statement for PG11)

2017-11-01 Thread Alvaro Herrera
Nico Williams wrote:

> As an aside, I'd like to be able to control which CTEs are view-like and
> which are table-like.  In SQLite3, for example, they are all view-like,
> and the optimizer will act accordingly, whereas in PG they are all
> table-like, and thus optimizer barriers.

There was a short and easy to grasp (OK, maybe not) discussion on the
topic of CTEs acting differently.  I think the consensus is that for
CTEs that are read-only and do not use functions that aren't immutable,
they may be considered for inlining.
https://www.postgresql.org/message-id/5351711493487...@web53g.yandex.ru

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-11-01 Thread Alvaro Herrera
Alvaro Herrera wrote:
> Tomas Vondra wrote:
> 
> > FWIW I can reproduce this on 9.5, and I don't even need to run the
> > UPDATE part. That is, INSERT + VACUUM running concurrently is enough to
> > produce broken BRIN indexes :-(
> 
> Hmm, I'm pretty sure we stress-tested brin in pretty much the same way.
> But I see this misbehavior too.  Looking ...

Turns out that this is related to concurrent growth of the table while
the summarization process is scanning -- so new pages have appeared at
the end of the table after the end point has been determined.  It would
be a pain to determine number of blocks for each range, so I'm looking
for a simple way to fix it without imposing so much overhead.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] strange relcache.c debug message

2017-11-01 Thread Alvaro Herrera
While messing with BRIN bugs, I noticed this debug message in the server
log:

2017-11-01 12:33:24.042 CET [361429] DEBUG:  rehashing catalog cache id 14 for 
pg_opclass; 17 tups, 8 buckets en carácter 194

notice that at the end it says "at character 194".  I suppose that's
because of some leftover errposition() value, but why is it being
reported in this message?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-10-31 Thread Alvaro Herrera
Tomas Vondra wrote:

> FWIW I can reproduce this on 9.5, and I don't even need to run the
> UPDATE part. That is, INSERT + VACUUM running concurrently is enough to
> produce broken BRIN indexes :-(

Hmm, I'm pretty sure we stress-tested brin in pretty much the same way.
But I see this misbehavior too.  Looking ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-10-31 Thread Alvaro Herrera
Tom Lane wrote:

> I really don't understand how any of this "let's release the buffer
> lock and then take it back later" logic is supposed to work reliably.

So summarize_range first inserts the placeholder tuple, which is there
purposefully for other processes to update concurrently; then, without
blocking any other process, scan the page range and update the
placeholder tuple (this could take a long time, so it'd be a bad idea
to hold buffer lock for that long).

I think what we should do is rethink the locking considerations in
brin_doupdate vs. brinGetTupleForHeapBlock, and how they are used in
summarize_range and brininsert.  In summarize_range, instead of hoping
that in some cases we will not need to re-obtain the placeholder tuple,
just do that in all cases keeping the buffer locked until the tuple is
updated.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-10-31 Thread Alvaro Herrera
Tom Lane wrote:
> So in a few more runs this morning using Alvaro's simplified test case,
> I have seen the following behaviors not previously reported:

> 1. Crashes in PageIndexTupleOverwrite, which has the same "invalid index
> offnum: %u" error report as PageIndexTupleDeleteNoCompact.  I note the
> same message appears in plain PageIndexTupleDelete as well.
> I think we've been too hasty to assume all instances of this came out of
> PageIndexTupleDeleteNoCompact.

Ah, I wasn't paying close attention to the originator routine of the
message, but you're right, I see this one too.

> 2. Crashes in the data-insertion process, not only the process running
> summarize_range:

Yeah, I saw these.  I was expecting it, since the two routines
(brininsert and summarize_range) pretty much share the insertion
protocol.

> I really don't understand how any of this "let's release the buffer
> lock and then take it back later" logic is supposed to work reliably.

Yeah, evidently that was way too optimistic and I'll need to figure out
a better mechanism to handle this.

The intention was to avoid deadlocks while locking the target page for
the insertion: by having both pages start unlocked we can simply lock
them in block number order.  If we keep the page containing the tuple
locked, I don't see how to reliably avoid a deadlock while acquiring a
buffer to insert the new tuple.

> BTW, while I'm bitching, it seems fairly insane from a concurrency
> standpoint that brin_getinsertbuffer is calling RecordPageWithFreeSpace
> while holding at least one and possibly two buffer locks.  Shouldn't
> that be done someplace else?

Hmm.  I spent a lot of effort (commit ccc4c074994d) to avoid leaving
pages uninitialized / unrecorded in FSM.  I left this on purpose on the
rationale that trying to fix it would make the callsites more convoluted
(the retry logic doesn't help).  But as I recall this was supposed to be
done only in the rare case where the buffer could not be returned to
caller ... but that's not what the current code does, so there is
something wrong there.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-10-30 Thread Alvaro Herrera
Tom Lane wrote:

> So: I put the blame on the fact that summarize_range() thinks that
> the tuple offset it has for the placeholder tuple is guaranteed to
> hold good, even across possibly-long intervals where it's holding
> no lock on the containing buffer.

Yeah, I think this is a pretty reasonable explanation for the problem.
I don't understand why it doesn't fail in 9.6.

> Fixing this without creating new problems is beyond my familiarity
> with the BRIN code.  It looks like it might be nontrivial :-(

Looking.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-10-30 Thread Alvaro Herrera
Thanks everyone for the analysis downthread.  Here's a test case that
provokes the crash faster.  Initialize with

create table brin_test (a text);
create index on brin_test using brin (a) with (pages_per_range = 1);

Then in one psql, run this:
select brin_summarize_new_values('brin_test_a_idx') \watch 0.1

and a pgbench running a script with this line (one client is enough):

insert into brin_test values (repeat(chr(ascii('A') + int4(random() * 61)), 
int4(random() * 200)));

It crashes in 10-20 seconds for me.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGSEGV in BRIN autosummarize

2017-10-30 Thread Alvaro Herrera
Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> 
> > Before pushing, I'll give a look to the regular autovacuum path to see
> > if it needs a similar fix.
> 
> Reading that one, my conclusion is that it doesn't have the same problem
> because the strings are allocated in AutovacuumMemCxt which is not reset
> by error recovery.  This gave me the idea to use that context instead of
> TopTransactionContext to store the strings in workitem processing also.
> The patch I propose now is attached.

I ended up backpatching 335f3d04e4c8.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-30 Thread Alvaro Herrera
Robert Haas wrote:
> On Tue, Oct 24, 2017 at 7:25 PM, Andres Freund  wrote:
> > I think it does the contrary. The current mechanism is, in my opinion,
> > downright dangerous:
> > https://www.postgresql.org/message-id/20160201235854.go8...@awork2.anarazel.de
> 
> A sort of middle way would be to keep the secondary checkpoint around
> but never try to replay from that point, or only if a specific flag is
> provided.

Why do you want to keep the secondary checkpoint?  If there is no way to
automatically start a recovery from the prior checkpoint, is it really
possible to do the same manually?  I think the only advantage of keeping
it is that the WAL files are kept around for a little bit longer.  But
is that useful?  Surely for any environment where you really care, you
have a WAL archive somewhere, so it doesn't matter if files are removed
from the primary's pg_xlog dir.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pow support for pgbench

2017-10-30 Thread Alvaro Herrera
Michael Paquier wrote:

> Attaching patches directly to a thread is a better practice as if
> github goes away, any Postgres developers can still have an access to
> any code you publish using the public archives on postgresql.org.

Also, by posting to pgsql-hackers indicating intention to integrate to
Postgres you're implicitly making a statement about the license of your
contribution; see the second half of the last paragraph at
https://wiki.postgresql.org/wiki/Archives_Policy

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pow support for pgbench

2017-10-30 Thread Alvaro Herrera
Michael Paquier wrote:

> Please add this patch to the upcoming commit fest if you would like to
> get some feedback:
> https://commitfest.postgresql.org/15/
> 
> I am adding as well Fabien in CC who worked in getting the internal
> function infrastructure in the shape it is now (waaay better) with
> commit 86c43f4.

I think Raúl would do well to review this patch by Fabien
https://www.postgresql.org/message-id/alpine.DEB.2.20.1710201835390.15170@lancre
which adds a few functions and operators.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-10-30 Thread Alvaro Herrera
Tomas Vondra wrote:
> FWIW I can reproduce this on REL_10_STABLE, but not on REL9_6_STABLE. So
> it seems to be due to something that changed in the last release.

I've been trying to reproduce it for half an hour with no success (I
think my laptop is just too slow).  I bet this is related to the
addition of PageIndexTupleOverwrite (commit b1328d78f) though I find it
more likely that this was not *caused* by that commit but rather just
made it easier to hit.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-27 Thread Alvaro Herrera
Alvaro Herrera wrote:

> I'm now working on the ability to build unique indexes (and unique
> constraints) on top of this.

So I think there's not a lot of additional code required to support
unique indexes with the restrictions mentioned; proof-of-concept (with
several holes still) attached.

As before, this is not finished, as there a few things that are wrong
(such as generateClonedIndexStmt taking a RangeVar), and others that I
don't understand (such as why is rd_partkey NULL for partitioned
partitions when DefineIndex cascades to them and the columns are
checked).

I noticed that RelationBuildPartitionKey is generating a partition key
in a temp context, then creating a private context and copying the key
into that.  That seems leftover from some previous iteration of some
other patch; I think it's pretty reasonable to create the new context
right from the start and allocate the key there directly instead.  Then
there's no need for copy_partition_key at all.

Anyway, here is a preliminary submission before I close shop for the
week.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 37e683e352df5f3e6c110d94881abe888e36953e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Mon, 16 Oct 2017 12:01:12 +0200
Subject: [PATCH v2 1/7] Tweak index_create/index_constraint_create APIs

I was annoyed by index_create and index_constraint_create respective
APIs.  It's too easy to get confused when adding a new behavioral flag
given the large number of boolean flags they already have.  Turning them
into a flags bitmask makes that code easier to read, as in the attached
patch.

I split the flags in two -- one set for index_create directly and
another related to constraints.  index_create() itself receives both,
and then passes down the constraint one to index_constraint_create.  It
is shorter in LOC terms to create a single set mixing all the values,
but it seemed to make more sense this way.  Also, I chose not to turn
the booleans 'allow_sytem_table_mods' and 'is_internal' into flag bits
because of the different way they interact with this code.
---
 src/backend/catalog/index.c  | 101 +++
 src/backend/catalog/toasting.c   |   3 +-
 src/backend/commands/indexcmds.c |  33 +
 src/backend/commands/tablecmds.c |  13 +++--
 src/include/catalog/index.h  |  29 ++-
 5 files changed, 100 insertions(+), 79 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c7b2f031f0..17faeffada 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -680,19 +680,25 @@ UpdateIndexRelation(Oid indexoid,
  * classObjectId: array of index opclass OIDs, one per index column
  * coloptions: array of per-index-column indoption settings
  * reloptions: AM-specific options
- * isprimary: index is a PRIMARY KEY
- * isconstraint: index is owned by PRIMARY KEY, UNIQUE, or EXCLUSION constraint
- * deferrable: constraint is DEFERRABLE
- * initdeferred: constraint is INITIALLY DEFERRED
+ * flags: bitmask that can include any combination of these bits:
+ * INDEX_CREATE_IS_PRIMARY
+ * the index is a primary key
+ * INDEX_CREATE_ADD_CONSTRAINT:
+ * invoke index_constraint_create also
+ * INDEX_CREATE_SKIP_BUILD:
+ * skip the index_build() step for the moment; caller must 
do it
+ * later (typically via reindex_index())
+ * INDEX_CREATE_CONCURRENT:
+ * do not lock the table against writers.  The index will 
be
+ * marked "invalid" and the caller must take additional 
steps
+ * to fix it up.
+ * INDEX_CREATE_IF_NOT_EXISTS:
+ * do not throw an error if a relation with the same name
+ * already exists.
+ * constr_flags: flags passed to index_constraint_create
+ * (only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
- * skip_build: true to skip the index_build() step for the moment; caller
- * must do it later (typically via reindex_index())
- * concurrent: if true, do not lock the table against writers.  The index
- * will be marked "invalid" and the caller must take additional 
steps
- * to fix it up.
  * is_internal: if true, post creation hook for new index
- * if_not_exists: if true, do not throw an error if a relation with
- * the same name already exists.
  *
  * Returns the OID of the created index.
  */
@@ -709,15 +715,10 @@ index_create(Relation heapRelation,
 Oid *classObjectId,
 int16 *coloptions,
 Datum reloptions,
-bool i

Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Alvaro Herrera
Tomas Vondra wrote:

> Not sure "a number of in-core opclasses" is a good reason to (not) add
> new ones. Also, we already have two built-in BRIN opclasses (minmax and
> inclusion).
>
> In general, "BRIN bloom" can be packed as a contrib module (at least I
> believe so). That's not the case for the "BRIN multi-range" which also
> requires some changes to some code in brin.c (but the rest can be moved
> to contrib module, of course).

I was rather thinking that if we can make this very robust against the
index growing out of proportion, we should consider ditching the
original minmax and replace it with multirange minmax, which seems like
it'd have much better behavior.

I don't see any reason to put any of this in contrib.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Alvaro Herrera
Simon Riggs wrote:

> Earlier thoughts on how this could/could not be done were sometimes
> imprecise or inaccurate, so I have gone through the command per
> SQL:2011 spec and produced a definitive spec in the form of an SGML
> ref page. This is what I intend to deliver for PG11.

Nice work.  I didn't verify the SQL spec, just read your HTML page;
some very minor comments based on that:

* use "and" not "where" as initial words in "when_clause" and
  "merge_update" clause definitions

* missing word here: "the DELETE privilege on the if you specify"

* I think the word "match." is leftover from some editing in the phrase
  " that specifies which rows in the data_source match rows in the
  target_table_name. match."  In the same paragraph, it is not clear
  whether all columns must be matched or it can be a partial match.

* In the when_clause note, it is not clear whether you can have multiple
  WHEN MATCHED and WHEN NOT MATCHED clauses.  Obviously you can have one
  of each, but I think your doc says it is possible to have more than one of
  each, with different conditions (WHEN MATCHED AND foo THEN bar WHEN
  MATCHED AND baz THEN qux).  No example shows more than one.

  On the same point: Is there short-circuiting of such conditions, i.e.
  will the execution will stop looking for further WHEN matches if some
  rule matches, or will it rather check all rules and raise an error if
  more than one WHEN rules match each given row?

* Your last example uses ELSE but that appears nowhere in the synopsys.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Burst in WAL size when UUID is used as PK while full_page_writes are enabled

2017-10-27 Thread Alvaro Herrera
Amit Kapila wrote:

> You might want to give a try with the hash index if you are planning
> to use PG10 and your queries involve equality operations.

So, btree indexes on monotonically increasing sequences don't write tons
of full page writes because typically the same page is touched many
times by insertions on each checkpoint cycle; so only one or very few
full page writes are generated for a limited number of index pages.

With UUID you lose locality of access: each insert goes to a different
btree page, so you generate tons of full page writes because the number
of modified index pages is very large.

With hash on monotonically increasing keys, my guess is that you get
behavior similar to btrees on UUID: the inserts are all over the place
in the index, so tons of full page writes.  Am I wrong?

With hash on UUID, the same thing should happen.  Am I wrong?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] taking stdbool.h into use

2017-10-26 Thread Alvaro Herrera
Michael Paquier wrote:
> On Thu, Oct 26, 2017 at 10:51 AM, Alvaro Herrera
> <alvhe...@alvh.no-ip.org> wrote:
> > I gave this a quick run, to see if my compiler would complain for things
> > like this:
> >
> >boolisprimary = flags & INDEX_CREATE_IS_PRIMARY;
> >
> > (taken from the first patch at
> > https://postgr.es/m/20171023161503.ohkybquxrlech7d7@alvherre.pgsql )
> >
> > which is assigning a value other than 1/0 to a bool variable without an
> > explicit cast.  I thought it would provoke a warning, but it does not.
> > Is that expected?  Is my compiler too old/new?
> 
> It seems to me that this proves the point of the proposed patch. You
> had better use a zero-equality comparison for such bitwise operation,
> and so you ought to do that:
> boolisprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;

Right, exactly.  But my point is that with the whole patch series
applied I didn't get any warnings.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] taking stdbool.h into use

2017-10-26 Thread Alvaro Herrera
Peter Eisentraut wrote:
> Here is an updated patch set.  This is just a rebase of the previous
> set, no substantial changes.  Based on the discussion so far, I'm
> proposing that 0001 through 0007 could be ready to commit after review,
> whereas the remaining two need more work at some later time.

I gave this a quick run, to see if my compiler would complain for things
like this:

   boolisprimary = flags & INDEX_CREATE_IS_PRIMARY;

(taken from the first patch at
https://postgr.es/m/20171023161503.ohkybquxrlech7d7@alvherre.pgsql )

which is assigning a value other than 1/0 to a bool variable without an
explicit cast.  I thought it would provoke a warning, but it does not.
Is that expected?  Is my compiler too old/new?

config.log says
gcc version 6.3.0 20170516 (Debian 6.3.0-18) 

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How to determine that a TransactionId is really aborted?

2017-10-26 Thread Alvaro Herrera
Eric Ridge wrote:

> > Again, you'll probably need to put this low level requirement into
> > context if you want sound advice from this list.
> 
> I'm just thinking out lout here, but the context is likely something
> along the lines of externally storing all transaction ids, and
> periodically asking Postgres if they're
> known-to-be-aborted-by-all-transactions -- one at a time.

I think if you just check the global xmin, then you're going to maintain
a very long list of transactions "potentially running" whenever there
are long-lived transactions (pg_dump, for example).  You could try to
add a TransactionIdIsInProgress() somewhere and discard the xact
downright (by DidCommit and DidAbort) if it returns false.  A single
long-running transaction can keep the global xmin down by hundreds of
millions of Xids.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-26 Thread Alvaro Herrera
Tsunakawa, Takayuki wrote:

> (Although unrelated to this, I've also been wondering why PostgreSQL
> flushes WAL to disk when writing a page in the shared buffer, because
> PostgreSQL doesn't use WAL for undo.)

The reason is that if the system crashes after writing the data page to
disk, but before writing the WAL, the data page would be inconsistent
with data in pages that weren't flushed, since there is no WAL to update
those other pages.  Also, if the system crashes after partially writing
the page (say it writes the first 4kB) then the page is downright
corrupted with no way to fix it.

So there has to be a barrier that ensures that the WAL is flushed up to
the last position that modified a page (i.e. that page's LSN) before
actually writing that page to disk.  And this is why we can't use mmap()
for shared buffers -- there is no mechanism to force the WAL down if the
operation system has the liberty to flush pages whenever it likes.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGSEGV in BRIN autosummarize

2017-10-24 Thread Alvaro Herrera
Alvaro Herrera wrote:

> Before pushing, I'll give a look to the regular autovacuum path to see
> if it needs a similar fix.

Reading that one, my conclusion is that it doesn't have the same problem
because the strings are allocated in AutovacuumMemCxt which is not reset
by error recovery.  This gave me the idea to use that context instead of
TopTransactionContext to store the strings in workitem processing also.
The patch I propose now is attached.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 013e54c9df54b7a324a30beba22138a86417f34f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Mon, 23 Oct 2017 18:55:12 +0200
Subject: [PATCH v2] Fix autovacuum work items

---
 src/backend/postmaster/autovacuum.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c 
b/src/backend/postmaster/autovacuum.c
index 776b1c0a9d..dc76728d71 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2521,9 +2521,10 @@ deleted:
{
AutoVacuumWorkItem *workitem = 
>av_workItems[i];
 
-   if (!workitem->avw_used)
-   continue;
-   if (workitem->avw_active)
+   /* Item empty, already being processed, not this database? skip 
it */
+   if (!workitem->avw_used ||
+   workitem->avw_active ||
+   workitem->avw_database != MyDatabaseId)
continue;
 
/* claim this one, and release lock while performing it */
@@ -2531,6 +2532,7 @@ deleted:
LWLockRelease(AutovacuumLock);
 
perform_work_item(workitem);
+   /* we intentially do not set did_vacuum here */
 
/*
 * Check for config changes before acquiring lock for further
@@ -2601,11 +2603,9 @@ perform_work_item(AutoVacuumWorkItem *workitem)
/*
 * Save the relation name for a possible error message, to avoid a 
catalog
 * lookup in case of an error.  If any of these return NULL, then the
-* relation has been dropped since last we checked; skip it. Note: they
-* must live in a long-lived memory context because we call vacuum and
-* analyze in different transactions.
+* relation has been dropped since last we checked; skip it.
 */
-
+   MemoryContextSwitchTo(AutovacMemCxt);
cur_relname = get_rel_name(workitem->avw_relation);
cur_nspname = 
get_namespace_name(get_rel_namespace(workitem->avw_relation));
cur_datname = get_database_name(MyDatabaseId);
@@ -2615,6 +2615,8 @@ perform_work_item(AutoVacuumWorkItem *workitem)
autovac_report_workitem(workitem, cur_nspname, cur_datname);
 
/*
+* Have at it.
+*
 * We will abort the current work item if something errors out, and
 * continue with the next one; in particular, this happens if we are
 * interrupted with SIGINT.  Note that this means that the work item 
list
@@ -2622,9 +2624,6 @@ perform_work_item(AutoVacuumWorkItem *workitem)
 */
PG_TRY();
{
-   /* have at it */
-   MemoryContextSwitchTo(TopTransactionContext);
-
switch (workitem->avw_type)
{
case AVW_BRINSummarizeRange:
@@ -2668,10 +2667,8 @@ perform_work_item(AutoVacuumWorkItem *workitem)
}
PG_END_TRY();
 
-   /* We intentionally do not set did_vacuum here */
-
-   /* be tidy */
 deleted2:
+   /* be tidy */
if (cur_datname)
pfree(cur_datname);
if (cur_nspname)
-- 
2.11.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGSEGV in BRIN autosummarize

2017-10-23 Thread Alvaro Herrera
Tom Lane wrote:

> What I'm suspicious of as the actual bug cause is the comment in
> perform_work_item about how we need to be sure that we're allocating these
> strings in a long-lived context.  If, in fact, they were allocated in some
> context that could get reset during the PG_TRY (particularly in the
> error-cleanup path, which I bet we don't test), then the observed symptom
> that the pointers seem to be pointing at garbage could be explained.
> 
> So what I'm thinking is that you need an error during perform_work_item,
> and/or more than one work_item picked up in the calling loop, to make this
> bug manifest.  You would need to enter perform_work_item in a
> non-long-lived context, so the first time through the loop is probably
> safe anyway.

I created a reproducer for this bug, by 1) adding some sleeps to make
the summarization process take longer, 2) injecting errors randomly
during the summarization run, 3) inserting lots of rows using the
attached pgbench script running with 8 clients (creation script below).
Takes less than a second to crash.

What is happening is that we're freeing the strings (allocated in
TopTransactionContext) even in the case where the PG_CATCH block aborts
the transaction, which is obviously bogus.  I moved the frees to inside
the PG_TRY block and no crash occurs (I didn't like a 'goto' from
outside to inside the PG_TRY block, so I duplicated those lines
instead).  I propose the attached patch.

Before pushing, I'll give a look to the regular autovacuum path to see
if it needs a similar fix.

initialization:
CREATE TABLE t (a integer);
CREATE INDEX t_a_idx ON t USING brin (a) WITH (autosummarize='true', 
pages_per_range='16');

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 2543c9ee25245f63653bf342a0240eaa8a1dcd6f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Mon, 23 Oct 2017 18:55:12 +0200
Subject: [PATCH] Fix autovacuum work items

---
 src/backend/postmaster/autovacuum.c | 39 +
 1 file changed, 22 insertions(+), 17 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c 
b/src/backend/postmaster/autovacuum.c
index 776b1c0a9d..f5aa520d39 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2525,12 +2525,15 @@ deleted:
continue;
if (workitem->avw_active)
continue;
+   if (workitem->avw_database != MyDatabaseId)
+   continue;
 
/* claim this one, and release lock while performing it */
workitem->avw_active = true;
LWLockRelease(AutovacuumLock);
 
perform_work_item(workitem);
+   /* we intentially do not set did_vacuum here */
 
/*
 * Check for config changes before acquiring lock for further
@@ -2601,16 +2604,22 @@ perform_work_item(AutoVacuumWorkItem *workitem)
/*
 * Save the relation name for a possible error message, to avoid a 
catalog
 * lookup in case of an error.  If any of these return NULL, then the
-* relation has been dropped since last we checked; skip it. Note: they
-* must live in a long-lived memory context because we call vacuum and
-* analyze in different transactions.
+* relation has been dropped since last we checked; skip it.
 */
-
+   MemoryContextSwitchTo(TopTransactionContext);
cur_relname = get_rel_name(workitem->avw_relation);
cur_nspname = 
get_namespace_name(get_rel_namespace(workitem->avw_relation));
cur_datname = get_database_name(MyDatabaseId);
if (!cur_relname || !cur_nspname || !cur_datname)
-   goto deleted2;
+   {
+   if (cur_datname)
+   pfree(cur_datname);
+   if (cur_nspname)
+   pfree(cur_nspname);
+   if (cur_relname)
+   pfree(cur_relname);
+   return;
+   }
 
autovac_report_workitem(workitem, cur_nspname, cur_datname);
 
@@ -2623,7 +2632,6 @@ perform_work_item(AutoVacuumWorkItem *workitem)
PG_TRY();
{
/* have at it */
-   MemoryContextSwitchTo(TopTransactionContext);
 
switch (workitem->avw_type)
{
@@ -2645,6 +2653,14 @@ perform_work_item(AutoVacuumWorkItem *workitem)
 * point.)
 */
QueryCancelPending = false;
+
+   /* be tidy */
+   if (cur_datname)
+   pfree(cur_datname);
+   if (cur_nspname)
+   pfree(cur_nspname);
+   if (cur_relname)
+   pfree(cur_relname);
}
PG_CATCH();
  

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-23 Thread Alvaro Herrera
Robert Haas wrote:
> On Mon, Oct 23, 2017 at 11:12 AM, Alvaro Herrera
> <alvhe...@alvh.no-ip.org> wrote:
> > I started with Maksim's submitted code, and developed according to the
> > ideas discussed in this thread.  Attached is a very WIP patch series for
> > this feature.
> >
> > Many things remain to be done before this is committable:  pg_dump
> > support needs to be written.  ALTER INDEX ATTACH/DETACH not yet
> > implemented.  No REINDEX support yet.  Docs not updated (but see the
> > regression test as a guide for how this is supposed to work; see patch
> > 0005).  CREATE INDEX CONCURRENTLY not done yet.
> >
> > I'm now working on the ability to build unique indexes (and unique
> > constraints) on top of this.
> 
> Cool.  Are you planning to do that by (a) only allowing the special
> case where the partition key columns/expressions are included in the
> indexed columns/expressions, (b) trying to make every insert to any
> index check all of the indexes for uniqueness conflicts, or (c)
> implementing global indexes?  Because (b) sounds complex - think about
> attach operations, for example - and (c) sounds super-hard.  I'd
> suggest doing (a) first, just on the basis of complexity.

Yes, I think (a) is a valuable thing to have -- not planning on doing
(c) at all because I fear it'll be a huge time sink.  I'm not sure about
(b), but it's not currently on my plan.

> I hope that you don't get so involved in making this unique index
> stuff work that we don't get the cascading index feature, at least,
> committed to v11.  That's already a considerable step forward in terms
> of ease of use, and I'd really like to have it.

Absolutely -- I do plan to get this one finished regardless of unique
indexes.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-23 Thread Alvaro Herrera
Hello

I started with Maksim's submitted code, and developed according to the
ideas discussed in this thread.  Attached is a very WIP patch series for
this feature.

Many things remain to be done before this is committable:  pg_dump
support needs to be written.  ALTER INDEX ATTACH/DETACH not yet
implemented.  No REINDEX support yet.  Docs not updated (but see the
regression test as a guide for how this is supposed to work; see patch
0005).  CREATE INDEX CONCURRENTLY not done yet.

I'm now working on the ability to build unique indexes (and unique
constraints) on top of this.

The docs have not been updated yet, but the new regression test file
illustrates how this works.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 8e8bf2f587188859eba1f61a44ee9ee08d960f51 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Mon, 16 Oct 2017 12:01:12 +0200
Subject: [PATCH v1 1/5] Tweak index_create/index_constraint_create APIs

I was annoyed by index_create and index_constraint_create respective
APIs.  It's too easy to get confused when adding a new behavioral flag
given the large number of boolean flags they already have.  Turning them
into a flags bitmask makes that code easier to read, as in the attached
patch.

I split the flags in two -- one set for index_create directly and
another related to constraints.  index_create() itself receives both,
and then passes down the constraint one to index_constraint_create.  It
is shorter in LOC terms to create a single set mixing all the values,
but it seemed to make more sense this way.  Also, I chose not to turn
the booleans 'allow_sytem_table_mods' and 'is_internal' into flag bits
because of the different way they interact with this code.
---
 src/backend/catalog/index.c  | 101 +++
 src/backend/catalog/toasting.c   |   3 +-
 src/backend/commands/indexcmds.c |  33 +
 src/backend/commands/tablecmds.c |  13 +++--
 src/include/catalog/index.h  |  29 ++-
 5 files changed, 100 insertions(+), 79 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c7b2f031f0..17faeffada 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -680,19 +680,25 @@ UpdateIndexRelation(Oid indexoid,
  * classObjectId: array of index opclass OIDs, one per index column
  * coloptions: array of per-index-column indoption settings
  * reloptions: AM-specific options
- * isprimary: index is a PRIMARY KEY
- * isconstraint: index is owned by PRIMARY KEY, UNIQUE, or EXCLUSION constraint
- * deferrable: constraint is DEFERRABLE
- * initdeferred: constraint is INITIALLY DEFERRED
+ * flags: bitmask that can include any combination of these bits:
+ * INDEX_CREATE_IS_PRIMARY
+ * the index is a primary key
+ * INDEX_CREATE_ADD_CONSTRAINT:
+ * invoke index_constraint_create also
+ * INDEX_CREATE_SKIP_BUILD:
+ * skip the index_build() step for the moment; caller must 
do it
+ * later (typically via reindex_index())
+ * INDEX_CREATE_CONCURRENT:
+ * do not lock the table against writers.  The index will 
be
+ * marked "invalid" and the caller must take additional 
steps
+ * to fix it up.
+ * INDEX_CREATE_IF_NOT_EXISTS:
+ * do not throw an error if a relation with the same name
+ * already exists.
+ * constr_flags: flags passed to index_constraint_create
+ * (only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
- * skip_build: true to skip the index_build() step for the moment; caller
- * must do it later (typically via reindex_index())
- * concurrent: if true, do not lock the table against writers.  The index
- * will be marked "invalid" and the caller must take additional 
steps
- * to fix it up.
  * is_internal: if true, post creation hook for new index
- * if_not_exists: if true, do not throw an error if a relation with
- * the same name already exists.
  *
  * Returns the OID of the created index.
  */
@@ -709,15 +715,10 @@ index_create(Relation heapRelation,
 Oid *classObjectId,
 int16 *coloptions,
 Datum reloptions,
-bool isprimary,
-bool isconstraint,
-bool deferrable,
-bool initdeferred,
+uint16 flags,
+uint16 constr_flags,
 bool allow_system_table_mods,
-bool skip_build,
-bool concurrent,
-   

Re: [HACKERS] [PATCH] Tests for reloptions

2017-10-19 Thread Alvaro Herrera
Oh, one more thing: be careful when editing parallel_schedule.  There
are constraints on the number of entries in each group; you had added a
20th entry after the comment that the group can only have 19.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Tests for reloptions

2017-10-19 Thread Alvaro Herrera
Nikolay Shaplov wrote:
> В письме от 3 октября 2017 11:48:43 пользователь Michael Paquier написал:

> I've been thinking a lot, and rereading the patch. When I reread it I've been 
> thinking that I would like to add more tests to it now... ;-)
> 
> If the only purpose of tests is to get better coverage, then I would agree 
> with you. But I've been thinking that tests also should check that everything 
> behaves as expected (or as written in documentation).
> 
> I would say that options names is a of part of SQL dialect that postgres 
> uses, 
> kind of part of the syntax. It is good to be sure that they still supported. 
> So I would add a test for every option heap supports.

Yeah, it would perhaps be good idea to ensure we don't break things that
are documented to work.  If the tests don't take too long, I'm not
opposed to testing every single option.  As you say, code coverage is
important but it's not the only goal.

I'm hesitant to hardcode things like the number of bits in bloom, as you
had in the original.  If I understand correctly, that number could
change with compile options (different blocksize?), so I removed that
part.  I also fixed a few spelling errors.

And pushed.  Let's see what the buildfarm says about this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] show "aggressive" or not in autovacuum logs

2017-10-19 Thread Alvaro Herrera
Kyotaro HORIGUCHI wrote:

> How about the followings?
> 
> "automatic [agressive ]vacuum of table \"%s..."
> "[aggressive ]vacuuming \"%s..."

That form of log message seems acceptable to me (first one is missing a 'g').

In any case, please do not construct the sentence with %s expanding the
word, because that is going to cause a problem for translations.  Use an
'if' test with two full sentences instead.  Yes, as Robert said, it's
going to add more strings, but that's okay.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGSEGV in BRIN autosummarize

2017-10-18 Thread Alvaro Herrera
Justin Pryzby wrote:
> On Wed, Oct 18, 2017 at 06:54:09PM +0200, Alvaro Herrera wrote:

> > And the previous code crashes in 45 minutes?  That's solid enough for
> > me; I'll clean up the patch and push in the next few days.  I think what
> > you have now should be sufficient for the time being for your production
> > system.
> 
> No - the crash happened 4 times since adding BRIN+autosummarize 6 days ago, 
> and
> in once instance occured twice within 3 hours (while I was trying to query 
> logs
> for the preceding crash).

Oh, okay.  Then we don't know enough yet, ISTM.

> [pryzbyj@database ~]$ sudo grep -hE 'in postgres|Saved core' 
> /var/log/messages*
> Oct 13 17:22:45 database kernel: postmaster[32127] general protection 
> ip:4bd467 sp:7ffd9b349990 error:0 in postgres[40+692000]
> Oct 13 17:22:47 database abrt[32387]: Saved core dump of pid 32127 
> (/usr/pgsql-10/bin/postgres) to 
> /var/spool/abrt/ccpp-2017-10-13-17:22:47-32127 (15040512 bytes)
> Oct 14 18:05:35 database kernel: postmaster[26500] general protection 
> ip:84a177 sp:7ffd9b349b88 error:0 in postgres[40+692000]
> Oct 14 18:05:35 database abrt[27564]: Saved core dump of pid 26500 
> (/usr/pgsql-10/bin/postgres) to 
> /var/spool/abrt/ccpp-2017-10-14-18:05:35-26500 (24137728 bytes)
> Oct 16 23:21:22 database kernel: postmaster[31543] general protection 
> ip:4bd467 sp:7ffe08a94890 error:0 in postgres[40+692000]
> Oct 16 23:21:22 database abrt[570]: Saved core dump of pid 31543 
> (/usr/pgsql-10/bin/postgres) to 
> /var/spool/abrt/ccpp-2017-10-16-23:21:22-31543 (25133056 bytes)
> Oct 17 01:58:36 database kernel: postmaster[8646]: segfault at 8 ip 
> 0084a177 sp 7ffe08a94a88 error 4 in postgres[40+692000]
> Oct 17 01:58:38 database abrt[9192]: Saved core dump of pid 8646 
> (/usr/pgsql-10/bin/postgres) to /var/spool/abrt/ccpp-2017-10-17-01:58:38-8646 
> (7692288 bytes)

Do you still have those core dumps?  If so, would you please verify the
database that autovacuum was running in?  Just open each with gdb (using
the original postgres binary, not the one you just installed) and do
"print MyDatabaseId".

> I'll continue runnning with the existing patch and come back if the issue
> recurs.

Thanks.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2017-10-18 Thread Alvaro Herrera
Robert Haas wrote:
> On Wed, Oct 18, 2017 at 11:27 AM, Alvaro Herrera
> <alvhe...@alvh.no-ip.org> wrote:
> > Maybe there are combinations of different persistence values that can be
> > allowed to differ (an unlogged partition is probably OK with a permanent
> > parent), but I don't think the current check is good enough.
> 
> This is also a sort of long-standing historical problem, I think.

Sure.

Actually, the code I'm calling attention to is ATExecAttachPartition()
which was specifically written for partitioning.  Looks like it was copied
verbatim from ATExecAddInherit, but there's no shared code there AFAICS.

I'm okay with prohibiting the case of different persistence values as
you suggest.  And I do suggest to back-patch that prohibition to pg10.

Let me add that I'm not looking to blame anyone for what I report here.
I'm very excited about the partitioning stuff and I'm happy of what was
done for pg10.  I'm now working on more partitioning-related changes
which means I review the existing code as I go along, so I just report
things that look wrong to me as I discover them, just with an interest
in seeing them fixed, or documented, or at least discussed and
explicitly agreed upon.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGSEGV in BRIN autosummarize

2017-10-18 Thread Alvaro Herrera
Justin Pryzby wrote:

> No crashes in ~28hr.  It occurs to me that it's a weaker test due to not
> preserving most compilation options.

And the previous code crashes in 45 minutes?  That's solid enough for
me; I'll clean up the patch and push in the next few days.  I think what
you have now should be sufficient for the time being for your production
system.

> If I understand, our crash isn't explained by the avw_database test
> anyway (?)

I don't see why you would think that -- I disagree.

> Should I make clean and recompile with all non-prefix options and a minimal
> patch (avw_database==MyDatabaseId || continue) ?
> 
> Or recompile with existing options but no patch to first verify crash occurs
> with locally compiled binary?

I don't think either of these actions is necessary.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2017-10-18 Thread Alvaro Herrera
This check is odd (tablecmds.c ATExecAttachPartition line 13861):

/* Temp parent cannot have a partition that is itself not a temp */
if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP &&
attachrel->rd_rel->relpersistence != RELPERSISTENCE_TEMP)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("cannot attach a permanent relation as partition of 
temporary relation \"%s\"",
RelationGetRelationName(rel;

This seems to work (i.e. it's allowed to create a temp partition on a
permanent parent and not vice-versa, which you'd think makes sense) but
it's illusory, because if two sessions try to create temp partitions for
overlapping values, the second session fails with a silly error message.
To be more precise, do this in one session:

CREATE TABLE p (a int, b int) PARTITION BY RANGE (a);
CREATE TEMP TABLE p1 PARTITION OF p FOR VALUES FROM (0) TO (10);

then without closing that one, in another session repeat the second
command:

alvherre=# CREATE TEMP TABLE p1 PARTITION OF p FOR VALUES FROM (0) TO (10);
ERROR:  partition "p1" would overlap partition "p1"

which is not what I would have expected.

Maybe there are combinations of different persistence values that can be
allowed to differ (an unlogged partition is probably OK with a permanent
parent), but I don't think the current check is good enough.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alter table doc fix

2017-10-18 Thread Alvaro Herrera
Amit Langote wrote:
> Hi.
> 
> Noticed that a alter table sub-command's name in Description (where it's
> OWNER) differs from that in synopsis (where it's OWNER TO).  Attached
> patch to make them match, if the difference is unintentional.

I agree -- pushed.

This paragraph

   
The actions for identity columns (ADD
GENERATED, SET etc., DROP
IDENTITY), as well as the actions
TRIGGER, CLUSTER, 
OWNER,
and TABLESPACE never recurse to descendant tables;
that is, they always act as though ONLY were specified.
Adding a constraint recurses only for CHECK constraints
that are not marked NO INHERIT.
   

is a bit annoying, though I think it'd be worse if we "fix" it to be
completely strict about the subcommands it refers to.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] v10 telease note for pg_basebackup refers to old --xlog-method argument

2017-10-18 Thread Alvaro Herrera
Pushed.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2017-10-18 Thread Alvaro Herrera
Amit Langote wrote:
> On 2017/10/18 1:52, Alvaro Herrera wrote:
> > Alvaro Herrera wrote:
> >> Robert Haas wrote:
> >>> Implement table partitioning.
> >>
> >> Is it intentional that you can use ALTER TABLE OWNER TO on the parent
> >> table, and that this does not recurse to modify the partitions' owners?
> >> This doesn't seem to be mentioned in comments nor documentation, so it
> >> seems an oversight to me.
> 
> Hmm, I would say of it that the new partitioning didn't modify the
> behavior that existed for inheritance.
> 
> That said, I'm not sure if the lack of recursive application of ownership
> change to descendant tables is unintentional.

My view is that the fact that partitioning uses inheritance is just an
implementation detail.  We shouldn't let historical behavior for
inheritance dictate behavior for partitioning.  Inheritance has many
undesirable warts.

> > The alter table docs say that ONLY must be specified if one does not
> > want to modify descendants, so I think this is a bug.
> 
> Just to clarify, if we do think of it as a bug, then it will apply to the
> inheritance case as well, right?

I'd leave it alone.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2017-10-17 Thread Alvaro Herrera
Alvaro Herrera wrote:
> Robert Haas wrote:
> > Implement table partitioning.
> 
> Is it intentional that you can use ALTER TABLE OWNER TO on the parent
> table, and that this does not recurse to modify the partitions' owners?
> This doesn't seem to be mentioned in comments nor documentation, so it
> seems an oversight to me.

The alter table docs say that ONLY must be specified if one does not
want to modify descendants, so I think this is a bug.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2017-10-17 Thread Alvaro Herrera
Robert Haas wrote:
> Implement table partitioning.

Is it intentional that you can use ALTER TABLE OWNER TO on the parent
table, and that this does not recurse to modify the partitions' owners?
This doesn't seem to be mentioned in comments nor documentation, so it
seems an oversight to me.

Thoughts?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGSEGV in BRIN autosummarize

2017-10-17 Thread Alvaro Herrera
Justin Pryzby wrote:

> I'm happy to try the patch, but in case it makes any difference, we have few
> DBs/schemas:

I don't expect that it does.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGSEGV in BRIN autosummarize

2017-10-17 Thread Alvaro Herrera
Justin Pryzby wrote:

> #1  0x006a52e9 in perform_work_item (workitem=0x7f8ad1f94824) at 
> autovacuum.c:2676
> cur_datname = 0x298c740 "no 1 :vartype 1184 :vartypmod -1 :varcollid 
> 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location 146} {CONST :consttype 
> 1184 :consttypmod -1 :constcollid 0 :constlen 8 :constbyval true :constisnull 
> fal"...
> cur_nspname = 0x298c728 "s ({VAR :varno 1 :varattno 1 :vartype 1184 
> :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location 
> 146} {CONST :consttype 1184 :consttypmod -1 :constcollid 0 :constlen 8 
> :constbyv"...
> cur_relname = 0x298cd68 
> "cdrs_eric_msc_sms_2017_10_14_startofcharge_idx"
> __func__ = "perform_work_item"

cur_datname here seems corrupted -- it points halfway into cur_nspname,
which is also a corrupt value.  And I think that's because we're not
checking that the namespace OID is a valid value before calling
get_namespace_name on it.  And I'm betting that these values are all not
what we expect, because we're not checking that we're in the correct
database before trying to execute the work item.  I don't quite
understand how this results in an invalid string rather than just a
NULL, as I would have expected.

Anyway, can give this patch a try?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 0a53aaf589dfdbd2f25ae2ee36323d77c2910a60 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Tue, 17 Oct 2017 12:58:38 +0200
Subject: [PATCH] Fix autovacuum workitems

---
 src/backend/postmaster/autovacuum.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c 
b/src/backend/postmaster/autovacuum.c
index 776b1c0a9d..83366b862c 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2525,6 +2525,8 @@ deleted:
continue;
if (workitem->avw_active)
continue;
+   if (workitem->avw_database != MyDatabaseId)
+   continue;
 
/* claim this one, and release lock while performing it */
workitem->avw_active = true;
@@ -2592,6 +2594,7 @@ perform_work_item(AutoVacuumWorkItem *workitem)
char   *cur_datname = NULL;
char   *cur_nspname = NULL;
char   *cur_relname = NULL;
+   Oid cur_nspoid;
 
/*
 * Note we do not store table info in MyWorkerInfo, since this is not
@@ -2607,9 +2610,12 @@ perform_work_item(AutoVacuumWorkItem *workitem)
 */
 
cur_relname = get_rel_name(workitem->avw_relation);
-   cur_nspname = 
get_namespace_name(get_rel_namespace(workitem->avw_relation));
+   cur_nspoid = get_rel_namespace(workitem->avw_relation);
+   if (!cur_relname || !OidIsValid(cur_nspoid))
+   goto deleted2;
+   cur_nspname = get_namespace_name(cur_nspoid);
cur_datname = get_database_name(MyDatabaseId);
-   if (!cur_relname || !cur_nspname || !cur_datname)
+   if (!cur_nspname || !cur_datname)
goto deleted2;
 
autovac_report_workitem(workitem, cur_nspname, cur_datname);
-- 
2.11.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains

2017-10-17 Thread Alvaro Herrera
Robert Haas wrote:

> I haven't really followed this thread in depth, but I'm very nervous
> about the idea that we should ever be examining the raw-xmin of a
> tuple that has been marked HEAP_XMIN_FROZEN for anything other than
> forensic purposes.

Yeah, me too.  If you see another way to fix the problem, let's discuss
it.

I think a possible way is to avoid considering that the relfrozenxid
value computed by the caller is final.  Something like this: if page
freezing sees that there is a HOT chain which would end up half-frozen
because of that freeze age, decrease the freeze xid enough that the
whole chain remains unfrozen; communicate the new value to caller so
that it is used as the true new relfrozenxid going forwards.  That
preserves the HOT chain intact so that it can be pruned and frozen
correctly in a later VACUUM call.  Need to be careful about HOT chains
containing members that are old enough to cause a long term problem
(i.e. a table where relfrozenxid doesn't move forward enough).

One thing I didn't quite investigate is why this bug only shows up with
multixacts so far.  Is it just because multixacts provide an easy way to
reproduce it, and that there are others, more difficult ways to cause
the same problem without involving multixacts?  If so, then the problem
is likely present in 9.2 as well.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGSEGV in BRIN autosummarize

2017-10-17 Thread Alvaro Herrera
Justin Pryzby wrote:
> On Sun, Oct 15, 2017 at 02:44:58PM +0200, Tomas Vondra wrote:
> > Thanks, but I'm not sure that'll help, at this point. We already know
> > what happened (corrupted memory), we don't know "how". And core files
> > are mostly just "snapshots" so are not very useful in answering that :-(
> 
> Is there anything I should be saving for these or hints how else to debug?

Sorry, I had missed this report.  I'll look into it today.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ERROR: MultiXactId 3268957 has not been created yet -- apparent wraparound after missused pg_resetxlogs

2017-10-16 Thread Alvaro Herrera
RADIX Alain - externe wrote:

> I'm facing a problem with a PostgreSQL 9.6.2 reporting this error when 
> selecting a table
> ERROR:  MultiXactId 3268957 has not been created yet -- apparent wraparound
> 
> The root cause is not a Postgres bug but a buggy person that used 
> pg_resetxlogs obviously without reading or understanding the documentation.
> 
> I'm trying to extract data from the db to create a new one ;-)
> But pg_dump logically end with the same error.
> 
> I've seen the row with t_max  = 3268957 page_inspect, I might use it to 
> extract row from the page, but this will not be quite easy.

I'm not quite sure what it is that you want, other than to chastise the
person who deleted pg_wal.  Maybe pageinspect's heap_page_item_attrs()
function, new in 9.6, can help you extract data from tuples with
"corrupt" xmax.  Or perhaps it is easier to return the pg_control
multixact counters to the original values?

> Is there a way to change the t_max to a value that won't fire the error, with 
> this row.

You could try setting it to 0, and/or set/reset the xmax flags in
infomask.

> Hopefully, this is not a production database :-D but the user want to
> start qualification of the new application version today .. no luck
> for him.

Hurray.  The other good news is that one of your engineers is now an
experienced person.

> At least, there is something to learn, renaming pg_xlog to pg_wal won't stop 
> everybody :-/

Yay?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] relkind check in DefineIndex

2017-10-13 Thread Alvaro Herrera
The relkind check in DefineIndex has grown into an ugly rats nest of
'if' statements.  I propose to change it into a switch, as per the
attached.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 1a7420321f7f48bbc87dfdd38ba10fa57c6513da Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Fri, 13 Oct 2017 17:56:44 +0200
Subject: [PATCH] reword kind check using switch

---
 src/backend/commands/indexcmds.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b61aaac284..3f615b6260 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -375,25 +375,24 @@ DefineIndex(Oid relationId,
relationId = RelationGetRelid(rel);
namespaceId = RelationGetNamespace(rel);
 
-   if (rel->rd_rel->relkind != RELKIND_RELATION &&
-   rel->rd_rel->relkind != RELKIND_MATVIEW)
+   /* Ensure that it makes sense to index this kind of relation */
+   switch (rel->rd_rel->relkind)
{
-   if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
-
-   /*
-* Custom error message for FOREIGN TABLE since the 
term is close
-* to a regular table and can confuse the user.
-*/
+   case RELKIND_RELATION:
+   case RELKIND_MATVIEW:
+   /* OK */
+   break;
+   case RELKIND_FOREIGN_TABLE:
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("cannot create index on foreign 
table \"%s\"",

RelationGetRelationName(rel;
-   else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+   case RELKIND_PARTITIONED_TABLE:
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("cannot create index on 
partitioned table \"%s\"",

RelationGetRelationName(rel;
-   else
+   default:
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("\"%s\" is not a table or 
materialized view",
-- 
2.11.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2017-10-13 Thread Alvaro Herrera
Robert Haas wrote:
> Implement table partitioning.

> Currently, tables can be range-partitioned or list-partitioned.  List
> partitioning is limited to a single column, but range partitioning can
> involve multiple columns.  A partitioning "column" can be an
> expression.

I find the "partition strategy" thing a bit suspect code-wise.  I was a
bit bothered by all the "default:" clauses in switches that deal with
the possible values, and I was going to propose simply that we turn that
into an enum -- a trivial patch, I thought.  Not so: the way it's used
by the grammar is a bit odd.  Apparently, it starts life as a string
(either "list" or "range"), and then transformPartitionSpec() has to go
out of its way to return it as a char.

I propose we have gram.y itself resolve the strings to either 'list' or
'range' and that the node contains only those values, not any string.
Unless there is a reason why things are like this that I'm not seeing?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains

2017-10-12 Thread Alvaro Herrera
Peter Geoghegan wrote:
> On Fri, Oct 6, 2017 at 8:29 AM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
> wrote:

> > /*
> >  * When a tuple is frozen, the original Xmin is lost, but we know it's a
> >  * committed transaction.  So unless the Xmax is InvalidXid, we don't 
> > know
> >  * for certain that there is a match, but there may be one; and we must
> >  * return true so that a HOT chain that is half-frozen can be walked
> >  * correctly.
> >  *
> >  * We no longer freeze tuples this way, but we must keep this in order 
> > to
> >  * interpret pre-pg_upgrade pages correctly.
> >  */
> > if (TransactionIdEquals(xmin, FrozenTransactionId) &&
> > TransactionIdIsValid(xmax))
> > return true;
> >
> > return false;
> > }
> 
> Wouldn't this last "if" test, to cover the pg_upgrade case, be better
> targeted by comparing *raw* xmin to FrozenTransactionId? You're using
> the potentially distinct xmin value returned by
> HeapTupleHeaderGetXmin() for the test here. I think we should be
> directly targeting tuples frozen on or before 9.4 (prior to
> pg_upgrade) instead.

Yes, agreed, I should change that.  Thanks for continuing to think about
this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Extended statistics is not working on Vars hidden under a RelabelType

2017-10-12 Thread Alvaro Herrera
David Rowley wrote:

> The reason Tomas coded it the way it was coded is due to the fact that
> there's already code that works exactly the same way in
> clauselist_selectivity(). Personally, I don't particularly like that
> code, but I'd rather not invent a new way to do the same thing.

I pushed your original fix.  I still maintain that this should be
written differently, but I'm not going to try to change that in a hurry
and together with a bugfix.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Extended statistics is not working on Vars hidden under a RelabelType

2017-10-12 Thread Alvaro Herrera
Tomas Vondra wrote:
> On 10/10/2017 05:03 AM, David Rowley wrote:
> > Basically, $subject is causing us not to properly find matching
> > extended stats in this case.
> > 
> > The attached patch fixes it.
> > 
> > The following test cases is an example of the misbehaviour. Note
> > rows=1 vs rows=98 in the Gather node.
> 
> Thanks for noticing this. The patch seems fine to me.

I propose this slightly larger change.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/statistics/dependencies.c 
b/src/backend/statistics/dependencies.c
index 2e7c0ad6ba..0a51639d9d 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -736,12 +736,9 @@ pg_dependencies_send(PG_FUNCTION_ARGS)
  * dependency_is_compatible_clause
  * Determines if the clause is compatible with functional 
dependencies
  *
- * Only OpExprs with two arguments using an equality operator are supported.
- * When returning True attnum is set to the attribute number of the Var within
- * the supported clause.
- *
- * Currently we only support Var = Const, or Const = Var. It may be possible
- * to expand on this later.
+ * Returns true, and sets *attnum to the attribute number of the Var in the
+ * clause, if the clause is compatible with functional dependencies on the
+ * given relation.  Otherwise, return false.
  */
 static bool
 dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
@@ -762,39 +759,47 @@ dependency_is_compatible_clause(Node *clause, Index 
relid, AttrNumber *attnum)
if (is_opclause(rinfo->clause))
{
OpExpr *expr = (OpExpr *) rinfo->clause;
+   Node   *left;
+   Node   *right;
Var*var;
-   boolvaronleft = true;
-   boolok;
 
-   /* Only expressions with two arguments are considered 
compatible. */
+   /* Only expressions with two arguments are considered 
compatible ... */
if (list_length(expr->args) != 2)
return false;
 
-   /* see if it actually has the right */
-   ok = (NumRelids((Node *) expr) == 1) &&
-   (is_pseudo_constant_clause(lsecond(expr->args)) ||
-(varonleft = false,
- is_pseudo_constant_clause(linitial(expr->args;
-
-   /* unsupported structure (two variables or so) */
-   if (!ok)
+   /* ... and only if they operate on a single relation */
+   if (NumRelids((Node *) expr) != 1)
return false;
 
/*
-* If it's not "=" operator, just ignore the clause, as it's not
-* compatible with functional dependencies.
-*
-* This uses the function for estimating selectivity, not the 
operator
-* directly (a bit awkward, but well ...).
+* See which one is the pseudo-constant one, if any. If neither 
is, the
+* clause is not compatible.
 */
-   if (get_oprrest(expr->opno) != F_EQSEL)
+   left = (Node *) linitial(expr->args);
+   right = (Node *) lsecond(expr->args);
+
+   if (IsA(left, Var) || IsA(left, RelabelType))
+   {
+   if (!is_pseudo_constant_clause(right))
+   return false;
+   var = (Var *) left;
+   }
+   else if (IsA(right, Var) || IsA(right, RelabelType))
+   {
+   if (!is_pseudo_constant_clause(left))
+   return false;
+   var = (Var *) right;
+   }
+   else
return false;
 
-   var = (varonleft) ? linitial(expr->args) : lsecond(expr->args);
-
-   /* We only support plain Vars for now */
-   if (!IsA(var, Var))
-   return false;
+   /*
+* We may ignore any RelabelType node above the operand.  
(There won't
+* be more than one, since eval_const_expressions() has been 
applied
+* already.)
+*/
+   if (IsA(var, RelabelType))
+   var = (Var *) ((RelabelType *) var)->arg;
 
/* Ensure var is from the correct relation */
if (var->varno != relid)
@@ -808,6 +813,16 @@ dependency_is_compatible_clause(Node *clause, Index relid, 
AttrNumber *attnum)
if (!AttrNumberIsForUserDefinedAttr(var->varattno))
return false;
 
+   /*
+* If it's not "=" operator, just ignore the clause, as it's not
+

Re: [HACKERS] Discussion on missing optimizations

2017-10-12 Thread Alvaro Herrera
Andres Freund wrote:
> On 2017-10-08 11:28:09 -0400, Tom Lane wrote:
> > Adam Brusselback  writes:
> > > On another note:
> > >> turning ORs into UNIONs
> > 
> > > This is another one which would be incredibly useful for me.  I've had
> > > to do this manually for performance reasons far too often.
> > 
> > Well, maybe you could sign up to help review the open patch for that then:
> > https://commitfest.postgresql.org/15/1001/
> > 
> > The reason that's not in v10 is we haven't been able to convince
> > ourselves whether it's 100% correct.
> 
> Unfortunately it won't help in this specific case (no support for UNION,
> just UNION ALL), but I thought it might be interesting to reference
> https://medium.com/@uwdb/introducing-cosette-527898504bd6
> here.

Interesting.  I thought about a completely different approach -- use a
fuzzer, which runs each generated query on two servers, one patched one
not, and compare the results.  Would it be possible to tweak sqlsmith to
do this?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replace GrantObjectType with ObjectType

2017-10-12 Thread Alvaro Herrera
Michael Paquier wrote:
> On Thu, Oct 12, 2017 at 7:55 AM, Peter Eisentraut
>  wrote:
> > It seems to me that having ACL_OBJECT_* symbols alongside OBJECT_*
> > symbols is not useful and leads to duplication.  Digging around in the
> > past suggests that we used to have a lot of these command-specific
> > symbols but got rid of them over time, except that the ACL stuff was
> > never touched.  The attached patch accomplishes that.

+1 for this.

> -bool
> -EventTriggerSupportsGrantObjectType(GrantObjectType objtype)
> -{
> -   switch (objtype)
> -   {
> -   case ACL_OBJECT_DATABASE:
> -   case ACL_OBJECT_TABLESPACE:
> -   /* no support for global objects */
> -   return false;
> By removing that, if any GRANT/REVOKE support is added for another
> object type, then EventTriggerSupportsObjectType would return true by
> default instead of getting a compilation failure.

Yeah, we've found it useful to remove default: clauses in some switch
blocks so that we get compile warnings when we forget to add a new type
(c.f. commit e84c0195980f).  Let's not add any more of those.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SendRowDescriptionMessage() is slow for queries with a lot of columns

2017-10-11 Thread Alvaro Herrera
Andres Freund wrote:
> On 2017-10-11 10:53:56 +0200, Alvaro Herrera wrote:

> > I wonder if it'd be a good idea to nag external users about pq_sendint
> > usage (is a #warning possible?).
> 
> Think we'd need some separate infrastructure, i.e. for gcc ending up
> with __attribute__((deprecated)). I don't quite see this being worth
> adding it, but ...

Probably not.

> > OTOH, do we really need to keep it
> > around?  Maybe we should ditch it, since obviously the compat shim can
> > be installed locally in each extension that really needs it (thinking
> > that most external code can simply be adjusted to the new functions).
> 
> That seems like causing unnecessary pain - we're talking about a few
> lines in a header here, right? It's not like they'll be trivially
> converting to pq_sendint$width anytime soon, unless we backpatch this.

Well, my concern is to ensure that extension authors take advantage of
the optimized implementation.  If we never let them know that we've
rewritten things, most are not going to realize they can make their
extensions faster with a very simple code change.  On the other hand, I
suppose that for the vast majority of extensions, doing the change is
unlikely to have a noticeable in performance, so perhaps we should just
keep the shim and move along.

If do nothing, it's unlikely we'd ever get rid of the compat function.
Maybe add a comment suggesting to remove once pg10 is out of support?

> > I'm scared about the not-null-terminated stringinfo stuff.  Is it
> > possible to create bugs by polluting a stringinfo with it, then having
> > the stringinfo be used by unsuspecting code?  Admittedly, you can break
> > things already with the binary appends, so probably not an issue.
> 
> All of the converted sites already add integers into the StringInfo -
> and most of the those integers consist out of a couple bytes of 0,
> because they're lengths. So I don't think there's a huge danger here.

Right, agreed on that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SendRowDescriptionMessage() is slow for queries with a lot of columns

2017-10-11 Thread Alvaro Herrera
Andres Freund wrote:
> Hi,
> 
> On 2017-10-03 13:58:37 -0400, Robert Haas wrote:
> > On Tue, Oct 3, 2017 at 12:23 PM, Andres Freund  wrote:
> > > Makes sense?
> > 
> > Yes.
> 
> Here's an updated version of this patchset.

Maybe it'd be a good idea to push 0001 with some user of restrict ahead
of the rest, just to see how older msvc reacts.

I wonder if it'd be a good idea to nag external users about pq_sendint
usage (is a #warning possible?).  OTOH, do we really need to keep it
around?  Maybe we should ditch it, since obviously the compat shim can
be installed locally in each extension that really needs it (thinking
that most external code can simply be adjusted to the new functions).

I'm scared about the not-null-terminated stringinfo stuff.  Is it
possible to create bugs by polluting a stringinfo with it, then having
the stringinfo be used by unsuspecting code?  Admittedly, you can break
things already with the binary appends, so probably not an issue.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_regress help output

2017-10-11 Thread Alvaro Herrera
Tom Lane wrote:
> Joe Conway  writes:
> > I have been annoyed at least twice now by the lack of pg_regress command
> > line help output for the "--bindir=" option. In passing I noted
> > that there was no output for "--help" or "--version" options either.
> 
> > Any objections to the attached?
> 
> +1 for documenting it, but the phrasing seems a bit awkward:
> 
> ! printf(_("  --bindir=BINPATH  use BINPATH for programs that 
> are run;\n"));
> ! printf(_("if BINPATH empty, use PATH 
> from the environment\n"));
> 
> Maybe just "if empty, use PATH ..." ?
> 
> Also, why is the patch apparently changing whitespace in all the help
> lines?  Seems like that will create a lot of make-work for translators.

I think we don't have translations for pg_regress.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On markers of changed data

2017-10-10 Thread Alvaro Herrera
Greg Stark wrote:

> The general shape of what I would like to see is some log which lists
> where each checkpoint starts and ends and what blocks are modified
> since the previous checkpoint. Then to generate an incremental backup
> from any point in time to the current you union all the block lists
> between them and fetch those blocks. There are other ways of using
> this aside from incremental backups on disk too -- you could imagine a
> replica that has fallen behind requesting the block lists and then
> fetching just those blocks instead of needing to receive and apply all
> the wal.

Hmm, this sounds pretty clever.  And we already have the blocks touched
by each record thanks to the work for pg_rewind (so we don't have to do
any nasty tricks like the stuff Suzuki-san did for pg_lesslog, where
each WAL record had to be processed individually to know what blocks it
referenced), so it shouldn't be *too* difficult ...

> It would also be useful for going in the reverse direction: look up
> all the records (or just the last record) that modified a given block.

Well, a LSN map is what I was suggesting.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-10 Thread Alvaro Herrera
Robert Haas wrote:
> On Fri, Oct 6, 2017 at 12:37 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
> wrote:
> > 2. create one index for each existing partition.  These would be
> >identical to what would happen if you created the index directly on
> >each partition, except that there is an additional dependency to the
> >parent's abstract index.
> 
> One thing I'm a bit worried about is how to name these subordinate
> indexes.  They have to have names because that's how pg_class works,
> and those names can't all be the same, again because that's how
> pg_class works.  There's no problem right away when you first create
> the partitioned index, because you can just pick names out of a hat
> using whatever name-generating algorithm seems best.  However, when
> you dump-and-restore (including but not limited to the pg_upgrade
> case) you've got to preserve those names.  If you just generate a new
> name that may or may not be the same as the old one, then it may
> collide with a user-specified name that only occurs later in the dump.
> Also, you'll have trouble if the user has applied a COMMENT or a
> SECURITY LABEL to the index because that command works by name, or if
> the user has a reference to the index name inside a function or
> whatever.
> 
> These are pretty annoying corner-case bugs because they're not likely
> to come up very often.  Most people won't notice or care if the index
> name changes.  But I don't think it's acceptable to just ignore the
> problem.

I agree it's a problem that needs to be addressed directly.

> An idea I had was to treat the abstract index - to use your
> term - sort of the way we treat an extension.  Normally, when you
> create an index on a partitioned table, it cascades, but for dump and
> restore purpose, we tag on some syntax that says "well, don't actually
> create the subordinate indexes, i'll tell you about those later".
> Then for each subordinate index we issue a separate CREATE INDEX
> command followed by ALTER INDEX abstract_index ATTACH PARTITION
> concrete_index or something of that sort.  That means you can't
> absolutely count on the parent index to have all of the children it's
> supposed to have but maybe that's OK.

Hmm ... yeah, ATTACH and DETACH sound acceptable to me.  On DETACH, the
abstract index should be marked indisvalid=false unless a substitute
index already exists; and on ATTACH when indisvalid=false we verify that
all local indexes exist, and if so we can flip indisvalid.  That way, we
can continue to rely on the parent index always having all its children
when the flag is set.

I'm not clear on a syntax that creates the main index and hopes to later
have the sub-indexes created.  Another approach is to do it the other
way around, i.e. create the children first, then once they're all in
place create the main one normally, which merely verifies that all the
requisite children exist.  This is related to what I proposed to occur
when a regular table is joined as a partition of the partitioned table:
we run a verification that an index matching the parent's abstract
indexes exists, and if not we raise an error.  (Alternatively we could
allow the case, and mark the abstract index as indisvalid=false, but
that seems to violate POLA).

> Another thing that would let you do is CREATE INDEX CONCURRENTLY
> replacement_concrete_index; ALTER INDEX abstract_index DETACH
> PARTITION old_concrete_index, ATTACH PARTITION
> replacement_concrete_index; DROP INDEX CONCURRENTLY
> old_concrete_index, which seems like a thing someone might want to do.

Yeah, this is a point I explicitly mentioned, and this proposal seems
like a good way.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-10 Thread Alvaro Herrera
Wood, Dan wrote:
> I’m unclear on what is being repro’d in 9.6.  Are you getting the
> duplicate rows problem or just the reindex problem?  Are you testing
> with asserts enabled(I’m not)?

I was seeing just the reindex problem.  I don't see any more dups.

But I've tried to reproduce it afresh now, and let it run for a long
time and nothing happened.  Maybe I made a mistake last week and
ran an unfixed version.  I don't see any more problems now.

> If you are getting the dup rows consider the code in the block in
> heapam.c that starts with the comment “replace multi by update xid”.
>
> When I repro this I find that MultiXactIdGetUpdateXid() returns 0.
> There is an updater in the multixact array however the status is
> MultiXactStatusForNoKeyUpdate and not MultiXactStatusNoKeyUpdate.  I
> assume this is a preliminary status before the following row in the
> hot chain has it’s multixact set to NoKeyUpdate.

Yes, the "For" version is the locker version rather than the actual
update.  That lock is acquired by EvalPlanQual locking the row just
before doing the update.  I think GetUpdateXid has no reason to return
such an Xid, since it's not an update.

> Since a 0 is returned this does precede cutoff_xid and
> TransactionIdDidCommit(0) will return false.  This ends up aborting
> the multixact on the row even though the real xid is committed.  This
> sets XMAX to 0 and that row becomes visible as one of the dups.
> Interestingly the real xid of the updater is 122944 and the cutoff_xid
> is 122945.

I haven't seen this effect.  Please keep us updated if you're able to
verify corruption this way.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-07 Thread Alvaro Herrera
Peter Geoghegan wrote:
> On Sat, Oct 7, 2017 at 1:31 AM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
> wrote:
> >> As you must have seen, Alvaro said he has a variant of Dan's original
> >> script that demonstrates that a problem remains, at least on 9.6+,
> >> even with today's fix. I think it's the stress-test that plays with
> >> fillfactor, many clients, etc [1].
> >
> > I just execute setup.sql once and then run this shell command,
> >
> > while :; do
> > psql -e -P pager=off -f ./repro.sql
> > for i in `seq 1 5`; do
> > psql -P pager=off -e --no-psqlrc -f ./lock.sql &
> > done
> > wait && psql -P pager=off -e --no-psqlrc -f ./reindex.sql
> > psql -P pager=off -e --no-psqlrc -f ./report.sql
> > echo "done"
> > done
> 
> I cannot reproduce the problem on my personal machine using this
> script/stress-test. I tried to do so on the master branch git tip.
> This reinforces the theory that there is some timing sensitivity,
> because the remaining race condition is very narrow.

Hmm, I think I added a random sleep (max. 100ms) right after the
HeapTupleSatisfiesVacuum call in vacuumlazy.c (lazy_scan_heap), and that
makes the race easier to hit.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On markers of changed data

2017-10-07 Thread Alvaro Herrera
Michael Paquier wrote:

> That’s actually what pg_rman is doing for what it calls incremental
> backups (perhaps that would be differential backup in PG
> terminology?), and the performance is bad as you can imagine. We could
> have a dedicated LSN map to do such things with 4 bytes per page. I am
> still not convinced that this much facility and the potential bug
> risks are worth it though, Postgres already knows about differential
> backups if you shape it as a delta of WAL segments. I think that, in
> order to find a LSN map more convincing, we should find first other
> use cases where it could become useful. Some use cases may pop up with
> VACUUM, but I have not studied the question hard enough...

The case I've discussed with barman developers is a large database
(couple dozen of TBs should be enough) where a large fraction (say 95%)
is read-only but there are many changes to the active part of the data,
so that WAL is more massive than size of active data.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-07 Thread Alvaro Herrera
Peter Geoghegan wrote:
> On Fri, Oct 6, 2017 at 2:09 PM, Wong, Yi Wen  wrote:
> > Yesterday, I've been spending time with pg_visibility on the pages when I 
> > reproduce the issue in 9.6.
> > None of the all-frozen or all-visible bits are necessarily set in 
> > problematic pages.
> 
> Since this happened yesterday, I assume it was with an unfixed version?
> 
> As you must have seen, Alvaro said he has a variant of Dan's original
> script that demonstrates that a problem remains, at least on 9.6+,
> even with today's fix. I think it's the stress-test that plays with
> fillfactor, many clients, etc [1].

I just execute setup.sql once and then run this shell command,

while :; do
psql -e -P pager=off -f ./repro.sql
for i in `seq 1 5`; do
psql -P pager=off -e --no-psqlrc -f ./lock.sql &
done
wait && psql -P pager=off -e --no-psqlrc -f ./reindex.sql 
psql -P pager=off -e --no-psqlrc -f ./report.sql
echo "done"
done

Note that you need to use pg10's psql because of the \if lines in
lock.sql.  For some runs I change the values to compare random() to, and
originally the commented out section in lock.sql was not commented out,
but I'm fairly sure the failures I saw where with this version.  Also, I
sometime change the 5 in the `seq` command to higher values (180, 250).

I didn't find the filler column to have any effect, so I took that out.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
begin;

values ('for key share') /*, ('') */ order by random() limit 1 \gset

select pg_sleep(random() * 0.03);
select id from t where id=3 :column1 ;

select random() > .0 as update \gset
\if :update
select pg_sleep(random() * 0.03);
update t set x=x+1 where id=3;
\endif

/*
select random() > .0 as update2 \gset
\if :update2
savepoint foo;
update t set x=x+1 where id=3;
\endif
select random() > .5 as rollback_to \gset
\if :rollback_to
rollback to foo;
\endif
*/

select random() > .0 as commit \gset

\if :commit
   commit;
\else
   rollback;
\endif

select pg_sleep(random() * 0.03);
vacuum freeze t;

reindex index t_pkey;


with
pages (blkno) as (select generate_series::int from generate_series(0, 
pg_relation_size('t')/current_setting('block_size')::int - 1)),
rawpages (blkno, pg) as (select blkno, get_raw_page from pages, 
get_raw_page('t', blkno)),
heapitems as (select blkno, heap_page_items.* from rawpages, 
heap_page_items(pg))
select 
blkno, lp, lp_flags, lp_off, t_xmin, t_xmax, t_ctid,
infomask(t_infomask, 1) as infomask, infomask(t_infomask2, 2) as infomask2
from heapitems where lp_off <> 0
drop table t;
create table t (id int primary key, name char(3), x integer)
-- with (fillfactor = 10)
;

insert into t values (1, '111', 0);
insert into t values (3, '333', 0);
create type infomask_bit_desc as (mask varbit, symbol text);
create or replace function infomask(msk int, which int) returns text
language plpgsql as $$
declare
r infomask_bit_desc;
str text = '';
append_bar bool = false;
begin
for r in select * from infomask_bits(which) loop
if (msk::bit(16) & r.mask)::int <> 0 then
if append_bar then
str = str || '|';
end if;
append_bar = true;
str = str || r.symbol;
end if;
end loop;
return str;
end;
$$ ;
create or replace function infomask_bits(which int)
returns setof infomask_bit_desc
language plpgsql as $$
begin
if which = 1 then
return query values
(x'8000'::varbit, 'MOVED_IN'),
(x'4000', 'MOVED_OFF'),
(x'2000', 'UPDATED'),
(x'1000', 'XMAX_IS_MULTI'),
(x'0800', 'XMAX_INVALID'),
(x'0400', 'XMAX_COMMITTED'),
(x'0200', 'XMIN_INVALID'),
(x'0100', 'XMIN_COMMITTED'),
(x'0080', 'XMAX_LOCK_ONLY'),
(x'0040', 'EXCL_LOCK'),
(x'0020', 'COMBOCID'),
(x'0010', 'XMAX_KEYSHR_LOCK'),
(x'0008', 'HASOID'),
(x'0004', 'HASEXTERNAL'),
(x'0002', 'HASVARWIDTH'),
(x'0001', 'HASNULL');
elsif which = 2 then
return query values
(x'2000'::varbit, 'UPDATE_KEY_REVOKED'),
(x'4000', 'HOT_UPDATED'),
(x'8000', 'HEAP_ONLY_TUPLE');
end if;
end;
$$;

create extension if not exists pageinspect;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-06 Thread Alvaro Herrera
Peter Geoghegan wrote:
> On Fri, Oct 6, 2017 at 6:18 AM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
> wrote:
> > Wood, Dan wrote:
> >> Yes, I’ve been testing 9.6.  I’ll try Alvaro’s patch today.
> >>
> >> I would prefer to focus on either latest 9X or 11dev.
> >
> > I tested my patch on 9.4 and 9.5 today and it seems to close the problem
> > (with the patch, I waited 10x as many iterations as it took for the
> > problem to occur ~10 times without the patch), but I can reproduce a
> > problem in 9.6 with my patch installed.  There must be something new in
> > 9.6 that is causing the problem to reappear.
> 
> What problem persists? The original one (or, at least, the original
> symptom of pruning HOT chains incorrectly)? If that's what you mean, I
> wouldn't be so quick to assume that it's the freeze map.

I can tell that, in 9.6, REINDEX still reports the error we saw in
earlier releases, after some of the runs of my reproducer scripts.  I'm
unable to reproduce it anymore in 9.3 to 9.5.  I can't see the one Dan
originally reported anywhere, either.

I don't know if it's really the freeze map at fault or something else.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-06 Thread Alvaro Herrera
Hello

I've been thinking about this issue too.  I think your patch is not
ambitious enough.  Here's my brain dump on the issue, to revive
discussion.

As you propose, IMO this new feature would use the standard index
creation syntax:
   CREATE [UNIQUE] INDEX someindex ON parted_table (a, b);

This command currently throws an error.  We'd have it do two things:

1. create catalog rows in pg_class and pg_index for the main table,
   indicating the existance of this partitioned index.  These would not
   point to an actual index (since the main table itself is empty), but
   instead to an "abstract" index.  This abstract index can be used by
   various operations; see below.

2. create one index for each existing partition.  These would be
   identical to what would happen if you created the index directly on
   each partition, except that there is an additional dependency to the
   parent's abstract index.

If partitions are themselves partitioned, we would recursively apply the
indexes to the sub-partitions by doing (1) above for the partitioned
partition.

Once the index has been created for all existing partitions, the
hierarchy-wide index becomes valid and can be used normally by the
planner/executor.  I think we could use the pg_index.indisvalid property
for the abstract index for this.

I propose that an index declared UNIQUE throws an error for now.  We can
implement uniqueness (for the case where the indexed columns match the
partitioning key) later, once we sort out all the issues here first.  I
think unique indexes would be very useful even with that limitation, but
let's have it as a separate project.

I think using pg_depend as the mechanism to link partition indexes to
parent is a bad idea.  How about pg_inherits instead?  Seems more
appropriate.


Creating hierachy-wide indexes for existing partitioned tables is likely
to take a long time, so we must include CONCURRENTLY as an option.  This
will need some transaction machinery support in order to ensure that
each partition gets its index created at some point in a long chain of
transactions, and that the whole thing is marked valid only at the end.
Also, if the creation is interrupted (because of a crash or a regular
shutdown), it'll be useful to be able to continue rather than being
forced to start from scratch.

When a new partition is added, indexes satisfying the partitioned
table's abstract indexes are created automatically.

During ALTER TABLE ... ATTACH PARTITION, we check that an indexing
satisfying the abstract index exist (and we create a pg_inherit link).
If not, the command is aborted.

REINDEX is easily supported (just reindex each partition's index
individually), but that command is blocking, which is not good.  For
concurrent operation for tables not partitioned, a typical pattern to
avoid blocking the table is to suggest creation of an identical index
using CREATE INDEX CONCURRENTLY, then drop the original one.  That's not
going to work with these partitioned indexes, which is going to be a
problem.  I don't have any great ideas about this part yet.

We need to come up with some way to generate names for each partition
index.


I am going to work on this now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-06 Thread Alvaro Herrera
By the way, I still wonder if there's any way for a new tuple to get
inserted in the place where a HOT redirect would be pointing to, and
have it be marked as Frozen, where the old redirect contains a
non-invalid Xmax.  I tried to think of a way for that to happen, but
couldn't think of anything.

What I imagine is a sequence like this:

1. insert a tuple
2. HOT-update a tuple
3. prune the page, making lp 1 be a redirect (lp 2 is the new tuple)
4. start transaction
5. HOT-update the tuple again, creating HOT in lp 3
6. abort transaction (leaving aborted update in lp 3)
7. somehow remove tuple from lp 3, make slot available
8. new transaction comes along, inserts tuple in lp 3
9. somebody freezes tuple in lp3 (???)

Then we follow the HOT chain, see that Xmin=2 in lp3 and conclude that
the tuple is part of the chain because of an xid "match".

Basically from step 7 onwards I don't think this is possible, but maybe
I'm just blind.


Maybe we can forestall the problem by checking whether the Xmax
TransactionIdIsCurrentTransaction || TransactionIdDidCommit (or some
variant thereof).  This would be very slow but safer; and in 9.4 and up
we'd only need to do it if the xmin value is actually FrozenXid which
should be rare (only in pages upgraded from 9.3).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On markers of changed data

2017-10-06 Thread Alvaro Herrera
Michael Paquier wrote:

> The only sane method for Postgres is really to scan the
> page header LSNs, and of course you already know that.

I hope the idea is not to have to scan every single page in the
database, because that would be too slow.  It should be possible to
build this so that a single summary LSN is kept for a largish group of
pages, allowing large portions of the database to be skipped from even
being read if they are known not to contain any page newer than the
previous backup.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-06 Thread Alvaro Herrera
Michael Paquier wrote:
> On Fri, Oct 6, 2017 at 10:18 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
> wrote:
> > Wood, Dan wrote:
> >> Yes, I’ve been testing 9.6.  I’ll try Alvaro’s patch today.
> >>
> >> I would prefer to focus on either latest 9X or 11dev.
> >
> > I tested my patch on 9.4 and 9.5 today and it seems to close the problem
> > (with the patch, I waited 10x as many iterations as it took for the
> > problem to occur ~10 times without the patch), but I can reproduce a
> > problem in 9.6 with my patch installed.  There must be something new in
> > 9.6 that is causing the problem to reappear.
> 
> The freeze visibility map has been introduced in 9.6... There could be
> interactions on this side.

Ah, thanks for the tip.  I hope the authors of that can do the gruntwork
of researching this problem there, then.  I'll go commit what I have
now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-06 Thread Alvaro Herrera
Wood, Dan wrote:
> Yes, I’ve been testing 9.6.  I’ll try Alvaro’s patch today.
> 
> I would prefer to focus on either latest 9X or 11dev.  

I tested my patch on 9.4 and 9.5 today and it seems to close the problem
(with the patch, I waited 10x as many iterations as it took for the
problem to occur ~10 times without the patch), but I can reproduce a
problem in 9.6 with my patch installed.  There must be something new in
9.6 that is causing the problem to reappear.

> Does Alvaro’s patch presume any of the other patch to set COMMITTED in
> the freeze code?

I don't know what you mean.  Here is the 9.6 version of my patch.  Note
that HEAP_XMIN_FROZEN (which uses the XMIN_COMMITTED bit as I recall)
was introduced in 9.4.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 46ca12d56402d33a78ea0e13367d1e0e25a474dd Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Fri, 6 Oct 2017 14:11:34 +0200
Subject: [PATCH] Fix bugs

---
 src/backend/access/heap/heapam.c| 52 +
 src/backend/access/heap/pruneheap.c |  4 +--
 src/backend/executor/execMain.c |  6 ++---
 src/include/access/heapam.h |  3 +++
 4 files changed, 54 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b41f2a2fdd..10916b140e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2060,8 +2060,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation 
relation, Buffer buffer,
 * broken.
 */
if (TransactionIdIsValid(prev_xmax) &&
-   !TransactionIdEquals(prev_xmax,
-
HeapTupleHeaderGetXmin(heapTuple->t_data)))
+   !HeapTupleUpdateXmaxMatchesXmin(prev_xmax, 
heapTuple->t_data))
break;
 
/*
@@ -2244,7 +2243,7 @@ heap_get_latest_tid(Relation relation,
 * tuple.  Check for XMIN match.
 */
if (TransactionIdIsValid(priorXmax) &&
-   !TransactionIdEquals(priorXmax, 
HeapTupleHeaderGetXmin(tp.t_data)))
+   !HeapTupleUpdateXmaxMatchesXmin(priorXmax, tp.t_data))
{
UnlockReleaseBuffer(buffer);
break;
@@ -2276,6 +2275,50 @@ heap_get_latest_tid(Relation relation,
}   /* end of loop 
*/
 }
 
+/*
+ * HeapTupleUpdateXmaxMatchesXmin - verify update chain xmax/xmin lineage
+ *
+ * Given the new version of a tuple after some update, verify whether the
+ * given Xmax (corresponding to the previous version) matches the tuple's
+ * Xmin, taking into account that the Xmin might have been frozen after the
+ * update.
+ */
+bool
+HeapTupleUpdateXmaxMatchesXmin(TransactionId xmax, HeapTupleHeader htup)
+{
+   TransactionId   xmin = HeapTupleHeaderGetXmin(htup);
+
+   /*
+* If the xmax of the old tuple is identical to the xmin of the new one,
+* it's a match.
+*/
+   if (TransactionIdEquals(xmax, xmin))
+   return true;
+
+   /*
+* If the Xmin that was in effect prior to a freeze matches the Xmax,
+* it's good too.
+*/
+   if (HeapTupleHeaderXminFrozen(htup) &&
+   TransactionIdEquals(HeapTupleHeaderGetRawXmin(htup), xmax))
+   return true;
+
+   /*
+* When a tuple is frozen, the original Xmin is lost, but we know it's a
+* committed transaction.  So unless the Xmax is InvalidXid, we don't 
know
+* for certain that there is a match, but there may be one; and we must
+* return true so that a HOT chain that is half-frozen can be walked
+* correctly.
+*
+* We no longer freeze tuples this way, but we must keep this in order 
to
+* interpret pre-pg_upgrade pages correctly.
+*/
+   if (TransactionIdEquals(xmin, FrozenTransactionId) &&
+   TransactionIdIsValid(xmax))
+   return true;
+
+   return false;
+}
 
 /*
  * UpdateXmaxHintBits - update tuple hint bits after xmax transaction ends
@@ -5693,8 +5736,7 @@ l4:
 * end of the chain, we're done, so return success.
 */
if (TransactionIdIsValid(priorXmax) &&
-   
!TransactionIdEquals(HeapTupleHeaderGetXmin(mytup.t_data),
-priorXmax))
+   !HeapTupleUpdateXmaxMatchesXmin(priorXmax, 
mytup.t_data))
{
result = HeapTupleMayBeUpdated;
goto out_locked;
diff --git a/src/backend/access/heap/pruneheap.c 

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-06 Thread Alvaro Herrera
Michael Paquier wrote:
> On Fri, Oct 6, 2017 at 1:24 AM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
> wrote:

> +   /*
> +* If the xmax of the old tuple is identical to the xmin of the new one,
> +* it's a match.
> +*/
> +   if (xmax == xmin)
> +   return true;
> I would use TransactionIdEquals() here, to remember once you switch
> that to a macro.

I've had second thoughts about the macro thing -- for now I'm keeping it
a function, actually.

> +/*
> + * Given a tuple, verify whether the given Xmax matches the tuple's Xmin,
> + * taking into account that the Xmin might have been frozen.
> + */
> [...]
> +   /*
> +* We actually don't know if there's a match, but if the previous tuple
> +* was frozen, we cannot really rely on a perfect match.
> +*/

I don't know what you had in mind here, but I tweaked the 9.3 version so
that it now looks like this:

/*
 * HeapTupleUpdateXmaxMatchesXmin - verify update chain xmax/xmin lineage
 *
 * Given the new version of a tuple after some update, verify whether the
 * given Xmax (corresponding to the previous version) matches the tuple's
 * Xmin, taking into account that the Xmin might have been frozen after the
 * update.
 */
bool
HeapTupleUpdateXmaxMatchesXmin(TransactionId xmax, HeapTupleHeader htup)
{
TransactionId   xmin = HeapTupleHeaderGetXmin(htup);

/*
 * If the xmax of the old tuple is identical to the xmin of the new one,
 * it's a match.
 */
if (TransactionIdEquals(xmax, xmin))
return true;

/*
 * When a tuple is frozen, the original Xmin is lost, but we know it's a
 * committed transaction.  So unless the Xmax is InvalidXid, we don't
 * know for certain that there is a match, but there may be one; and we
 * must return true so that a HOT chain that is half-frozen can be 
walked
 * correctly.
 */
if (TransactionIdEquals(xmin, FrozenTransactionId) &&
TransactionIdIsValid(xmax))
return true;

return false;
}


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()

2017-10-06 Thread Alvaro Herrera
Michael Paquier wrote:
> On Tue, Oct 3, 2017 at 12:54 AM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
> wrote:

> > I'd argue about this in the same direction I argued about
> > BufferGetPage() needing an LSN check that's applied separately: if it's
> > too easy for a developer to do the wrong thing (i.e. fail to apply the
> > separate LSN check after reading the page) then the routine should be
> > patched to do the check itself, and another routine should be offered
> > for the cases that explicitly can do without the check.
> >
> > I was eventually outvoted in the BufferGetPage() saga, though, so I'm
> > not sure that that discussion sets precedent.
> 
> Oh... I don't recall this discussion. A quick lookup at the archives
> does not show me a specific thread either.

Just search for "Æsop" in the archives:
https://www.postgresql.org/message-id/CACjxUsPPCbov6DDPnuGpR=fmxhsjsn_mri3rjygvbrmcrff...@mail.gmail.com

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-05 Thread Alvaro Herrera
I think this is the patch for 9.3.  I ran the test a few hundred times
(with some additional changes such as randomly having an update inside a
savepoint that's randomly aborted, randomly aborting the transaction,
randomly skipping the for key share lock, randomly sleeping at a few
points; and also adding filler columns, reducing fillfactor and using
5, 50, 180, 250, 500 sessions after verifying that it causes the tuples
to stay in the same page or migrate to later pages).  The final REINDEX
has never complained again about failing to find the root tuple.  I hope
it's good now.

The attached patch needs a few small tweaks, such as improving
commentary in the new function, maybe turn it into a macro (otherwise I
think it could be bad for performance; I'd like a static func but not
sure those are readily available in 9.3), change the XID comparison to
use the appropriate macro rather than ==, and such.

Regarding changes of xmin/xmax comparison, I also checked manually the
spots I thought should be modified and later double-checked against the
list that Michael posted.  It's a match, except for rewriteheap.c which
I cannot make heads or tails about.  (I think it's rather unfortunate
that it sticks a tuple's Xmax into a field that's called Xmin, but let's
put that aside).  Maybe there's a problem here, maybe there isn't.

I'm now going to forward-port this to 9.4.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e00dc6c1ca..87bce0e7ea 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1718,8 +1718,7 @@ heap_hot_search_buffer(ItemPointer tid, Relation 
relation, Buffer buffer,
 * broken.
 */
if (TransactionIdIsValid(prev_xmax) &&
-   !TransactionIdEquals(prev_xmax,
-
HeapTupleHeaderGetXmin(heapTuple->t_data)))
+   !HeapTupleUpdateXmaxMatchesXmin(prev_xmax, 
heapTuple->t_data))
break;
 
/*
@@ -1888,7 +1887,7 @@ heap_get_latest_tid(Relation relation,
 * tuple.  Check for XMIN match.
 */
if (TransactionIdIsValid(priorXmax) &&
- !TransactionIdEquals(priorXmax, 
HeapTupleHeaderGetXmin(tp.t_data)))
+   !HeapTupleUpdateXmaxMatchesXmin(priorXmax, tp.t_data))
{
UnlockReleaseBuffer(buffer);
break;
@@ -1920,6 +1919,36 @@ heap_get_latest_tid(Relation relation,
}   /* end of loop 
*/
 }
 
+/*
+ * Given a tuple, verify whether the given Xmax matches the tuple's Xmin,
+ * taking into account that the Xmin might have been frozen.
+ */
+bool
+HeapTupleUpdateXmaxMatchesXmin(TransactionId xmax, HeapTupleHeader htup)
+{
+   TransactionId   xmin = HeapTupleHeaderGetXmin(htup);
+
+   /* xmax must not be invalid or frozen */
+   Assert(TransactionIdIsNormal(xmax));
+
+   /*
+* If the xmax of the old tuple is identical to the xmin of the new one,
+* it's a match.
+*/
+   if (xmax == xmin)
+   return true;
+
+   /* FIXME Here we need to check the HEAP_XMIN_FROZEN in 9.4 and up */
+
+   /*
+* We actually don't know if there's a match, but if the previous tuple
+* was frozen, we cannot really rely on a perfect match.
+*/
+   if (xmin == FrozenTransactionId)
+   return true;
+
+   return false;
+}
 
 /*
  * UpdateXmaxHintBits - update tuple hint bits after xmax transaction ends
@@ -5045,8 +5074,7 @@ l4:
 * the end of the chain, we're done, so return success.
 */
if (TransactionIdIsValid(priorXmax) &&
-   
!TransactionIdEquals(HeapTupleHeaderGetXmin(mytup.t_data),
-priorXmax))
+   !HeapTupleUpdateXmaxMatchesXmin(priorXmax, 
mytup.t_data))
{
UnlockReleaseBuffer(buf);
return HeapTupleMayBeUpdated;
@@ -5500,7 +5528,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
if (TransactionIdPrecedes(xid, cutoff_xid))
{
if (TransactionIdDidCommit(xid))
+   {
+   xid = FrozenTransactionId;
*flags = FRM_MARK_COMMITTED | 
FRM_RETURN_IS_XID;
+   }
else
{
*flags |= FRM_INVALIDATE_XMAX;
diff --git 

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-05 Thread Alvaro Herrera
Robert Haas wrote:

> Regarding nomenclature and my previous griping about wisdom, I was
> wondering about just calling this a "partition join" like you have in
> the regression test.  So the GUC would be enable_partition_join, you'd
> have generate_partition_join_paths(), etc.  Basically just delete
> "wise" throughout.

If I understand correctly, what's being used here is the "-wise" suffix,
unrelated to wisdom, which Merriam Webster lists as "adverb combining
form" here https://www.merriam-webster.com/dictionary/wise (though you
have to scroll down a lot), which is defined as

1 a :in the manner of  * crabwise * fanwise
  b :in the position or direction of  * slantwise * clockwise
2 :with regard to :in respect of * dollarwise

According to that, the right way to write this is "partitionwise join"
(no dash), which means "join in respect of partitions", "join with
regard to partitions".

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logging idle checkpoints

2017-10-05 Thread Alvaro Herrera
Kyotaro HORIGUCHI wrote:

> # This reminded me of a concern. I'd like to count vacuums that
> # are required but skipped by lock-failure, or killed by other
> # backend.

We clearly need to improve the stats and logs related to vacuuming work
executed, both by autovacuum and manually invoked.  One other item I
have in my head is to report numbers related to the truncation phase of
a vacuum run, since in some cases it causes horrible and hard to
diagnose problems.  (Also, add an reloption to stop vacuum from doing
the truncation phase at all -- for some usage patterns that is a serious
problem.)

However, please do open a new thread about it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-05 Thread Alvaro Herrera
Peter Geoghegan wrote:

> As you know, on version 9.4+, as of commit 37484ad2a, we decided that
> we are "largely ignoring the value to which it [xmin] is set". The
> expectation became that raw xmin is available after freezing, but
> mostly for forensic purposes. I think Alvaro should now memorialize
> the idea that its value is actually critical in some place
> (htup_details.h?).

I'd love to be certain that we preserve the original value in all
cases.

> On Wed, Oct 4, 2017 at 6:46 AM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
> wrote:

> > I independently arrived at the same conclusion.  Since I was trying with
> > 9.3, the patch differs -- in the old version we must explicitely test
> > for the FrozenTransactionId value, instead of using GetRawXmin.
> 
> Obviously you're going to have to be prepared for a raw xmin of
> FrozenTransactionId, even on 9.4+, due to pg_upgrade.

Hmm.  It would be good to be able to get rid of that case, actually,
because I now think it's less safe (see below).

At any rate, I was thinking in a new routine to encapsulate the logic,

/*
 * Check the tuple XMIN against prior XMAX, if any
 */
if (!HeapTupleUpdateXmaxMatchesXmin(priorXmax, 
HeapTupleHeaderGetXmin(htup)))
break;

where ..XmaxMatchesXmin would know about checking for possibly frozen
tuples.

> I can see why it
> would be safe (or at least no more dangerous) to rely on
> HeapTupleHeaderGetRawXmin() in the way mentioned here, at least on
> installations that initdb'd on a version after commit 37484ad2a
> (version 9.4+). However, I'm not sure why what you propose here would
> be safe when even raw xmin happens to be FrozenTransactionId. Are you
> sure that that's truly race-free? If it's really true that we only
> need to check for FrozenTransactionId on 9.3, why not just do that on
> all versions, and never bother with HeapTupleHeaderGetRawXmin()?
> ("Sheer paranoia" is a valid answer; I just want us to be clear on the
> reasoning.)

I think the RawXmin is the better mechanism.  I'm not absolutely certain
that the windows is completely closed in 9.3; as I understand things, it
is possible for transaction A to prune an aborted heap-only tuple, then
transaction B to insert a frozen tuple in the same location, then
transaction C follows a link to the HOT that was pruned.  I think we'd
end up considering that the new frozen tuple is part of the chain, which
is wrong ...  In 9.4 we can compare the real xmin.

I hope I am proved wrong about this, because if not, I think we're
looking at an essentially unsolvable problem in 9.3.

> Obviously any race would have a ridiculously tiny window, but it's not
> obvious why this protocol would be completely race-free (in the event
> of a FrozenTransactionId raw xmin).

As far as I understand, this problem only emerges if one part of a HOT
chain reaches the min freeze age while another part of the same chain is
still visible by some running transaction.  It is particularly
noticeably in our current test case because we use a min freeze age of
0, with many concurrrent modifying the same page.  What this says to me
is that VACUUM FREEZE is mildly dangerous when there's lots of high
concurrent HOT UPDATE activity.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-05 Thread Alvaro Herrera
Wood, Dan wrote:
> Whatever you do make sure to also test 250 clients running lock.sql.  Even 
> with the communities fix plus YiWen’s fix I can still get duplicate rows.  
> What works for “in-block” hot chains may not work when spanning blocks.

Good idea.  You can achieve a similar effect by adding a filler column,
and reducing fillfactor.

> Once nearly all 250 clients have done their updates and everybody is
> waiting to vacuum which one by one will take a while I usually just
> “pkill -9 psql”.  After that I have many of duplicate “id=3” rows.

Odd ...

> On top of that I think we might have a lock leak.  After the pkill I
> tried to rerun setup.sql to drop/create the table and it hangs.  I see
> an autovacuum process starting and existing every couple of seconds.
> Only by killing and restarting PG can I drop the table.

Please do try to figure this one out.  It'd be a separate problem,
worthy of its own thread.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-04 Thread Alvaro Herrera
Wong, Yi Wen wrote:
> My interpretation of README.HOT is the check is just to ensure the chain is 
> continuous; in which case the condition should be:
> 
> > if (TransactionIdIsValid(priorXmax) &&
> > !TransactionIdEquals(priorXmax, 
> > HeapTupleHeaderGetRawXmin(htup)))
> > break;
> 
> So the difference is GetRawXmin vs GetXmin, because otherwise we get the 
> FreezeId instead of the Xmin when the transaction happened

I independently arrived at the same conclusion.  Since I was trying with
9.3, the patch differs -- in the old version we must explicitely test
for the FrozenTransactionId value, instead of using GetRawXmin.
Attached is the patch I'm using, and my own oneliner test (pretty much
the same I posted earlier) seems to survive dozens of iterations without
showing any problem in REINDEX.

This patch is incomplete, since I think there are other places that need
to be patched in the same way (EvalPlanQualFetch? heap_get_latest_tid?).
Of course, for 9.4 and onwards we need to patch like you described.


This bit in EvalPlanQualFetch caught my attention ... why is it saying
xmin never changes?  It does change with freezing.

/*
 * If xmin isn't what we're expecting, the slot must 
have been
 * recycled and reused for an unrelated tuple.  This 
implies that
 * the latest version of the row was deleted, so we 
need do
 * nothing.  (Should be safe to examine xmin without 
getting
 * buffer's content lock, since xmin never changes in 
an existing
 * tuple.)
 */
if 
(!TransactionIdEquals(HeapTupleHeaderGetXmin(tuple.t_data),
 
priorXmax))


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e00dc6c1ca..e68746fc3b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5500,7 +5500,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
if (TransactionIdPrecedes(xid, cutoff_xid))
{
if (TransactionIdDidCommit(xid))
+   {
+   xid = FrozenTransactionId;
*flags = FRM_MARK_COMMITTED | 
FRM_RETURN_IS_XID;
+   }
else
{
*flags |= FRM_INVALIDATE_XMAX;
diff --git a/src/backend/access/heap/pruneheap.c 
b/src/backend/access/heap/pruneheap.c
index 9a8db74cb9..561acd144a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -435,7 +435,8 @@ heap_prune_chain(Relation relation, Buffer buffer, 
OffsetNumber rootoffnum,
 * Check the tuple XMIN against prior XMAX, if any
 */
if (TransactionIdIsValid(priorXmax) &&
-   !TransactionIdEquals(HeapTupleHeaderGetXmin(htup), 
priorXmax))
+   !TransactionIdEquals(HeapTupleHeaderGetXmin(htup), 
priorXmax) &&
+   HeapTupleHeaderGetXmin(htup) != FrozenTransactionId)
break;
 
/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-04 Thread Alvaro Herrera
Alvaro Herrera wrote:
> Peter Geoghegan wrote:
> 
> > I thought that we no longer store FrozenTransactionId (xid 2) as our
> > "raw" xmin while freezing, and yet that's what we see here.
> 
> I'm doing this in 9.3.  I can't tell if the new tuple freezing stuff
> broke things more thoroughly, but it is already broken in earlier
> releases.

In fact, I think in 9.3 we should include this patch, to set the Xmin to
FrozenXid.  9.4 and onwards have commit 37484ad2a "Change the way we
mark tuples as frozen" which uses a combination of infomask bits, but in
9.3 I think leaving the unfrozen value in the xmax field is a bad idea
even if we set the HEAP_XMAX_COMMITTED bit.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e00dc6c1ca..e68746fc3b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5500,7 +5500,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
if (TransactionIdPrecedes(xid, cutoff_xid))
{
if (TransactionIdDidCommit(xid))
+   {
+   xid = FrozenTransactionId;
*flags = FRM_MARK_COMMITTED | 
FRM_RETURN_IS_XID;
+   }
else
{
*flags |= FRM_INVALIDATE_XMAX;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-04 Thread Alvaro Herrera
Wood, Dan wrote:
> One minor side note…   Is it weird for xmin/xmax to go backwards in a hot row 
> chain?
> 
> lp | t_ctid | lp_off | t_infomask | t_infomask2 | t_xmin | t_xmax 
> ++++-++
>   1 | (0,1)  |   8152 |   2818 |   3 |  36957 |  0
>   2 ||  5 || ||   
>   3 ||  0 || ||   
>   4 ||  0 || ||   
>   5 | (0,6)  |   8112 |   9986 |   49155 |  36962 |  36963
>   6 | (0,7)  |   8072 |   9986 |   49155 |  36963 |  36961
>   7 | (0,7)  |   8032 |  11010 |   32771 |  36961 |  0
> (7 rows)

No, it just means transaction A got its XID before transaction B, but B
executed the update first and A updated the tuple second.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-04 Thread Alvaro Herrera
Wood, Dan wrote:

> There is a tangled web of issues here.  With the community fix we get a 
> corrupted page(invalid redirect ptr from indexed item).  The cause of that is:
> pruneheap.c:
> 
>   /*
>* Check the tuple XMIN against prior XMAX, if any
>*/
>   if (TransactionIdIsValid(priorXmax) &&
>   !TransactionIdEquals(HeapTupleHeaderGetXmin(htup), 
> priorXmax))
>   break;
> 
>   chainitems[nchain++] = offnum;
> 
> The priorXmax is a multixact key share lock,

Uhh, what?  That certainly shouldn't happen -- the priorXmax comes from

priorXmax = HeapTupleHeaderGetUpdateXid(htup);

so only the XID of the update itself should be reported, not a multixact
and certainly not just a tuple lock XID.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-04 Thread Alvaro Herrera
Peter Geoghegan wrote:

> I thought that we no longer store FrozenTransactionId (xid 2) as our
> "raw" xmin while freezing, and yet that's what we see here.

I'm doing this in 9.3.  I can't tell if the new tuple freezing stuff
broke things more thoroughly, but it is already broken in earlier
releases.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] datetime.h defines like PM conflict with external libraries

2017-10-04 Thread Alvaro Herrera
Andrew Dunstan wrote:

> On 10/03/2017 04:43 PM, Tom Lane wrote:

> > I like the new-header-file idea because it will result in minimal code
> > churn and thus minimal back-patching hazards.
> >
> > I do *not* like "PG_PM".  For our own purposes that adds no uniqueness
> > at all.  If we're to touch these symbols then I'd go for names like
> > "DATETIME_PM".  Or maybe "DT_PM" ... there's a little bit of precedent
> > for the DT_ prefix already.
> 
> Yeah. If we use a prefix +1 for DT_. If we do that then I think they
> should *all* be prefixed, not just the ones we know of conflicts for.

Maybe it'd be good idea to unify some of that stuff so that ecpg can use
it, too?  Having a second copy of the same stuff in
src/interfaces/ecpg/pgtypeslib/dt.h is pretty terrible.  Even if not,
let's make sure they don't diverge.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-03 Thread Alvaro Herrera
So here's my attempt at an explanation for what is going on.  At one
point, we have this:

select lp, lp_flags, t_xmin, t_xmax, t_ctid, to_hex(t_infomask) as infomask,
to_hex(t_infomask2) as infomask2
from heap_page_items(get_raw_page('t', 0));
 lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2 
+--++++--+---
  1 |1 |  2 |  0 | (0,1)  | 902  | 3
  2 |0 ||||  | 
  3 |1 |  2 |  19928 | (0,4)  | 3142 | c003
  4 |1 |  14662 |  19929 | (0,5)  | 3142 | c003
  5 |1 |  14663 |  19931 | (0,6)  | 3142 | c003
  6 |1 |  14664 |  19933 | (0,7)  | 3142 | c003
  7 |1 |  14665 |  0 | (0,7)  | 2902 | 8003
(7 filas)

which shows a HOT-update chain, where the t_xmax are multixacts.  Then a
vacuum freeze comes, and because the multixacts are below the freeze
horizon for multixacts, we get this:

select lp, lp_flags, t_xmin, t_xmax, t_ctid, to_hex(t_infomask) as infomask,
to_hex(t_infomask2) as infomask2
from heap_page_items(get_raw_page('t', 0));
 lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2 
+--++++--+---
  1 |1 |  2 |  0 | (0,1)  | 902  | 3
  2 |0 ||||  | 
  3 |1 |  2 |  14662 | (0,4)  | 2502 | c003
  4 |1 |  2 |  14663 | (0,5)  | 2502 | c003
  5 |1 |  2 |  14664 | (0,6)  | 2502 | c003
  6 |1 |  2 |  14665 | (0,7)  | 2502 | c003
  7 |1 |  2 |  0 | (0,7)  | 2902 | 8003
(7 filas)

where the xmin values have all been frozen, and the xmax values are now
regular Xids.  I think the HOT code that walks the chain fails to detect
these as chains, because the xmin values no longer match the xmax
values.  I modified the multixact freeze code, so that whenever the
update Xid is below the cutoff Xid, it's set to FrozenTransactionId,
since keeping the other value is invalid anyway (even though we have set
the HEAP_XMAX_COMMITTED flag).  But that still doesn't fix the problem;
as far as I can see, vacuum removes the root of the chain, not yet sure
why, and then things are just as corrupted as before.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] list of credits for release notes

2017-10-03 Thread Alvaro Herrera
Michael Paquier wrote:

> Looking at this list, the first name is followed by the family name,
> so there are inconsistencies with some Japanese names:
> - Fujii Masao should be Masao Fujii.
> - KaiGai Kohei should be Kohei Kaigai.

We've already been here:
https://www.postgresql.org/message-id/20150613231826.gy133...@postgresql.org

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Alvaro Herrera
Tom Lane wrote:
> Alvaro Herrera <alvhe...@alvh.no-ip.org> writes:

> > Tom Lane wrote:
> >> I don't especially like the Asserts inside spinlocks, either.
> 
> > I didn't change these.  It doesn't look to me that these asserts are
> > worth very much as production code.
> 
> OK.  If we ever see these hit in the buildfarm I might argue for
> reconsidering, but without some evidence of that sort it's not
> worth much concern.

Sure.  I would be very surprised if buildfarm ever exercises this code.

> > I think the latch is only used locally.  Seems that it was only put in
> > shmem to avoid a separate variable ...
> 
> Hm, I'm strongly tempted to move it to a separate static variable then.
> That's not a bug fix, so maybe it only belongs in HEAD, but is there
> value in keeping the branches in sync in this code?  It sounded from
> your commit message like they were pretty different already :-(

Well, there were conflicts in almost every branch, but they were pretty
minor.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Alvaro Herrera
Fixed the pstrdup problem by replacing with strlcpy() to stack-allocated
variables (rather than palloc + memcpy as proposed in Michael's patch).

About the other problems:

Tom Lane wrote:

> I took a quick look through walreceiver.c, and there are a couple of
> obvious problems of the same ilk in WalReceiverMain, which is doing this:
> 
>   walrcv->lastMsgSendTime = walrcv->lastMsgReceiptTime = 
> walrcv->latestWalEndTime = GetCurrentTimestamp();
> 
> (ie, a potential kernel call) inside a spinlock.  But there seems no
> real problem with just collecting the timestamp before we enter that
> critical section.

Done that way.

> I also don't especially like the fact that just above there it reaches
> elog(PANIC) with the lock still held, though at least that's a case that
> should never happen.

Fixed by releasing spinlock just before elog.

> Further down, it's doing a pfree() inside the spinlock, apparently
> for no other reason than to save one "if (tmp_conninfo)".

Fixed.

> I don't especially like the Asserts inside spinlocks, either.  Personally,
> I think if those conditions are worth testing then they're worth testing
> for real (in production).  Variables that are manipulated by multiple
> processes are way more likely to assume unexpected states than local
> variables.

I didn't change these.  It doesn't look to me that these asserts are
worth very much as production code.

> I'm also rather befuddled by the fact that this code sets and clears
> walrcv->latch outside the critical sections.  If that field is used
> by any other process, surely that's completely unsafe.  If it isn't,
> why is it being kept in shared memory?

I think the latch is only used locally.  Seems that it was only put in
shmem to avoid a separate variable ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM

2017-10-03 Thread Alvaro Herrera
Andreas Seltenreich wrote:
> Michael Paquier writes:
> 
> > I am attaching a patch that addresses the bugs for the spin lock sections.
> > [2. text/x-diff; walreceiver-spin-calls.patch]
> 
> I haven't seen a spinlock PANIC since testing with the patch applied.
> They occured five times with the same amount of testing done earlier.

Thanks for testing.  I'm about to push something.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   5   6   7   8   9   10   >