Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-16 Thread John Naylor
On Wed, Apr 17, 2019 at 2:04 AM Andres Freund  wrote:
>
> Hi,
>
> I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
> complexity that looks like it should be purely in freespacemap.c to
> callers.
>
>
>  extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
> -extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
> +extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
> +bool check_fsm_only);
>
> So now freespace.c has an argument that says we should only check the
> fsm. That's confusing. And it's not explained to callers what that
> argument means, and when it should be set.

When first looking for free space, it's "false": Within
GetPageWithFreeSpace(), we call RelationGetNumberOfBlocks() if the FSM
returns invalid.

If we have to extend, after acquiring the lock to extend the relation,
we call GetPageWithFreeSpace() again to see if another backend already
extended while waiting on the lock. If there's no FSM, the thinking
is, it's not worth it to get the number of blocks again.

> @@ -176,20 +269,44 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber 
> oldPage,
>   * Note that if the new spaceAvail value is higher than the old value stored
>   * in the FSM, the space might not become visible to searchers until the next
>   * FreeSpaceMapVacuum call, which updates the upper level pages.
> + *
> + * Callers have no need for a local map.
>   */
>  void
> -RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
> +RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
> +   Size spaceAvail, BlockNumber nblocks)
>
> There's no explanation as to what that "nblocks" argument is. One
> basically has to search other callers to figure it out. It's not even
> clear to which fork it relates to. Nor that one can set it to
> InvalidBlockNumber if one doesn't have the relation size conveniently
> reachable.  But it's not exposed to RecordAndGetPageWithFreeSpace(), for
> a basically unexplained reason.  There's a comment above
> fsm_allow_writes() - but that's  file-local function that external
> callers basically have need to know about.

Okay.

> I can't figure out what "Callers have no need for a local map." is
> supposed to mean.

It was meant to contrast with [RecordAnd]GetPageWithFreeSpace(), but I
see how it's confusing.

> +/*
> + * Clear the local map.  We must call this when we have found a block with
> + * enough free space, when we extend the relation, or on transaction abort.
> + */
> +void
> +FSMClearLocalMap(void)
> +{
> +   if (FSM_LOCAL_MAP_EXISTS)
> +   {
> +   fsm_local_map.nblocks = 0;
> +   memset(_local_map.map, FSM_LOCAL_NOT_AVAIL,
> +  sizeof(fsm_local_map.map));
> +   }
> +}
> +
>
> So now there's a new function one needs to call after successfully using
> the block returned by [RecordAnd]GetPageWithFreeSpace().  But it's not
> referenced from those functions, so basically one has to just know that.

Right.

> +/* Only create the FSM if the heap has greater than this many blocks */
> +#define HEAP_FSM_CREATION_THRESHOLD 4
>
> Hm, this seems to be tying freespace.c closer to heap than I think is
> great - think of new AMs like zheap, that also want to use it.

Amit and I kept zheap in mind when working on the patch. You'd have to
work around the metapage, but everything else should work the same.

> I think this is mostly fallout about the prime issue I'm unhappy
> about. There's now some global variable in freespacemap.c that code
> using freespace.c has to know about and maintain.
>
>
> +static void
> +fsm_local_set(Relation rel, BlockNumber cur_nblocks)
> +{
> +   BlockNumber blkno,
> +   cached_target_block;
> +
> +   /* The local map must not be set already. */
> +   Assert(!FSM_LOCAL_MAP_EXISTS);
> +
> +   /*
> +* Starting at the current last block in the relation and working
> +* backwards, mark alternating blocks as available.
> +*/
> +   blkno = cur_nblocks - 1;
>
> That comment explains very little about why this is done, and why it's a
> good idea.

Short answer: performance -- it's too expensive to try every block.
The explanation is in storage/freespace/README -- maybe that should be
referenced here?

> +/* Status codes for the local map. */
> +
> +/* Either already tried, or beyond the end of the relation */
> +#define FSM_LOCAL_NOT_AVAIL 0x00
> +
> +/* Available to try */
> +#define FSM_LOCAL_AVAIL0x01
>
> +/* Local map of block numbers for small heaps with no FSM. */
> +typedef struct
> +{
> +   BlockNumber nblocks;
> +   uint8   map[HEAP_FSM_CREATION_THRESHOLD];
> +}  FSMLocalMap;
> +
>
> Hm, given realistic HEAP_FSM_CREATION_THRESHOLD, and the fact that we
> really only need one bit per relation, it seems like map should really
> be just a uint32 with one bit per page.

I fail to see the advantage of that.

> +static bool
> +fsm_allow_writes(Relation rel, BlockNumber 

RE: Copy data to DSA area

2019-04-16 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>Sent: Wednesday, December 5, 2018 2:42 PM
>Subject: RE: Copy data to DSA area

Hi
It's been a long while since we discussed this topic. 
Let me recap first and I'll give some thoughts.

It seems things we got consensus is:

- Want to palloc/pfree transparently in DSA
- Use Postgres-initialized shared memory as DSA 
- Don’t leak memory in shared memory 

Things under discussion:
- How we prevent memory leak
- How we prevent dangling pointer after cleaning up about-to-leak-objects 

Regarding memory leak, I think Robert's idea that allocate objects under 
temporal context 
while building and re-parent it to permanent one at some point is promising. 
While building objects they are under temporal DSA-MemoryContext, which is
child of TopTransactionContext (if it's in the transaction) and are freed all 
at once when error happens.
To do delete all the chunks allocated under temporal DSA context, we need to 
search
or remember all chunks location under the context. Unlike AllocAset we don't 
have block information
to delete them altogether.

So I'm thinking to manage dsa_allocated chunks with single linked list to keep 
track of them and delete them.
The context has head of linked list and all chunks have pointer to next 
allocated chunk.
But this way adds space overhead to every dsa_allocated chunk and we maybe want 
to avoid it because shared memory size is limited.
In this case, we can free these pointer area at some point when we make sure 
that allocation is successful.

Another debate is when we should think the allocation is successful (when we 
make sure object won't leak). 
If allocation is done in the transaction, we think if transaction is committed 
we can think it's safe.
Or I assume this DSA memory context for cache such as relcache, catcache, 
plancache and so on.
In this case cache won't leak once it's added to hash table or list because I 
assume some eviction mechanism like LRU will be implemented
and it will erase useless cache some time later.

What do you think about these ideas?

Regarding dangling pointer I think it's also problem. 
After cleaning up objects to prevent memory leak we don't have mechanism to 
reset dangling pointer.
On this point I gave some thoughts while ago though begin_allocate/end_allocate 
don't seem good names. 
Maybe more explaining names are like 
start_pointing_to_dsa_object_under_construction() and 
end_pointing_to_dsa_object_under_construction(). 
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F1F259F%40G01JPEXMBKW04
 
If we make sure that such dangling pointer never happen, we don't need to use 
it.
As Thomas mentioned before, where these interface should be put needs review 
but couldn't hit upon another solution right now. 

Do you have some thoughts?

best regards,
Ideriha, Takeshi





Re: Runtime pruning problem

2019-04-16 Thread Tom Lane
David Rowley  writes:
> On Wed, 17 Apr 2019 at 15:54, Tom Lane  wrote:
>> What I'm more worried about is whether this breaks any internal behavior
>> of explain.c, as the comment David quoted upthread seems to think.
>> If we need to have a tlist to reference, can we make that code look
>> to the pre-pruning plan tree, rather than the planstate tree?

> I think most of the complexity is in what to do in
> set_deparse_planstate() given that there might be no outer plan to
> choose from for Append and MergeAppend. This controls what's done in
> resolve_special_varno() as this descends the plan tree down the outer
> side until it gets to the node that the outer var came from.

> We wouldn't need to do this if we just didn't show the targetlist in
> EXPLAIN VERBOSE, but there's also MergeAppend sort keys to worry about
> too.  Should we just skip on those as well?

No, the larger issue is that *any* plan node above the Append might
be recursing down to/through the Append to find out what to print for
a Var reference.  We have to be able to support that.

regards, tom lane




Re: Runtime pruning problem

2019-04-16 Thread Amit Langote
On 2019/04/17 12:58, David Rowley wrote:
> On Wed, 17 Apr 2019 at 15:54, Tom Lane  wrote:
>>
>> Amit Langote  writes:
>>> On 2019/04/17 11:29, David Rowley wrote:
 Where do you think the output list for EXPLAIN VERBOSE should put the
 output column list in this case? On the Append node, or just not show
 them?
>>
>>> Maybe, not show them?
>>
>> Yeah, I think that seems like a reasonable idea.  If we show the tlist
>> for Append in this case, when we never do otherwise, that will be
>> confusing, and it could easily break plan-reading apps like depesz.com.
>>
>> What I'm more worried about is whether this breaks any internal behavior
>> of explain.c, as the comment David quoted upthread seems to think.
>> If we need to have a tlist to reference, can we make that code look
>> to the pre-pruning plan tree, rather than the planstate tree?
> 
> I think most of the complexity is in what to do in
> set_deparse_planstate() given that there might be no outer plan to
> choose from for Append and MergeAppend. This controls what's done in
> resolve_special_varno() as this descends the plan tree down the outer
> side until it gets to the node that the outer var came from.
> 
> We wouldn't need to do this if we just didn't show the targetlist in
> EXPLAIN VERBOSE, but there's also MergeAppend sort keys to worry about
> too.  Should we just skip on those as well?

I guess so, if only to be consistent with Append.

Thanks,
Amit





Re: log_planner_stats and prepared statements

2019-04-16 Thread Tom Lane
Bruce Momjian  writes:
> I have found that log_planner_stats only outputs stats until the generic
> plan is chosen.  For example, if you run the following commands:

Uh, well, the planner doesn't get run after that point ...

regards, tom lane




Re: Runtime pruning problem

2019-04-16 Thread David Rowley
On Wed, 17 Apr 2019 at 15:54, Tom Lane  wrote:
>
> Amit Langote  writes:
> > On 2019/04/17 11:29, David Rowley wrote:
> >> Where do you think the output list for EXPLAIN VERBOSE should put the
> >> output column list in this case? On the Append node, or just not show
> >> them?
>
> > Maybe, not show them?
>
> Yeah, I think that seems like a reasonable idea.  If we show the tlist
> for Append in this case, when we never do otherwise, that will be
> confusing, and it could easily break plan-reading apps like depesz.com.
>
> What I'm more worried about is whether this breaks any internal behavior
> of explain.c, as the comment David quoted upthread seems to think.
> If we need to have a tlist to reference, can we make that code look
> to the pre-pruning plan tree, rather than the planstate tree?

I think most of the complexity is in what to do in
set_deparse_planstate() given that there might be no outer plan to
choose from for Append and MergeAppend. This controls what's done in
resolve_special_varno() as this descends the plan tree down the outer
side until it gets to the node that the outer var came from.

We wouldn't need to do this if we just didn't show the targetlist in
EXPLAIN VERBOSE, but there's also MergeAppend sort keys to worry about
too.  Should we just skip on those as well?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Runtime pruning problem

2019-04-16 Thread Tom Lane
Amit Langote  writes:
> On 2019/04/17 11:29, David Rowley wrote:
>> Where do you think the output list for EXPLAIN VERBOSE should put the
>> output column list in this case? On the Append node, or just not show
>> them?

> Maybe, not show them?

Yeah, I think that seems like a reasonable idea.  If we show the tlist
for Append in this case, when we never do otherwise, that will be
confusing, and it could easily break plan-reading apps like depesz.com.

What I'm more worried about is whether this breaks any internal behavior
of explain.c, as the comment David quoted upthread seems to think.
If we need to have a tlist to reference, can we make that code look
to the pre-pruning plan tree, rather than the planstate tree?

regards, tom lane




log_planner_stats and prepared statements

2019-04-16 Thread Bruce Momjian
I have found that log_planner_stats only outputs stats until the generic
plan is chosen.  For example, if you run the following commands:

SET client_min_messages = 'log';
SET log_planner_stats = TRUE;

PREPARE e AS SELECT relkind FROM pg_class WHERE relname = $1 ORDER BY 1;

EXPLAIN ANALYZE VERBOSE EXECUTE e ('pg_class');
EXPLAIN ANALYZE VERBOSE EXECUTE e ('pg_class');
EXPLAIN ANALYZE VERBOSE EXECUTE e ('pg_class');
EXPLAIN ANALYZE VERBOSE EXECUTE e ('pg_class');
EXPLAIN ANALYZE VERBOSE EXECUTE e ('pg_class');
EXPLAIN ANALYZE VERBOSE EXECUTE e ('pg_class');
--> EXPLAIN ANALYZE VERBOSE EXECUTE e ('pg_class');

The last explain will _not_ show any log_planner_stats duration, though
it does show an EXPLAIN planning time:

 Planning Time: 0.012 ms

It this expected behavior?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Runtime pruning problem

2019-04-16 Thread Amit Langote
On 2019/04/17 11:29, David Rowley wrote:
> On Wed, 17 Apr 2019 at 13:13, Amit Langote
>  wrote:
>> When you see this:
>>
>> explain select * from t1 where dt = current_date + 400;
>>  QUERY PLAN
>> 
>>  Append  (cost=0.00..198.42 rows=44 width=8)
>>Subplans Removed: 3
>>->  Seq Scan on t1_1  (cost=0.00..49.55 rows=11 width=8)
>>  Filter: (dt = (CURRENT_DATE + 400))
>> (4 rows)
>>
>> Doesn't this give an impression that t1_1 *matches* the WHERE condition
>> where it clearly doesn't?  IMO, contorting explain.c to show an empty
>> Append like what Hosoya-san suggests doesn't sound too bad given that the
>> first reaction to seeing the above result is to think it's a bug of
>> partition pruning.
> 
> Where do you think the output list for EXPLAIN VERBOSE should put the
> output column list in this case? On the Append node, or just not show
> them?

Maybe, not show them?  That may be a bit inconsistent, because the point
of VERBOSE is to the targetlist among other things, but maybe the users
wouldn't mind not seeing it on such empty Append nodes.  OTOH, they are
more likely to think seeing a subplan that's clearly prunable as a bug of
the pruning logic.

Thanks,
Amit





Re: Runtime pruning problem

2019-04-16 Thread David Rowley
On Wed, 17 Apr 2019 at 13:13, Amit Langote
 wrote:
> When you see this:
>
> explain select * from t1 where dt = current_date + 400;
>  QUERY PLAN
> 
>  Append  (cost=0.00..198.42 rows=44 width=8)
>Subplans Removed: 3
>->  Seq Scan on t1_1  (cost=0.00..49.55 rows=11 width=8)
>  Filter: (dt = (CURRENT_DATE + 400))
> (4 rows)
>
> Doesn't this give an impression that t1_1 *matches* the WHERE condition
> where it clearly doesn't?  IMO, contorting explain.c to show an empty
> Append like what Hosoya-san suggests doesn't sound too bad given that the
> first reaction to seeing the above result is to think it's a bug of
> partition pruning.

Where do you think the output list for EXPLAIN VERBOSE should put the
output column list in this case? On the Append node, or just not show
them?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Runtime pruning problem

2019-04-16 Thread Amit Langote
Hi,

On 2019/04/16 21:09, David Rowley wrote:
> On Tue, 16 Apr 2019 at 23:55, Yuzuko Hosoya  
> wrote:
>> postgres=# explain analyze select * from t1 where dt = current_date + 400;
>>   QUERY PLAN
>> ---
>>  Append  (cost=0.00..198.42 rows=44 width=8) (actual time=0.000..0.001 
>> rows=0 loops=1)
>>Subplans Removed: 3
>>->  Seq Scan on t1_1  (cost=0.00..49.55 rows=11 width=8) (never executed)
>>  Filter: (dt = (CURRENT_DATE + 400))
>>  Planning Time: 0.400 ms
>>  Execution Time: 0.070 ms
>> (6 rows)
>> 
>>
>> I realized t1_1 was not scanned actually since "never executed"
>> was displayed in the plan using EXPLAIN ANALYZE.  But I think
>> "One-Time Filter: false" and "Subplans Removed: ALL" or something
>> like that should be displayed instead.
>>
>> What do you think?
> 
> This is intended behaviour explained by the following comment in nodeAppend.c
> 
> /*
> * The case where no subplans survive pruning must be handled
> * specially.  The problem here is that code in explain.c requires
> * an Append to have at least one subplan in order for it to
> * properly determine the Vars in that subplan's targetlist.  We
> * sidestep this issue by just initializing the first subplan and
> * setting as_whichplan to NO_MATCHING_SUBPLANS to indicate that
> * we don't really need to scan any subnodes.
> */
> 
> It's true that there is a small overhead in this case of having to
> initialise a useless subplan, but the code never tries to pull any
> tuples from it, so it should be fairly minimal.  I expected that using
> a value that matches no partitions would be unusual enough not to go
> contorting explain.c into working for this case.

When I saw this, I didn't think as much of the overhead of initializing a
subplan as I was surprised to see that result at all.

When you see this:

explain select * from t1 where dt = current_date + 400;
 QUERY PLAN

 Append  (cost=0.00..198.42 rows=44 width=8)
   Subplans Removed: 3
   ->  Seq Scan on t1_1  (cost=0.00..49.55 rows=11 width=8)
 Filter: (dt = (CURRENT_DATE + 400))
(4 rows)

Doesn't this give an impression that t1_1 *matches* the WHERE condition
where it clearly doesn't?  IMO, contorting explain.c to show an empty
Append like what Hosoya-san suggests doesn't sound too bad given that the
first reaction to seeing the above result is to think it's a bug of
partition pruning.

Thanks,
Amit





Re: PANIC: could not flush dirty data: Operation not permitted power8, Redhat Centos

2019-04-16 Thread Thomas Munro
On Mon, Apr 15, 2019 at 7:57 PM  wrote:
> I forgot to mention that this is happening in a docker container.

Huh, so there may be some configuration of Linux container that can
fail here with EPERM, even though that error that does not appear in
the man page, and doesn't make much intuitive sense.  Would be good to
figure out how that happens.

If we could somehow confirm* that sync_file_range() with the
non-waiting flags we are using is non-destructive of error state, as
Andres speculated (that is, it cannot eat the only error report we're
ever going to get to tell us that buffered dirty data may have been
dropped), then I suppose we could just remove the data_sync_elevel()
promotion here.  As with the WSL case (before the PANIC commit and the
subsequent don't-repeat-the-warning-forever patch), a user of this
posited EPERM-generating container configuration would then get
repeated warnings in the log forever (as they presumably did before).
Repeated WARNING messages are probably OK here, I think... I mean, if,
say, someone complains that FlubOS's Linux emulation fails here with
EIEIO, I'd say they should put up with the warnings and complain over
on the flub-hackers list, or whatever, and I'd say the same for
containers that generate EPERM: either the man page or the containter
technology needs work.

But... I still think we should try to avoid making decisions based on
knowledge of kernel implementation details, if it can be avoided.  I'd
probably rather treat EPERM explicitly differently (and eventually
EIEIO too, if a report comes in) than drop the current paranoid coding
completely.

*I'm not looking at it myself.  A sync_file_range() implementation is
on my list of potential FreeBSD projects for a rainy day, so I don't
want to study anything but the man page, even if it's wrong.

-- 
Thomas Munro
https://enterprisedb.com




Re: Calling pgstat_report_wait_end() before ereport(ERROR)

2019-04-16 Thread Michael Paquier
On Tue, Apr 16, 2019 at 08:03:22PM +0900, Masahiko Sawada wrote:
> Agreed. There are also some code which raise an ERROR after close a
> transient file but I think it's a good idea to not include them for
> safety. It looks to me that the patch you proposed cleans places as
> much as we can do.

Thanks for the lookup, committed.
--
Michael


signature.asc
Description: PGP signature


Re: REINDEX CONCURRENTLY 2.0

2019-04-16 Thread Michael Paquier
On Tue, Apr 16, 2019 at 08:50:31AM +0200, Peter Eisentraut wrote:
> Looks good to me.

Thanks, committed.  If there are additional discussions on various
points of the feature, let's move to a new thread please.  This one
has been already extensively used ;)
--
Michael


signature.asc
Description: PGP signature


Re: Race conditions with checkpointer and shutdown

2019-04-16 Thread Andres Freund
Hi,

On 2019-04-16 17:05:36 -0700, Andres Freund wrote:
> On 2019-04-16 18:59:37 -0400, Robert Haas wrote:
> > On Tue, Apr 16, 2019 at 6:45 PM Tom Lane  wrote:
> > > Do we need to think harder about establishing rules for multiplexed
> > > use of the process latch?  I'm imagining some rule like "if you are
> > > not the outermost event loop of a process, you do not get to
> > > summarily clear MyLatch.  Make sure to leave it set after waiting,
> > > if there was any possibility that it was set by something other than
> > > the specific event you're concerned with".
> >
> > Hmm, yeah.  If the latch is left set, then the outer loop will just go
> > through an extra and unnecessary iteration, which seems fine.  If the
> > latch is left clear, then the outer loop might miss a wakeup intended
> > for it and hang forever.
> 
> Arguably that's a sign that the latch using code in the outer loop(s) isn't
> written correctly? If you do it as:
> 
> while (true)
> {
> CHECK_FOR_INTERRUPTS();
> 
> ResetLatch(MyLatch);
> 
> if (work_needed)
> {
> Plenty();
> Code();
> Using(MyLatch);
> }
> else
> {
> WaitLatch(MyLatch);
> }
> }
> 
> I think that's not a danger? I think the problem really is that we
> suggest doing that WaitLatch() unconditionally:
> 
>  * The correct pattern to wait for event(s) is:
>  *
>  * for (;;)
>  * {
>  *   ResetLatch();
>  *   if (work to do)
>  *   Do Stuff();
>  *   WaitLatch();
>  * }
>  *
>  * It's important to reset the latch *before* checking if there's work to
>  * do. Otherwise, if someone sets the latch between the check and the
>  * ResetLatch call, you will miss it and Wait will incorrectly block.
>  *
>  * Another valid coding pattern looks like:
>  *
>  * for (;;)
>  * {
>  *   if (work to do)
>  *   Do Stuff(); // in particular, exit loop if some condition 
> satisfied
>  *   WaitLatch();
>  *   ResetLatch();
>  * }
> 
> Obviously there's the issue that a lot of latch using code isn't written
> that way - but I also don't think there's that many latch using code
> that then also uses latch. Seems like we could fix that.  While it has
> obviously dangers of not being followed, so does the
> 'always-set-latch-unless-outermost-loop' approach.
> 
> I'm not sure I like the idea of incurring another unnecessary SetLatch()
> call for most latch using places.
> 
> I guess there's a bit bigger danger of taking longer to notice
> postmaster-death. But I'm not sure I can quite see that being
> problematic - seems like all we should incur is another cycle through
> the loop, as the latch shouldn't be set anymore.

I think we should thus change our latch documentation to say:

something like:

diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index fc995819d35..dc46dd94c5b 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -44,22 +44,31 @@
  * {
  *ResetLatch();
  *if (work to do)
- *Do Stuff();
- *WaitLatch();
+ *DoStuff();
+ *else
+ *WaitLatch();
  * }
  *
  * It's important to reset the latch *before* checking if there's work to
  * do. Otherwise, if someone sets the latch between the check and the
  * ResetLatch call, you will miss it and Wait will incorrectly block.
  *
+ * The reason to only wait on the latch in case there is nothing to do is that
+ * code inside DoStuff() might use the same latch, and leave it reset, even
+ * though a SetLatch() aimed for the outer loop arrived. Which again could
+ * lead to incorrectly blocking in Wait.
+ *
  * Another valid coding pattern looks like:
  *
  * for (;;)
  * {
  *if (work to do)
- *Do Stuff(); // in particular, exit loop if some condition satisfied
- *WaitLatch();
- *ResetLatch();
+ *DoStuff(); // in particular, exit loop if some condition satisfied
+ * else
+ * {
+ *WaitLatch();
+ *ResetLatch();
+ * }
  * }
  *
  * This is useful to reduce latch traffic if it's expected that the loop's

and adapt code to match (at least in the outer loops).

Greetings,

Andres Freund




Re: Race conditions with checkpointer and shutdown

2019-04-16 Thread Andres Freund
Hi,

On 2019-04-16 18:59:37 -0400, Robert Haas wrote:
> On Tue, Apr 16, 2019 at 6:45 PM Tom Lane  wrote:
> > Do we need to think harder about establishing rules for multiplexed
> > use of the process latch?  I'm imagining some rule like "if you are
> > not the outermost event loop of a process, you do not get to
> > summarily clear MyLatch.  Make sure to leave it set after waiting,
> > if there was any possibility that it was set by something other than
> > the specific event you're concerned with".
>
> Hmm, yeah.  If the latch is left set, then the outer loop will just go
> through an extra and unnecessary iteration, which seems fine.  If the
> latch is left clear, then the outer loop might miss a wakeup intended
> for it and hang forever.

Arguably that's a sign that the latch using code in the outer loop(s) isn't
written correctly? If you do it as:

while (true)
{
CHECK_FOR_INTERRUPTS();

ResetLatch(MyLatch);

if (work_needed)
{
Plenty();
Code();
Using(MyLatch);
}
else
{
WaitLatch(MyLatch);
}
}

I think that's not a danger? I think the problem really is that we
suggest doing that WaitLatch() unconditionally:

 * The correct pattern to wait for event(s) is:
 *
 * for (;;)
 * {
 * ResetLatch();
 * if (work to do)
 * Do Stuff();
 * WaitLatch();
 * }
 *
 * It's important to reset the latch *before* checking if there's work to
 * do. Otherwise, if someone sets the latch between the check and the
 * ResetLatch call, you will miss it and Wait will incorrectly block.
 *
 * Another valid coding pattern looks like:
 *
 * for (;;)
 * {
 * if (work to do)
 * Do Stuff(); // in particular, exit loop if some condition 
satisfied
 * WaitLatch();
 * ResetLatch();
 * }

Obviously there's the issue that a lot of latch using code isn't written
that way - but I also don't think there's that many latch using code
that then also uses latch. Seems like we could fix that.  While it has
obviously dangers of not being followed, so does the
'always-set-latch-unless-outermost-loop' approach.

I'm not sure I like the idea of incurring another unnecessary SetLatch()
call for most latch using places.

I guess there's a bit bigger danger of taking longer to notice
postmaster-death. But I'm not sure I can quite see that being
problematic - seems like all we should incur is another cycle through
the loop, as the latch shouldn't be set anymore.

Greetings,

Andres Freund




Re: Race conditions with checkpointer and shutdown

2019-04-16 Thread Robert Haas
On Tue, Apr 16, 2019 at 6:45 PM Tom Lane  wrote:
> Do we need to think harder about establishing rules for multiplexed
> use of the process latch?  I'm imagining some rule like "if you are
> not the outermost event loop of a process, you do not get to
> summarily clear MyLatch.  Make sure to leave it set after waiting,
> if there was any possibility that it was set by something other than
> the specific event you're concerned with".

Hmm, yeah.  If the latch is left set, then the outer loop will just go
through an extra and unnecessary iteration, which seems fine.  If the
latch is left clear, then the outer loop might miss a wakeup intended
for it and hang forever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: hyrax vs. RelationBuildPartitionDesc

2019-04-16 Thread Robert Haas
On Sun, Apr 14, 2019 at 3:29 PM Tom Lane  wrote:
> What I get for test cases like [1] is
>
> single-partition SELECT, hash partitioning:
>
> N   tps, HEAD   tps, patch
> 2   11426.24375411448.615193
> 8   11254.83326711374.278861
> 32  11288.32911411371.942425
> 128 11222.32925611185.845258
> 512 11001.17713710572.917288
> 102410612.4564709834.172965
> 40968819.110195 7021.864625
> 81927372.611355 5276.130161
>
> single-partition SELECT, range partitioning:
>
> N   tps, HEAD   tps, patch
> 2   11037.85533811153.595860
> 8   11085.21802211019.132341
> 32  10994.34820710935.719951
> 128 10884.41732410532.685237
> 512 10635.5834119578.108915
> 102410407.2864148689.585136
> 40968361.463829 5139.084405
> 81927075.880701 3442.542768

I have difficulty interpreting these results in any way other than as
an endorsement of the approach that I took.  It seems like you're
proposing to throw away what is really a pretty substantial amount of
performance basically so that the code will look more like you think
it should look.  But I dispute the idea that the current code is so
bad that we need to do this.  I don't think that's the case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Race conditions with checkpointer and shutdown

2019-04-16 Thread Tom Lane
Michael Paquier  writes:
> The buildfarm has reported two similar failures when shutting down a
> node:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=2019-03-23%2022%3A28%3A59
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2019-04-16%2006%3A14%3A01

> In both cases, the instance cannot shut down because it times out,
> waiting for the shutdown checkpoint to finish but I suspect that this
> checkpoint actually never happens.

Hmm, I don't think that that is actually where the problem is.  In
piculet's failure, the test script times out waiting for a "fast"
shutdown of the standby server, and what we see in the standby's log is

2019-03-23 22:44:12.181 UTC [9731] LOG:  received fast shutdown request
2019-03-23 22:44:12.181 UTC [9731] LOG:  aborting any active transactions
2019-03-23 22:44:12.181 UTC [9960] FATAL:  terminating walreceiver process due 
to administrator command
2019-03-23 22:50:13.088 UTC [9731] LOG:  received immediate shutdown request

where the last line indicates that the test script lost patience and
issued an immediate shutdown.  However, in a successful run of the
test, the log looks like

2019-03-24 03:33:25.592 UTC [23816] LOG:  received fast shutdown request
2019-03-24 03:33:25.592 UTC [23816] LOG:  aborting any active transactions
2019-03-24 03:33:25.592 UTC [23895] FATAL:  terminating walreceiver process due 
to administrator command
2019-03-24 03:33:25.595 UTC [23819] LOG:  shutting down
2019-03-24 03:33:25.600 UTC [23816] LOG:  database system is shut down
2019-03-24 03:33:25.696 UTC [23903] LOG:  starting PostgreSQL 12devel on 
x86_64-pc-linux-gnu, compiled by gcc (Debian 8.2.0-12) 8.2.0, 64-bit

where the last line reflects restarting the server for the next test step.
So in the failure case we don't see the "shutting down" message, which
means we never got to ShutdownXLOG, so no checkpoint request was made.
Even if we had got to ShutdownXLOG, the process is just executing the
operation directly, it's not sending a signal asking some other process
to do the checkpoint; so it's hard to see how either of the commits
you mention could be involved.

I think what we need to look for is reasons why (1) the postmaster
never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
main loop doesn't get to noticing shutdown_requested.

A rather scary point for (2) is that said main loop seems to be
assuming that MyLatch a/k/a MyProc->procLatch is not used for any
other purposes in the checkpointer process.  If there were something,
like say a condition variable wait, that would reset MyLatch at any
time during a checkpoint, then we could very easily go to sleep at the
bottom of the loop and not notice that there's a pending shutdown request.

Now, c6c9474aa did not break this, because the latch resets that
it added happen in other processes not the checkpointer.  But I'm
feeling suspicious that some other change we made recently might've
borked it.  And in general, it seems like we've managed to load a
lot of potentially conflicting roles onto process latches.

Do we need to think harder about establishing rules for multiplexed
use of the process latch?  I'm imagining some rule like "if you are
not the outermost event loop of a process, you do not get to
summarily clear MyLatch.  Make sure to leave it set after waiting,
if there was any possibility that it was set by something other than
the specific event you're concerned with".

regards, tom lane




Re: block-level incremental backup

2019-04-16 Thread Robert Haas
On Tue, Apr 16, 2019 at 5:44 PM Stephen Frost  wrote:
> > > I love the general idea of having additional facilities in core to
> > > support block-level incremental backups.  I've long been unhappy that
> > > any such approach ends up being limited to a subset of the files which
> > > need to be included in the backup, meaning the rest of the files have to
> > > be backed up in their entirety.  I don't think we have to solve for that
> > > as part of this, but I'd like to see a discussion for how to deal with
> > > the other files which are being backed up to avoid needing to just
> > > wholesale copy them.
> >
> > I assume you are talking about non-heap/index files.  Which of those are
> > large enough to benefit from incremental backup?
>
> Based on discussions I had with Andrey, specifically the visibility map
> is an issue for them with WAL-G.  I haven't spent a lot of time thinking
> about it, but I can understand how that could be an issue.

If I understand correctly, the VM contains 1 byte per 4 heap pages and
the FSM contains 1 byte per heap page (plus some overhead for higher
levels of the tree).  Since the FSM is not WAL-logged, I'm not sure
there's a whole lot we can do to avoid having to back it up, although
maybe there's some clever idea I'm not quite seeing.  The VM is
WAL-logged, albeit with some strange warts that I have the honor of
inventing, so there's more possibilities there.

Before worrying about it too much, it would be useful to hear more
about the concerns related to these forks, so that we make sure we're
solving the right problem.  It seems difficult for a single relation
to be big enough for these to be much of an issue.  For example, on a
1TB relation, we have 2^40 bytes = 2^27 pages = ~2^25 bits of VM fork
= 32MB.  Not nothing, but 32MB of useless overhead every time you back
up a 1TB database probably isn't going to break the bank.  It might be
more of a concern for users with many small tables.  For example, if
somebody has got a million tables with 1 page in each one, they'll
have a million data pages, a million VM pages, and 3 million FSM pages
(unless the new don't-create-the-FSM-for-small-tables stuff in v12
kicks in).  I don't know if it's worth going to a lot of trouble to
optimize that case.  Creating a million tables with 100 tuples (or
whatever) in each one sounds like terrible database design to me.

> > > I'm quite concerned that trying to graft this on to pg_basebackup
> > > (which, as you note later, is missing an awful lot of what users expect
> > > from a real backup solution already- retention handling, parallel
> > > capabilities, WAL archive management, and many more... but also is just
> > > not nearly as developed a tool as the external solutions) is going to
> > > make things unnecessairly difficult when what we really want here is
> > > better support from core for block-level incremental backup for the
> > > existing external tools to leverage.
> >
> > I think there is some interesting complexity brought up in this thread.
> > Which options are going to minimize storage I/O, network I/O, have only
> > background overhead, allow parallel operation, integrate with
> > pg_basebackup.  Eventually we will need to evaluate the incremental
> > backup options against these criteria.
>
> This presumes that we're going to have multiple competeing incremental
> backup options presented, doesn't it?  Are you aware of another effort
> going on which aims for inclusion in core?  There's been past attempts
> made, but I don't believe there's anyone else currently planning to or
> working on something for inclusion in core.

Yeah, I really hope we don't end up with dueling patches.  I want to
come up with an approach that can be widely-endorsed and then have
everybody rowing in the same direction.  On the other hand, I do think
that we may support multiple options in certain places which may have
the kinds of trade-offs that Bruce mentions.  For instance,
identifying changed blocks by scanning the whole cluster and checking
the LSN of each block has an advantage in that it requires no prior
setup or extra configuration.  Like a sequential scan, it always
works, and that is an advantage.  Of course, for many people, the
competing advantage of a WAL-scanning approach that can save a lot of
I/O will appear compelling, but maybe not for everyone.  I think
there's room for two or three approaches there -- not in the sense of
competing patches, but in the sense of giving users a choice based on
their needs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: block-level incremental backup

2019-04-16 Thread Stephen Frost
Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Mon, Apr 15, 2019 at 09:01:11AM -0400, Stephen Frost wrote:
> > * Robert Haas (robertmh...@gmail.com) wrote:
> > > Several companies, including EnterpriseDB, NTT, and Postgres Pro, have
> > > developed technology that permits a block-level incremental backup to
> > > be taken from a PostgreSQL server.  I believe the idea in all of those
> > > cases is that non-relation files should be backed up in their
> > > entirety, but for relation files, only those blocks that have been
> > > changed need to be backed up.
> > 
> > I love the general idea of having additional facilities in core to
> > support block-level incremental backups.  I've long been unhappy that
> > any such approach ends up being limited to a subset of the files which
> > need to be included in the backup, meaning the rest of the files have to
> > be backed up in their entirety.  I don't think we have to solve for that
> > as part of this, but I'd like to see a discussion for how to deal with
> > the other files which are being backed up to avoid needing to just
> > wholesale copy them.
> 
> I assume you are talking about non-heap/index files.  Which of those are
> large enough to benefit from incremental backup?

Based on discussions I had with Andrey, specifically the visibility map
is an issue for them with WAL-G.  I haven't spent a lot of time thinking
about it, but I can understand how that could be an issue.

> > I'm quite concerned that trying to graft this on to pg_basebackup
> > (which, as you note later, is missing an awful lot of what users expect
> > from a real backup solution already- retention handling, parallel
> > capabilities, WAL archive management, and many more... but also is just
> > not nearly as developed a tool as the external solutions) is going to
> > make things unnecessairly difficult when what we really want here is
> > better support from core for block-level incremental backup for the
> > existing external tools to leverage.
> 
> I think there is some interesting complexity brought up in this thread. 
> Which options are going to minimize storage I/O, network I/O, have only
> background overhead, allow parallel operation, integrate with
> pg_basebackup.  Eventually we will need to evaluate the incremental
> backup options against these criteria.

This presumes that we're going to have multiple competeing incremental
backup options presented, doesn't it?  Are you aware of another effort
going on which aims for inclusion in core?  There's been past attempts
made, but I don't believe there's anyone else currently planning to or
working on something for inclusion in core.

Just to be clear- we're not currently working on one, but I'd really
like to see core provide good support for incremental block-level backup
so that we can leverage when it is there.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: New vacuum option to do only freezing

2019-04-16 Thread Tom Lane
I wrote:
> I'm thinking that we really need to upgrade vacuum's reporting totals
> so that it accounts in some more-honest way for pre-existing dead
> line pointers.  The patch as it stands has made the reporting even more
> confusing, rather than less so.

Here's a couple of ideas about that:

1. Ignore heap_page_prune's activity altogether, on the grounds that
it's just random chance that any cleanup done there was done during
VACUUM and not some preceding DML operation.  Make tups_vacuumed
be the count of dead line pointers removed.  The advantage of this
way is that tups_vacuumed would become independent of previous
non-VACUUM pruning activity, as well as preceding index-cleanup-disabled
VACUUMs.  But maybe it's hiding too much information.

2. Have heap_page_prune count, and add to tups_vacuumed, only HOT
tuples that it deleted entirely.  The action of replacing a DEAD
root tuple with a dead line pointer doesn't count for anything.
Then also add the count of dead line pointers removed to tups_vacuumed.

Approach #2 is closer to the way we've defined tups_vacuumed up to
now, but I think it'd be more realistic in cases where previous
pruning or index-cleanup-disabled vacuums have left lots of dead
line pointers.

I'm not especially wedded to either of these --- any other ideas?

regards, tom lane




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-16 Thread Tom Lane
Andres Freund  writes:
> On 2019-04-16 14:31:25 -0400, Tom Lane wrote:
>> This can only work at all if an inaccurate map is very fail-soft,
>> which I'm not convinced it is

> I think it better needs to be fail-soft independent of this the no-fsm
> patch. Because the fsm is not WAL logged etc, it's pretty easy to get a
> pretty corrupted version. And we better deal with that.

Yes, FSM has to be fail-soft from a *correctness* viewpoint; but it's
not fail-soft from a *performance* viewpoint.  It can take awhile for
us to self-heal a busted map.  And this fake map spends almost all its
time busted and in need of (expensive) corrections.  I think this may
actually be the same performance complaint you're making, in different
words.

regards, tom lane




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-16 Thread Andres Freund
Hi,

On 2019-04-16 14:31:25 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I'm kinda thinking that this is the wrong architecture.
> 
> The bits of that patch that I've looked at seemed like a mess
> to me too.  AFAICT, it's trying to use a single global "map"
> for all relations (strike 1) without any clear tracking of
> which relation the map currently describes (strike 2).

Well, strike 2 basically is not a problem right now, because the map is
cleared whenever a search for a target buffer succeeded. But that has
pretty obvious efficiency issues...


> This can only work at all if an inaccurate map is very fail-soft,
> which I'm not convinced it is

I think it better needs to be fail-soft independent of this the no-fsm
patch. Because the fsm is not WAL logged etc, it's pretty easy to get a
pretty corrupted version. And we better deal with that.


> and in any case it seems pretty inefficient for workloads that insert
> into multiple tables.

As is, it's inefficient for insertions into a *single* relation. The
RelationGetTargetBlock() makes it not crazily expensive, but it's still
plenty expensive.


> I'd have expected any such map to be per-table and be stored in
> the relcache.

Same.

Greetings,

Andres Freund




Re: Improve search for missing parent downlinks in amcheck

2019-04-16 Thread Peter Geoghegan
On Tue, Apr 16, 2019 at 12:00 PM Peter Geoghegan  wrote:
> Can you be more specific? What was the cause of the corruption? I'm
> always very interested in hearing about cases that amcheck could have
> detected, but didn't.

FWIW, v4 indexes in Postgres 12 will support the new "rootdescend"
verification option, which isn't lossy, and would certainly have
detected your customer issue in practice. Admittedly the new check is
quite expensive, even compared to the other bt_index_parent_check()
checks, but it is nice that we now have a verification option that is
*extremely* thorough, and uses _bt_search() directly.

-- 
Peter Geoghegan




Re: Improve search for missing parent downlinks in amcheck

2019-04-16 Thread Peter Geoghegan
On Mon, Apr 15, 2019 at 7:30 PM Alexander Korotkov
 wrote:
> Currently we amcheck supports lossy checking for missing parent
> downlinks.  It collects bitmap of downlink hashes and use it to check
> subsequent tree level.  We've experienced some large corrupted indexes
> which pass this check due to its looseness.

Can you be more specific? What was the cause of the corruption? I'm
always very interested in hearing about cases that amcheck could have
detected, but didn't.

Was the issue that the Bloom filter was simply undersized/ineffective?

> However, it seems to me we can implement this check in non-lossy
> manner without making it significantly slower.  We anyway traverse
> downlinks from parent to children in order to verify that hikeys are
> corresponding to downlink keys.

Actually, we don't check the high keys in children against the parent
(all other items are checked, though). We probably *should* do
something with the high key when verifying consistency across levels,
but currently we don't. (We only use the high key for the same-page
high key check -- more on this below.)

> We can also traverse from one
> downlink to subsequent using rightlinks.  So, if there are some
> intermediate pages between them, they are candidates to have missing
> parent downlinks.  The patch is attached.
>
> With this patch amcheck could successfully detect corruption for our
> customer, which unpatched amcheck couldn't find.

Maybe we can be a lot less conservative about sizing the Bloom filter
instead? That would be an easier fix IMV -- we can probably change
that logic to be a lot more aggressive without anybody noticing, since
the Bloom filter is already usually tiny. We are already not very
careful about saving work within bt_index_parent_check(), but with
this patch we follow each downlink twice instead of once. That seems
wasteful.

The reason I used a Bloom filter here is because I would eventually
like the missing downlink check to fingerprint entire tuples, not just
block numbers. In L terms, the check could in the future fingerprint
the separator key and the downlink at the same time, not just the
downlink. That way, we could probe the Bloom filter on the next level
down for its high key (with the right sibling pointer set to be
consistent with the parent) iff we don't see that the page split was
interrupted (i.e. iff P_INCOMPLETE_SPLIT() bit is not set). Obviously
this would be a more effective form of verification, since we would
notice high key values that don't agree with the parent's values for
the same sibling/cousin/child block.

I didn't do it that way for v11 because of "minus infinity" items on
internal pages, which don't store the original key (the key remains
the high key of the left sibling page, but we truncate the original to
0 attributes in _bt_pgaddtup()). I think that we should eventually
stop truncating minus infinity items, and actually store a "low key"
at P_FIRSTDATAKEY() within internal pages instead. That would be
useful for other things anyway (e.g. prefix compression).

--
Peter Geoghegan




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-16 Thread Tom Lane
Andres Freund  writes:
> I'm kinda thinking that this is the wrong architecture.

The bits of that patch that I've looked at seemed like a mess
to me too.  AFAICT, it's trying to use a single global "map"
for all relations (strike 1) without any clear tracking of
which relation the map currently describes (strike 2).
This can only work at all if an inaccurate map is very fail-soft,
which I'm not convinced it is, and in any case it seems pretty
inefficient for workloads that insert into multiple tables.

I'd have expected any such map to be per-table and be stored in
the relcache.

regards, tom lane




Re: hyrax vs. RelationBuildPartitionDesc

2019-04-16 Thread Tom Lane
Amit Langote  writes:
>> I get that we want to get rid of the keep_* kludge in the long term, but
>> is it wrong to think, for example, that having keep_partdesc today allows
>> us today to keep the pointer to rd_partdesc as long as we're holding the
>> relation open or refcnt on the whole relation such as with
>> PartitionDirectory mechanism?

Well, it's safe from the caller's standpoint as long as a suitable lock is
being held, which is neither well-defined nor enforced in any way :-(

> Ah, we're also trying to fix the memory leak caused by the current
> design of PartitionDirectory.  AIUI, the design assumes that the leak
> would occur in fairly rare cases, but maybe not so?  If partitions are
> frequently attached/detached concurrently (maybe won't be too uncommon
> if reduced lock levels encourages users) causing the PartitionDesc of
> a given relation changing all the time, then a planning session that's
> holding the PartitionDirectory containing that relation would leak as
> many PartitionDescs as there were concurrent changes, I guess.

We should get a relcache inval after a partdesc change, but the problem
with the current code is that that will only result in freeing the old
partdesc if the inval event is processed while the relcache entry has
refcount zero.  Otherwise the old rd_pdcxt is just shoved onto the
context chain, where it could survive indefinitely.

I'm not sure that this is really a huge problem in practice.  The example
I gave upthread shows that a partdesc-changing transaction's own internal
invals do arrive during CommandCounterIncrement calls that occur while the
relcache pin is held; but it seems a bit artificial to assume that one
transaction would do a huge number of such changes.  (Although, hm, maybe
a single-transaction pg_restore run could have an issue.)  Once out of
the transaction, it's okay because we'll again invalidate the entry
at the start of the next transaction, and then the refcount will be zero
and we'll clean up.  For other sessions it'd only happen if they saw the
inval while already holding a pin on the partitioned table, which probably
requires some unlucky timing; and that'd have to happen repeatedly to have
a leak that amounts to anything.

Still, though, I'm unhappy with the code as it stands.  It's risky to
assume that it has no unpleasant behaviors that we haven't spotted yet
but will manifest after v12 is in the field.  And I do not think that
it represents a solid base to build on.  (As an example, if we made
any effort to get rid of the redundant extra inval events that occur
post-transaction, we'd suddenly have a much worse problem here.)
I'd rather go over to the copy-based solution for now, which *is*
semantically sound, and accept that we still have more performance
work to do.  It's not like v12 isn't going to be light-years ahead of
v11 in this area anyway.

regards, tom lane




Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-16 Thread Andres Freund
Hi,

I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
complexity that looks like it should be purely in freespacemap.c to
callers.


 extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
-extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
+extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
+bool check_fsm_only);

So now freespace.c has an argument that says we should only check the
fsm. That's confusing. And it's not explained to callers what that
argument means, and when it should be set.


@@ -176,20 +269,44 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber 
oldPage,
  * Note that if the new spaceAvail value is higher than the old value stored
  * in the FSM, the space might not become visible to searchers until the next
  * FreeSpaceMapVacuum call, which updates the upper level pages.
+ *
+ * Callers have no need for a local map.
  */
 void
-RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
+RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
+   Size spaceAvail, BlockNumber nblocks)

There's no explanation as to what that "nblocks" argument is. One
basically has to search other callers to figure it out. It's not even
clear to which fork it relates to. Nor that one can set it to
InvalidBlockNumber if one doesn't have the relation size conveniently
reachable.  But it's not exposed to RecordAndGetPageWithFreeSpace(), for
a basically unexplained reason.  There's a comment above
fsm_allow_writes() - but that's  file-local function that external
callers basically have need to know about.

I can't figure out what "Callers have no need for a local map." is
supposed to mean.


+/*
+ * Clear the local map.  We must call this when we have found a block with
+ * enough free space, when we extend the relation, or on transaction abort.
+ */
+void
+FSMClearLocalMap(void)
+{
+   if (FSM_LOCAL_MAP_EXISTS)
+   {
+   fsm_local_map.nblocks = 0;
+   memset(_local_map.map, FSM_LOCAL_NOT_AVAIL,
+  sizeof(fsm_local_map.map));
+   }
+}
+

So now there's a new function one needs to call after successfully using
the block returned by [RecordAnd]GetPageWithFreeSpace().  But it's not
referenced from those functions, so basically one has to just know that.


+/* Only create the FSM if the heap has greater than this many blocks */
+#define HEAP_FSM_CREATION_THRESHOLD 4

Hm, this seems to be tying freespace.c closer to heap than I think is
great - think of new AMs like zheap, that also want to use it.


I think this is mostly fallout about the prime issue I'm unhappy
about. There's now some global variable in freespacemap.c that code
using freespace.c has to know about and maintain.


+static void
+fsm_local_set(Relation rel, BlockNumber cur_nblocks)
+{
+   BlockNumber blkno,
+   cached_target_block;
+
+   /* The local map must not be set already. */
+   Assert(!FSM_LOCAL_MAP_EXISTS);
+
+   /*
+* Starting at the current last block in the relation and working
+* backwards, mark alternating blocks as available.
+*/
+   blkno = cur_nblocks - 1;

That comment explains very little about why this is done, and why it's a
good idea.

+/* Status codes for the local map. */
+
+/* Either already tried, or beyond the end of the relation */
+#define FSM_LOCAL_NOT_AVAIL 0x00
+
+/* Available to try */
+#define FSM_LOCAL_AVAIL0x01

+/* Local map of block numbers for small heaps with no FSM. */
+typedef struct
+{
+   BlockNumber nblocks;
+   uint8   map[HEAP_FSM_CREATION_THRESHOLD];
+}  FSMLocalMap;
+

Hm, given realistic HEAP_FSM_CREATION_THRESHOLD, and the fact that we
really only need one bit per relation, it seems like map should really
be just a uint32 with one bit per page.


+static bool
+fsm_allow_writes(Relation rel, BlockNumber heapblk,
+BlockNumber nblocks, BlockNumber *get_nblocks)

+   if (rel->rd_rel->relpages != InvalidBlockNumber &&
+   rel->rd_rel->relpages > HEAP_FSM_CREATION_THRESHOLD)
+   return true;
+   else
+   skip_get_nblocks = false;
+   }

This badly needs a comment explaining that these values can be basically
arbitrarily out of date. Explaining why it's correct to rely on them
anyway (Presumably because creating an fsm unnecessarily is ok, it just
avoid susing this optimization).


+static bool
+fsm_allow_writes(Relation rel, BlockNumber heapblk,
+BlockNumber nblocks, BlockNumber *get_nblocks)

+   RelationOpenSmgr(rel);
+   if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
+   return true;

Isn't this like really expensive? mdexists() closes the relations and
reopens it from scratch. Shouldn't we at the very least check
smgr_fsm_nblocks beforehand, so this is only done once?


I'm kinda thinking that this is the wrong architecture.

1) Unless I miss something, this will trigger a
   RelationGetNumberOfBlocks(), which in turn ends up doing an 

Re: hyrax vs. RelationBuildPartitionDesc

2019-04-16 Thread Tom Lane
Amit Langote  writes:
> On 2019/04/15 2:38, Tom Lane wrote:
>> To my mind there are only two trustworthy solutions to the problem of
>> wanting time-extended usage of a relcache subsidiary data structure: one
>> is to copy it, and the other is to reference-count it.  I think that going
>> over to a reference-count-based approach for many of these structures
>> might well be something we should do in future, maybe even the very near
>> future.  In the mean time, though, I'm not really satisfied with inserting
>> half-baked kluges, especially not ones that are different from our other
>> half-baked kluges for similar purposes.  I think that's a path to creating
>> hard-to-reproduce bugs.

> +1 to reference-count-based approach.

> I've occasionally wondered why there is both keep_tupdesc kludge and
> refcounting for table TupleDescs.  I guess it's because *only* the
> TupleTableSlot mechanism in the executor uses TupleDesc pinning (that is,
> refcounting) and the rest of the sites depend on the shaky guarantee
> provided by keep_tupdesc.

The reason for that is simply that at the time we added TupleDesc
refcounts, we didn't want to do the extra work of making all uses
of relcache entries' tupdescs deal with refcounting; keep_tupdesc
is certainly a kluge, but it works for an awful lot of callers.
We'd have to go back and deal with that more honestly if we go down
this path.

I'm inclined to think we could still allow many call sites to not
do an incr/decr-refcount dance as long as they're just fetching
scalar values out of the relcache's tupdesc, and not keeping any
pointer into it.  However, it's hard to see how to enforce such
a rule mechanically, so it might be impractically error-prone
to allow that.

regards, tom lane




Re: Ltree syntax improvement

2019-04-16 Thread Dmitry Belyavsky
Dear Nikolay,

Many thanks for your efforts!

On Sat, Apr 6, 2019 at 2:29 PM Nikolay Shaplov  wrote:

> В письме от воскресенье, 24 февраля 2019 г. 14:31:55 MSK пользователь
> Dmitry
> Belyavsky написал:
>
> Hi! Am back here again.
>
> I've been thinking about this patch a while... Come to some conclusions
> and
> wrote some examples...
>
> First I came to idea that the best way to simplify the code is change the
> state machine from if to switch/case. Because in your code a lot of
> repetition
> is done just because you can't say "Thats all, let's go to next symbol" in
> any
> place in the code. Now it should follow all the branches of if-else tree
> that
> is inside the state-machine "node" to get to the next symbol.
>
> To show how simpler the things would be I changed the state machine
> processing
> in lquery_in form if to switch/case, and changed the code for
> LQPRS_WAITLEVEL
> state processing, removing duplicate code, using "break" wherever it is
> possible
>
> (The indention in this example is unproper to make diff more clear)
>
> so from that much of code
> =
> if (state == LQPRS_WAITLEVEL)
> {
> if (t_isspace(ptr))
> {
> ptr += charlen;
> pos++;
> continue;
> }
>
> escaped_count = 0;
> real_levels++;
> if (charlen == 1)
> {
> if (t_iseq(ptr, '!'))
> {
> GETVAR(curqlevel) = lptr =
> (nodeitem *)
> palloc0(sizeof(nodeitem) * numOR);
> lptr->start = ptr + 1;
> state = LQPRS_WAITDELIM;
> curqlevel->numvar = 1;
> curqlevel->flag |= LQL_NOT;
> hasnot = true;
> }
> else if (t_iseq(ptr, '*'))
> state = LQPRS_WAITOPEN;
> else if (t_iseq(ptr, '\\'))
> {
> GETVAR(curqlevel) = lptr =
> (nodeitem *)
> palloc0(sizeof(nodeitem) * numOR);
> lptr->start = ptr;
> curqlevel->numvar = 1;
> state = LQPRS_WAITESCAPED;
> }
> else if (strchr(".|@%{}", *ptr))
> {
> UNCHAR;
> }
> else
> {
> GETVAR(curqlevel) = lptr =
> (nodeitem *)
> palloc0(sizeof(nodeitem) * numOR);
> lptr->start = ptr;
> state = LQPRS_WAITDELIM;
> curqlevel->numvar = 1;
> if (t_iseq(ptr, '"'))
> {
> lptr->flag |=
> LVAR_QUOTEDPART;
> }
> }
> }
> else
> {
> GETVAR(curqlevel) = lptr = (nodeitem *)
> palloc0(sizeof(nodeitem) *
> numOR);
> lptr->start = ptr;
> state = LQPRS_WAITDELIM;
> curqlevel->numvar = 1;
> }
> }
> =
> I came to this
> =
>  case LQPRS_WAITLEVEL:
> if (t_isspace(ptr))
> break; /* Just go to next symbol */
> escaped_count = 0;
> real_levels++;
>
> if (charlen == 1)
> {
> if (strchr(".|@%{}", *ptr))
> UNCHAR;
> if (t_iseq(ptr, '*'))
> {
> state = LQPRS_WAITOPEN;
> break;
> }
> }
> GETVAR(curqlevel) = lptr = (nodeitem *)
> palloc0(sizeof(nodeitem) *
> numOR);
> lptr->start = ptr;
> curqlevel->numvar = 1;
> state = LQPRS_WAITDELIM;

Re: New vacuum option to do only freezing

2019-04-16 Thread Tom Lane
So after thinking about this a bit more ...

ISTM that what we have here is a race condition (ie, tuple changed state
since heap_page_prune), and that ideally we want the code to resolve it
as if no race had happened.  That is, either of these behaviors would
be acceptable:

1. Delete the tuple, just as heap_page_prune would've done if it had seen
it DEAD.  (Possibly we could implement that by jumping back and doing
heap_page_prune again, but that seems pretty messy and bug-prone.
In any case, if we're not doing index cleanup then this must reduce to
"replace tuple by a dead line pointer", not remove it altogether.)

2. Act as if the tuple were still live, just as would've happened if the
state didn't change till just after we looked instead of just before.

Option #2 is a lot simpler and safer, and can be implemented as I
suggested earlier, assuming we're all good with the assumption that
heap_prepare_freeze_tuple isn't going to do anything bad.

However ... it strikes me that there's yet another assumption in here
that this patch has broken.  Namely, notice that the reason we normally
don't get here is that what heap_page_prune does with an already-DEAD
tuple is reduce it to a dead line pointer and then count it in its
return value, which gets added to tups_vacuumed.  But then what
lazy_scan_heap's per-tuple loop does is

/*
 * DEAD item pointers are to be vacuumed normally; but we don't
 * count them in tups_vacuumed, else we'd be double-counting (at
 * least in the common case where heap_page_prune() just freed up
 * a non-HOT tuple).
 */
if (ItemIdIsDead(itemid))
{
lazy_record_dead_tuple(vacrelstats, &(tuple.t_self));
all_visible = false;
continue;
}

When this patch is active, it will *greatly* increase the odds that
we report a misleading tups_vacuumed total, for two different reasons:

* DEAD tuples reduced to dead line pointers during heap_page_prune will be
counted as tups_vacuumed, even though we don't take the further step of
removing the dead line pointer, as always happened before.

* When, after some vacuum cycles with index_cleanup disabled, we finally
do one with index_cleanup enabled, there are going to be a heck of a lot
of old dead line pointers to clean out, which the existing logic won't
count at all.  That was only barely tolerable before, and it seems like
this has pushed it over the bounds of silliness.  People are going to
be wondering why vacuum reports that it removed zillions of index
entries and no tuples.

I'm thinking that we really need to upgrade vacuum's reporting totals
so that it accounts in some more-honest way for pre-existing dead
line pointers.  The patch as it stands has made the reporting even more
confusing, rather than less so.

BTW, the fact that dead line pointers will accumulate without limit
makes me even more dubious of the proposition that this "feature"
will be safe to enable as a reloption in production.  I really think
that we ought to restrict it to be a manual VACUUM option, to be
used only when you're desperate to freeze old tuples.

regards, tom lane




Re: Zedstore - compressed in-core columnar storage

2019-04-16 Thread Tomas Vondra

On Mon, Apr 15, 2019 at 10:45:51PM -0700, Ashwin Agrawal wrote:

On Mon, Apr 15, 2019 at 12:50 PM Peter Geoghegan  wrote:


On Mon, Apr 15, 2019 at 9:16 AM Ashwin Agrawal 
wrote:
> Would like to know more specifics on this Peter. We may be having
different context on hybrid row/column design.

I'm confused about how close your idea of a TID is to the traditional
definition from heapam (and even zheap). If it's a purely logical
identifier, then why would it have two components like a TID? Is that
just a short-term convenience or something?



TID is purely logical identifier. Hence, stated in initial email that for
Zedstore TID, block number and offset split carries no meaning at all. It's
purely 48 bit integer entity assigned to datum of first column during
insertion, based on where in BTree it gets inserted. Rest of the column
datums are inserted using this assigned TID value. Just due to rest to
system restrictions discussed by Heikki and Andres on table am thread poses
limitations of value it can carry currently otherwise from zedstore design
perspective it just integer number.



I'm not sure it's that clear cut, actually. Sure, it's not the usual
(block,item) pair so it's not possible to jump to the exact location, so
it's not the raw physical identifier as regular TID. But the data are
organized in a btree, with the TID as a key, so it does actually provide
some information about the location.

I've asked about BRIN indexes elsewhere in this thread, which I think is
related to this question, because that index type relies on TID providing
sufficient information about location. And I think BRIN indexes are going
to be rather important for colstores (and formats like ORC have something
very similar built-in).

But maybe all we'll have to do is define the ranges differently - instead
of "number of pages" we may define them as "number of rows" and it might
be working.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Speedup of relation deletes during recovery

2019-04-16 Thread Fujii Masao
On Tue, Apr 16, 2019 at 10:48 AM Jamison, Kirk  wrote:
>
> Hello Fujii-san,
>
> On April 18, 2018, Fujii Masao wrote:
>
> > On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki 
> >  wrote:
> >> Furthermore, TRUNCATE has a similar and worse issue.  While DROP TABLE
> >> scans the shared buffers once for each table, TRUNCATE does that for
> >> each fork, resulting in three scans per table.  How about changing the
> >> following functions
> >>
> >> smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber
> >> nblocks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber 
> >> forkNum,
> >>BlockNumber firstDelBlock)
> >>
> >> to
> >>
> >> smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber
> >> *nblocks, int nforks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, 
> >> ForkNumber *forkNum,
> >>BlockNumber *firstDelBlock,
> >> int nforks)
> >>
> >> to perform the scan only once per table?  If there doesn't seem to be a 
> >> problem,
> >> I think I'll submit the patch next month.  I'd welcome if anyone could do 
> >> that.
> >
> > Yeah, it's worth working on this problem. To decrease the number of scans of
> > shared_buffers, you would need to change the order of truncations of files 
> > and WAL
> > logging. In RelationTruncate(), currently WAL is logged after FSM and VM 
> > are truncated.
> > IOW, with the patch, FSM and VM would need to be truncated after WAL 
> > logging. You would
> > need to check whether this reordering is valid.
>
> I know it's been almost a year since the previous message, but if you could 
> still
> recall your suggestion above, I'd like to ask question.
> Could you explain your idea a bit further on how would the reordering of WAL 
> writing and
> file truncation possibly reduce the number of shared_buffers scan?

Sorry, I forgot the detail of that my comment. But anyway, you want to
modify smgr_redo(info = XLOG_SMGR_TRUNCATE) so that the number of
scans of shared_buffers to be decreased to one. Right?

IMO it's worth thinking to change smgrtruncate(MAIN_FORK),
FreeSpaceMapTruncateRel() and visibilitymap_truncate() so that
they just mark the relation and blocks as to be deleted, and then
so that they scan shared_buffers at once to invalidate the blocks
at the end of smgr_redo().

Regards,

-- 
Fujii Masao




Re: New vacuum option to do only freezing

2019-04-16 Thread Andres Freund
Hi,

On 2019-04-16 12:01:36 -0400, Tom Lane wrote:
> (BTW, I don't understand why that code will throw "found xmin %u from
> before relfrozenxid %u" if HeapTupleHeaderXminFrozen is true?  Shouldn't
> the whole if-branch at lines 6113ff be skipped if xmin_frozen?)

I *think* that just looks odd, but isn't actively wrong. That's because
TransactionIdIsNormal() won't trigger, as:

#define HeapTupleHeaderGetXmin(tup) \
( \
HeapTupleHeaderXminFrozen(tup) ? \
FrozenTransactionId : HeapTupleHeaderGetRawXmin(tup) \
)

which afaict makes
xmin_frozen = ((xid == FrozenTransactionId) ||
   HeapTupleHeaderXminFrozen(tuple));
redundant.

Looks like that was introduced relatively recently, in

commit d2599ecfcc74fea9fad1720a70210a740c716730
Author: Alvaro Herrera 
Date:   2018-05-04 15:24:44 -0300

Don't mark pages all-visible spuriously


@@ -6814,6 +6815,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
/* Process xmin */
xid = HeapTupleHeaderGetXmin(tuple);
+   xmin_frozen = ((xid == FrozenTransactionId) ||
+  HeapTupleHeaderXminFrozen(tuple));
if (TransactionIdIsNormal(xid))
{
if (TransactionIdPrecedes(xid, relfrozenxid))
@@ -6832,9 +6835,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
frz->t_infomask |= HEAP_XMIN_FROZEN;
changed = true;
+   xmin_frozen = true;
}
-   else
-   totally_frozen = false;
}

Greetings,

Andres Freund




Re: New vacuum option to do only freezing

2019-04-16 Thread Tom Lane
Robert Haas  writes:
> On Mon, Apr 15, 2019 at 9:07 PM Tom Lane  wrote:
>> If we're failing to remove it, and it's below the desired freeze
>> horizon, then we'd darn well better freeze it instead, no?

> I don't know that that's safe.  IIRC, the freeze code doesn't cope
> nicely with being given a tuple that actually ought to have been
> deleted.  It'll just freeze it anyway, which is obviously bad.

Looking at heap_prepare_freeze_tuple, it looks to me like it'd notice
the problem and throw an error.  The two possible reasons for a tuple
to be dead are xmin aborted and xmax committed, right?  There are
tests in there that will complain if either of those is true and
the xid is below the freeze horizon.

Given that we don't get here except when the tuple has just become dead,
it probably is all right to assume that it can't possibly get selected
for freezing, and let those tests backstop the assumption.

(BTW, I don't understand why that code will throw "found xmin %u from
before relfrozenxid %u" if HeapTupleHeaderXminFrozen is true?  Shouldn't
the whole if-branch at lines 6113ff be skipped if xmin_frozen?)

regards, tom lane

PS: I see that mandrill just replicated the topminnow failure that
started this discussion.




Re: New vacuum option to do only freezing

2019-04-16 Thread Andres Freund
Hi,

On 2019-04-16 11:38:01 -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > On 2019-Apr-16, Robert Haas wrote:
> >> On Mon, Apr 15, 2019 at 9:07 PM Tom Lane  wrote:
> >>> If we're failing to remove it, and it's below the desired freeze
> >>> horizon, then we'd darn well better freeze it instead, no?
> 
> >> I don't know that that's safe.  IIRC, the freeze code doesn't cope
> >> nicely with being given a tuple that actually ought to have been
> >> deleted.  It'll just freeze it anyway, which is obviously bad.
> 
> > Umm, but if we fail to freeze it, we'll leave a tuple around that's
> > below the relfrozenxid for the table, causing later pg_commit to be
> > truncated and error messages saying that the tuple cannot be read, no?
> 
> Yeah.  If you think that it's unsafe to freeze the tuple, then this
> entire patch is ill-conceived and needs to be reverted.  I don't
> know how much more plainly I can put it: index_cleanup cannot be a
> license to ignore the freeze horizon.  (Indeed, I do not quite see
> what the point of the feature is otherwise.  Why would you run a
> vacuum with this option at all, if not to increase the table's
> relfrozenxid?  But you can *not* advance relfrozenxid if you left
> old XIDs behind.)

As I just wrote - I don't think this codepath can ever deal with tuples
that old.

Greetings,

Andres Freund




Re: New vacuum option to do only freezing

2019-04-16 Thread Andres Freund
Hi,

On 2019-04-16 10:54:34 -0400, Alvaro Herrera wrote:
> On 2019-Apr-16, Robert Haas wrote:
> > On Mon, Apr 15, 2019 at 9:07 PM Tom Lane  wrote:
> > > > I'm not sure that's correct.  If you do that, it'll end up in the
> > > > non-tupgone case, which might try to freeze a tuple that should've
> > > > been removed.  Or am I confused?
> > >
> > > If we're failing to remove it, and it's below the desired freeze
> > > horizon, then we'd darn well better freeze it instead, no?
> > 
> > I don't know that that's safe.  IIRC, the freeze code doesn't cope
> > nicely with being given a tuple that actually ought to have been
> > deleted.  It'll just freeze it anyway, which is obviously bad.
> 
> Umm, but if we fail to freeze it, we'll leave a tuple around that's
> below the relfrozenxid for the table, causing later pg_commit to be
> truncated and error messages saying that the tuple cannot be read, no?

Is the below-relfrozenxid case actually reachable? Isn't the theory of
that whole codeblock that we ought to only get there if a transaction
concurrently commits?

 * Ordinarily, DEAD tuples would have 
been removed by
 * heap_page_prune(), but it's possible 
that the tuple
 * state changed since 
heap_page_prune() looked.  In
 * particular an INSERT_IN_PROGRESS 
tuple could have
 * changed to DEAD if the inserter 
aborted.  So this
 * cannot be considered an error 
condition.

And in case there was a concurrent transaction at the time of the
heap_page_prune(), it got to be above the OldestXmin passed to
HeapTupleSatisfiesVacuum() to - as it's the same OldestXmin value.  And
as FreezeLimit should always be older than than OldestXmin, we should
never get into a situation where heap_page_prune() couldn't prune
something that we would have been forced to remove?


> > I don't know that that's safe.  IIRC, the freeze code doesn't cope
> > nicely with being given a tuple that actually ought to have been
> > deleted.  It'll just freeze it anyway, which is obviously bad.
> >
> > Unless this has been changed since I last looked at it.
> 
> I don't think so.

I think it has changed a bit - these days heap_prepare_freeze_tuple()
will detect that case, and error out:

/*
 * If we freeze xmax, make absolutely sure that it's 
not an XID
 * that is important.  (Note, a lock-only xmax can be 
removed
 * independent of committedness, since a committed lock 
holder has
 * released the lock).
 */
if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
TransactionIdDidCommit(xid))
ereport(ERROR,

(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("cannot freeze 
committed xmax %u",

 xid)));
and the equivalent multixact case:

if (TransactionIdDidCommit(xid))
ereport(ERROR,

(errcode(ERRCODE_DATA_CORRUPTED),
 
errmsg_internal("cannot freeze committed update xid %u", xid)));

We even complain if xmin is uncommitted and would need to be frozen:

if (TransactionIdPrecedes(xid, cutoff_xid))
{
if (!TransactionIdDidCommit(xid))
ereport(ERROR,

(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("uncommitted 
xmin %u from before xid cutoff %u needs to be frozen",

 xid, cutoff_xid)));


I IIRC added that after one of the multixact issues lead to precisely
that, heap_prepare_freeze_tuple() leading to a valid xmax just being
emptied out, resurfacing dead tuples (and HOT corruption and such).

These messages are obviously intended to be a backstop against
continuing to corrupt further, than actually something a user should
ever see in a working system.

Greetings,

Andres Freund




Re: Checksum errors in pg_stat_database

2019-04-16 Thread Robert Treat
On Mon, Apr 15, 2019 at 3:32 PM Julien Rouhaud  wrote:
>
> Sorry for late reply,
>
> On Sun, Apr 14, 2019 at 7:12 PM Magnus Hagander  wrote:
> >
> > On Sat, Apr 13, 2019 at 8:46 PM Robert Treat  wrote:
> >>
> >> On Fri, Apr 12, 2019 at 8:18 AM Magnus Hagander  
> >> wrote:
> >> ISTM the argument here is go with zero since you have zero connections
> >> vs go with null since you can't actually connect, so it doesn't make
> >> sense. (There is a third argument about making it -1 since you can't
> >> connect, but that breaks sum(numbackends) so it's easily dismissed.) I
> >> think I would have gone for 0 personally, but what ended up surprising
> >> me was that a bunch of other stuff like xact_commit show zero when
> >> AFAICT the above reasoning would apply the same to those columns.
> >> (unless there is a way to commit a transaction in the global objects
> >> that I don't know about).
> >
> >
> > That's a good point. I mean, you can commit a transaction that involves 
> > changes of global objects, but it counts in the database that you were 
> > conneced to.
> >
> > We should probably at least make it consistent and make it NULL in all or 0 
> > in all.
> >
> > I'm -1 for using -1 (!), for the very reason that you mention. But either 
> > changing the numbackends to 0, or the others to NULL would work for 
> > consistency. I'm leaning towards the 0 as well.
>
> +1 for 0 :)  Especially since it's less code in the view.
>

+1 for 0

> >> What originally got me looking at this was the idea of returning -1
> >> (or maybe null) for checksum failures for cases when checksums are not
> >> enabled. This seems a little more complicated to set up, but seems
> >> like it might ward off people thinking they are safe due to no
> >> checksum error reports when they actually aren't.
> >
> >
> > NULL seems like the reasonable thing to return there. I'm not sure what 
> > you're referring to with a little more complicated to set up, thought? Do 
> > you mean somehow for the end user?
> >
> > Code-wise it seems it should be simple -- just do an "if checksums disabled 
> > then return null"  in the two functions.
>
> That's indeed a good point!  Lack of checksum error is distinct from
> checksums not activated and we should make it obvious.
>
> I don't know if that counts as an open item, but I attach a patch for
> all points discussed here.

ISTM we should mention shared objects in both places in the docs, and
want "NULL if data checksums" rather than "NULL is data checksums".
Attaching slightly modified patch with those changes, but otherwise
LGTM.

Robert Treat
https://xzilla.net


checksums_reporting_fix_v2.diff
Description: Binary data


Re: New vacuum option to do only freezing

2019-04-16 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Apr-16, Robert Haas wrote:
>> On Mon, Apr 15, 2019 at 9:07 PM Tom Lane  wrote:
>>> If we're failing to remove it, and it's below the desired freeze
>>> horizon, then we'd darn well better freeze it instead, no?

>> I don't know that that's safe.  IIRC, the freeze code doesn't cope
>> nicely with being given a tuple that actually ought to have been
>> deleted.  It'll just freeze it anyway, which is obviously bad.

> Umm, but if we fail to freeze it, we'll leave a tuple around that's
> below the relfrozenxid for the table, causing later pg_commit to be
> truncated and error messages saying that the tuple cannot be read, no?

Yeah.  If you think that it's unsafe to freeze the tuple, then this
entire patch is ill-conceived and needs to be reverted.  I don't
know how much more plainly I can put it: index_cleanup cannot be a
license to ignore the freeze horizon.  (Indeed, I do not quite see
what the point of the feature is otherwise.  Why would you run a
vacuum with this option at all, if not to increase the table's
relfrozenxid?  But you can *not* advance relfrozenxid if you left
old XIDs behind.)

regards, tom lane




Re: New vacuum option to do only freezing

2019-04-16 Thread Masahiko Sawada
On Tue, Apr 16, 2019 at 11:26 PM Robert Haas  wrote:
>
> On Mon, Apr 15, 2019 at 9:07 PM Tom Lane  wrote:
> > > I'm not sure that's correct.  If you do that, it'll end up in the
> > > non-tupgone case, which might try to freeze a tuple that should've
> > > been removed.  Or am I confused?
> >
> > If we're failing to remove it, and it's below the desired freeze
> > horizon, then we'd darn well better freeze it instead, no?
>
> I don't know that that's safe.  IIRC, the freeze code doesn't cope
> nicely with being given a tuple that actually ought to have been
> deleted.  It'll just freeze it anyway, which is obviously bad.

Hmm, I think that we already choose to leave HEAPTUPLE_DEAD tuples and
might freeze them if HeapTupleIsHotUpdated() ||
HeapTupleIsHeapOnly(L1083 at vacuumlazy.c) is true, which actually
have to be deleted. What difference between these tuples and the
tuples that we intentionally leave when index cleanup is disabled?
Maybe I'm missing something and confused.




Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: Autovacuum-induced regression test instability

2019-04-16 Thread Tom Lane
Michael Paquier  writes:
> Aren't extra ORDER BY clauses the usual response to tuple ordering?  I
> really think that we should be more aggressive with that.

I'm not excited about that.  The traditional argument against it
is that if we start testing ORDER BY queries exclusively (and it
would have to be pretty nearly exclusively, if we were to take
this seriously) then we'll lack test coverage for queries without
ORDER BY.  Also, regardless of whether you think that regression
test results can be kicked around at will, we are certainly going
to hear complaints from users if traditional behaviors like
"inserting N rows into a new table, then selecting them, gives
those rows back in the same order" go away.  Recall that we had
to provide a way to disable the syncscan optimization because
some users complained about the loss of row-ordering consistency.

regards, tom lane




Re: New vacuum option to do only freezing

2019-04-16 Thread Alvaro Herrera
On 2019-Apr-16, Robert Haas wrote:

> On Mon, Apr 15, 2019 at 9:07 PM Tom Lane  wrote:
> > > I'm not sure that's correct.  If you do that, it'll end up in the
> > > non-tupgone case, which might try to freeze a tuple that should've
> > > been removed.  Or am I confused?
> >
> > If we're failing to remove it, and it's below the desired freeze
> > horizon, then we'd darn well better freeze it instead, no?
> 
> I don't know that that's safe.  IIRC, the freeze code doesn't cope
> nicely with being given a tuple that actually ought to have been
> deleted.  It'll just freeze it anyway, which is obviously bad.

Umm, but if we fail to freeze it, we'll leave a tuple around that's
below the relfrozenxid for the table, causing later pg_commit to be
truncated and error messages saying that the tuple cannot be read, no?

I remember that for a similar case in multixact-land, what we do is
generate a fresh multixact that carries the members that are still alive
(ie. those that cause the multixact to be kept rather than remove it),
and relabel the tuple with that one.  So the old multixact can be
removed safely.  Obviously we cannot do that for XIDs, but I do wonder
what can possibly cause a tuple to be unfreezable yet the XID to be
below the freeze horizon.  Surely if the transaction is that old, we
should have complained about it, and generated a freeze horizon that was
even older?

> Unless this has been changed since I last looked at it.

I don't think so.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Calling pgstat_report_wait_end() before ereport(ERROR)

2019-04-16 Thread Tom Lane
Michael Paquier  writes:
> In short, I tend to think that the attached is an acceptable cleanup.
> Thoughts?

WFM.

regards, tom lane




Re: New vacuum option to do only freezing

2019-04-16 Thread Robert Haas
On Mon, Apr 15, 2019 at 9:07 PM Tom Lane  wrote:
> > I'm not sure that's correct.  If you do that, it'll end up in the
> > non-tupgone case, which might try to freeze a tuple that should've
> > been removed.  Or am I confused?
>
> If we're failing to remove it, and it's below the desired freeze
> horizon, then we'd darn well better freeze it instead, no?

I don't know that that's safe.  IIRC, the freeze code doesn't cope
nicely with being given a tuple that actually ought to have been
deleted.  It'll just freeze it anyway, which is obviously bad.

Unless this has been changed since I last looked at it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Commit message / hash in commitfest page.

2019-04-16 Thread Tom Lane
Magnus Hagander  writes:
> On Tue, Apr 16, 2019 at 8:55 AM Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> wrote:
>> On 2019-04-16 08:47, Magnus Hagander wrote:
>>> Unless we want to go all the way and have said bot actualy close the CF
>>> entry. But the question is, do we?

>> I don't think so.  There are too many special cases that would make this
>> unreliable, like one commit fest thread consisting of multiple patches.

> I definitely don't think we should close just because they show up.

Agreed.

> ... Which means we'd have the async/out-of-order issue.

I don't see that as much of a problem.  The use-case for these links,
as I understand it, is for retrospective examination of CF data anyway.
The mere fact of closing the CF entry is enough for real-time status.

regards, tom lane




Re: Compressed TOAST Slicing

2019-04-16 Thread Andrey Borodin



> 9 апр. 2019 г., в 22:30, Tom Lane  написал(а):
> 
> The proposal is kind of cute, but I'll bet it's a net loss for
> small copy lengths --- likely we'd want some cutoff below which
> we do it with the dumb byte-at-a-time loop.

Ture.
I've made simple extension to compare decompression time on pgbench-generated 
WAL [0]

Use of smart memcpy unless match length is smaller than 16 (sane random value) 
gives about 20% speedup to decompression time.
Sole use of memcpy gives smaller effect.

We will dig into this further.

Best regards, Andrey Borodin.


[0] https://github.com/x4m/test_pglz



Re: pg_dump is broken for partition tablespaces

2019-04-16 Thread David Rowley
On Mon, 15 Apr 2019 at 15:26, Alvaro Herrera  wrote:
>
> On 2019-Apr-15, David Rowley wrote:
>
> > To be honest, if I'd done a better job of thinking through the
> > implications of this tablespace inheritance in ca4103025d, then I'd
> > probably have not bothered submitting a patch for it.  We could easily
> > revert that, but we'd still be left with the same behaviour in
> > partitioned indexes, which is in PG11.
>
> Well, I suppose if we do decide to revert it for tables, we should do it
> for both tables and indexes.  But as I said, I'm not yet convinced that
> this is the best way forward.

Ok.  Any ideas or suggestions on how we move on from here?  It seems
like a bit of a stalemate.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Runtime pruning problem

2019-04-16 Thread David Rowley
On Tue, 16 Apr 2019 at 23:55, Yuzuko Hosoya  wrote:
> postgres=# explain analyze select * from t1 where dt = current_date + 400;
>   QUERY PLAN
> ---
>  Append  (cost=0.00..198.42 rows=44 width=8) (actual time=0.000..0.001 rows=0 
> loops=1)
>Subplans Removed: 3
>->  Seq Scan on t1_1  (cost=0.00..49.55 rows=11 width=8) (never executed)
>  Filter: (dt = (CURRENT_DATE + 400))
>  Planning Time: 0.400 ms
>  Execution Time: 0.070 ms
> (6 rows)
> 
>
> I realized t1_1 was not scanned actually since "never executed"
> was displayed in the plan using EXPLAIN ANALYZE.  But I think
> "One-Time Filter: false" and "Subplans Removed: ALL" or something
> like that should be displayed instead.
>
> What do you think?

This is intended behaviour explained by the following comment in nodeAppend.c

/*
* The case where no subplans survive pruning must be handled
* specially.  The problem here is that code in explain.c requires
* an Append to have at least one subplan in order for it to
* properly determine the Vars in that subplan's targetlist.  We
* sidestep this issue by just initializing the first subplan and
* setting as_whichplan to NO_MATCHING_SUBPLANS to indicate that
* we don't really need to scan any subnodes.
*/

It's true that there is a small overhead in this case of having to
initialise a useless subplan, but the code never tries to pull any
tuples from it, so it should be fairly minimal.  I expected that using
a value that matches no partitions would be unusual enough not to go
contorting explain.c into working for this case.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Runtime pruning problem

2019-04-16 Thread Yuzuko Hosoya
Hi all,

I found a runtime pruning test case which may be a problem as follows:


create table t1 (id int, dt date) partition by range(dt);
create table t1_1 partition of t1 for values from ('2019-01-01') to 
('2019-04-01');
create table t1_2 partition of t1 for values from ('2019-04-01') to 
('2019-07-01');
create table t1_3 partition of t1 for values from ('2019-07-01') to 
('2019-10-01');
create table t1_4 partition of t1 for values from ('2019-10-01') to 
('2020-01-01');

In this example, current_date is 2019-04-16.

postgres=# explain select * from t1 where dt = current_date + 400;
 QUERY PLAN 

 Append  (cost=0.00..198.42 rows=44 width=8)
   Subplans Removed: 3
   ->  Seq Scan on t1_1  (cost=0.00..49.55 rows=11 width=8)
 Filter: (dt = (CURRENT_DATE + 400))
(4 rows)

postgres=# explain analyze select * from t1 where dt = current_date + 400;
  QUERY PLAN
   
---
 Append  (cost=0.00..198.42 rows=44 width=8) (actual time=0.000..0.001 rows=0 
loops=1)
   Subplans Removed: 3
   ->  Seq Scan on t1_1  (cost=0.00..49.55 rows=11 width=8) (never executed)
 Filter: (dt = (CURRENT_DATE + 400))
 Planning Time: 0.400 ms
 Execution Time: 0.070 ms
(6 rows)


I realized t1_1 was not scanned actually since "never executed" 
was displayed in the plan using EXPLAIN ANALYZE.  But I think 
"One-Time Filter: false" and "Subplans Removed: ALL" or something
like that should be displayed instead.

What do you think?


Best regards,
Yuzuko Hosoya
NTT Open Source Software Center






Re: Caveats from reloption toast_tuple_target

2019-04-16 Thread David Rowley
On Fri, 5 Apr 2019 at 17:31, Pavan Deolasee  wrote:
> IMV it makes sense to simply cap the lower limit of toast_tuple_target to the 
> compile time default and update docs to reflect that. Otherwise, we need to 
> deal with the possibility of dynamically creating the toast table if the 
> relation option is lowered after creating the table. Your proposed patch 
> handles the case when the toast_tuple_target is specified at CREATE TABLE, 
> but we would still have problem with ALTER TABLE, no? But there might be side 
> effects of changing the lower limit for pg_dump/pg_restore. So we would need 
> to think about that too.

I've attached a patch which increases the lower limit up to
TOAST_TUPLE_TARGET.  Unfortunately, reloptions don't have an
assign_hook like GUCs do. Unless we add those we've no way to still
accept lower values without an error.  Does anyone think that's worth
adding for this?  Without it, it's possible that pg_restore /
pg_upgrade could fail if someone happened to have toast_tuple_target
set lower than 2032 bytes.

I didn't bother capping RelationGetToastTupleTarget() to ensure it
never returns less than TOAST_TUPLE_TARGET since the code that
currently uses it can't trigger if it's lower than that value.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


toast_tuple_target_fix_david.patch
Description: Binary data


Re: Calling pgstat_report_wait_end() before ereport(ERROR)

2019-04-16 Thread Masahiko Sawada
On Tue, Apr 16, 2019 at 2:45 PM Michael Paquier  wrote:
>
> On Fri, Apr 12, 2019 at 10:06:41PM +0900, Masahiko Sawada wrote:
> > But I think that's not right, I've checked the code. If the startup
> > process failed in that function it raises a FATAL and recovery fails,
> > and if checkpointer process does then it calls
> > pgstat_report_wait_end() in CheckpointerMain().
>
> Well, the point is that the code raises an ERROR, then a FATAL because
> it gets upgraded by recovery.  The take, at least it seems to me, is
> that if any new caller of the function misses to clean up the event
> then the routine gets cleared.  So it seems to me that the current
> coding is aimed to be more defensive than anything.  I agree that
> there is perhaps little point in doing so.  In my experience a backend
> switches very quickly back to ClientRead, cleaning up the previous
> event.  Looking around, we have also some code paths in slot.c and
> origin.c which close a transient file, clear the event flag...  And
> then PANIC, which makes even less sense.
>
> In short, I tend to think that the attached is an acceptable cleanup.
> Thoughts?

Agreed. There are also some code which raise an ERROR after close a
transient file but I think it's a good idea to not include them for
safety. It looks to me that the patch you proposed cleans places as
much as we can do.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: Calling pgstat_report_wait_end() before ereport(ERROR)

2019-04-16 Thread Masahiko Sawada
On Fri, Apr 12, 2019 at 11:05 PM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > There are something like the following code in many places in PostgreSQL 
> > code.
> > ...
> > Since we eventually call
> > pgstat_report_wait_end() in AbortTransaction(). I think that we don't
> > need to call pgstat_report_wait_end() if we're going to raise an error
> > just after that. Is that right?
>
> Yes ... and those CloseTransientFile calls are unnecessary as well.
>
> To a first approximation, *any* cleanup-type call occurring just before
> an ereport(ERROR) is probably unnecessary, or if it is necessary then
> the code is broken in other ways.  One should not assume that there is
> no other way for an error to be thrown while the resource is held, and
> therefore it's generally better design to have enough infrastructure
> so that the error cleanup mechanisms can handle whatever cleanup is
> needed.  We certainly have such infrastructure for OpenTransientFile/
> CloseTransientFile, and according to what you say above (I didn't
> check it) pgstat wait reporting is handled similarly.  So these
> call sites could all be simplified substantially.
>
> There are exceptions to this rule of thumb.  In some places, for
> instance, it's worth releasing a lock before ereport simply to shorten
> the length of time that the lock might stay held.  And there are places
> where a very low-level resource (such as a spinlock) is only held in
> straight-line code so there's not really need for error cleanup
> infrastructure for it.  Perhaps there's an argument to be made that
> pgstat wait reporting could be put in this second category, but
> I doubt it.
>

Thank you for explanation! That's really helpful for me.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: New vacuum option to do only freezing

2019-04-16 Thread Masahiko Sawada
On Tue, Apr 16, 2019 at 4:47 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Mon, Apr 15, 2019 at 1:13 PM Tom Lane  wrote:
> >> I have a very strong feeling that this patch was not fully baked.
>
> > I think you're right, but I don't understand the comment in the
> > preceding paragraph.  How does this problem prevent tupgone from
> > getting set?
>
> My point is that I suspect that tupgone *shouldn't* get set.
> It's not (going to be) gone.
>
> > It looks to me like nleft_dead_tuples should be ripped out.
>
> That was pretty much what I was thinking too.

tups_vacuumed counts not only (1)dead-but-not-yet-removable tuple but
also HOT-pruned tuples. These HOT-pruned tuples include both (2)the
tuples we removed both its itemid and tuple storage and the tuples
(3)we removed only its tuple storage and marked itemid as dead. So we
cannot add tups_vacuumed to nkeeps as it includes completely removed
tuple like tuples-(2). I added nleft_dead_itemids to count only
tuples-(3) and nleft_dead_tuples to count only tuples-(1) for
reporting. Tuples-(2) are removed even when index cleanup is disabled.

> It makes more sense
> just to treat this case identically to dead-but-not-yet-removable.
> I have substantial doubts about nleft_dead_itemids being worth
> anything, as well.

I think that the adding tuples-(3) to nkeeps would be a good idea. If
we do that, nleft_dead_tuples is no longer necessary. On the other
hand, I think we need nleft_dead_itemids to report how many itemids we
left when index cleanup is disabled.
Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: extensions are hitting the ceiling

2019-04-16 Thread Eric Hanson
On Tue, Apr 16, 2019 at 4:47 AM Eric Hanson  wrote:

> We would probably be wise to learn from what has gone (so I hear) terribly
> wrong with the Node / NPM packaging system (and I'm sure many before it),
> namely versioning.  What happens when two extensions require different
> versions of the same extension?  At a glance it almost seems unsolvable,
> given the constraint that an extension can only be installed once, and only
> at a single version.  I don't understand why that constraint exists though.
>

How about this:

1. Extension can be installed once *per-version*.
2. Each version of an extension that is installed is assigned by the system
a unique, hidden schema (similar to temp table schemas) whose name doesn't
matter because the extension user will never need to know it.
3. There exists a dynamic variable, like you proposed above, but it
includes version number as well.  @DEPNAME_VERSION_schema@ perhaps.  This
variable would resolve to the system-assigned schema name of the extension
specified, at the version specified.
4. Since sprinkling ones code with version numbers is awful, there exists a
way (which I haven't thought of) to set a kind of search_path-type setting
which sets in the current scope the version number of the extension that
should be dereferenced, so developers can still use @DEPNAME_schema@.

This would allow multiple versions of extensions to coexist, and it would
solve the problem with two extensions wanting the same dependency in
different places.

It's radical, but extensions are radically broken.  A critique of the above
would be that extensions still have a single global namespace, so
personally I don't think it even goes far enough.

Cheers,
Eric


Re: extensions are hitting the ceiling

2019-04-16 Thread Eric Hanson
On Tue, Apr 16, 2019 at 4:24 AM Eric Hanson  wrote:

>
>
> On Tue, Apr 16, 2019 at 12:47 AM Noah Misch  wrote:
>
>> On Mon, Mar 18, 2019 at 09:38:19PM -0500, Eric Hanson wrote:
>> > I have heard talk of a way to write extensions so that they dynamically
>> > reference the schema of their dependencies, but sure don't know how that
>> > would work if it's possible.  The @extschema@ variable references the
>> > *current* extension's schema, but not there is no dynamic variable to
>> > reference the schema of a dependency.
>>
>> If desperate, you can do it like this:
>>
>>   DO $$ BEGIN EXECUTE format('SELECT %I.earth()',
>> (SELECT nspname FROM pg_namespace n
>>  JOIN pg_extension ON n.oid = extnamespace
>>  WHERE extname = 'earthdistance' )); END $$;
>>
>> Needless to say, that's too ugly.  Though probably unimportant in
>> practice, it
>> also has a race condition vs. ALTER EXTENSION SET SCHEMA.
>>
>> > Also it is possible in theory to dynamically set search_path to contain
>> > every schema of every dependency in play and then just not specify a
>> schema
>> > when you use something in a dependency.  But this ANDs together all the
>> > scopes of all the dependencies of an extension, introducing potential
>> for
>> > collisions, and is generally kind of clunky.
>>
>> That's how it works today, and it has the problems you describe.  I
>> discussed
>> some solution candidates here:
>>
>> https://www.postgresql.org/message-id/20180710014308.ga805...@rfd.leadboat.com
>>
>> The @DEPNAME_schema@ thing was trivial to implement, but I shelved it.
>> I'm
>> attaching the proof of concept, for your information.
>>
>
> Interesting.
>
> Why shelved?  I like it.  You said you lean toward 2b in the link above,
> but there is no 2b :-) but 1b was this option, which maybe you meant?
>
> The other approach would be to have each extension be in it's own schema,
> whose name is fixed for life.  Then there are no collisions and no
> ambiguity about their location.   I don't use NPM but was just reading
> about how they converted their package namespace from a single global
> namespace with I think it was 30k packages in it,
> to @organization/packagename.  I don't know how folks would feel about a
> central namespace registry, I don't love the idea if we can find a way
> around it, but would settle for it if there's no better solution.  Either
> that or use a UUID as the schema name.  Truly hideous.  But it seems like
> your approach above with just dynamically looking up the extension's schema
> as a variable would solve everything.
>
> There is the problem of sequencing, where extension A installs dependency
> extension B in it's own schema.  Then extension C also wants to use
> dependency B, but extension A is uninstalled and extension B is now still
> hanging around in A's old schema.  Not ideal but at least everything would
> still function.
>
> I'll keep thinking about it...
>

We would probably be wise to learn from what has gone (so I hear) terribly
wrong with the Node / NPM packaging system (and I'm sure many before it),
namely versioning.  What happens when two extensions require different
versions of the same extension?  At a glance it almost seems unsolvable,
given the constraint that an extension can only be installed once, and only
at a single version.  I don't understand why that constraint exists though.

Eric


Re: extensions are hitting the ceiling

2019-04-16 Thread Eric Hanson
On Tue, Apr 16, 2019 at 12:47 AM Noah Misch  wrote:

> On Mon, Mar 18, 2019 at 09:38:19PM -0500, Eric Hanson wrote:
> > I have heard talk of a way to write extensions so that they dynamically
> > reference the schema of their dependencies, but sure don't know how that
> > would work if it's possible.  The @extschema@ variable references the
> > *current* extension's schema, but not there is no dynamic variable to
> > reference the schema of a dependency.
>
> If desperate, you can do it like this:
>
>   DO $$ BEGIN EXECUTE format('SELECT %I.earth()',
> (SELECT nspname FROM pg_namespace n
>  JOIN pg_extension ON n.oid = extnamespace
>  WHERE extname = 'earthdistance' )); END $$;
>
> Needless to say, that's too ugly.  Though probably unimportant in
> practice, it
> also has a race condition vs. ALTER EXTENSION SET SCHEMA.
>
> > Also it is possible in theory to dynamically set search_path to contain
> > every schema of every dependency in play and then just not specify a
> schema
> > when you use something in a dependency.  But this ANDs together all the
> > scopes of all the dependencies of an extension, introducing potential for
> > collisions, and is generally kind of clunky.
>
> That's how it works today, and it has the problems you describe.  I
> discussed
> some solution candidates here:
>
> https://www.postgresql.org/message-id/20180710014308.ga805...@rfd.leadboat.com
>
> The @DEPNAME_schema@ thing was trivial to implement, but I shelved it.
> I'm
> attaching the proof of concept, for your information.
>

Interesting.

Why shelved?  I like it.  You said you lean toward 2b in the link above,
but there is no 2b :-) but 1b was this option, which maybe you meant?

The other approach would be to have each extension be in it's own schema,
whose name is fixed for life.  Then there are no collisions and no
ambiguity about their location.   I don't use NPM but was just reading
about how they converted their package namespace from a single global
namespace with I think it was 30k packages in it,
to @organization/packagename.  I don't know how folks would feel about a
central namespace registry, I don't love the idea if we can find a way
around it, but would settle for it if there's no better solution.  Either
that or use a UUID as the schema name.  Truly hideous.  But it seems like
your approach above with just dynamically looking up the extension's schema
as a variable would solve everything.

There is the problem of sequencing, where extension A installs dependency
extension B in it's own schema.  Then extension C also wants to use
dependency B, but extension A is uninstalled and extension B is now still
hanging around in A's old schema.  Not ideal but at least everything would
still function.

I'll keep thinking about it...


> > #2:  Data in Extensions
> >
> > Extensions that are just a collection of functions and types seem to be
> the
> > norm.  Extensions can contain what the docs call "configuration" data,
> but
> > rows are really second class citizens:  They aren't tracked with
> > pg_catalog.pg_depend, they aren't deleted when the extension is dropped,
> > etc.
> >
> > Sometimes it would make sense for an extension to contain *only* data, or
> > insert some rows in a table that the extension doesn't "own", but has as
> a
> > dependency.  For example, a "webserver" extension might contain a
> > "resource" table that serves up the content of resources in the table at
> a
> > specified path. But then, another extension, say an app, might want to
> just
> > list the webserver extension as a dependency, and insert a few resource
> > rows into it.  This is really from what I can tell beyond the scope of
> what
> > extensions are capable of.
>
> I never thought of this use case.  Interesting.
>

It's a *really* powerful pattern.  I am sure of this because I've been
exploring it while developing a row packaging system modeled after git [1],
and using it in conjunction with EXTENSIONs with extreme joy.  But one does
rows, and the other does DDL, and this is not ideal.

Cheers,
Eric

[1]
https://github.com/aquametalabs/aquameta/tree/master/src/pg-extension/bundle


Compile with 64-bit kerberos on Windows

2019-04-16 Thread Peifeng Qiu
Hi, hackers.

I'm trying to build 64-bit windows binaries with kerberos support.
I downloaded latest kerberos source package from here:
https://kerberos.org/dist/index.html
I followed the the instructions in src\windows\README, and executed the
following script in 64-bit Visual Studio Command Prompt to build and
install it.

set NO_LEASH=1
set PATH=%PATH%;"%WindowsSdkVerBinPath%"\x86
set KRB_INSTALL_DIR=C:\krb5
cd src
nmake -f Makefile.in prep-windows
nmake NODEBUG=1
nmake install NODEBUG=1

To compile postgres with kerberos support, we need to configure the install
location in src/tools/msvc/config.pl
our $config = {gss => 'C:/krb5'};

If I run build.pl the compiler will complain about gssapi.h not found.
At src/tools/msvc/Solution.pm line 633, we can see the include directory is
set to '\inc\krb5'. This is no longer the case for 64-bit kerberos package.

The correct include directory is '\include'. The library paths also need to
be fixed with 64-bit version.

Here's a patch to fixed these paths, with this patch we can build 64-bit
binaries with kerberos support successfully.

Best regards,
Peifeng Qiu


compile-with-64-bit-kerberos-on-windows-v1.patch
Description: Binary data


RE: libpq debug log

2019-04-16 Thread Iwata, Aya
Hi Horiguchi-san,

Thank you for your reviewing. 
I updated patch. Please see my attached patch.

> +/* protocol message name */
> +static char *command_text_b[] = {
> 
> Couldn't the name be more descriptive? The comment just above doesn't seem
> consistent with the variable.  The tables are very sparse. I think the
> definition could be in more compact form.
Thank you. I changed the description more clear.

> 
> + /* y */ 0,
> + /* z */ 0
> +};
> +#define COMMAND_BF_MAX (sizeof(command_text_bf) /
> +sizeof(*command_text_bf))
> 
> It seems that at least the trailing 0-elements are not required.
Sure. I removed.

> + * message_get_command_text:
> + *   Get Protocol message text from byte1 identifier
> + */
> +static char *
> +message_get_command_text(unsigned char c, CommunicationDirection
> +direction)
> ..
> +message_nchar(PGconn *conn, const char *v, int length,
> +CommunicationDirection direction)
> 
> Also the function names are not very descriptive.
Thank you. I fixed function names and added descriptions.

> 
> +message_Int(PGconn *conn, int v, int length, CommunicationDirection
> +direct)
> 
> We are not using names mixing CamelCase and undercored there.
> 
> 
> + if (c >= 0 && c < COMMAND_BF_MAX)
> + {
> + text = command_text_bf[c];
> + if (text)
> + return text;
> + }
> +
> + if (direction == FROM_BACKEND && c >= 0 && c < COMMAND_B_MAX)
> + {
> + text = command_text_b[c];
> + if (text)
> ..
> + if (direction == FROM_FRONTEND && c >= 0 && c < COMMAND_F_MAX)
> 
> 
> This code is assuming that elements of command_text_bf is mutually exclusive
> with command_text_b or _bf. That is, the first has an element for 'C', others
> don't have an element in the same position. But _bf[C] = "CommandComplete"
> and _f[C] = "Close". Is it working correctly?
Elements sent from both the backend and the frontend are 'c' and 'd'.
There is no same elements in protocol_message_type_b and _bf.
The same applies to protocol_message_type_f and _bf too. So I think it is 
working correctly. 


> +typedef enum CommunicationDirection
> 
> The type CommunicationDirection is two-member enum which is equivalent to
> just a boolean. Is there a reason you define that?
> 
> +typedef enum State
> +typedef enum Type
> 
> The name is too generic.
> +typedef struct _LoggingMsg
> ...
> +} LoggingMsg;
> 
> Why the tag name is prefixed with an underscore?
> 
> +typedef struct _Frontend_Entry
> 
> The name doesn't give an idea of its characteristics.
Thank you. I fixed.

Regards,
Aya Iwata


v3-libpq-PQtrace-output-one-line.patch
Description: v3-libpq-PQtrace-output-one-line.patch


Re: Commit message / hash in commitfest page.

2019-04-16 Thread Magnus Hagander
On Tue, Apr 16, 2019 at 8:55 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-04-16 08:47, Magnus Hagander wrote:
> > Unless we want to go all the way and have said bot actualy close the CF
> > entry. But the question is, do we?
>
> I don't think so.  There are too many special cases that would make this
> unreliable, like one commit fest thread consisting of multiple patches.
>

I definitely don't think we should close just because they show up. It
would also require a keyword somewhere to indicate that it should be
closed. Of course, it can still lead to weird results when the same thread
is attached to multiple CF entries etc. So I agree, I don't think we'd want
that. Which means we'd have the async/out-of-order issue.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Race conditions with checkpointer and shutdown

2019-04-16 Thread Michael Paquier
Hi all,

This is a continuation of the following thread, but I prefer spawning
a new thread for clarity:
https://www.postgresql.org/message-id/20190416064512.gj2...@paquier.xyz

The buildfarm has reported two similar failures when shutting down a
node:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet=2019-03-23%2022%3A28%3A59
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2019-04-16%2006%3A14%3A01

In both cases, the instance cannot shut down because it times out,
waiting for the shutdown checkpoint to finish but I suspect that this
checkpoint actually never happens.

The first case involves piculet which has --disable-atomics, gcc 6 and
the recovery test 016_min_consistency where we trigger a checkpoint,
then issue a fast shutdown on a standby.  And at this point the test
waits forever.

The second case involves dragonet which has JIT enabled and clang.
The failure is on test 009_twophase.pl.  The failure happens after
test preparing transaction xact_009_11, where a *standby* gets
restarted.  Again, the test waits forever for the instance to shut
down.

The most recent commits which have touched checkpoints are 0dfe3d0e
and c6c9474a, which maps roughly to the point where the failures
began to happen, and that something related to standby clean shutdowns
has broken since.

Thanks,
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH v20] GSSAPI encryption support

2019-04-16 Thread Peter Eisentraut
On 2019-04-16 06:36, Michael Paquier wrote:
> +$node->append_conf('pg_hba.conf',
> +   qq{hostgssenc all all $hostaddr/32 gss map=mymap});
> +$node->restart;
> A reload should be enough but not race-condition free, which is why a
> set of restarts is done in this test right?  (I have noticed that it
> is done this way since the beginning.)

We got rid of all the reloads in these kinds of tests because they have
the effect that if the configuration has an error, the reload just
ignores it.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Minimal logical decoding on standbys

2019-04-16 Thread Amit Khandekar
On Sat, 13 Apr 2019 at 00:57, Andres Freund  wrote:
>
> Hi,
>
> On 2019-04-12 23:34:02 +0530, Amit Khandekar wrote:
> > I tried to see if I can quickly understand what's going on.
> >
> > Here, master wal_level is hot_standby, not logical, though slave
> > wal_level is logical.
>
> Oh, that's well diagnosed.  Cool.  Also nicely tested - this'd be ugly
> in production.

Tushar had made me aware of the fact that this reproduces only when
master wal_level is hot_standby.

>
> I assume the problem isn't present if you set the primary to wal_level =
> logical?

Right.

>
>
> > Not sure why this is happening. On slave, wal_level is logical, so
> > logical records should have tuple data. Not sure what does that have
> > to do with wal_level of master. Everything should be there on slave
> > after it replays the inserts; and also slave wal_level is logical.
>
> The standby doesn't write its own WAL, only primaries do. I thought we
> forbade running with wal_level=logical on a standby, when the primary is
> only set to replica.  But that's not what we do, see
> CheckRequiredParameterValues().
>
> I've not yet thought this through, but I think we'll have to somehow
> error out in this case.  I guess we could just check at the start of
> decoding what ControlFile->wal_level is set to,

By "start of decoding", I didn't get where exactly. Do you mean
CheckLogicalDecodingRequirements() ?

> and then raise an error
> in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
> wal_level to something lower?

Didn't get where exactly we should error out. We don't do
XLOG_PARAMETER_CHANGE handling in decode.c , so obviously you meant
something else, which I didn't understand.

What I am thinking is :
In CheckLogicalDecodingRequirements(), besides checking wal_level,
also check ControlFile->wal_level when InHotStandby. I mean, when we
are InHotStandby, both wal_level and ControlFile->wal_level should be
>= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical
slot when master has incompatible wal_level.

ControlFile is not accessible outside xlog.c so need to have an API to
extract this field.


>
> Could you try to implement that?
>
> Greetings,
>
> Andres Freund


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Commit message / hash in commitfest page.

2019-04-16 Thread Peter Eisentraut
On 2019-04-16 08:47, Magnus Hagander wrote:
> Unless we want to go all the way and have said bot actualy close the CF
> entry. But the question is, do we?

I don't think so.  There are too many special cases that would make this
unreliable, like one commit fest thread consisting of multiple patches.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: REINDEX CONCURRENTLY 2.0

2019-04-16 Thread Peter Eisentraut
On 2019-04-16 08:19, Michael Paquier wrote:
> On Fri, Apr 12, 2019 at 12:11:12PM +0100, Dagfinn Ilmari Mannsåker wrote:
>> I don't have any comments on the code (but the test looks sensible, it's
>> the same trick I used to discover the issue in the first place).
> 
> After thinking some more on it, this behavior looks rather sensible to
> me.  Are there any objections?

Looks good to me.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Commit message / hash in commitfest page.

2019-04-16 Thread Magnus Hagander
On Sat, Apr 13, 2019 at 10:28 PM Tom Lane  wrote:

> Tomas Vondra  writes:
> > On Thu, Apr 11, 2019 at 02:55:10PM +0500, Ibrar Ahmed wrote:
> >> On Thu, Apr 11, 2019 at 2:44 PM Erikjan Rijkers  wrote:
>  Is it possible to have commit-message or at least git hash in
>  commitfest. It will be very easy to track commit against commitfest
>  item.
>
> >>> Commitfest items always point to discussion threads. These threads
> often
> >>> end with a message that says that the patch is pushed.  IMHO, that
> >>> message would be the place to include the commithash.   It would also
> be
> >>> easily findable via the commitfest application.
>
> > I think it might be useful to actually have that directly in the CF app,
> > not just in the thread. There would need to a way to enter multiple
> > hashes, because patches often have multiple pieces.
>
> > But maybe that'd be too much unnecessary burden. I don't remember when I
> > last needed this information. And I'd probably try searching in git log
> > first anyway.
>
> Yeah, I can't see committers bothering to do this.  Including the
> discussion thread link in the commit message is already pretty
> significant hassle, and something not everybody remembers/bothers with.
>
> But ... maybe it could be automated?  A bot looking at the commit log
> could probably suck out the thread links and try to match them up
> to CF entries.  Likely you could get about 90% right even without that,
> just by matching the committer's name and the time of commit vs time
> of CF entry closure.
>

Would you even need to match that? It would be easy enough to scan all git
commit messages for links to th earchives and populate any CF entry that
attaches to that same thread.

Of course, that would be async, so you'd end up closing the CF entry and
then have it populate with the git information a bit later (in the simple
case where there is just one commit and then it 's done).

Unless we want to go all the way and have said bot actualy close the CF
entry. But the question is, do we?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint

2019-04-16 Thread Michael Paquier
On Sun, Mar 24, 2019 at 09:47:58PM +0900, Michael Paquier wrote:
> The failure is a bit weird, as I would expect all those three actions
> to be sequential.  piculet is the only failure happening on the
> buildfarm and it uses --disable-atomics, so I am wondering if that is
> related and if 0dfe3d0 is part of that.  With a primary/standby set,
> it could be possible to test that scenario pretty easily.  I'll give
> it a shot.

The buildfarm has just failed with a similar failure, for another
test aka 009_twophase.pl:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet=2019-04-16%2006%3A14%3A01

I think that we have race conditions with checkpointing and shutdown
on HEAD.
--
Michael


signature.asc
Description: PGP signature


Re: REINDEX CONCURRENTLY 2.0

2019-04-16 Thread Michael Paquier
On Fri, Apr 12, 2019 at 12:11:12PM +0100, Dagfinn Ilmari Mannsåker wrote:
> I don't have any comments on the code (but the test looks sensible, it's
> the same trick I used to discover the issue in the first place).

After thinking some more on it, this behavior looks rather sensible to
me.  Are there any objections?

> However, the doc patch lost the trailing paren:

Fixed on my branch, thanks.
--
Michael


signature.asc
Description: PGP signature


Re: Accidental setting of XLogReaderState.private_data ?

2019-04-16 Thread Michael Paquier
On Mon, Apr 15, 2019 at 11:06:18AM -0400, Tom Lane wrote:
> Hmm.  The second, duplicate assignment is surely pointless, but I think
> that setting the ctx as the private_data is a good idea.  It hardly seems
> out of the question that it might be needed in future.

Agreed that we should keep the assignment done with
XLogReaderAllocate().  I have committed the patch which removes the
useless assignment though.
--
Michael


signature.asc
Description: PGP signature