Re: Oddity with parallel safety test for scan/join target in grouping_planner

2019-03-10 Thread Etsuro Fujita

(2019/03/11 14:14), Tom Lane wrote:

Etsuro Fujita  writes:

(2019/03/11 13:06), Tom Lane wrote:

Is that the only possible outcome?  Per Robert's summary quoted above,
it seems like it might be possible for the code to decide that the final
scan/join target to be parallel safe when it is not, leading to outright
wrong answers or query failures.



Maybe I'm missing something, but I think that if the final scan/join
target is not parallel-safe, then the grouping target would not be
parallel-safe either, by the construction of the two targets, so I don't
think that we have such a correctness issue.


Seems to me it's the other way around: the final target would include
all functions invoked in the grouping target plus maybe some more.
So a non-parallel-safe grouping target implies a non-parallel-safe
final target, but not vice versa.


I mean the final *scan/join* target, not the final target.

Best regards,
Etsuro Fujita




Re: Oddity with parallel safety test for scan/join target in grouping_planner

2019-03-10 Thread Tom Lane
Etsuro Fujita  writes:
> (2019/03/11 13:06), Tom Lane wrote:
>> Is that the only possible outcome?  Per Robert's summary quoted above,
>> it seems like it might be possible for the code to decide that the final
>> scan/join target to be parallel safe when it is not, leading to outright
>> wrong answers or query failures.

> Maybe I'm missing something, but I think that if the final scan/join 
> target is not parallel-safe, then the grouping target would not be 
> parallel-safe either, by the construction of the two targets, so I don't 
> think that we have such a correctness issue.

Seems to me it's the other way around: the final target would include
all functions invoked in the grouping target plus maybe some more.
So a non-parallel-safe grouping target implies a non-parallel-safe
final target, but not vice versa.

Possibly the summary had it backwards and the actual code bug is
inferring things in the safe direction, but I'm too tired to double
check that right now ...

regards, tom lane



Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-03-10 Thread amul sul
On Mon, Mar 11, 2019 at 8:29 AM Ashutosh Bapat 
wrote:

> On Thu, Mar 7, 2019 at 8:20 PM amul sul  wrote:
>
>>
>>
>> On Thu, Mar 7, 2019 at 1:02 PM amul sul  wrote:
>>
>>> Thanks Rajkumar,
>>>
>>> I am looking into this.
>>>
>>>
>> The crash happens when none of the if-else branch of
>> handle_missing_partition()
>> evaluates and returns merged_index unassigned.
>>
>> Let me explain, in Rajkumar 's test case, the join type is JOIN_INNER.
>> When
>> only outer rel has null partition, merge_null_partitions() function calls
>> handle_missing_partition() with missing_side_inner = false and
>> missing_side_outer = false
>>
>
> Both missing_side_ variables being false when the NULL partition is
> missing on the inner side looks suspicious. I guess from the variable names
> that the missing_side_inner should be true in this case.
>
>

All the places from where this handle_missing_partition() get called
have the following code to decide the value for missing_side_outer/_inner
which
I yet to understand. Do you think this has some flaw?

/*
 * For a FULL join, inner relation acts as both OUTER and INNER
 * relation.  For LEFT and ANTI join the inner relation acts as
 * INNER relation. For INNER and SEMI join OUTER and INNER
 * differentiation is immaterial.
 */
missing_side_inner = (jointype == JOIN_FULL ||
  jointype == JOIN_LEFT ||
  jointype == JOIN_ANTI);
missing_side_outer = (jointype == JOIN_FULL);



> argument value which fails to set merged_index.
>>
>> In the attached patch, I tried to fix this case by setting merged_index
>> explicitly which fixes the reported crash.
>>
>
> I expect handle_missing_partition() to set the merged_index always. In
> your patches, I don't see that function in your patches is setting it
> explicitly. If we are setting merged_index explicitly somewhere else, other
> places may miss that explicit assignment. So it's better to move it inside
> this function.
>
>

Ok, that can be fixed.

Similarly, I think merge_null_partitions should set null_index instead of
asserting when null partitions missing from both the side, make sense?

Regards,
Amul


Re: Oddity with parallel safety test for scan/join target in grouping_planner

2019-03-10 Thread Etsuro Fujita

(2019/03/11 13:06), Tom Lane wrote:

Etsuro Fujita  writes:

The parallel safety of the final scan/join target is determined from the
grouping target, not that target, which [ is wrong ]



This would only affect plan quality a little bit, so I don't think we
need a regression test for this, either, but the fix might destabilize
existing plan choices, so we should apply it to HEAD only?


Is that the only possible outcome?  Per Robert's summary quoted above,
it seems like it might be possible for the code to decide that the final
scan/join target to be parallel safe when it is not, leading to outright
wrong answers or query failures.


Maybe I'm missing something, but I think that if the final scan/join 
target is not parallel-safe, then the grouping target would not be 
parallel-safe either, by the construction of the two targets, so I don't 
think that we have such a correctness issue.


Best regards,
Etsuro Fujita




Re: Offline enabling/disabling of data checksums

2019-03-10 Thread Michael Paquier
On Wed, Feb 27, 2019 at 07:59:31AM +0100, Fabien COELHO wrote:
> Hallo Michael,

Okay, let's move on with these patches!

> The renaming implies quite a few changes (eg in the documentation,
> makefiles, tests) which warrants a review, so it should be a patch. Also,
> ISTM that the renaming only make sense when adding the enable/disable
> feature, so I'd say that it belongs to this patch. Opinions?

I have worked on the last v4 sent by Michael B, finishing with the
attached after review and addressed the last points raised by Fabien.
The thing is that I have been rather unhappy with a couple of things
in what was proposed, so I have finished by modifying quite a couple
of areas.  The patch set is now splitted as I think is suited for
commit, with the refactoring and renaming being separated from the
actual feature:
- 0001 if a patch to refactor the routine for the control file
update.  I have made it backend-aware, and we ought to be careful with
error handling, use of fds and such, something that v4 was not very
careful about.
- 0002 renames pg_verify_checksums to pg_checksums with a
straight-forward switch.  Docs as well as all references to
pg_verify_checksums are updated.
- 0003 adds the new options --check, --enable and --disable, with
--check being the default as discussed.
- 0004 adds a -N/--no-sync which I think is nice for consistency with
other tools.  That's also useful for the tests, and was not discussed
until now on this thread.

> Patch applies cleanly, compiles, global & local "make check" are ok.
> 
> Doc: "enable, disable or verifies" -> "checks, enables or disables"?
> Spurious space: "> ." -> ">.".
> 
> As checksum are now checked, the doc could use "check" instead of "verify",
> especially if there is a rename and the "verify" word disappears.

That makes sense.  I have fixed these, and simplified the docs a bit
to have a more simple page.

> I'd be less terse when documenting actions, eg: "Disable checksums" ->
> "Disable checksums on cluster."

The former is fine in my opinion.

> Doc should state that checking is the default action, eg "Check checksums on
> cluster. This is the default action."

Check.

> Help string could say that -c is the default action. There is a spurious
> help line remaining from the previous "--action" implementation.

Yeah, we should.  Added.

> open: I'm positively unsure about ?: priority over |, and probably not the
> only one, so I'd add parentheses around the former.

Yes, I agree that the current code is hard to decrypt.  So reworked
with a separate variable in scan_file, and added extra parenthesis in
the part which updates the control file.

> I'm at odds with the "postmaster.pid" check, which would not prevent an
> issue if a cluster is started with "postmaster". I still think that the
> enabling-in-progress should be stored in the cluster state.
>
> ISTM that the cluster read/update cycle should lock somehow the control file
> being modified. However other commands do not seem to do something about it.

I am still not on board for adding more complexity in this area, at
least not for this stuff and for the purpose of this thread, because
this can happen at various degrees for various configurations for ages
and not only for checksums.  Also, I think that we still have to see
users complain about that.  Here are some scenarios where this can
happen:
- A base backup partially written.  pg_basebackup limits this risk but
it could still be possible to see a case where partially-written data
folder.  And base backups are around for many years.
- pg_rewind, and the tool is in the tree since 9.5, the tool is
actually available on github since 9.3.

> I do not think that enabling if already enabled or disabling or already
> disable should exit(1), I think it is a no-op and should simply exit(0).

We already issue exit(1) when attempting to verify checksums on a
cluster where they are disabled.  So I agree with Michael B's point of
Issuing an error in such cases.  I think also that this makes handling
for operators easier.

> About tests: I'd run a check on a disabled cluster to check that the command
> fails because disabled.

Makes sense.  Added.  We need a test also for the case of successive
runs with --enable.

Here are also some notes from my side.
- There was no need to complicate the synopsis of the docs.
- usage() included still references to --action and indentation was a
bit too long at the top.
- There were no tests for disabling checksums, so I have added some.
- We should check that the combination of --enable and -r fails.
- Tests use only long options, that's better for readability.
- Improved comments in tests.
- Better to check for "data_checksum_version > 0" if --enable is
used.  That's more portable long-term if more checksum versions are
added.
- The check on postmaster.pid is actually not necessary as we already
know that the cluster has been shutdown cleanly per the state of the
control file.  I agree that there could be a small 

Re: Should we add GUCs to allow partition pruning to be disabled?

2019-03-10 Thread Amit Langote
On 2019/03/11 13:22, Justin Pryzby wrote:
> On Mon, Mar 11, 2019 at 01:06:08PM +0900, Amit Langote wrote:
>> On 2019/03/11 11:13, David Rowley wrote:
>>> On Mon, 11 Mar 2019 at 15:00, David Rowley  
>>> wrote:
 On Mon, 11 Mar 2019 at 14:33, Amit Langote  
 wrote:
> PG 11 moved the needle a bit for SELECT queries:
>
> Excluding unnecessary partitions is slow for UPDATE and DELETE queries,

 With those words I expect the user might be surprised that it's still
 slow after doing SET enable_partition_pruning = off;
>>>
>>> I had in mind in 10, 11 and master add a note to mention:
>>
>> Thanks for putting this together.
>>
>>> Currently, it is not recommended to have partition hierarchies more
>>> than a few hundred partitions.  Larger partition hierarchies can
>>> suffer from slow planning times with SELECT
>>> queries.  Planning times for UPDATE and
>>> DELETE commands may also suffer slow planning
>>> times, but in addition, memory consumption may also become an issue
>>> due to how the planner currently plans the query once per partition.
>>> These limitations are likely to be resolved in a future version of
>>> PostgreSQL.
> 
> Can I offer the following variation:
> 
> | Currently, it is not recommended to have partition hierarchies with more 
> than
> | a few hundred partitions.  Larger partition hierarchies may incur long
> | planning time.
> | In addition, UPDATE and DELETE
> | commands on larger hierarchies may cause excessive memory consumption.
> | These deficiencies are likely to be fixed in a future release of
> | PostgreSQL.

Says essentially the same thing but with fewer words, so +1.

Now the question is where to put this text?  Currently, we have:

5.10. Table Partitioning
  5.10.1. Overview
  5.10.2. Declarative Partitioning
  5.10.3. Implementation Using Inheritance
  5.10.4. Partition Pruning
  5.10.5. Partitioning and Constraint Exclusion

Should we add 5.10.6 Notes for the above "note", or should it be stuffed
under one of the existing sub-headings?

Thanks,
Amit




Re: Should we add GUCs to allow partition pruning to be disabled?

2019-03-10 Thread Justin Pryzby
On Mon, Mar 11, 2019 at 01:06:08PM +0900, Amit Langote wrote:
> On 2019/03/11 11:13, David Rowley wrote:
> > On Mon, 11 Mar 2019 at 15:00, David Rowley  
> > wrote:
> >> On Mon, 11 Mar 2019 at 14:33, Amit Langote  
> >> wrote:
> >>> PG 11 moved the needle a bit for SELECT queries:
> >>>
> >>> Excluding unnecessary partitions is slow for UPDATE and DELETE queries,
> >>
> >> With those words I expect the user might be surprised that it's still
> >> slow after doing SET enable_partition_pruning = off;
> > 
> > I had in mind in 10, 11 and master add a note to mention:
> 
> Thanks for putting this together.
> 
> > Currently, it is not recommended to have partition hierarchies more
> > than a few hundred partitions.  Larger partition hierarchies can
> > suffer from slow planning times with SELECT
> > queries.  Planning times for UPDATE and
> > DELETE commands may also suffer slow planning
> > times, but in addition, memory consumption may also become an issue
> > due to how the planner currently plans the query once per partition.
> > These limitations are likely to be resolved in a future version of
> > PostgreSQL.

Can I offer the following variation:

| Currently, it is not recommended to have partition hierarchies with more than
| a few hundred partitions.  Larger partition hierarchies may incur long
| planning time.
| In addition, UPDATE and DELETE
| commands on larger hierarchies may cause excessive memory consumption.
| These deficiencies are likely to be fixed in a future release of
| PostgreSQL.




Re: Should we add GUCs to allow partition pruning to be disabled?

2019-03-10 Thread Amit Langote
On 2019/03/11 11:13, David Rowley wrote:
> On Mon, 11 Mar 2019 at 15:00, David Rowley  
> wrote:
>>
>> On Mon, 11 Mar 2019 at 14:33, Amit Langote
>>  wrote:
>>> PG 11 moved the needle a bit for SELECT queries:
>>>
>>> Excluding unnecessary partitions is slow for UPDATE and DELETE queries,
>>
>> With those words I expect the user might be surprised that it's still
>> slow after doing SET enable_partition_pruning = off;
> 
> I had in mind in 10, 11 and master add a note to mention:

Thanks for putting this together.

> Currently, it is not recommended to have partition hierarchies more
> than a few hundred partitions.  Larger partition hierarchies can
> suffer from slow planning times with SELECT
> queries.  Planning times for UPDATE and
> DELETE commands may also suffer slow planning
> times, but in addition, memory consumption may also become an issue
> due to how the planner currently plans the query once per partition.
> These limitations are likely to be resolved in a future version of
> PostgreSQL.

How about slightly rewriting the sentence toward the end as:

memory consumption may also become an issue, because planner currently
plans the query once for every partition.

> I've not really thought too much on the fact that the issue also
> exists with inheritance tables in earlier version too.

That's fine maybe.

Thanks,
Amit




Re: Oddity with parallel safety test for scan/join target in grouping_planner

2019-03-10 Thread Tom Lane
Etsuro Fujita  writes:
>>> The parallel safety of the final scan/join target is determined from the
>>> grouping target, not that target, which [ is wrong ]

> This would only affect plan quality a little bit, so I don't think we 
> need a regression test for this, either, but the fix might destabilize 
> existing plan choices, so we should apply it to HEAD only?

Is that the only possible outcome?  Per Robert's summary quoted above,
it seems like it might be possible for the code to decide that the final
scan/join target to be parallel safe when it is not, leading to outright
wrong answers or query failures.  If the wrong answer can only be wrong
in the safe direction, it's not very clear why.

regards, tom lane



Re: Oddity with parallel safety test for scan/join target in grouping_planner

2019-03-10 Thread Etsuro Fujita

(2019/03/09 5:36), Tom Lane wrote:

Etsuro Fujita  writes:

(2019/02/28 0:52), Robert Haas wrote:

On Tue, Feb 26, 2019 at 7:26 AM Etsuro Fujita
   wrote:

The parallel safety of the final scan/join target is determined from the
grouping target, not that target, which [ is wrong ]



Your patch looks right to me.



I think it would be better for you to take this one.  Could you?


I concur with Robert that this is a brown-paper-bag-grade bug in
960df2a97.  Please feel free to push (and don't forget to back-patch).


OK, will do.


The only really interesting question is whether it's worth adding
a regression test for.  I doubt it; the issue seems much too narrow.
Usually the point of a regression test is to prevent re-introduction
of the same/similar bug, but what class of bugs would you argue
we'd be protecting against?


Plan degradation; without the fix, we would have this on data created by 
"pgbench -i -s 10 postgres", as shown in [1]:


postgres=# set parallel_setup_cost = 10;
postgres=# set parallel_tuple_cost = 0.001;

postgres=# explain verbose select aid+bid, sum(abalance), random() from 
pgbench_accounts group by aid+bid;

 QUERY PLAN


 GroupAggregate  (cost=137403.01..159903.01 rows=100 width=20)
   Output: ((aid + bid)), sum(abalance), random()
   Group Key: ((pgbench_accounts.aid + pgbench_accounts.bid))
   ->  Sort  (cost=137403.01..139903.01 rows=100 width=8)
 Output: ((aid + bid)), abalance
 Sort Key: ((pgbench_accounts.aid + pgbench_accounts.bid))
 ->  Gather  (cost=10.00..24070.67 rows=100 width=8)
-->Output: (aid + bid), abalance
   Workers Planned: 2
   ->  Parallel Seq Scan on public.pgbench_accounts 
(cost=0.00..20560.67 rows=416667 width=12)

 Output: aid, bid, abalance
(11 rows)

The final scan/join target {(aid + bid), abalance} would be parallel 
safe, but in the plan the target is not parallelized across workers. 
The reason for that is because the parallel-safety of the target is 
assessed incorrectly using the grouping target {((aid + bid)), 
sum(abalance), random()}, which would not be parallel safe.  By the fix 
we would have this:


postgres=# explain verbose select aid+bid, sum(abalance), random() from 
pgbench_accounts group by aid+bid;

QUERY PLAN

---
 GroupAggregate  (cost=135944.68..158444.68 rows=100 width=20)
   Output: ((aid + bid)), sum(abalance), random()
   Group Key: ((pgbench_accounts.aid + pgbench_accounts.bid))
   ->  Sort  (cost=135944.68..138444.68 rows=100 width=8)
 Output: ((aid + bid)), abalance
 Sort Key: ((pgbench_accounts.aid + pgbench_accounts.bid))
 ->  Gather  (cost=10.00..22612.33 rows=100 width=8)
   Output: ((aid + bid)), abalance
   Workers Planned: 2
   ->  Parallel Seq Scan on public.pgbench_accounts 
(cost=0.00..21602.33 rows=416667 width=8)

-->  Output: (aid + bid), abalance
(11 rows)

Note that the final scan/join target is parallelized in the plan.

This would only affect plan quality a little bit, so I don't think we 
need a regression test for this, either, but the fix might destabilize 
existing plan choices, so we should apply it to HEAD only?


Best regards,
Etsuro Fujita




Re: Portability of strtod (was Re: pgsql: Include GUC's unit, if it has one, in out-of-range error message)

2019-03-10 Thread Amit Kapila
On Mon, Mar 11, 2019 at 8:45 AM Tom Lane  wrote:
>
> Michael Paquier  writes:
> > On Sun, Mar 10, 2019 at 07:18:20PM +, Tom Lane wrote:
> I think what's going on here is what's mentioned in the comments in
> float8in_internal:
>
>  * C99 requires that strtod() accept NaN, [+-]Infinity, and [+-]Inf,
>  * but not all platforms support all of these (and some accept them
>  * but set ERANGE anyway...)
>
> Specifically, these symptoms would be explained if these platforms'
> strtod() sets ERANGE for infinity.
>
> I can think of three plausible responses.  In decreasing order of
> amount of work:
>
> 1. Decide that we'd better wrap strtod() with something that ensures
> platform-independent behavior for all our uses of strtod (and strtof?)
> rather than only float8in_internal.
>

This sounds like a good approach, but won't it has the risk of change
in behaviour?

> 2. Put in a hack in guc.c to make it ignore ERANGE as long as the result
> satisfies isinf().  This would ensure GUC cases would go through the
> value-out-of-range path rather than the syntax-error path.  We've got
> a bunch of other strtod() calls that are potentially subject to similar
> platform dependencies though ...
>

Yeah, this won't completely fix the symptom.

> 3. Decide this isn't worth avoiding platform dependencies for, and just
> take out the new regression test case.  I'd only put in that test on
> the spur of the moment anyway, so it's hard to argue that it's worth
> much.
>

For the time being option-3 sounds like a reasonable approach to fix
buildfarm failures and then later if we want to do some bigger surgery
based on option-1 or some other option, we can anyways do it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: using index or check in ALTER TABLE SET NOT NULL

2019-03-10 Thread David Rowley
On Mon, 11 Mar 2019 at 00:13, Sergei Kornilov  wrote:
>
> > Providing I'm imagining it correctly, I do think the patch is slightly
> > cleaner with that change.
>
> Ok, sounds reasonable. I changed patch this way.

Looks good to me.  Good idea to keep the controversial setting of
client_min_messages to debug1 in the tests in a separate patch.

Apart from a few lines that need to be wrapped at 80 chars and some
comment lines that seem to have been wrapped early, I'm happy for it
to be marked as ready for committer, but I'll defer to Ildar in case
he wants to have another look.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Portability of strtod (was Re: pgsql: Include GUC's unit, if it has one, in out-of-range error message)

2019-03-10 Thread Tom Lane
Michael Paquier  writes:
> On Sun, Mar 10, 2019 at 07:18:20PM +, Tom Lane wrote:
>> Include GUC's unit, if it has one, in out-of-range error messages.

> It does not seem to have cooled down all animals yet, whelk is
> still complaining:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk=2019-03-11%2000%3A41%3A13

Yeah, also the HPUX animals (gaur isn't booted up at the moment, but
I bet it'd fail too).

I think what's going on here is what's mentioned in the comments in
float8in_internal:

 * C99 requires that strtod() accept NaN, [+-]Infinity, and [+-]Inf,
 * but not all platforms support all of these (and some accept them
 * but set ERANGE anyway...)

Specifically, these symptoms would be explained if these platforms'
strtod() sets ERANGE for infinity.

I can think of three plausible responses.  In decreasing order of
amount of work:

1. Decide that we'd better wrap strtod() with something that ensures
platform-independent behavior for all our uses of strtod (and strtof?)
rather than only float8in_internal.

2. Put in a hack in guc.c to make it ignore ERANGE as long as the result
satisfies isinf().  This would ensure GUC cases would go through the
value-out-of-range path rather than the syntax-error path.  We've got
a bunch of other strtod() calls that are potentially subject to similar
platform dependencies though ...

3. Decide this isn't worth avoiding platform dependencies for, and just
take out the new regression test case.  I'd only put in that test on
the spur of the moment anyway, so it's hard to argue that it's worth
much.

Thoughts?

regards, tom lane



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-10 Thread Tomas Vondra



On 3/10/19 11:27 PM, David Rowley wrote:
> On Mon, 11 Mar 2019 at 06:36, Tomas Vondra  
> wrote:
>>
>> On 3/9/19 7:33 PM, Dean Rasheed wrote:
>>> I wonder if it's possible to write smaller, more targeted tests.
>>> Currently "stats_ext" is by far the slowest test in its group, and I'm
>>> not sure that some of those tests add much. It ought to be possible to
>>> write a function that calls EXPLAIN and returns a query's row
>>> estimate, and then you could write tests to confirm the effect of the
>>> new stats by verifying the row estimates change as expected.
>>
>> Sure, if we can write more targeted tests, that would be good. But it's
>> not quite clear to me how wrapping EXPLAIN in a function makes those
>> tests any faster?
> 
> I've not looked at the tests in question, but if they're executing an
> inferior plan is used when no extended stats exists, then maybe that's
> why they're slow.
> 

I don't think the tests are executing any queries - the tests merely
generate execution plans, without executing them.

> I think Dean might mean to create a function similar to
> explain_parallel_append() in partition_prune.sql then write tests that
> check the row estimate with EXPLAIN (COSTS ON) but strip out the other
> costing stuff instead of validating that the poor plan was chosen.
> 

I'm not opposed to doing that, of course. I'm just not sure it's a way
to make the tests faster. Will investigate.

>> On 3/10/19 2:09 PM, Dean Rasheed wrote:
>>> 12). bms_member_index() should surely be in bitmapset.c. It could be
>>> more efficient by just traversing the bitmap words and making use of
>>> bmw_popcount(). Also, its second argument should be of type 'int' for
>>> consistency with other bms_* functions.
>>
>> Yes, moving to bitmapset.c definitely makes sense. I don't see how it
>> could use bms_popcount() though.
> 
> I think it could be done by first checking if the parameter is a
> member of the set, and then if so, count all the bits that come on and
> before that member. You can use bmw_popcount() for whole words before
> the specific member's word then just bitwise-and a bit mask of a
> bitmapword that has all bits set for all bits on and before your
> parameter's BITNUM(), and add the bmw_popcount of the final word
> bitwise-anding the mask. bms_add_range() has some masking code you
> could copy.
> 

Ah, right - that would work.


cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WAL logging problem in 9.4.3?

2019-03-10 Thread Noah Misch
This has been waiting for a review since October, so I reviewed it.  The code
comment at PendingRelSync summarizes the design well, and I like that design.
I also liked the design in the https://postgr.es/m/559fa0ba.3080...@iki.fi
last paragraph, and I suspect it would have been no harder to back-patch.  I
wonder if it would have been simpler and better, but I'm not asking anyone to
investigate that.  Let's keep pursuing your current design.

This moves a shared_buffers scan and smgrimmedsync() from commands like COPY
to COMMIT.  Users setting a timeout on COMMIT may need to adjust, and
log_min_duration_statement analysis will reflect the change.  I feel that's
fine.  (There already exist ways for COMMIT to be slow.)

On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> --- a/src/backend/access/nbtree/nbtsort.c
> +++ b/src/backend/access/nbtree/nbtsort.c
> @@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, 
> BlockNumber blkno)
>   /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
>   RelationOpenSmgr(wstate->index);
>  
> - /* XLOG stuff */
> - if (wstate->btws_use_wal)
> + /* XLOG stuff
> +  *
> +  * Even if minimal mode, WAL is required here if truncation happened 
> after
> +  * being created in the same transaction. It is not needed otherwise but
> +  * we don't bother identifying the case precisely.
> +  */
> + if (wstate->btws_use_wal ||
> + (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))

We initialized "btws_use_wal" like this:

#define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
#define RelationNeedsWAL(relation) \
((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);

Hence, this change causes us to emit WAL for the metapage of a
RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation.  We should never do
that.  If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
relfilenode.  I've attached a test case for this; it is a patch that applies
on top of your v7 patches.  The test checks for orphaned files after redo.

> +  * If no tuple was inserted, it's possible that we are truncating a
> +  * relation. We need to emit WAL for the metapage in the case. However 
> it
> +  * is not required elsewise,

Did you mean to write more words after that comma?

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c

> + * NB: after WAL-logging has been skipped for a block, we must not WAL-log
> + * any subsequent actions on the same block either. Replaying the WAL record
> + * of the subsequent action might fail otherwise, as the "before" state of
> + * the block might not match, as the earlier actions were not WAL-logged.

Good point.  To participate in WAL redo properly, each "before" state must
have a distinct pd_lsn.  In CREATE INDEX USING btree, the initial index build
skips WAL, but an INSERT later in the same transaction writes WAL.  There,
however, each "before" state does have a distinct pd_lsn; the initial build
has pd_lsn==0, and each subsequent state has a pd_lsn driven by WAL position.
Hence, I think the CREATE INDEX USING btree behavior is fine, even though it
doesn't conform to this code comment.

I think this restriction applies only to full_page_writes=off.  Otherwise, the
first WAL-logged change will find pd_lsn==0 and emit a full-page image.  With
a full-page image in the record, the block's "before" state doesn't matter.
Also, one could make it safe to write WAL for a particular block by issuing
heap_sync() for the block's relation.

> +/*
> + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> + */
> +void
> +RelationRemovePendingSync(Relation rel)

What is the coding rule for deciding when to call this?  Currently, only
ATExecSetTableSpace() calls this.  CLUSTER doesn't call it, despite behaving
much like ALTER TABLE SET TABLESPACE behaves.

> +{
> + bool found;
> +
> + rel->pending_sync = NULL;
> + rel->no_pending_sync = true;
> + if (pendingSyncs)
> + {
> + elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
> + hash_search(pendingSyncs, (void *) >rd_node, HASH_REMOVE, 
> );
> + }
> +}

We'd need a mechanism to un-remove the sync at subtransaction abort.  My
attachment includes a test case demonstrating the consequences of that defect.
Please look for other areas that need to know about subtransactions; patch v7
had no code pertaining to subtransactions.

> + elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block 
> %u, because sync_above is %u",

As you mention upthread, you have many debugging elog()s.  These are too
detailed to include in every binary, but I do want them in the code.  See
CACHE_elog() for a good example of achieving that.

> +/*
> + * Sync to disk any relations that 

RE: Timeout parameters

2019-03-10 Thread Nagaura, Ryohei
Hi, Fabien-san.

> From: Fabien COELHO 
> As I understand the "client-side timeout" use-case, as distinct from
> network-issues-related timeouts proposed in the other patches, the point is 
> more
> or less to deal with an overloaded server.
The main purpose of this parameter is to avoid client's waiting for DB server 
infinitely, not reducing the server's burden.
This results in not waiting end-user, which is most important.
As Tsunakawa-san presented, there are similar parameters in JDBC and Oracle, so 
I think that this idea is valid.

> The timeout is raised because getting the result takes time, which may be 
> simply
> because the query really takes time, and it does not imply that the server 
> would
> not be able to process a cancel request.
You mentioned about when a SQL query is heavy, but I want to talk about when OS 
hang.
If OS hang occurs, the cost of cancel request processing is so high.
This is the reason why the implementation is just only disconnection without 
canceling.

Best regards,
-
Ryohei Nagaura





Re: Pluggable Storage - Andres's take

2019-03-10 Thread Haribabu Kommi
On Sat, Mar 9, 2019 at 2:18 PM Andres Freund  wrote:

> Hi,
>
> On 2019-03-09 11:03:21 +1100, Haribabu Kommi wrote:
> > Here I attached the rebased patches that I shared earlier. I am adding
> the
> > comments to explain the API's in the code, will share those patches
> later.
>
> I've started to add those for the callbacks in the first commit. I'd
> appreciate a look!
>

Thanks for the updated patches.

+ /*

+ * Callbacks for hon-modifying operations on individual tuples
+ * 

Typo in tableam.h file. hon->non



> I think I'll include the docs patches, sans the callback documentation,
> in the next version. I'll probably merge them into one commit, if that's
> OK with you?
>

OK.
For easy review, I will still maintain 3 or 4 patches instead of the
current patch
series.


> > I observed a crash with the latest patch series in the COPY command.
>
> Hm, which version was this? I'd at some point accidentally posted a
> 'tmp' commit that was just a performance hack
>

Yes. in my version that checked have that commit.
May be that is the reason for the failure.

Btw, your patches always are attached out of order:
>
> https://www.postgresql.org/message-id/CAJrrPGd%2Brkz54wE-oXRojg4XwC3bcF6bjjRziD%2BXhFup9Q3n2w%40mail.gmail.com
> 10, 1, 3, 4, 2 ...
>

Sorry about that.
I always think why it is ordering that way when I attached the patch files
into the mail.
I thought it may be gmail behavior, but with experiment I found that, while
adding the multiple
patches, the last selected patch given the preference and it will be listed
as first attachment.

I will take care that this problem doesn't repeat it again.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: Should we add GUCs to allow partition pruning to be disabled?

2019-03-10 Thread David Rowley
On Mon, 11 Mar 2019 at 15:00, David Rowley  wrote:
>
> On Mon, 11 Mar 2019 at 14:33, Amit Langote
>  wrote:
> > PG 11 moved the needle a bit for SELECT queries:
> >
> > Excluding unnecessary partitions is slow for UPDATE and DELETE queries,
>
> With those words I expect the user might be surprised that it's still
> slow after doing SET enable_partition_pruning = off;

I had in mind in 10, 11 and master add a note to mention:

Currently, it is not recommended to have partition hierarchies more
than a few hundred partitions.  Larger partition hierarchies can
suffer from slow planning times with SELECT
queries.  Planning times for UPDATE and
DELETE commands may also suffer slow planning
times, but in addition, memory consumption may also become an issue
due to how the planner currently plans the query once per partition.
These limitations are likely to be resolved in a future version of
PostgreSQL.

I've not really thought too much on the fact that the issue also
exists with inheritance tables in earlier version too.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Should we add GUCs to allow partition pruning to be disabled?

2019-03-10 Thread Amit Langote
On 2019/03/11 11:00, David Rowley wrote:
> On Mon, 11 Mar 2019 at 14:33, Amit Langote
>  wrote:
>> PG 11 moved the needle a bit for SELECT queries:
>>
>> Excluding unnecessary partitions is slow for UPDATE and DELETE queries,
> 
> With those words I expect the user might be surprised that it's still
> slow after doing SET enable_partition_pruning = off;
> 
> I'm not really talking about constraint exclusion or partition
> pruning. The memory growth problem the user was experiencing was down
> to the fact that we plan once per partition and each of the
> PlannerInfos used for each planner run has a RangeTblEntry for all
> partitions. This means if you add one more partition and you get N
> partitions more RangeTblEntry items in memory. This is the quadratic
> memory growth that I mentioned in the -general post.

Yeah, I get it.  As I said in my email, all we have ever mentioned in the
documentation as the reason for queries on partitioned tables being slow
is that partition exclusion is slow and nothing else.  Can we put
quadratic memory growth during planning as the reason for performance
degradation into the documentation?  Maybe we could, but every time I
tried it, it didn't read like user-facing documentation to me.  Do you
have something in mind that we could add?

Thanks,
Amit




RE: ECPG regression with DECLARE STATEMENT support

2019-03-10 Thread Kuroda, Hayato
Dear Rushabh, Michael,

I attached a simple bug-fixing patch.
Could you review it?

An added logic is:

1. Send a close statement to a backend process regardless of the existence 
of a cursor.

2. If ecpg_do function returns false, raise “cursor is invalid” error.

3. Remove cursor from application.

I already checked this patch passes regression tests and Rushabh’s test code.

Best Regards,
Hayato Kuroda
Fujitsu LIMITED



DeclareStmt_fix_close_bug.patch
Description: DeclareStmt_fix_close_bug.patch


Re: Should we add GUCs to allow partition pruning to be disabled?

2019-03-10 Thread David Rowley
On Mon, 11 Mar 2019 at 14:33, Amit Langote
 wrote:
> PG 11 moved the needle a bit for SELECT queries:
>
> Excluding unnecessary partitions is slow for UPDATE and DELETE queries,

With those words I expect the user might be surprised that it's still
slow after doing SET enable_partition_pruning = off;

I'm not really talking about constraint exclusion or partition
pruning. The memory growth problem the user was experiencing was down
to the fact that we plan once per partition and each of the
PlannerInfos used for each planner run has a RangeTblEntry for all
partitions. This means if you add one more partition and you get N
partitions more RangeTblEntry items in memory. This is the quadratic
memory growth that I mentioned in the -general post.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Should we add GUCs to allow partition pruning to be disabled?

2019-03-10 Thread Amit Langote
On 2019/03/11 0:25, Justin Pryzby wrote:
> On Sun, Mar 10, 2019 at 10:53:02PM +1300, David Rowley wrote:
>> On Fri, 11 May 2018 at 17:37, Amit Langote  
>> wrote:
>>> 5. The last sentence in caveats, that is,
>>>
>>> "Partitioning using these techniques will work well with up to perhaps a
>>> hundred partitions; don't try to use many thousands of partitions."
>>>
>>> should perhaps be reworded as:
>>>
>>> "So the legacy inheritance based partitioning will work well with up to
>>> perhaps a hundred partitions; don't try to use many thousands of 
>>> partitions."
> 
>> In the -general post, I was just about to point them at the part in
>> the documents that warn against these large partition hierarchies, but
>> it looks like the warning was removed in bebc46931a1, or at least
>> modified to say that constraint exclusion with heritance tables is
>> slow. I really wonder if we shouldn't put something back in there to
>> warn against this sort of thing.
> 
> +1
> 
> I believe I was of the same mind when I wrote:
> https://www.postgresql.org/message-id/flat/20180525215002.GD14378%40telsasoft.com#c9de33b17fe63cecad4ac30fb1662531

I agree PG 11 didn't improve things enough to have removed such a warning
from the documentation even for partitioning.  Actually, we only ever had
a warning about constraint exclusion getting slower as more children are
added, but nothing about UPDATE/DELETE planning being slow in itself;
perhaps more importantly, much slower than SELECT.  It seems very hard to
put that in the documentation though.

In PG 10:

Excluding unnecessary partitions is slow, especially as the number of
partitions increases, because constraint exclusion needs to look at each
partition to determine whether it could be excluded.  Also, planning for
UPDATE and DELETE queries is significantly slower than for SELECT queries
for $REASONS.  Given that, it is wise to use up to a few hundred
partitions but not more.


PG 11 moved the needle a bit for SELECT queries:

Excluding unnecessary partitions is slow for UPDATE and DELETE queries,
especially as the number of partitions increases, because constraint
exclusion needs to look at each partition to determine whether it could be
excluded.  Also, planning for UPDATE and DELETE queries is significantly
slower than for SELECT queries for $REASONS.  Given that, it is wise to
use up to a few hundred partitions but not more.

Thanks,
Amit




Re: Copy function for logical replication slots

2019-03-10 Thread Masahiko Sawada
On Mon, Feb 25, 2019 at 4:50 PM Masahiko Sawada  wrote:
>
> On Wed, Feb 20, 2019 at 1:00 PM Masahiko Sawada  wrote:
> >
> > BTW, XLogRecPtrIsInvalid(copy_restart_lsn) ||  copy_restart_lsn <
> > src_restart_lsn is redundant, the former should be removed.
> >
>
> So attached the updated version patch.
>

Since the patch failed to build, attached the updated version patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


v11-0001-Add-copy-function-for-replication-slots.patch
Description: Binary data


Re: Binary upgrade from <12 to 12 creates toast table for partitioned tables

2019-03-10 Thread Andres Freund
On 2019-03-10 13:29:06 -0300, Alvaro Herrera wrote:
> Hello
> 
> On 2019-Mar-08, Andres Freund wrote:
> 
> > > diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
> > > index e962ae7e913..1de8da59361 100644
> > > --- a/src/bin/pg_dump/pg_dump.c
> > > +++ b/src/bin/pg_dump/pg_dump.c
> > > @@ -4359,9 +4359,9 @@ binary_upgrade_set_type_oids_by_rel_oid(Archive 
> > > *fout,
> > > "SELECT c.reltype AS crel, t.reltype 
> > > AS trel "
> > > "FROM pg_catalog.pg_class c "
> > > "LEFT JOIN pg_catalog.pg_class t ON "
> > > -   "  (c.reltoastrelid = t.oid) "
> > > +   "  (c.reltoastrelid = t.oid AND 
> > > c.relkind <> '%c') "
> > > "WHERE c.oid = '%u'::pg_catalog.oid;",
> > > -   pg_rel_oid);
> > > +   RELKIND_PARTITIONED_TABLE, 
> > > pg_rel_oid);
> > 
> > Hm, I know this code isn't generally well documented, but perhaps we
> > could add a comment as to why we're excluding partitioned tables?
> 
> I added a short comment nearby.  Hopefully that's sufficient.  Let's see
> what the buildfarm members have to say now.

Thanks!  Looks like crake's doing better.

Greetings,

Andres Freund



RE: speeding up planning with partitions

2019-03-10 Thread Imai, Yoshikazu
Amit-san,

On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote:
> On 2019/03/08 16:16, Imai, Yoshikazu wrote:
> > So I modified the code and did test to confirm memory increasement don't
> happen. The test and results are below.
> >
> > [test]
> > * Create partitioned table with 1536 partitions.
> > * Execute "update rt set a = random();"
> >
> > [results]
> > A backend uses below amount of memory in update transaction:
> >
> > HEAD: 807MB
> > With v26-0001, 0002: 790MB
> > With v26-0001, 0002, 0003: 860MB
> > With v26-0003 modified: 790MB
> 
> Can you measure with v28, or better attached v29 (about which more below)?
> 
> > I attached the diff of modification for v26-0003 patch which also
> contains some refactoring.
> > Please see if it is ok.
> 
> I like the is_first_child variable which somewhat improves readability,
> so updated the patch to use it.
> 
> Maybe you know that range_table_mutator() spends quite a long time if
> there are many target children, but I realized there's no need for
> range_table_mutator() to copy/mutate child target RTEs.  First, there's
> nothing to translate in their case.  Second, copying them is not necessary
> too, because they're not modified across different planning cycles.  If
> we pass to adjust_appendrel_attrs only the RTEs in the original range
> table (that is, before any child target RTEs were added), then
> range_table_mutator() has to do significantly less work and allocates
> lot less memory than before.  I've broken this change into its own patch;
> see patch 0004.

Cool!
I tested with v29 patches and checked it saved a lot of memory..

HEAD: 807MB
With v29-0001, 0002, 0003, 0004: 167MB

Maybe planning time in this case is also improved, but I didn't check it.


> but I realized there's no need for
> range_table_mutator() to copy/mutate child target RTEs.  First, there's
> nothing to translate in their case.  Second, copying them is not necessary
> too, because they're not modified across different planning cycles.

Yeah, although I couldn't check the codes in detail, but from the below 
comments in inheritance_planner(), ISTM we need copies of subquery RTEs but 
need not copies of other RTEs in each planning.

/*   
 * We generate a modified instance of the original Query for each target
 * relation, plan that, and put all the plans into a list that will be
 * controlled by a single ModifyTable node.  All the instances share the
 * same rangetable, but each instance must have its own set of subquery
 * RTEs within the finished rangetable because (1) they are likely to get
 * scribbled on during planning, and (2) it's not inconceivable that
 * subqueries could get planned differently in different cases.  We need
 * not create duplicate copies of other RTE kinds, in particular not the
 * target relations, because they don't have either of those issues.  Not
 * having to duplicate the target relations is important because doing so
 * (1) would result in a rangetable of length O(N^2) for N targets, with
 * at least O(N^3) work expended here; and (2) would greatly complicate
 * management of the rowMarks list.


Thanks 
--
Yoshikazu Imai


Re: what makes the PL cursor life-cycle must be in the same transaction?

2019-03-10 Thread Andy Fan
DECLARE cur CURSOR with hold FOR SELECT * FROM t;

the "with hold"  is designed for this purpose.  sorry for this
interruption.

On Sun, Mar 10, 2019 at 4:14 PM Andy Fan  wrote:

> for example:
> begin;
> declare cur cursor for select * from t;
> insert into t2 values(...);
> fetch next cur;
> commit;
>
> // after this,  I can't fetch cur any more.
>
> My question are:
> 1.  Is this must in principle?  or it is easy to implement as this in PG?
> 2.  Any bad thing would happen if I keep the named portal (for the cursor)
> available even the transaction is commit, so that I can fetch the cursor
> after the transaction is committed?
>
> Thanks
>


Re: pgsql: Removed unused variable, openLogOff.

2019-03-10 Thread Michael Paquier
On Fri, Mar 08, 2019 at 10:27:52AM +0900, Michael Paquier wrote:
> After sleeping on it, let's live with just switching to nleft in the
> message, without openLogOff as that's the second time folks complain
> about the previous code.  So I just propose the attached.  Robert,
> others, any objections?  Perhaps you would prefer fixing it yourself?

Okay, done.
--
Michael


signature.asc
Description: PGP signature


Add missing operator <->(box, point)

2019-03-10 Thread Nikita Glukhov

Hi, hackers.

Attached patches add missing distance operator <->(box, point).

We already have reverse operator <->(point, box), but it can't be used for kNN
search in GiST and SP-GiST.  GiST and SP-GiST now support kNN searches over more
complex polygons and circles, but do not support more simple boxes, which seems
to be inconsistent.

Description of the patches:
1. Add function dist_pb(box, point) and operator <->.

2. Add <-> to GiST box_ops.
   Extracted gist_box_distance_helper() common for gist_box_distance() and
   gist_bbox_distance().

3. Add <-> to SP-GiST.
   Changed only catalog and tests.  Box case is already checked in
   spg_box_quad_leaf_consistent():
 out->recheckDistances = distfnoid == F_DIST_POLYP;


--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From c2655e83ae0bd7798e5cfa1a38ace01e38f7f43b Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Thu, 7 Mar 2019 20:22:49 +0300
Subject: [PATCH 1/3] Add operator <->(box, point)

---
 src/backend/utils/adt/geo_ops.c| 12 +
 src/include/catalog/pg_operator.dat|  5 +-
 src/include/catalog/pg_proc.dat|  3 ++
 src/test/regress/expected/geometry.out | 96 +-
 src/test/regress/sql/geometry.sql  |  2 +-
 5 files changed, 68 insertions(+), 50 deletions(-)

diff --git a/src/backend/utils/adt/geo_ops.c b/src/backend/utils/adt/geo_ops.c
index 28e85e3..a8f9111 100644
--- a/src/backend/utils/adt/geo_ops.c
+++ b/src/backend/utils/adt/geo_ops.c
@@ -2419,6 +2419,18 @@ dist_pb(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Distance from a box to a point
+ */
+Datum
+dist_bp(PG_FUNCTION_ARGS)
+{
+	BOX		   *box = PG_GETARG_BOX_P(0);
+	Point	   *pt = PG_GETARG_POINT_P(1);
+
+	PG_RETURN_FLOAT8(box_closept_point(NULL, box, pt));
+}
+
+/*
  * Distance from a lseg to a line
  */
 Datum
diff --git a/src/include/catalog/pg_operator.dat b/src/include/catalog/pg_operator.dat
index 06aec07..ebace5a 100644
--- a/src/include/catalog/pg_operator.dat
+++ b/src/include/catalog/pg_operator.dat
@@ -666,7 +666,10 @@
   oprresult => 'float8', oprcode => 'dist_ps' },
 { oid => '615', descr => 'distance between',
   oprname => '<->', oprleft => 'point', oprright => 'box',
-  oprresult => 'float8', oprcode => 'dist_pb' },
+  oprresult => 'float8', oprcom => '<->(box,point)', oprcode => 'dist_pb' },
+{ oid => '606', descr => 'distance between',
+  oprname => '<->', oprleft => 'box', oprright => 'point',
+  oprresult => 'float8', oprcom => '<->(point,box)', oprcode => 'dist_bp' },
 { oid => '616', descr => 'distance between',
   oprname => '<->', oprleft => 'lseg', oprright => 'line',
   oprresult => 'float8', oprcode => 'dist_sl' },
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 562c540..3800150 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1062,6 +1062,9 @@
 { oid => '364',
   proname => 'dist_pb', prorettype => 'float8', proargtypes => 'point box',
   prosrc => 'dist_pb' },
+{ oid => '357',
+  proname => 'dist_bp', prorettype => 'float8', proargtypes => 'box point',
+  prosrc => 'dist_bp' },
 { oid => '365',
   proname => 'dist_sb', prorettype => 'float8', proargtypes => 'lseg box',
   prosrc => 'dist_sb' },
diff --git a/src/test/regress/expected/geometry.out b/src/test/regress/expected/geometry.out
index 055d32c..5efe80a 100644
--- a/src/test/regress/expected/geometry.out
+++ b/src/test/regress/expected/geometry.out
@@ -602,54 +602,54 @@ SELECT p.f1, l.s, p.f1 <-> l.s FROM POINT_TBL p, LSEG_TBL l;
 (72 rows)
 
 -- Distance to box
-SELECT p.f1, b.f1, p.f1 <-> b.f1 FROM POINT_TBL p, BOX_TBL b;
-f1 | f1  |  ?column?  
+-+
- (0,0) | (2,2),(0,0) |  0
- (0,0) | (3,3),(1,1) |  1.41421356237
- (0,0) | (-2,2),(-8,-10) |  2
- (0,0) | (2.5,3.5),(2.5,2.5) |  3.53553390593
- (0,0) | (3,3),(3,3) |  4.24264068712
- (-10,0)   | (2,2),(0,0) | 10
- (-10,0)   | (3,3),(1,1) |  11.0453610172
- (-10,0)   | (-2,2),(-8,-10) |  2
- (-10,0)   | (2.5,3.5),(2.5,2.5) |   12.747548784
- (-10,0)   | (3,3),(3,3) |  13.3416640641
- (-3,4)| (2,2),(0,0) |  3.60555127546
- (-3,4)| (3,3),(1,1) |  4.12310562562
- (-3,4)| (-2,2),(-8,-10) |  2
- (-3,4)| (2.5,3.5),(2.5,2.5) |  5.52268050859
- (-3,4)| (3,3),(3,3) |   6.0827625303
- (5.1,34.5)| (2,2),(0,0) |  32.6475113906
- (5.1,34.5)| (3,3),(1,1) |  31.5699223946
- (5.1,34.5)| (-2,2),(-8,-10) |  33.2664996656
- (5.1,34.5)| (2.5,3.5),(2.5,2.5) |   31.108841187
- (5.1,34.5)  

Re: pgbench MAX_ARGS

2019-03-10 Thread Dagfinn Ilmari Mannsåker
David Rowley  writes:

> On Wed, 27 Feb 2019 at 01:57, Simon Riggs  wrote:
>> I've put it as 256 args now.
>
> I had a look at this and I see you've added some docs to mention the
> number of parameters that are allowed; good.
>
> +   pgbench supports up to 256 variables in one
> +   statement.
>
> However, the code does not allow 256 variables as the documents claim.
> Per >= in:
>
>   if (cmd->argc >= MAX_ARGS)
>   {
>   fprintf(stderr, "statement has too many arguments (maximum is %d): %s\n",
>
> For it to be 256 that would have to be > MAX_ARGS.

Which would overflow 'char *argv[MAX_ARGS];'.

> I also don't agree with this change:
>
> - MAX_ARGS - 1, cmd->lines.data);
> + MAX_ARGS, cmd->lines.data);
>
> The 0th element of the argv array was for the sql, per:
>
> cmd->argv[0] = sql;
>
> then the 9 others were for the variables, so the MAX_ARGS - 1 was
> correct originally.

The same goes for backslash commands, which use argv[] for each argument
word in the comand, and argv[0] for the command word itself.

> I think some comments in the area to explain the 0th is for the sql
> would be a good idea too, that might stop any confusion in the
> future. I see that's documented in the struct header comment, but
> maybe worth a small note around that error message just to confirm the
> - 1 is not a mistake, and neither is the >= MAX_ARGS.

I have done this in the updated version of the patch, attached.

> Probably it's fine to define MAX_ARGS to 256 then put back the
> MAX_ARGS - 1 code so that we complain if we get more than 255
> unless 256 is really needed, of course, in which case MAX_ARGS will
> need to be 257.

I've kept it at 256, and adjusted the docs to say 255.

> The test also seems to test that 256 variables in a statement gives an
> error. That contradicts the documents that have been added, which say
> 256 is the maximum allowed.

I've adjusted the test (and the \shell test) to check for 256 variables
(arguments) exactly, and manually verified that it doesn't error on 255.

> Setting to WoA

Setting back to NR.

- ilmari
-- 
"The surreality of the universe tends towards a maximum" -- Skud's Law
"Never formulate a law or axiom that you're not prepared to live with
 the consequences of."  -- Skud's Meta-Law

>From 5d14c9c6ee9ba5c0a67ce7baf127704803d87a42 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Sun, 10 Mar 2019 23:20:32 +
Subject: [PATCH] pgbench: increase the maximum number of variables/arguments

pgbench has a strange restriction to only allow 10 arguments, which is
too low for some real world uses.

Since MAX_ARGS is only used for an array of pointers and an array of
integers increasing this should not be a problem, so increase value to 256.

Add coments to clarify that MAX_ARGS includes the SQL statement or
backslash metacommand, respectively, since this caused some confusion as
to whether there was an off-by-one error in the error checking and
message.
---
 doc/src/sgml/ref/pgbench.sgml|  2 ++
 src/bin/pgbench/pgbench.c| 15 ++-
 src/bin/pgbench/t/001_pgbench_with_server.pl | 28 
 3 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 9d18524834..e1ab73e582 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -916,6 +916,8 @@ pgbench  options  d
value can be inserted into a SQL command by writing
:variablename.  When running more than
one client session, each session has its own set of variables.
+   pgbench supports up to 255 variables in one
+   statement.
   
 

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 5df54a8e57..4789ab92ee 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -476,7 +476,12 @@ typedef struct
  */
 #define SQL_COMMAND		1
 #define META_COMMAND	2
-#define MAX_ARGS		10
+
+/*
+ * max number of backslash command arguments or SQL variables,
+ * including the command or SQL statement itself
+ */
+#define MAX_ARGS		256
 
 typedef enum MetaCommand
 {
@@ -4124,6 +4129,10 @@ parseQuery(Command *cmd)
 			continue;
 		}
 
+		/*
+		 * cmd->argv[0] is the SQL statement itself, so the max number of
+		 * arguments is one less than MAX_ARGS
+		 */
 		if (cmd->argc >= MAX_ARGS)
 		{
 			fprintf(stderr, "statement has too many arguments (maximum is %d): %s\n",
@@ -4461,6 +4470,10 @@ process_backslash_command(PsqlScanState sstate, const char *source)
 	/* For all other commands, collect remaining words. */
 	while (expr_lex_one_word(sstate, _buf, _offset))
 	{
+		/*
+		 * my_command->argv[0] is the command itself, so the max number of
+		 * arguments is one less than MAX_ARGS
+		 */
 		if (j >= MAX_ARGS)
 			syntax_error(source, lineno, my_command->first_line, my_command->argv[0],
 		 "too many arguments", NULL, -1);
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl 

Re: what makes the PL cursor life-cycle must be in the same transaction?

2019-03-10 Thread Dagfinn Ilmari Mannsåker
Andy Fan  writes:

> for example:
> begin;
> declare cur cursor for select * from t;
> insert into t2 values(...);
> fetch next cur;
> commit;
>
> // after this,  I can't fetch cur any more.
>
> My question are:
> 1.  Is this must in principle?  or it is easy to implement as this in PG?

It is already implemented. If you declare the cursor WITH HOLD, you can
keep using it after the transaction commits.

> 2.  Any bad thing would happen if I keep the named portal (for the cursor)
> available even the transaction is commit, so that I can fetch the cursor
> after the transaction is committed?

According to the documentation
(https://www.postgresql.org/docs/current/sql-declare.html):

| In the current implementation, the rows represented by a held cursor
| are copied into a temporary file or memory area so that they remain
| available for subsequent transactions.  

> Thanks

- ilmari
-- 
"I use RMS as a guide in the same way that a boat captain would use
 a lighthouse.  It's good to know where it is, but you generally
 don't want to find yourself in the same spot." - Tollef Fog Heen



Re: Fwd: Add tablespace tap test to pg_rewind

2019-03-10 Thread Michael Paquier
On Sat, Mar 09, 2019 at 09:09:24AM +0900, Michael Paquier wrote:
> When working on the first version of pg_rewind for VMware with Heikki,
> tablespace support has been added only after, so what you propose is
> sensible I think.
> 
> +# Run the test in both modes.
> +run_test('local');
> Small nit: we should test for the remote mode here as well.

I got to think more about this one a bit more, and I think that it
would be good to also check the consistency of a tablespace created
before promotion.  If you copy the logic from 002_databases.pl, this
is not going to work because the primary and the standby would try to
create the tablespace in the same path, stepping on each other's
toes.  So what you need to do is to create the tablespace on the
primary because creating the standby.  This requires a bit more work
than what you propose here though as you basically need to extend
RewindTest::create_standby so as it is possible to pass extra
arguments to $node_master->backup.  And in this case the extra option
to use is --tablespace-mapping to make sure that the primary and the
standby have the same tablespace, but defined on different paths.
--
Michael


signature.asc
Description: PGP signature


Re: Add missing CREATE TABLE IF NOT EXISTS table_name AS EXECUTE query;

2019-03-10 Thread Michael Paquier
On Sun, Mar 10, 2019 at 06:56:57PM +0530, Ramanarayana wrote:
> Will it be helpful if the comments section of ExecuteStmt in gram.y is
> updated to include the IF NOT EXISTS clause.Something like this
> 
> CREATE TABLE  [IF NOT EXISTS] AS EXECUTE  [(params, ...)]

Not sure it helps much.  The other clauses don't include that much
details, and the grammar supported can be guessed easily from the
code.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-10 Thread David Rowley
On Mon, 11 Mar 2019 at 06:36, Tomas Vondra  wrote:
>
> On 3/9/19 7:33 PM, Dean Rasheed wrote:
> > I wonder if it's possible to write smaller, more targeted tests.
> > Currently "stats_ext" is by far the slowest test in its group, and I'm
> > not sure that some of those tests add much. It ought to be possible to
> > write a function that calls EXPLAIN and returns a query's row
> > estimate, and then you could write tests to confirm the effect of the
> > new stats by verifying the row estimates change as expected.
>
> Sure, if we can write more targeted tests, that would be good. But it's
> not quite clear to me how wrapping EXPLAIN in a function makes those
> tests any faster?

I've not looked at the tests in question, but if they're executing an
inferior plan is used when no extended stats exists, then maybe that's
why they're slow.

I think Dean might mean to create a function similar to
explain_parallel_append() in partition_prune.sql then write tests that
check the row estimate with EXPLAIN (COSTS ON) but strip out the other
costing stuff instead of validating that the poor plan was chosen.

> On 3/10/19 2:09 PM, Dean Rasheed wrote:
> > 12). bms_member_index() should surely be in bitmapset.c. It could be
> > more efficient by just traversing the bitmap words and making use of
> > bmw_popcount(). Also, its second argument should be of type 'int' for
> > consistency with other bms_* functions.
>
> Yes, moving to bitmapset.c definitely makes sense. I don't see how it
> could use bms_popcount() though.

I think it could be done by first checking if the parameter is a
member of the set, and then if so, count all the bits that come on and
before that member. You can use bmw_popcount() for whole words before
the specific member's word then just bitwise-and a bit mask of a
bitmapword that has all bits set for all bits on and before your
parameter's BITNUM(), and add the bmw_popcount of the final word
bitwise-anding the mask. bms_add_range() has some masking code you
could copy.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Segfault when restoring -Fd dump on current HEAD

2019-03-10 Thread Dmitry Dolgov
> On Mon, Mar 4, 2019 at 7:15 PM Andres Freund  wrote:
>
> The pluggable storage patchset contains exactly that... I've attached
> the precursor patch (CREATE ACCESS METHOD ... TYPE TABLE), and the patch
> for pg_dump support. They need a bit more cleanup, but it might be
> useful information for this thread.

Didn't expect this to happen so quickly, thanks!

> On 2019-03-04 13:25:40 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > >
> > > But it does basically require breaking archive compatibility.  I
> > > personally am OK with that, but I thought it might be worth discussing.
> >
> > I don't recall there being huge pushback when we did that in the past,
> > so I'm fine with it as long as there's an identifiable feature making
> > it necessary.
>
> Cool.

Then I guess we need to add the attached patch on top of a pg_dump support for
table am. It contains changes to use NULL as a default value for owner / defn /
dropStmt (exactly what we've changed back in 19455c9f56), and doesn't crash,
since K_VERS is different.


0001-ArchiveEntry-null-handling.patch
Description: Binary data


Re: Should we increase the default vacuum_cost_limit?

2019-03-10 Thread Tom Lane
Julien Rouhaud  writes:
> On Sat, Mar 9, 2019 at 10:04 PM Tom Lane  wrote:
>> 2. It's always bugged me that we don't allow fractional unit
>> specifications, say "0.1GB", even for GUCs that are integers underneath.
>> That would be a simple additional change on top of this, but I didn't
>> do it here.

> It annoyed me multiple times, so +1 for making that happen.

The first patch below does that, but I noticed that if we just do it
without any subtlety, you get results like this:

regression=# set work_mem = '30.1GB';
SET
regression=# show work_mem;
  work_mem  

 31562138kB
(1 row)

The second patch is a delta that rounds off to the next smaller unit
if there is one, producing a less noisy result:

regression=# set work_mem = '30.1GB';
SET
regression=# show work_mem;
 work_mem 
--
 30822MB
(1 row)

I'm not sure if that's a good idea or just overthinking the problem.
Thoughts?

regards, tom lane

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fe17357..3b59565 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -51,14 +51,21 @@
In general, enclose the value in single quotes, doubling any single
quotes within the value.  Quotes can usually be omitted if the value
is a simple number or identifier, however.
+   (Values that match a SQL keyword require quoting in some contexts.)
   
  
 
  
   
Numeric (integer and floating point):
-   A decimal point is permitted only for floating-point parameters.
-   Do not use thousands separators.  Quotes are not required.
+   Numeric parameters can be specified in the customary integer and
+   floating-point formats; fractional values are rounded to the nearest
+   integer if the parameter is of type integer.  Integer parameters
+   additionally accept hexadecimal input (beginning
+   with 0x) and octal input (beginning
+   with 0), but these formats cannot have a fraction.
+   Do not use thousands separators.
+   Quotes are not required, except for hexadecimal input.
   
  
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fe6c6f8..d374f53 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -6141,8 +6141,10 @@ get_config_unit_name(int flags)
 
 /*
  * Try to parse value as an integer.  The accepted formats are the
- * usual decimal, octal, or hexadecimal formats, optionally followed by
- * a unit name if "flags" indicates a unit is allowed.
+ * usual decimal, octal, or hexadecimal formats, as well as floating-point
+ * formats (which will be rounded to integer after any units conversion).
+ * Optionally, the value can be followed by a unit name if "flags" indicates
+ * a unit is allowed.
  *
  * If the string parses okay, return true, else false.
  * If okay and result is not NULL, return the value in *result.
@@ -6152,7 +6154,11 @@ get_config_unit_name(int flags)
 bool
 parse_int(const char *value, int *result, int flags, const char **hintmsg)
 {
-	int64		val;
+	/*
+	 * We assume here that double is wide enough to represent any integer
+	 * value with adequate precision.
+	 */
+	double		val;
 	char	   *endptr;
 
 	/* To suppress compiler warnings, always set output params */
@@ -6161,35 +6167,42 @@ parse_int(const char *value, int *result, int flags, const char **hintmsg)
 	if (hintmsg)
 		*hintmsg = NULL;
 
-	/* We assume here that int64 is at least as wide as long */
+	/*
+	 * Try to parse as an integer (allowing octal or hex input).  If the
+	 * conversion stops at a decimal point or 'e', or overflows, re-parse as
+	 * float.  This should work fine as long as we have no unit names starting
+	 * with 'e'.  If we ever do, the test could be extended to check for a
+	 * sign or digit after 'e', but for now that's unnecessary.
+	 */
 	errno = 0;
 	val = strtol(value, , 0);
-
-	if (endptr == value)
-		return false;			/* no HINT for integer syntax error */
-
-	if (errno == ERANGE || val != (int64) ((int32) val))
+	if (*endptr == '.' || *endptr == 'e' || *endptr == 'E' ||
+		errno == ERANGE)
 	{
-		if (hintmsg)
-			*hintmsg = gettext_noop("Value exceeds integer range.");
-		return false;
+		errno = 0;
+		val = strtod(value, );
 	}
 
-	/* allow whitespace between integer and unit */
+	if (endptr == value || errno == ERANGE)
+		return false;			/* no HINT for these cases */
+
+	/* reject NaN (infinities will fail range check below) */
+	if (isnan(val))
+		return false;			/* treat same as syntax error; no HINT */
+
+	/* allow whitespace between number and unit */
 	while (isspace((unsigned char) *endptr))
 		endptr++;
 
 	/* Handle possible unit */
 	if (*endptr != '\0')
 	{
-		double		cval;
-
 		if ((flags & GUC_UNIT) == 0)
 			return false;		/* this setting does not accept a unit */
 
-		if (!convert_to_base_unit((double) val,
+		if (!convert_to_base_unit(val,
   endptr, (flags & GUC_UNIT),
-  ))
+  ))
 		{
 		

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-10 Thread Peter Geoghegan
On Sun, Mar 10, 2019 at 12:53 PM Heikki Linnakangas  wrote:
> Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but
> changed to "searching_for_leaf_page" at the last minute. My thinking was
> that in the page-deletion case, you're trying to re-locate a particular
> leaf page. Otherwise, you're searching for matching tuples, not a
> particular page. Although during insertion, I guess you are also
> searching for a particular page, the page to insert to.

I prefer something like "searching_for_pivot_tuple", because it's
unambiguous. Okay with that?

> It's a hot codepath, but I doubt it's *that* hot that it matters,
> performance-wise...

I'll figure that out. Although I am currently looking into a
regression in workloads that fit in shared_buffers, that my
micro-benchmarks didn't catch initially. Indexes are still much
smaller, but we get a ~2% regression all the same. OTOH, we get a
7.5%+ increase in throughput when the workload is I/O bound, and
latency is generally no worse and even better with any workload.

I suspect that the nice top-down approach to nbtsplitloc.c has its
costs...will let you know more when I know more.

> > The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
> > they too have a heap TID attribute. nbtsearch.c code is not allowed to
> > rely on its value, though, and must use
> > minusinfkey/searching_for_pivot_tuple semantics (relying on its value
> > being minus infinity is still relying on its value being something).
>
> Yeah. I find that's a complicated way to think about it. My mental model
> is that v4 indexes store heap TIDs, and every tuple is unique thanks to
> that. But on v3, we don't store heap TIDs, and duplicates are possible.

I'll try it that way, then.

-- 
Peter Geoghegan



Re: performance issue in remove_from_unowned_list()

2019-03-10 Thread Alvaro Herrera
On 2019-Feb-07, Tomas Vondra wrote:

> Attached is a WIP patch removing the optimization from DropRelationFiles
> and adding it to smgrDoPendingDeletes. This resolves the issue, at least
> in the cases I've been able to reproduce. But maybe we should deal with
> this issue earlier by ensuring the two lists are ordered the same way
> from the very beginning, somehow.

I noticed that this patch isn't in the commitfest.  Are you planning to
push it soon?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-10 Thread Heikki Linnakangas

On 10/03/2019 20:53, Peter Geoghegan wrote:

On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas  wrote:

If we don't flip the meaning of the flag, then maybe calling it
something like "searching_for_leaf_page" would make sense:

/*
   * When set, we're searching for the leaf page with the given high key,
   * rather than leaf tuples matching the search keys.
   *
   * Normally, when !searching_for_pivot_tuple, if a page's high key


I guess you meant to say "searching_for_pivot_tuple" both times (not
"searching_for_leaf_page"). After all, we always search for a leaf
page. :-)


Ah, yeah. Not sure. I wrote it as "searching_for_pivot_tuple" first, but 
changed to "searching_for_leaf_page" at the last minute. My thinking was 
that in the page-deletion case, you're trying to re-locate a particular 
leaf page. Otherwise, you're searching for matching tuples, not a 
particular page. Although during insertion, I guess you are also 
searching for a particular page, the page to insert to.



As the patch stands, you're also setting minusinfkey when dealing with
v3 indexes. I think it would be better to only set
searching_for_leaf_page in nbtpage.c.


That would mean I would have to check both heapkeyspace and
minusinfkey within _bt_compare().


Yeah.


I would rather just keep the
assertion that makes sure that !heapkeyspace callers are also
minusinfkey callers, and the comments that explain why that is. It
might even matter to performance -- having an extra condition in
_bt_compare() is something we should avoid.


It's a hot codepath, but I doubt it's *that* hot that it matters, 
performance-wise...



In general, I think BTScanInsert
should describe the search key, regardless of whether it's a V3 or V4
index. Properties of the index belong elsewhere. (We're violating that
by storing the 'heapkeyspace' flag in BTScanInsert. That wart is
probably OK, it is pretty convenient to have it there. But in principle...)


The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
they too have a heap TID attribute. nbtsearch.c code is not allowed to
rely on its value, though, and must use
minusinfkey/searching_for_pivot_tuple semantics (relying on its value
being minus infinity is still relying on its value being something).


Yeah. I find that's a complicated way to think about it. My mental model 
is that v4 indexes store heap TIDs, and every tuple is unique thanks to 
that. But on v3, we don't store heap TIDs, and duplicates are possible.


- Heikki



Re: Batch insert in CTAS/MatView code

2019-03-10 Thread David Fetter
On Wed, Mar 06, 2019 at 10:06:27PM +0800, Paul Guo wrote:
> Hello, Postgres hackers,
> 
> The copy code has used batch insert with function heap_multi_insert() to
> speed up. It seems that Create Table As or Materialized View could leverage
> that code also to boost the performance also. Attached is a patch to
> implement that.

This is great!

Is this optimization doable for multi-row INSERTs, either with tuples
spelled out in the body of the query or in constructs like INSERT ...
SELECT ...?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Re: proposal: make NOTIFY list de-duplication optional

2019-03-10 Thread Filip Rembiałkowski
Thank you.

Here is my latest attempt, with actual syntax error handling.

Also,  the syntax is updated to what Tom Lane suggested in other
thread (with another variant of the same thing, from Julien Demoor)
NOTIFY [ ( option [, ...] ) ] channel [ , payload ]

Still no hash table fallback is implemented, so this is *not* a
performance improvement. Only a little more flexibility.












On Sat, Mar 9, 2019 at 3:31 AM Thomas Munro  wrote:
>
> On Fri, Mar 8, 2019 at 1:37 PM Filip Rembiałkowski
>  wrote:
> > See attached patch... I'm ready to work on so it can get merged in the next 
> > CF.
>
> Hi Filip,
>
> Seen on Travis:
>
>  async... FAILED  126 ms
>
> Looks like the new error isn't being raised for invalid send mode?
> (What kind of error message is "?" anyway? :-))
>
>  ERROR:  channel name too long
>  -- Should fail. Invalid 3rd parameter
>  NOTIFY notify_async2, 'test', 'invalid';
> -ERROR:  ?
>  NOTIFY notify_async2, 'test', true;
> -ERROR:  ?
>  --Should work. Valid NOTIFY/LISTEN/UNLISTEN commands
>  NOTIFY notify_async2;
>  NOTIFY notify_async2, '';
>
> --
> Thomas Munro
> https://enterprisedb.com
 contrib/tcn/tcn.c   |  2 +-
 doc/src/sgml/ref/notify.sgml| 64 +++--
 src/backend/commands/async.c| 93 +++--
 src/backend/nodes/copyfuncs.c   |  7 +--
 src/backend/nodes/equalfuncs.c  |  7 +--
 src/backend/nodes/outfuncs.c|  3 +-
 src/backend/nodes/readfuncs.c   |  3 +-
 src/backend/parser/gram.y   | 78 ++-
 src/backend/tcop/utility.c  |  8 ++--
 src/backend/utils/adt/ruleutils.c   |  2 +-
 src/include/catalog/pg_proc.dat |  4 ++
 src/include/commands/async.h|  4 +-
 src/include/nodes/parsenodes.h  | 12 +++--
 src/test/regress/expected/async.out | 21 +
 src/test/regress/sql/async.sql  | 10 
 15 files changed, 257 insertions(+), 61 deletions(-)

diff --git a/contrib/tcn/tcn.c b/contrib/tcn/tcn.c
index 5355a64c5e..b80337a5ce 100644
--- a/contrib/tcn/tcn.c
+++ b/contrib/tcn/tcn.c
@@ -161,7 +161,7 @@ triggered_change_notification(PG_FUNCTION_ARGS)
 	strcpy_quoted(payload, SPI_getvalue(trigtuple, tupdesc, colno), '\'');
 }
 
-Async_Notify(channel, payload->data);
+Async_Notify(channel, payload->data, true);
 			}
 			ReleaseSysCache(indexTuple);
 			break;
diff --git a/doc/src/sgml/ref/notify.sgml b/doc/src/sgml/ref/notify.sgml
index e0e125a2a2..e06b00f1f3 100644
--- a/doc/src/sgml/ref/notify.sgml
+++ b/doc/src/sgml/ref/notify.sgml
@@ -21,7 +21,11 @@ PostgreSQL documentation
 
  
 
-NOTIFY channel [ , payload ]
+NOTIFY [ ( option [, ...] ) ] channel [ , payload ]
+
+where option can be one of:
+
+COLLAPSE [ boolean ]
 
  
 
@@ -47,10 +51,10 @@ NOTIFY channel [ , 
 
   
-   The information passed to the client for a notification event includes the
-   notification channel
-   name, the notifying session's server process PID, and the
-   payload string, which is an empty string if it has not been specified.
+   The information passed to the client for a notification event includes
+   the notification channel name, the notifying session's server process
+   PID, and the payload string, which is an empty string
+   if it has not been specified.
   
 
   
@@ -92,21 +96,13 @@ NOTIFY channel [ , NOTIFY for real-time signaling
should try to keep their transactions short.
   
-
   
-   If the same channel name is signaled multiple times from the same
-   transaction with identical payload strings, the
-   database server can decide to deliver a single notification only.
-   On the other hand, notifications with distinct payload strings will
-   always be delivered as distinct notifications. Similarly, notifications from
-   different transactions will never get folded into one notification.
Except for dropping later instances of duplicate notifications,
NOTIFY guarantees that notifications from the same
transaction get delivered in the order they were sent.  It is also
-   guaranteed that messages from different transactions are delivered in
-   the order in which the transactions committed.
+   guaranteed that messages from different transactions are delivered
+   in the order in which the transactions committed.
   
-
   
It is common for a client that executes NOTIFY
to be listening on the same notification channel itself.  In that case
@@ -147,6 +143,21 @@ NOTIFY channel [ , 
 

+   
+COLLAPSE
+
+ 
+  Controls collapsing of repeated messages.
+
+  When set to on, notification will be skipped if
+  the same channel was already signaled with identical payload in the same
+  transaction block (finished with COMMIT,
+  END or SAVEPOINT).
+  
+  When set to off, duplicates will not be collapsed.
+ 
+
+   
   
  
 
@@ -190,6 +201,13 @@ NOTIFY channel [ , NOTIFY command if you need to work with
 non-constant 

Re: dropdb --force

2019-03-10 Thread Filip Rembiałkowski
Thank you. Updated patch attached.

On Sat, Mar 9, 2019 at 2:53 AM Thomas Munro  wrote:
>
> On Wed, Mar 6, 2019 at 1:39 PM Filip Rembiałkowski
>  wrote:
> > Here is Pavel's patch rebased to master branch, added the dropdb
> > --force option, a test case & documentation.
>
> Hello,
>
> cfbot.cputube.org says this fails on Windows, due to a missing semicolon here:
>
> #ifdef HAVE_SETSID
> kill(-(proc->pid), SIGTERM);
> #else
> kill(proc->pid, SIGTERM)
> #endif
>
> The test case failed on Linux, I didn't check why exactly:
>
> Test Summary Report
> ---
> t/050_dropdb.pl (Wstat: 65280 Tests: 13 Failed: 2)
> Failed tests: 12-13
> Non-zero exit status: 255
> Parse errors: Bad plan. You planned 11 tests but ran 13.
>
> +/* Time to sleep after isuing SIGTERM to backends */
> +#define TERMINATE_SLEEP_TIME 1
>
> s/isuing/issuing/
>
> But, hmm, this macro doesn't actually seem to be used in the patch.
> Wait, is that because the retry loop forgot to actually include the
> sleep?
>
> +/* without "force" flag raise exception immediately, or after
> 5 minutes */
>
> Normally we call it an "error", not an "exception".
>
> --
> Thomas Munro
> https://enterprisedb.com
 doc/src/sgml/ref/drop_database.sgml | 28 +
 doc/src/sgml/ref/dropdb.sgml| 14 +
 src/backend/commands/dbcommands.c   | 42 -
 src/backend/nodes/copyfuncs.c   |  1 +
 src/backend/nodes/equalfuncs.c  |  1 +
 src/backend/parser/gram.y   | 18 
 src/backend/storage/ipc/procarray.c | 14 -
 src/backend/tcop/utility.c  |  2 +-
 src/bin/scripts/dropdb.c| 14 +
 src/bin/scripts/t/050_dropdb.pl |  6 ++
 src/include/commands/dbcommands.h   |  2 +-
 src/include/nodes/parsenodes.h  |  1 +
 src/include/storage/procarray.h |  3 ++-
 13 files changed, 124 insertions(+), 22 deletions(-)

diff --git a/doc/src/sgml/ref/drop_database.sgml b/doc/src/sgml/ref/drop_database.sgml
index 3ac06c984a..84df485e11 100644
--- a/doc/src/sgml/ref/drop_database.sgml
+++ b/doc/src/sgml/ref/drop_database.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-DROP DATABASE [ IF EXISTS ] name
+DROP DATABASE [ IF EXISTS ] name [ FORCE ]
 
  
 
@@ -32,9 +32,11 @@ DROP DATABASE [ IF EXISTS ] name
DROP DATABASE drops a database. It removes the
catalog entries for the database and deletes the directory
containing the data.  It can only be executed by the database owner.
-   Also, it cannot be executed while you or anyone else are connected
-   to the target database.  (Connect to postgres or any
-   other database to issue this command.)
+   Also, it cannot be executed while you are connected to the target database.
+   (Connect to postgres or any other database to issue this
+   command.)
+   If anyone else is connected to the target database, this command will fail
+   - unless you use the FORCE option described below.
   
 
   
@@ -64,6 +66,24 @@ DROP DATABASE [ IF EXISTS ] name
  
 

+
+   
+FORCE
+
+ 
+  Attempt to terminate all existing connections to the target database.
+ 
+ 
+  This will fail, if current user has no permissions to terminate other
+  connections. Required permissions are the same as with 
+  pg_terminate_backend, described
+  in .
+
+  This will also fail, if the connections do not terminate in 60 seconds.
+ 
+
+   
+
   
  
 
diff --git a/doc/src/sgml/ref/dropdb.sgml b/doc/src/sgml/ref/dropdb.sgml
index 38f38f01ce..365ba317c1 100644
--- a/doc/src/sgml/ref/dropdb.sgml
+++ b/doc/src/sgml/ref/dropdb.sgml
@@ -86,6 +86,20 @@ PostgreSQL documentation
   
  
 
+ 
+  -f
+  --force
+  
+   
+Force termination of connected backends before removing the database.
+   
+   
+This will add the FORCE option to the DROP
+DATABASE command sent to the server.
+   
+  
+ 
+
  
-V
--version
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index d207cd899f..0e7a23cb26 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -487,7 +487,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 * potential waiting; we may as well throw an error first if we're gonna
 	 * throw one.
 	 */
-	if (CountOtherDBBackends(src_dboid, , ))
+	if (CountOtherDBBackends(src_dboid, , , false))
 		ereport(ERROR,
 (errcode(ERRCODE_OBJECT_IN_USE),
  errmsg("source database \"%s\" is being accessed by other users",
@@ -777,7 +777,7 @@ createdb_failure_callback(int code, Datum arg)
  * DROP DATABASE
  */
 void
-dropdb(const char *dbname, bool missing_ok)
+dropdb(const char *dbname, bool missing_ok, bool force)
 {
 	Oid			db_id;
 	bool		db_istemplate;
@@ -785,6 +785,7 @@ dropdb(const char *dbname, bool missing_ok)
 	HeapTuple	tup;
 	int			notherbackends;
 	int			npreparedxacts;
+	int			

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-10 Thread Peter Geoghegan
On Sun, Mar 10, 2019 at 7:09 AM Heikki Linnakangas  wrote:
> "descendrighttrunc" doesn't make much sense to me, either. I don't
> understand it. Maybe a comment would make it clear, though.

It's not an easily grasped concept. I don't think that any name will
easily convey the idea to the reader, though. I'm happy to go with
whatever name you prefer.

> I don't feel like this is an optimization. It's natural consequence of
> what the high key means. I guess you can think of it as an optimization,
> in the same way that not fully scanning the whole index for every search
> is an optimization, but that's not how I think of it :-).

I would agree with this in a green field situation, where we don't
have to consider the legacy of v3 indexes. But that's not the case
here.

> If we don't flip the meaning of the flag, then maybe calling it
> something like "searching_for_leaf_page" would make sense:
>
> /*
>   * When set, we're searching for the leaf page with the given high key,
>   * rather than leaf tuples matching the search keys.
>   *
>   * Normally, when !searching_for_pivot_tuple, if a page's high key

I guess you meant to say "searching_for_pivot_tuple" both times (not
"searching_for_leaf_page"). After all, we always search for a leaf
page. :-)

I'm fine with "searching_for_pivot_tuple", I think. The underscores
are not really stylistically consistent with other stuff in nbtree.h,
but I can use something very similar to your suggestion that is
consistent.

>   * has truncated columns, and the columns that are present are equal to
>   * the search key, the search will not descend to that page. For
>   * example, if an index has two columns, and a page's high key is
>   * ("foo", ), and the search key is also ("foo," ),
>   * the search will not descend to that page, but its right sibling. The
>   * omitted column in the high key means that all tuples on the page must
>   * be strictly < "foo", so we don't need to visit it. However, sometimes
>   * we perform a search to find the parent of a leaf page, using the leaf
>   * page's high key as the search key. In that case, when we search for
>   * ("foo", ), we do want to land on that page, not its right
>   * sibling.
>   */
> boolsearching_for_leaf_page;

That works for me (assuming you meant searching_for_pivot_tuple).

> As the patch stands, you're also setting minusinfkey when dealing with
> v3 indexes. I think it would be better to only set
> searching_for_leaf_page in nbtpage.c.

That would mean I would have to check both heapkeyspace and
minusinfkey within _bt_compare(). I would rather just keep the
assertion that makes sure that !heapkeyspace callers are also
minusinfkey callers, and the comments that explain why that is. It
might even matter to performance -- having an extra condition in
_bt_compare() is something we should avoid.

> In general, I think BTScanInsert
> should describe the search key, regardless of whether it's a V3 or V4
> index. Properties of the index belong elsewhere. (We're violating that
> by storing the 'heapkeyspace' flag in BTScanInsert. That wart is
> probably OK, it is pretty convenient to have it there. But in principle...)

The idea with pg_upgrade'd v3 indexes is, as I said a while back, that
they too have a heap TID attribute. nbtsearch.c code is not allowed to
rely on its value, though, and must use
minusinfkey/searching_for_pivot_tuple semantics (relying on its value
being minus infinity is still relying on its value being something).

Now, it's also true that there are a number of things that we have to
do within nbtinsert.c for !heapkeyspace that don't really concern the
key space as such. Even still, thinking about everything with
reference to the keyspace, and keeping that as similar as possible
between v3 and v4 is a good thing. It is up to high level code (such
as _bt_first()) to not allow a !heapkeyspace index scan to do
something that won't work for it. It is not up to low level code like
_bt_compare() to worry about these differences (beyond asserting that
caller got it right). If page deletion didn't need minusinfkey
semantics (if nobody but v3 indexes needed that), then I would just
make the "move right of separator" !minusinfkey code within
_bt_compare() test heapkeyspace. But we do have a general need for
minusinfkey semantics, so it seems simpler and more future-proof to
keep heapkeyspace out of low-level nbtsearch.c code.

-- 
Peter Geoghegan



Re: subscriptionCheck failures on nightjar

2019-03-10 Thread Andrew Dunstan


On 2/13/19 1:12 PM, Andres Freund wrote:
> Hi,
>
> On 2019-02-13 12:59:19 -0500, Tom Lane wrote:
>> Andres Freund  writes:
>>> On 2019-02-13 12:37:35 -0500, Tom Lane wrote:
 Bleah.  But in any case, the rename should not create a situation
 in which we need to fsync the file data again.
>>> Well, it's not super well defined which of either you need to make the
>>> rename durable, and it appears to differ between OSs. Any argument
>>> against fixing it up like I suggested, by using an fd from before the
>>> rename?
>> I'm unimpressed.  You're speculating about the filesystem having random
>> deviations from POSIX behavior, and using that weak argument to justify a
>> totally untested technique having its own obvious portability hazards.
> Uhm, we've reproduced failures due to the lack of such fsyncs at some
> point. And not some fringe OS, but ext4 (albeit with data=writeback).
>
> I don't think POSIX has yet figured out what they actually think is
> required:
> http://austingroupbugs.net/view.php?id=672
>
> I guess we could just ignore ENOENT in this case, that ought to be just
> as safe as using the old fd.
>
>
>> Also, I wondered why this is coming out as a PANIC.  I thought originally
>> that somebody must be causing this code to run in a critical section,
>> but it looks like the real issue is just that fsync_fname() uses
>> data_sync_elevel, which is
>>
>> int
>> data_sync_elevel(int elevel)
>> {
>>  return data_sync_retry ? elevel : PANIC;
>> }
>>
>> I really really don't want us doing questionably-necessary fsyncs with a
>> PANIC as the result.
> Well, given the 'failed fsync throws dirty data away' issue, I don't
> quite see what we can do otherwise. But:
>
>
>> Perhaps more to the point, the way this was coded, the PANIC applies
>> to open() failures in fsync_fname_ext() not just fsync() failures;
>> that's painting with too broad a brush isn't it?
> That indeed seems wrong. Thomas?  I'm not quite sure how to best fix
> this though - I guess we could rename fsync_fname_ext's eleval parameter
> to fsync_failure_elevel? It's not visible outside fd.c, so that'd not be
> to bad?
>

Thread seems to have gone quiet ...


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-03-10 Thread David Fetter
On Mon, Mar 04, 2019 at 05:46:07PM -0300, Alvaro Herrera wrote:
> Hi Rahila,
> 
> Thanks for looking.
> 
> On 2019-Mar-04, Rahila wrote:
> 
> > 1. Thank you for incorporating review comments.
> > Can you please rebase the latest 0001-Report-progress-of-
> > CREATE-INDEX-operations.patch on master? Currently it does not apply on
> > 754b90f657bd54b482524b73726dae4a9165031c
> 
> Hmm, rebased to current master.  Didn't conflict at all when rebasing,
> so it's strange that it didn't apply.
> 
> > >   15:56:44.694 | building index (3 of 8): initializing (1/5)| 
> > >   442478 |  442399 |0 |   0 |
> > > 0 |   0
> > >   15:56:44.705 | building index (3 of 8): sorting tuples, spool 1 (3/5) | 
> > >   442478 |  442399 |1 |   0 |
> > > 0 |   0
> > >   15:56:44.716 | building index (3 of 8): sorting tuples, spool 1 (3/5) | 
> > >   442478 |  442399 |1 |   0 |
> > > 0 |   0
> > >   15:56:44.727 | building index (3 of 8): final btree sort & load (5/5) | 
> > >   442478 |  442399 |1 |   79057 |
> > > 0 |   0
> > >   15:56:44.738 | building index (3 of 8): final btree sort & load (5/5) | 
> > >   442478 |  442399 |1 |  217018 |
> > > 0 |   0
> > >   15:56:44.75  | building index (3 of 8): final btree sort & load (5/5) | 
> > >   442478 |  442399 |1 |  353804 |
> > > 0 |   0
> > 2. In the above report, even though we are reporting progress in terms of
> > tuples_done for final btree sort & load phase we have not cleared
> > the blocks_done entry from previous phases. I think this can be confusing as
> > the blocks_done does not correspond to the tuples_done in the final btree
> > sort & load phase.
> 
> Good point.  Done in the attached version, wherein I also added comments
> to explain the IndexBuildHeapScan API change.  I didn't change the hash
> AM implementation here.

Would it be a very large lift to report progress for the rest of the
index types we support?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-10 Thread Tomas Vondra
Hi Dean,

Thanks for the review. I'll post a patch fixing most of the stuff soon,
but a few comments/questions regarding some of the issues:

On 3/9/19 7:33 PM, Dean Rasheed wrote:
> 5). It's not obvious what some of the new test cases in the
> "stats_ext" tests are intended to show. For example, the first test
> creates a table with 5000 rows and a couple of indexes, does a couple
> of queries, builds some MCV stats, and then repeats the queries, but
> the results seem to be the same with and without the stats.
>

Hmmm. I thought those tests are testing that we get the right plan, but
maybe I broke that somehow during the rebases. Will check.

> I wonder if it's possible to write smaller, more targeted tests.
> Currently "stats_ext" is by far the slowest test in its group, and I'm
> not sure that some of those tests add much. It ought to be possible to
> write a function that calls EXPLAIN and returns a query's row
> estimate, and then you could write tests to confirm the effect of the
> new stats by verifying the row estimates change as expected.

Sure, if we can write more targeted tests, that would be good. But it's
not quite clear to me how wrapping EXPLAIN in a function makes those
tests any faster?

On 3/10/19 2:09 PM, Dean Rasheed wrote:
> 12). bms_member_index() should surely be in bitmapset.c. It could be
> more efficient by just traversing the bitmap words and making use of
> bmw_popcount(). Also, its second argument should be of type 'int' for
> consistency with other bms_* functions.

Yes, moving to bitmapset.c definitely makes sense. I don't see how it
could use bms_popcount() though.

On 3/10/19 2:09 PM, Dean Rasheed wrote:
> 14). The attnums Bitmapset passed to
> statext_is_compatible_clause_internal() is an input/output argument
> that it updates. That should probably be documented. When it calls
> itself recursively for AND/OR/NOT clauses, it could just pass the
> original Bitmapset through to be updated (rather than creating a new
> one and merging it), as it does for other types of clause.
>

I don't think it's really possible, because the AND/OR/NOT clause is
considered compatible only when all the pieces are compatible. So we
can't tweak the original bitmapset directly in case the incompatible
clause is not the very first one.

> On the other hand, the outer function statext_is_compatible_clause()
> does need to return a new bitmap, which may or may not be used by its
> caller, so it would be cleaner to make that a strictly "out" parameter
> and initialise it to NULL in that function, rather than in its caller.


On 3/10/19 2:09 PM, Dean Rasheed wrote:
> 15). As I said yesterday, I don't think that there is a clean
> separator of concerns between the functions clauselist_selectivity(),
> statext_clauselist_selectivity(),
> dependencies_clauselist_selectivity() and
> mcv_clauselist_selectivity(), I think things could be re-arranged as
> follows:
>
> statext_clauselist_selectivity() - as the name suggests - should take
> care of *all* extended stats estimation, not just MCVs and histograms.
> So make it a fairly small function, calling
> mcv_clauselist_selectivity() and
> dependencies_clauselist_selectivity(), and histograms when that gets
> added.
>
> Most of the current code in statext_clauselist_selectivity() is really
> MCV-specific, so move that to mcv_clauselist_selectivity(). Amongst
> other things, that would move the call to choose_best_statistics() to
> mcv_clauselist_selectivity() (just as
> dependencies_clauselist_selectivity() calls choose_best_statistics()
> to get the best dependencies statistics). Then, when histograms are
> added later, you won't have the problem pointed out before where it
> can't apply both MCV and histogram stats if they're on different
> STATISTICS objects.

I agree clauselist_selectivity() shouldn't care about various types of
extended statistics (MCV vs. functional dependencies). But I'm not sure
the approach you suggested (moving stuff to mcv_clauselist_selectivity)
would work particularly well because most of it is not specific to MCV
lists. It'll also need to care about histograms, for example.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Why don't we have a small reserved OID range for patch revisions?

2019-03-10 Thread John Naylor
On Sat, Mar 9, 2019 at 1:14 AM Tom Lane  wrote:
> I took a quick look at this.  I went ahead and pushed the parts that
> were just code cleanup in reformat_dat_file.pl, since that seemed
> pretty uncontroversial.  As far as the rest of it goes:

Okay, thanks.

> * I'm really not terribly happy with sticking this functionality into
> reformat_dat_file.pl.  First, there's an issue of discoverability:
> it's not obvious that a script named that way would have such an
> ability.  Second, it clutters the script in a way that seems to me
> to hinder its usefulness as a basis for one-off hacks.  So I'd really
> rather have a separate script named something like "renumber_oids.pl",
> even if there's a good deal of code duplication between it and
> reformat_dat_file.pl.

> * In my vision of what this might be good for, I think it's important
> that it be possible to specify a range of input OIDs to renumber, not
> just "everything above N".  I agree the output range only needs a
> starting OID.

Now it looks like:

perl  renumber_oids.pl  --first-mapped-oid 8000 --last-mapped-oid 8999
 --first-target-oid 2000  *.dat

To prevent a maintenance headache, I didn't copy any of the formatting
logic over. You'll also have to run reformat_dat_files.pl afterwards
to restore that. It seems to work, but I haven't tested thoroughly.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From c6f5a1f6016feeb7f7ea94e504be6d23a264ca07 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Mon, 11 Mar 2019 01:05:15 +0800
Subject: [PATCH v2] Provide script to renumber OIDs.

Historically, It's been considered good practice to choose an OID that's
at the beginning of the range shown by the unused_oids script, but
nowadays that space is getting cramped, leading to merge conflicts.

Instead, allow developers to use high-numbered OIDs, and remap them
to a lower range at some future date.
---
 doc/src/sgml/bki.sgml|  16 +++
 src/include/catalog/renumber_oids.pl | 193 +++
 2 files changed, 209 insertions(+)
 create mode 100644 src/include/catalog/renumber_oids.pl

diff --git a/doc/src/sgml/bki.sgml b/doc/src/sgml/bki.sgml
index 3c5830bc8a..7e553dc30d 100644
--- a/doc/src/sgml/bki.sgml
+++ b/doc/src/sgml/bki.sgml
@@ -392,6 +392,22 @@
 at compile time.)

 
+   
+For a variety of reasons, newly assigned OIDs should occupy the lower
+end of the reserved range.  Because multiple patches considered for
+inclusion into PostgreSQL could be using
+the same new OIDs, there is the possibilty of conflict.  To mitigate
+this, there is a convention that OIDs 8000 and over are reserved for
+development.  This reduces the effort in rebasing over the HEAD branch.
+To enable committers to move these OIDs to the lower range easily,
+renumber_oids.pl can map OIDs as in the following,
+following, which maps any OIDs between 8000 and 8999, inclusive, to
+unused OIDs starting at 2000:
+
+$ perl  renumber_oids.pl  --first-mapped-oid 8000 --last-mapped-oid 8999  --first-target-oid 2000  *.dat
+
+   
+

 The OID counter starts at 1 at the beginning of a bootstrap run.
 If a row from a source other than postgres.bki
diff --git a/src/include/catalog/renumber_oids.pl b/src/include/catalog/renumber_oids.pl
new file mode 100644
index 00..79537465bc
--- /dev/null
+++ b/src/include/catalog/renumber_oids.pl
@@ -0,0 +1,193 @@
+#!/usr/bin/perl
+#--
+#
+# renumber_oids.pl
+#Perl script that shifts a range of high-numbered OIDs in the given
+#catalog data file(s) to a lower range.
+#
+#Note: This does not format the output, so you will still need to
+#run reformat_dat_file.pl.
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/include/catalog/renumber_oids.pl
+#
+#--
+
+use strict;
+use warnings;
+
+use FindBin;
+use Getopt::Long;
+use Data::Dumper;
+$Data::Dumper::Terse = 1;
+
+# If you copy this script to somewhere other than src/include/catalog,
+# you'll need to modify this "use lib" or provide a suitable -I switch.
+use lib "$FindBin::RealBin/../../backend/catalog/";
+use Catalog;
+
+# Must run in src/include/catalog
+chdir $FindBin::RealBin or die "could not cd to $FindBin::RealBin: $!\n";
+
+my $FirstGenbkiObjectId =
+  Catalog::FindDefinedSymbol('access/transam.h', '..',
+	'FirstGenbkiObjectId');
+
+# Process command line switches.
+my $output_path = '';
+my $first_mapped_oid;
+my $last_mapped_oid = $FirstGenbkiObjectId - 1;
+my $first_target_oid = 1;
+
+GetOptions(
+	'output:s'   => \$output_path,
+	'first-mapped-oid:i' => \$first_mapped_oid,
+	'last-mapped-oid:i'  => \$last_mapped_oid,
+	'first-target-oid:i' => 

Re: Covering GiST indexes

2019-03-10 Thread Andrey Borodin
Hi!

> 10 марта 2019 г., в 13:39, Alexander Korotkov  
> написал(а):
> 
> Pushed with some more small changes in docs.

That's great, thank you!

Best regards, Andrey Borodin.


Re: GiST VACUUM

2019-03-10 Thread Andrey Borodin

> 5 марта 2019 г., в 18:21, Heikki Linnakangas  написал(а):
> 
> On 05/03/2019 02:26, Andrey Borodin wrote:
>> 
>> That's cool! I'll work on 2nd step of these patchset to make
>> blockset data structure prettier and less hacky.
> 
> Committed the first patch. Thanks for the patch!

That's cool! Thanks!

> I'll change the status of this patch to "Waiting on Author", to reflect the 
> state of the 2nd patch, since you're working on the radix tree blockset stuff.

Here's new version of the patch for actual page deletion.
Changes:
1. Fixed possible concurrency issue:
We were locking child page while holding lock on internal page. Notes in GiST 
README recommend locking child before parent.
Thus now we unlock internal page before locking children for deletion, and lock 
it again, check that everything is still on it's place and delete only then.
One thing still bothers me. Let's assume that we have internal page with 2 
deletable leaves. We lock these leaves in order of items on internal page. Is 
it possible that 2nd page have follow-right link on 1st and someone will lock 
2nd page, try to lock 1st and deadlock with VACUUM?
2. Added radix-tree-based block set to lib, with tests.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM.patch
Description: Binary data


0001-Add-block-set-data-structure.patch
Description: Binary data


Re: proposal: plpgsql pragma statement

2019-03-10 Thread Pavel Stehule
so 9. 3. 2019 v 22:17 odesílatel Pavel Stehule 
napsal:

>
>
> čt 7. 3. 2019 v 18:45 odesílatel Andrew Dunstan <
> andrew.duns...@2ndquadrant.com> napsal:
>
>>
>> On 3/7/19 12:41 PM, Pavel Stehule wrote:
>> >
>> >
>> > čt 7. 3. 2019 v 18:35 odesílatel Andrew Dunstan
>> > > > > napsal:
>> >
>> >
>> >
>> >
>> > The other thing that bugs me a bit about the patch is that the only
>> > testing it does it to make sure that pragmas are ignored by the core
>> > plpgsql processor. Maybe that's enough, but mostly we tend to like
>> to
>> > have one actual use of a feature.
>> >
>> >
>> > Unfortunately plpgsql_check is not part of upstream
>> >
>> > https://github.com/okbob/plpgsql_check
>> >
>> > I can to write some simple extension - some print tracing, that can
>> > use this feature?
>> >
>> >
>>
>> Works for me. Another idea I had was some sort of crypto signature pragma.
>>
>
> Here is pragma patch with demo
>

fixed check-world

Regards

Pavel


> Regards
>
> Pavel
>
>>
>>
>> I still think making it block level only is unwarranted, though.
>>
>>
>> cheers
>>
>>
>> andrew
>>
>>
>> --
>> Andrew Dunstanhttps://www.2ndQuadrant.com
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>>
diff --git a/contrib/Makefile b/contrib/Makefile
index 92184ed487..a5edd12e35 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -39,6 +39,7 @@ SUBDIRS = \
 		pgrowlocks	\
 		pgstattuple	\
 		pg_visibility	\
+		plpgsql_tracer	\
 		postgres_fdw	\
 		seg		\
 		spi		\
diff --git a/contrib/plpgsql_tracer/.gitignore b/contrib/plpgsql_tracer/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/contrib/plpgsql_tracer/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/contrib/plpgsql_tracer/Makefile b/contrib/plpgsql_tracer/Makefile
new file mode 100644
index 00..aedbd4f20f
--- /dev/null
+++ b/contrib/plpgsql_tracer/Makefile
@@ -0,0 +1,25 @@
+# contrib/plpgsql_tracer/Makefile
+
+MODULES = plpgsql_tracer
+
+EXTENSION = plpgsql_tracer
+PGFILEDESC = "plpgsql tracer - example of PRAGMA based extension"
+
+REGRESS = plpgsql_tracer
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/plpgsql_tracer
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+ifeq ($(PORTNAME), darwin)
+override CFLAGS += -undefined dynamic_lookup
+endif
+
+override CFLAGS += -I$(top_builddir)/src/pl/plpgsql/src -Wall
diff --git a/contrib/plpgsql_tracer/expected/plpgsql_tracer.out b/contrib/plpgsql_tracer/expected/plpgsql_tracer.out
new file mode 100644
index 00..d183fcebef
--- /dev/null
+++ b/contrib/plpgsql_tracer/expected/plpgsql_tracer.out
@@ -0,0 +1,38 @@
+--
+--  Test plpgsql_tracer extension
+--
+load 'plpgsql';
+load 'plpgsql_tracer';
+create or replace function fx()
+returns integer
+language plpgsql
+AS $function$
+declare
+  x int = 0;
+  pragma plpgsql_tracer(on);
+  pragma ignore_me;
+begin
+  x := x + 1;
+  x := x + 1;
+  declare
+pragma plpgsql_tracer(off);
+  begin
+-- hidden statemt
+declare pragma plpgsql_tracer; -- ignored, just warning
+begin
+  x := x + 1;
+end;
+  end;
+  return x;
+end;
+$function$;
+select fx();
+WARNING:  missing argument of PRAGMA plpgsql_tracer
+NOTICE:  plpgsql_tracer:7: assignment
+NOTICE:  plpgsql_tracer:8: assignment
+NOTICE:  plpgsql_tracer:   18: RETURN
+ fx 
+
+  3
+(1 row)
+
diff --git a/contrib/plpgsql_tracer/plpgsql_tracer.c b/contrib/plpgsql_tracer/plpgsql_tracer.c
new file mode 100644
index 00..03d4ed8a62
--- /dev/null
+++ b/contrib/plpgsql_tracer/plpgsql_tracer.c
@@ -0,0 +1,248 @@
+/*
+ * contrib/plpgsql_tracer/plpgsql_tracer.c
+ */
+#include "postgres.h"
+#include "plpgsql.h"
+#include "fmgr.h"
+
+#include "nodes/bitmapset.h"
+
+PG_MODULE_MAGIC;
+
+
+#define PLPGSQL_TRACER_MAGIC		73071522
+
+typedef struct plpgsql_tracer_data
+{
+	intmagic;
+	Bitmapset	   *stmtids;
+} plpgsql_tracer_data;
+
+
+static void collect_stmtid(PLpgSQL_stmt *stmt, plpgsql_tracer_data *data, bool trace);
+
+static void plpgsql_tracer_func_init(PLpgSQL_execstate *estate, PLpgSQL_function *func);
+static void plpgsql_tracer_stmt_beg(PLpgSQL_execstate *estate, PLpgSQL_stmt *stmt);
+
+static PLpgSQL_plugin plugin_funcs = { plpgsql_tracer_func_init,
+	   NULL,
+	   NULL,
+	   plpgsql_tracer_stmt_beg,
+	   NULL,
+	   NULL,
+	   NULL};
+
+/*
+ * Collect traced statement id from list of statements.
+ */
+static void
+collect_stmtid_list(List *stmts,
+	plpgsql_tracer_data *data,
+	bool trace)
+{
+	ListCell	   *lc;
+
+	foreach(lc, stmts)
+	{
+		collect_stmtid((PLpgSQL_stmt *) lfirst(lc), data, trace);
+	}
+}
+
+
+/*
+ * It is iterate over all plpgsql statements and collect stmtid of statements
+ * inside blocks marked by 

Re: Binary upgrade from <12 to 12 creates toast table for partitioned tables

2019-03-10 Thread Alvaro Herrera
Hello

On 2019-Mar-08, Andres Freund wrote:

> > diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
> > index e962ae7e913..1de8da59361 100644
> > --- a/src/bin/pg_dump/pg_dump.c
> > +++ b/src/bin/pg_dump/pg_dump.c
> > @@ -4359,9 +4359,9 @@ binary_upgrade_set_type_oids_by_rel_oid(Archive *fout,
> >   "SELECT c.reltype AS crel, t.reltype 
> > AS trel "
> >   "FROM pg_catalog.pg_class c "
> >   "LEFT JOIN pg_catalog.pg_class t ON "
> > - "  (c.reltoastrelid = t.oid) "
> > + "  (c.reltoastrelid = t.oid AND 
> > c.relkind <> '%c') "
> >   "WHERE c.oid = '%u'::pg_catalog.oid;",
> > - pg_rel_oid);
> > + RELKIND_PARTITIONED_TABLE, 
> > pg_rel_oid);
> 
> Hm, I know this code isn't generally well documented, but perhaps we
> could add a comment as to why we're excluding partitioned tables?

I added a short comment nearby.  Hopefully that's sufficient.  Let's see
what the buildfarm members have to say now.

(I wondered about putting the comment between two lines of the string
literal, but decided against it ...)

Thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-10 Thread Dean Rasheed
On Sun, 10 Mar 2019 at 13:09, Dean Rasheed  wrote:
> Here are some more comments:
>

One more thing --- the comment for statext_clauselist_selectivity() says:

 * So (simple_selectivity - base_selectivity) may be seen as a correction for
 * the part not covered by the MCV list.

That's not quite right. It should really say that (simple_selectivity
- base_selectivity) is an estimate for the part not covered by the MCV
list, or that (mcv_selectivity - base_selectivity) is a correction for
the part covered by the MCV list. Those 2 statements are actually
equivalent, and different from what you wrote.

Perhaps the easiest way to see it is to work through a simple example:

Suppose you had the following clauses:

  a = 1 AND b >= 0 AND b <= 10

The per-column stats might be expected to give reasonable independent
estimates for the following 2 things:

  P(a = 1)

  P(b >= 0 AND b <= 10)  -- in general, greater than P(b >= 0) * P(b <= 10)

but the overall estimate produced by clauselist_selectivity_simple()
would then just be the product of those 2 things:

  simple_sel = P(a = 1) * P(b >= 0 AND b <= 10)

which might not be so good if the columns were correlated.

Now suppose you had MCV stats, which included MCV items for the
following specific values:

  (a=1,b=1), (a=1,b=2), (a=1,b=3)

but no other relevant MCV entries. (There might be lots of other MCV
items that don't match the original clauses, but they're irrelavent
for this discssion.) That would mean that we could get reasonable
estimates for the following 2 quantities:

  mcv_sel = P(a = 1 AND b IN (1,2,3))
  = P(a = 1 AND b = 1) + P(a = 1 AND b = 2) + P(a = 1 AND b = 3)

  mcv_basesel = base_freq(a = 1 AND b IN (1,2,3))
  = P(a = 1) * (P(b = 1) + P(b = 2) + P(b = 3))

So how is that useful? Well, returning to the quantity that we're
actually trying to compute, it can be split into MCV and non-MCV
parts, and since they're mutually exclusive possibilities, their
probabilities just add up. Thus we can write:

  P(a = 1 AND b >= 0 AND b <= 10)

= P(a = 1 AND b IN (1,2,3)) -- MCV part
+ P(a = 1 AND b >= 0 AND b <= 10 AND b NOT IN (1,2,3))  -- non-MCV part

= mcv_sel + other_sel

So the first term is easy -- it's just mcv_sel, from above. The second
term is trickier though, since we have no information about the
correlation between a and b in the non-MCV region. Just about the best
we can do is assume that they're independent, which gives:

  other_sel = P(a = 1 AND b >= 0 AND b <= 10 AND b NOT IN (1,2,3))
   ~= P(a = 1) * P(b >= 0 AND b <= 10 AND b NOT IN (1,2,3))

and that can now be written in terms of things that we know

  other_sel ~= P(a = 1) * P(b >= 0 AND b <= 10 AND b NOT IN (1,2,3))

 = P(a = 1) * P(b >= 0 AND b <= 10)
 - P(a = 1) * P(b IN (1,2,3))  -- mutually exclusive possibilities

 = simple_sel - mcv_basesel

So, as I said above, (simple_selectivity - base_selectivity) is an
estimate for the part not covered by the MCV list.


Another way to look at it is to split the original per-column estimate
up into MCV and non-MCV parts, and correct the MCV part using the MCV
stats:

  simple_sel = P(a = 1) * P(b >= 0 AND b <= 10)

 = P(a = 1) * P(b IN (1,2,3))
 + P(a = 1) * P(b >= 0 AND b <= 10 AND b NOT IN (1,2,3))

The first term is just mcv_basesel, so we can define other_sel to be
the other term, giving

  simple_sel = mcv_basesel  -- MCV part
 + other_sel-- non-MCV part

Clearly mcv_basesel isn't the best estimate for the MCV part, and it
should really be mcv_sel, so we can improve upon simple_sel by
applying a correction of (mcv_sel - basesel) to it:

  better estimate = simple_sel + (mcv_sel - mcv_basesel)
  = mcv_sel + other_sel

(where other_sel = simple_sel - mcv_basesel)

Of course, that's totally equivalent, but looking at it this way
(mcv_selectivity - base_selectivity) can be seen as a correction for
the part covered by the MCV list.


All of that generalises to arbitrary clauses, because the matching
items in the MCV list are independent possibilities that sum up, and
the MCV and non-MCV parts are mutually exclusive. That's also why the
basesel calculation in mcv_clauselist_selectivity() must only include
matching MCV items, and the following XXX comment is wrong:

+   for (i = 0; i < mcv->nitems; i++)
+   {
+   *totalsel += mcv->items[i]->frequency;
+
+   if (matches[i] != STATS_MATCH_NONE)
+   {
+   /* XXX Shouldn't the basesel be outside the if condition? */
+   *basesel += mcv->items[i]->base_frequency;
+   s += mcv->items[i]->frequency;
+   }
+   }

So I believe that that code is correct, as written.

Regards,
Dean



Re: Should we increase the default vacuum_cost_limit?

2019-03-10 Thread Julien Rouhaud
On Sun, Mar 10, 2019 at 4:47 PM Tom Lane  wrote:
>
> Julien Rouhaud  writes:
> > On Sat, Mar 9, 2019 at 10:04 PM Tom Lane  wrote:
> >> I tried this, and it seems to work pretty well.  The first of the two
> >> attached patches just teaches guc.c to support units for float values,
> >> incidentally allowing "us" as an input unit for time-based GUCs.
>
> > Why not allowing third party extensions to declare a GUC stored in us?
>
> I think that adding a new base unit type (GUC_UNIT_US) is possible but
> I'm disinclined to do it on the basis of zero evidence that it's needed.
> Only three of the five already-known time units are allowed to be base
> units (ms, s, min are but d and h aren't) so it's not like there's no
> precedent for excluding this one.  Anyway, such a patch would be mostly
> orthogonal to what I've done here, so it should be considered on its
> own merits.
> (BTW, if we're expecting to have GUCs that are meant to measure only
> very short time intervals, maybe it'd be more forward-looking for
> their base unit to be NS not US.)

That's fair.

> >> 2. It's always bugged me that we don't allow fractional unit
> >> specifications, say "0.1GB", even for GUCs that are integers underneath.
> >> That would be a simple additional change on top of this, but I didn't
> >> do it here.
>
> > It annoyed me multiple times, so +1 for making that happen.
>
> OK, will do.

Thanks!



Re: Should we increase the default vacuum_cost_limit?

2019-03-10 Thread Tom Lane
Julien Rouhaud  writes:
> On Sat, Mar 9, 2019 at 10:04 PM Tom Lane  wrote:
>> I tried this, and it seems to work pretty well.  The first of the two
>> attached patches just teaches guc.c to support units for float values,
>> incidentally allowing "us" as an input unit for time-based GUCs.

> Why not allowing third party extensions to declare a GUC stored in us?

I think that adding a new base unit type (GUC_UNIT_US) is possible but
I'm disinclined to do it on the basis of zero evidence that it's needed.
Only three of the five already-known time units are allowed to be base
units (ms, s, min are but d and h aren't) so it's not like there's no
precedent for excluding this one.  Anyway, such a patch would be mostly
orthogonal to what I've done here, so it should be considered on its
own merits.

(BTW, if we're expecting to have GUCs that are meant to measure only
very short time intervals, maybe it'd be more forward-looking for
their base unit to be NS not US.)

>> 2. It's always bugged me that we don't allow fractional unit
>> specifications, say "0.1GB", even for GUCs that are integers underneath.
>> That would be a simple additional change on top of this, but I didn't
>> do it here.

> It annoyed me multiple times, so +1 for making that happen.

OK, will do.

regards, tom lane



Re: Should we add GUCs to allow partition pruning to be disabled?

2019-03-10 Thread Justin Pryzby
On Sun, Mar 10, 2019 at 10:53:02PM +1300, David Rowley wrote:
> On Fri, 11 May 2018 at 17:37, Amit Langote  
> wrote:
> > 5. The last sentence in caveats, that is,
> >
> > "Partitioning using these techniques will work well with up to perhaps a
> > hundred partitions; don't try to use many thousands of partitions."
> >
> > should perhaps be reworded as:
> >
> > "So the legacy inheritance based partitioning will work well with up to
> > perhaps a hundred partitions; don't try to use many thousands of 
> > partitions."

> In the -general post, I was just about to point them at the part in
> the documents that warn against these large partition hierarchies, but
> it looks like the warning was removed in bebc46931a1, or at least
> modified to say that constraint exclusion with heritance tables is
> slow. I really wonder if we shouldn't put something back in there to
> warn against this sort of thing.

+1

I believe I was of the same mind when I wrote:
https://www.postgresql.org/message-id/flat/20180525215002.GD14378%40telsasoft.com#c9de33b17fe63cecad4ac30fb1662531

Justin

PS. Sorry to dredge up another 10 month old thread..



Re: WIP: Avoid creation of the free space map for small tables

2019-03-10 Thread John Naylor
On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila  wrote:
> Few minor comments:
> 1.
> warning C4715: 'new_cluster_needs_fsm': not all control paths return a value
>
> Getting this new warning in the patch.

Hmm, I don't get that in a couple versions of gcc. Your compiler must
not know that pg_fatal() cannot return. I blindly added a fix.

> 2.
>
> This comment line doesn't seem to be related to this patch.  If so, I
> think we can avoid having any additional change which is not related
> to the functionality of this patch.  Feel free to submit it
> separately, if you think it is an improvement.

> +
> + /* Transfer any VM files if we can trust their contents. */
>   if (vm_crashsafe_match)

Well, I guess the current comment is still ok, so reverted. If I were
to do a separate cleanup patch, I would rather remove the
vm_must_add_frozenbit parameter -- there's no reason I can see for
calls that transfer the heap and FSM to know about this.

I also changed references to the 'first segment of the main fork'
where there will almost always only be one segment. This was a vestige
of the earlier algorithm I had.

> 3. Can we add a note about this in the Notes section of pg_upgrade
> documentation [1]?

Done.

> Have you done any performance testing of this patch?  I mean to say
> now that we added a new stat call for each table, we should see if
> that has any impact.  Ideally, that should be compensated by the fact
> that we are now not transferring *fsm files for small relations.  How
> about constructing a test where all relations are greater than 4 pages
> and then try to upgrade them.  We can check for a cluster with a
> different number of relations say 10K, 20K, 50K, 100K.

I have not, but I agree it should be done. I will try to do so soon.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From b65084daf15d198c9fdd986992b0147290c8e690 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Sun, 10 Mar 2019 22:01:21 +0800
Subject: [PATCH v23] During pg_upgrade, conditionally skip transfer of FSMs.

If a heap on the old cluster has 4 pages or fewer, and the old cluster
was PG v11 or earlier, don't copy or link the FSM. This will shrink
space usage for installations with large numbers of small tables.
---
 doc/src/sgml/ref/pgupgrade.sgml  |  7 
 src/bin/pg_upgrade/info.c| 16 +++--
 src/bin/pg_upgrade/pg_upgrade.h  |  6 
 src/bin/pg_upgrade/relfilenode.c | 58 +++-
 4 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 7e1afaf0a5..c896882dd1 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -792,6 +792,13 @@ psql --username=postgres --file=script.sql postgres
is down.
   
 
+  
+   In PostgreSQL 12 and later small tables by
+   default don't have a free space map, as a space optimization.  If you are
+   upgrading a pre-12 cluster, the free space maps of small tables will
+   likewise not be transferred to the new cluster.  
+  
+
  
 
  
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index 2f925f086c..902bfc647e 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -200,6 +200,8 @@ create_rel_filename_map(const char *old_data, const char *new_data,
 
 	map->old_db_oid = old_db->db_oid;
 	map->new_db_oid = new_db->db_oid;
+	map->relpages = old_rel->relpages;
+	map->relkind = old_rel->relkind;
 
 	/*
 	 * old_relfilenode might differ from pg_class.oid (and hence
@@ -418,6 +420,7 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo)
 	char	   *nspname = NULL;
 	char	   *relname = NULL;
 	char	   *tablespace = NULL;
+	char	   *relkind = NULL;
 	int			i_spclocation,
 i_nspname,
 i_relname,
@@ -425,7 +428,9 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo)
 i_indtable,
 i_toastheap,
 i_relfilenode,
-i_reltablespace;
+i_reltablespace,
+i_relpages,
+i_relkind;
 	char		query[QUERY_ALLOC];
 	char	   *last_namespace = NULL,
 			   *last_tablespace = NULL;
@@ -494,7 +499,7 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo)
 	 */
 	snprintf(query + strlen(query), sizeof(query) - strlen(query),
 			 "SELECT all_rels.*, n.nspname, c.relname, "
-			 "  c.relfilenode, c.reltablespace, %s "
+			 "  c.relfilenode, c.reltablespace, c.relpages, c.relkind, %s "
 			 "FROM (SELECT * FROM regular_heap "
 			 "  UNION ALL "
 			 "  SELECT * FROM toast_heap "
@@ -525,6 +530,8 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo)
 	i_relname = PQfnumber(res, "relname");
 	i_relfilenode = PQfnumber(res, "relfilenode");
 	i_reltablespace = PQfnumber(res, "reltablespace");
+	i_relpages = PQfnumber(res, "relpages");
+	i_relkind = PQfnumber(res, "relkind");
 	i_spclocation = PQfnumber(res, "spclocation");
 
 	for (relnum = 0; relnum < ntups; relnum++)
@@ -556,6 +563,11 @@ get_rel_infos(ClusterInfo *cluster, 

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-10 Thread Heikki Linnakangas

On 08/03/2019 23:21, Peter Geoghegan wrote:

On Fri, Mar 8, 2019 at 10:48 AM Peter Geoghegan  wrote:

All of that said, maybe it would be clearer if page deletion was not
the special case that has to opt in to minusinfkey semantics. Perhaps
it would make more sense for everyone else to opt out of minusinfkey
semantics, and to get the !minusinfkey optimization as a result of
that. I only did it the other way around because that meant that only
nbtpage.c had to acknowledge the special case.


This seems like a good idea -- we should reframe the !minusinfkey
optimization, without actually changing the behavior. Flip it around.

The minusinfkey field within the insertion scankey struct would be
called something like "descendrighttrunc" instead. Same idea, but with
the definition inverted. Most _bt_search() callers (all of those
outside of nbtpage.c and amcheck) would be required to opt in to that
optimization to get it.

Under this arrangement, nbtpage.c/page deletion would not ask for the
"descendrighttrunc" optimization, and would therefore continue to do
what it has always done: find the first leaf page that its insertion
scankey values could be on (we don't lie about searching for negative
infinity, or having a negative infinity sentinel value in scan key).
The only difference for page deletion between v3 indexes and v4
indexes is that with v4 indexes we'll relocate the same leaf page
reliably, since every separator key value is guaranteed to be unique
on its level (including the leaf level/leaf high keys). This is just a
detail, though, and not one that's even worth pointing out; we're not
*relying* on that being true on v4 indexes anyway (we check that the
block number is a match too, which is strictly necessary for v3
indexes and seems like a good idea for v4 indexes).

This is also good because it makes it clear that the unique index code
within _bt_doinsert() (that temporarily sets scantid to NULL) benefits
from the descendrighttrunc/!minusinfkey optimization -- it should be
"honest" and ask for it explicitly. We can make _bt_doinsert() opt in
to the optimization for unique indexes, but not for other indexes,
where scantid is set from the start. The
descendrighttrunc/!minusinfkey optimization cannot help when scantid
is set from the start, because we'll always have an attribute value in
insertion scankey that breaks the tie for us instead. We'll always
move right of a heap-TID-truncated separator key whose untruncated
attributes are all equal to a prefix of our insertion scankey values.

(This _bt_doinsert() descendrighttrunc/!minusinfkey optimization for
unique indexes matters more than you might think -- we do really badly
with things like Zipfian distributions currently, and reducing the
contention goes some way towards helping with that. Postgres pro
noticed this a couple of years back, and analyzed it in detail at that
time. It's really nice that we very rarely have to move right within
code like _bt_check_unique() and _bt_findsplitloc() with the patch.)

Does that make sense to you? Can you live with the name
"descendrighttrunc", or do you have a better one?


"descendrighttrunc" doesn't make much sense to me, either. I don't 
understand it. Maybe a comment would make it clear, though.


I don't feel like this is an optimization. It's natural consequence of 
what the high key means. I guess you can think of it as an optimization, 
in the same way that not fully scanning the whole index for every search 
is an optimization, but that's not how I think of it :-).


If we don't flip the meaning of the flag, then maybe calling it 
something like "searching_for_leaf_page" would make sense:


/*
 * When set, we're searching for the leaf page with the given high key,
 * rather than leaf tuples matching the search keys.
 *
 * Normally, when !searching_for_pivot_tuple, if a page's high key
 * has truncated columns, and the columns that are present are equal to
 * the search key, the search will not descend to that page. For
 * example, if an index has two columns, and a page's high key is
 * ("foo", ), and the search key is also ("foo," ),
 * the search will not descend to that page, but its right sibling. The
 * omitted column in the high key means that all tuples on the page must
 * be strictly < "foo", so we don't need to visit it. However, sometimes
 * we perform a search to find the parent of a leaf page, using the leaf
 * page's high key as the search key. In that case, when we search for
 * ("foo", ), we do want to land on that page, not its right
 * sibling.
 */
boolsearching_for_leaf_page;


As the patch stands, you're also setting minusinfkey when dealing with 
v3 indexes. I think it would be better to only set 
searching_for_leaf_page in nbtpage.c. In general, I think BTScanInsert 
should describe the search key, regardless of whether it's a V3 or V4 
index. Properties of the index belong elsewhere. (We're violating that 
by storing the 'heapkeyspace' flag in BTScanInsert. That wart is 

Re: Batch insert in CTAS/MatView code

2019-03-10 Thread Paul Guo
Sorry for the late reply.

To Michael. Thank you. I know this commitfest is ongoing and I'm not
targeting for this.

On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas  wrote:

> On 06/03/2019 22:06, Paul Guo wrote:
> > The patch also modifies heap_multi_insert() a bit to do a bit further
> > code-level optimization by using static memory, instead of using memory
> > context and dynamic allocation.
>
> If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
> That is still leaked to the current memory context.
>
>
Thanks. I checked the code for that but apparently, I missed that one. I'll
see what proper context can be used for CTAS. For copy code maybe just
revert my change.



> Leaking into the current memory context is not a bad thing, because
> resetting a memory context is faster than doing a lot of pfree() calls.
> The callers just need to be prepared for that, and use a short-lived
> memory context.
>
> > By the way, while looking at the code, I noticed that there are 9 local
> > arrays with large length in toast_insert_or_update() which seems to be a
> > risk of stack overflow. Maybe we should put it as static or global.
>
> Hmm. We currently reserve 512 kB between the kernel's limit, and the
> limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
> arrays add up to 52800 bytes on a 64-bit maching, if I did my math
> right. So there's still a lot of headroom. I agree that it nevertheless
> seems a bit excessive, though.
>

I was worried about some recursive calling of it, but probably there should
be no worry for toast_insert_or_update().


> > With the patch,
> >
> > Time: 4728.142 ms (00:04.728)
> > Time: 14203.983 ms (00:14.204)
> > Time: 1008.669 ms (00:01.009)
> >
> > Baseline,
> > Time: 11096.146 ms (00:11.096)
> > Time: 13106.741 ms (00:13.107)
> > Time: 1100.174 ms (00:01.100)
>
> Nice speedup!
>
> > While for toast and large column size there is < 10% decrease but for
> > small column size the improvement is super good. Actually if I hardcode
> > the batch count as 4 all test cases are better but the improvement for
> > small column size is smaller than that with current patch. Pretty much
> > the number 4 is quite case specific so I can not hardcode that in the
> > patch. Of course we could further tune that but the current value seems
> > to be a good trade-off?
>
> Have you done any profiling, on why the multi-insert is slower with
> large tuples? In principle, I don't see why it should be slower.
>

Thanks for the suggestion. I'll explore a bit more on this.


>
> - Heikki
>


Re: Add missing CREATE TABLE IF NOT EXISTS table_name AS EXECUTE query;

2019-03-10 Thread Ramanarayana
Hi,

Will it be helpful if the comments section of ExecuteStmt in gram.y is
updated to include the IF NOT EXISTS clause.Something like this

CREATE TABLE  [IF NOT EXISTS] AS EXECUTE  [(params, ...)]

Regards,
Ram.


Re: Should we increase the default vacuum_cost_limit?

2019-03-10 Thread Julien Rouhaud
On Sat, Mar 9, 2019 at 11:14 PM Tom Lane  wrote:
>
> BTW ... I noticed while fooling with this that GUC's out-of-range
> messages can be confusing:
>
> regression=# set vacuum_cost_delay = '1s';
> ERROR:  1000 is outside the valid range for parameter "vacuum_cost_delay" (0 
> .. 100)
>
> One's immediate reaction to that is "I put in 1, not 1000".  I think
> it'd be much clearer if we included the unit we'd converted to, thus:
>
> ERROR:  1000 ms is outside the valid range for parameter "vacuum_cost_delay" 
> (0 .. 100)
>
> (Notice that this also implicitly tells what units the range limits
> are being quoted in.

I like it!

> A small problem with this idea is that GUC_UNIT_[X]BLOCK variables don't
> really have a natural unit name.  If we follow the lead of pg_settings,
> such errors would look something like
>
> ERROR:  1000 8kB is outside the valid range for ...
>
> I can't think of a better idea, though, and it'd still be clearer than
> what happens now.
>
> Barring objections I'll go make this happen.

No objection here.



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-03-10 Thread Dean Rasheed
On Sat, 9 Mar 2019 at 18:33, Dean Rasheed  wrote:
>
> On Thu, 28 Feb 2019 at 19:56, Tomas Vondra  
> wrote:
> > Attached is an updated version of this patch series.
>
> Here are some random review comments. I'll add more later, but I'm out
> of energy for today.
>

Here are some more comments:

11). In dependency_degree():

-/* sort the items so that we can detect the groups */
-qsort_arg((void *) items, numrows, sizeof(SortItem),
-  multi_sort_compare, mss);
+/*
+ * build an array of SortItem(s) sorted using the multi-sort support
+ *
+ * XXX This relies on all stats entries pointing to the same tuple
+ * descriptor. Not sure if that might not be the case.
+ */
+items = build_sorted_items(numrows, rows, stats[0]->tupDesc,
+   mss, k, attnums_dep);

That XXX comment puzzled me for a while. Actually it's OK though,
unless/until we try to support stats across multiple relations, which
will require a much larger refactoring of this code. For now though,
The stats entries all point to the same tuple descriptor from the
onerel passed to BuildRelationExtStatistics(), so it's OK to just use
the first tuple descriptor in this way. The comment should be updated
to explain that.

12). bms_member_index() should surely be in bitmapset.c. It could be
more efficient by just traversing the bitmap words and making use of
bmw_popcount(). Also, its second argument should be of type 'int' for
consistency with other bms_* functions.

13). estimate_ndistinct() has been moved from mvdistinct.c to
extended_stats.c and changed from static to extern, but it is only
called from mvdistinct.c, so that change is unnecessary (at least as
far as this patch is concerned).

14). The attnums Bitmapset passed to
statext_is_compatible_clause_internal() is an input/output argument
that it updates. That should probably be documented. When it calls
itself recursively for AND/OR/NOT clauses, it could just pass the
original Bitmapset through to be updated (rather than creating a new
one and merging it), as it does for other types of clause.

On the other hand, the outer function statext_is_compatible_clause()
does need to return a new bitmap, which may or may not be used by its
caller, so it would be cleaner to make that a strictly "out" parameter
and initialise it to NULL in that function, rather than in its caller.

15). As I said yesterday, I don't think that there is a clean
separator of concerns between the functions clauselist_selectivity(),
statext_clauselist_selectivity(),
dependencies_clauselist_selectivity() and
mcv_clauselist_selectivity(), I think things could be re-arranged as
follows:

statext_clauselist_selectivity() - as the name suggests - should take
care of *all* extended stats estimation, not just MCVs and histograms.
So make it a fairly small function, calling
mcv_clauselist_selectivity() and
dependencies_clauselist_selectivity(), and histograms when that gets
added.

Most of the current code in statext_clauselist_selectivity() is really
MCV-specific, so move that to mcv_clauselist_selectivity(). Amongst
other things, that would move the call to choose_best_statistics() to
mcv_clauselist_selectivity() (just as
dependencies_clauselist_selectivity() calls choose_best_statistics()
to get the best dependencies statistics). Then, when histograms are
added later, you won't have the problem pointed out before where it
can't apply both MCV and histogram stats if they're on different
STATISTICS objects.

Most of the comments for statext_clauselist_selectivity() are also
MCV-related. Those too would move to mcv_clauselist_selectivity().

Regards,
Dean



Re: Should we increase the default vacuum_cost_limit?

2019-03-10 Thread Julien Rouhaud
On Sat, Mar 9, 2019 at 10:04 PM Tom Lane  wrote:
>
> I tried this, and it seems to work pretty well.  The first of the two
> attached patches just teaches guc.c to support units for float values,
> incidentally allowing "us" as an input unit for time-based GUCs.

Why not allowing third party extensions to declare a GUC stored in us?
 We need a backward-compatible format for vacuum setting, but I don't
see a good reason to force external extensions to do the same, and it
wouldn't require much extra work.

> The second converts [autovacuum_]cost_delay to float GUCs, and changes
> the default value for autovacuum_cost_delay from 20ms to 2ms.
> We'd want to revert the previous patch that changed the default value
> of vacuum_cost_limit, else this means a 100x not 10x change in the
> default autovac speed ... but I've not included that in this patch.

Otherwise everything looks good to me.  BTW the patches didn't apply
cleanly with git-apply, but patch -p1 didn't complain.

> 2. It's always bugged me that we don't allow fractional unit
> specifications, say "0.1GB", even for GUCs that are integers underneath.
> That would be a simple additional change on top of this, but I didn't
> do it here.

It annoyed me multiple times, so +1 for making that happen.



Re: Checksum errors in pg_stat_database

2019-03-10 Thread Julien Rouhaud
On Sat, Mar 9, 2019 at 7:58 PM Julien Rouhaud  wrote:
>
> On Sat, Mar 9, 2019 at 7:50 PM Magnus Hagander  wrote:
> >
> > On Sat, Mar 9, 2019 at 10:41 AM Julien Rouhaud  wrote:
> >>
> >> Sorry, I have again new comments after a little bit more thinking.
> >> I'm wondering if we can do something about shared objects while we're
> >> at it.  They don't belong to any database, so it's a little bit
> >> orthogonal to this proposal, but it seems quite important to track
> >> error on those too!
> >>
> >> What about adding a new field in PgStat_GlobalStats for that?  We can
> >> use the same lastDir to easily detect such objects and slightly adapt
> >> sendFile again, which seems quite straightforward.
>
> > Question is then what number that should show -- only the checksum counter 
> > in non-database-fields, or the total number across the cluster?
>
> I'd say only for non-database-fields errors, especially if we can
> reset each counters separately.  If necessary, we can add a new view
> to give a global overview of checksum errors for DBA convenience.

So, after reading current implementation, I don't think that
PgStat_GlobalStats is the right place.  It's already enough of a mess,
and clearly pg_stat_reset_shared('bgwriter') would not make any sense
if it did reset the shared relations checksum errors (though arguably
the fact that's it's resetting checkpointer stats right now is hardly
better), and handling a different target to reset part of
PgStat_GlobalStats counters would be an ugly kludge.

I'm considering adding a new PgStat_ChecksumStats for that purpose
instead, but I don't know if that's acceptable to do so in the last
commitfest.  It seems worthwhile to add it eventually, since we'll
probably end up having more things to report to users related to
checksum.  Online enabling of checksum could be the most immediate
potential target.

If that's acceptable, I was thinking this new stat could have those
fields with the first drop:

- number of non-db-related checksum checks done
- number of non-db-related checksum checks failed
- last stats reset

(and adding the number of checks for db-related blocks done with the
current checksum errors counter).  Maybe also adding a
pg_checksum_stats view that would summarise all the counters in one
place.

It'll obviously add some traffic to the stats collector, but I'd hope
not too much, since BufferAlloc shouldn't be that frequent, and
BASE_BACKUP reports stats only once per file.



Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)

2019-03-10 Thread Fabien COELHO


Attached is a rebase.

--
Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 5df54a8e57..1db6b75823 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -6032,6 +6032,96 @@ main(int argc, char **argv)
 	return exit_code;
 }
 
+/* display the progress report */
+static void
+doProgressReport(TState *thread, StatsData *plast, int64 *plast_report,
+ int64 now, int64 thread_start)
+{
+	StatsData	cur;
+	int64		run = now - *plast_report,
+ntx;
+	double		tps,
+total_run,
+latency,
+sqlat,
+lag,
+stdev;
+	char		tbuf[64];
+	int		i;
+
+	/*
+	 * Add up the statistics of all threads.
+	 *
+	 * XXX: No locking. There is no guarantee that we get an
+	 * atomic snapshot of the transaction count and latencies, so
+	 * these figures can well be off by a small amount. The
+	 * progress report's purpose is to give a quick overview of
+	 * how the test is going, so that shouldn't matter too much.
+	 * (If a read from a 64-bit integer is not atomic, you might
+	 * get a "torn" read and completely bogus latencies though!)
+	 */
+	initStats(, 0);
+	for (i = 0; i < nthreads; i++)
+	{
+		mergeSimpleStats(, [i].stats.latency);
+		mergeSimpleStats(, [i].stats.lag);
+		cur.cnt += thread[i].stats.cnt;
+		cur.skipped += thread[i].stats.skipped;
+	}
+
+	/* we count only actually executed transactions */
+	ntx = (cur.cnt - cur.skipped) - (plast->cnt - plast->skipped);
+	total_run = (now - thread_start) / 100.0;
+	tps = 100.0 * ntx / run;
+	if (ntx > 0)
+	{
+		latency = 0.001 * (cur.latency.sum - plast->latency.sum) / ntx;
+		sqlat = 1.0 * (cur.latency.sum2 - plast->latency.sum2) / ntx;
+		stdev = 0.001 * sqrt(sqlat - 100.0 * latency * latency);
+		lag = 0.001 * (cur.lag.sum - plast->lag.sum) / ntx;
+	}
+	else
+	{
+		latency = sqlat = stdev = lag = 0;
+	}
+
+	if (progress_timestamp)
+	{
+		/*
+		 * On some platforms the current system timestamp is
+		 * available in now_time, but rather than get entangled
+		 * with that, we just eat the cost of an extra syscall in
+		 * all cases.
+		 */
+		struct timeval tv;
+
+		gettimeofday(, NULL);
+		snprintf(tbuf, sizeof(tbuf), "%ld.%03ld s",
+ (long) tv.tv_sec, (long) (tv.tv_usec / 1000));
+	}
+	else
+	{
+		/* round seconds are expected, but the thread may be late */
+		snprintf(tbuf, sizeof(tbuf), "%.1f s", total_run);
+	}
+
+	fprintf(stderr,
+			"progress: %s, %.1f tps, lat %.3f ms stddev %.3f",
+			tbuf, tps, latency, stdev);
+
+	if (throttle_delay)
+	{
+		fprintf(stderr, ", lag %.3f ms", lag);
+		if (latency_limit)
+			fprintf(stderr, ", " INT64_FORMAT " skipped",
+	cur.skipped - plast->skipped);
+	}
+	fprintf(stderr, "\n");
+
+	*plast = cur;
+	*plast_report = now;
+}
+
 static void *
 threadRun(void *arg)
 {
@@ -6042,6 +6132,7 @@ threadRun(void *arg)
 	int			nstate = thread->nstate;
 	int			remains = nstate;	/* number of remaining clients */
 	socket_set *sockets = alloc_socket_set(nstate);
+	int			aborted = 0;
 	int			i;
 
 	/* for reporting progress: */
@@ -6271,6 +6362,10 @@ threadRun(void *arg)
 			 */
 			if (st->state == CSTATE_FINISHED || st->state == CSTATE_ABORTED)
 remains--;
+
+			/* count aborted clients */
+			if (st->state == CSTATE_ABORTED)
+aborted++;
 		}
 
 		/* progress report is made by thread 0 for all threads */
@@ -6283,93 +6378,15 @@ threadRun(void *arg)
 			now = INSTR_TIME_GET_MICROSEC(now_time);
 			if (now >= next_report)
 			{
-/* generate and show report */
-StatsData	cur;
-int64		run = now - last_report,
-			ntx;
-double		tps,
-			total_run,
-			latency,
-			sqlat,
-			lag,
-			stdev;
-char		tbuf[315];
-
-/*
- * Add up the statistics of all threads.
- *
- * XXX: No locking. There is no guarantee that we get an
- * atomic snapshot of the transaction count and latencies, so
- * these figures can well be off by a small amount. The
- * progress report's purpose is to give a quick overview of
- * how the test is going, so that shouldn't matter too much.
- * (If a read from a 64-bit integer is not atomic, you might
- * get a "torn" read and completely bogus latencies though!)
- */
-initStats(, 0);
-for (i = 0; i < nthreads; i++)
-{
-	mergeSimpleStats(, [i].stats.latency);
-	mergeSimpleStats(, [i].stats.lag);
-	cur.cnt += thread[i].stats.cnt;
-	cur.skipped += thread[i].stats.skipped;
-}
-
-/* we count only actually executed transactions */
-ntx = (cur.cnt - cur.skipped) - (last.cnt - last.skipped);
-total_run = (now - thread_start) / 100.0;
-tps = 100.0 * ntx / run;
-if (ntx > 0)
-{
-	latency = 0.001 * (cur.latency.sum - last.latency.sum) / ntx;
-	sqlat = 1.0 * (cur.latency.sum2 - last.latency.sum2) / ntx;
-	stdev = 0.001 * sqrt(sqlat - 100.0 * latency * latency);
-	lag = 0.001 * (cur.lag.sum - last.lag.sum) / ntx;
-}
-else
-{
-	latency = sqlat = stdev = lag = 0;
-}
-
-

Re: using index or check in ALTER TABLE SET NOT NULL

2019-03-10 Thread Sergei Kornilov
Hi

> Providing I'm imagining it correctly, I do think the patch is slightly
> cleaner with that change.

Ok, sounds reasonable. I changed patch this way.

regards, Sergeidiff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 890b23afd6..926b3361ea 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -214,8 +214,17 @@ WITH ( MODULUS numeric_literal, REM
 
  
   These forms change whether a column is marked to allow null
-  values or to reject null values.  You can only use SET
-  NOT NULL when the column contains no null values.
+  values or to reject null values.
+ 
+
+ 
+  SET NOT NULL may only be applied to a column
+  providing none of the records in the table contain a
+  NULL value for the column.  Ordinarily this is
+  checked during the ALTER TABLE by scanning the
+  entire table, however if a valid CHECK constraint is
+  found which proves no NULL can exist, then the
+  table scan is skipped.
  
 
  
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 59341e2a40..5200aac530 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -160,7 +160,7 @@ typedef struct AlteredTableInfo
 	/* Information saved by Phases 1/2 for Phase 3: */
 	List	   *constraints;	/* List of NewConstraint */
 	List	   *newvals;		/* List of NewColumnValue */
-	bool		new_notnull;	/* T if we added new NOT NULL constraints */
+	bool		verify_new_notnull; /* T if we should recheck NOT NULL */
 	int			rewrite;		/* Reason for forced rewrite, if any */
 	Oid			newTableSpace;	/* new tablespace; 0 means no change */
 	bool		chgPersistence; /* T if SET LOGGED/UNLOGGED is used */
@@ -372,6 +372,9 @@ static ObjectAddress ATExecDropNotNull(Relation rel, const char *colName, LOCKMO
 static void ATPrepSetNotNull(Relation rel, bool recurse, bool recursing);
 static ObjectAddress ATExecSetNotNull(AlteredTableInfo *tab, Relation rel,
  const char *colName, LOCKMODE lockmode);
+static bool NotNullImpliedByRelConstraints(Relation rel, Form_pg_attribute attr);
+static bool ConstraintImpliedByRelConstraint(Relation scanrel,
+	 List *partConstraint, List *existedConstraints);
 static ObjectAddress ATExecColumnDefault(Relation rel, const char *colName,
 	Node *newDefault, LOCKMODE lockmode);
 static ObjectAddress ATExecAddIdentity(Relation rel, const char *colName,
@@ -4550,10 +4553,11 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode)
 		else
 		{
 			/*
-			 * Test the current data within the table against new constraints
-			 * generated by ALTER TABLE commands, but don't rebuild data.
+			 * If required, test the current data within the table against new
+			 * constraints generated by ALTER TABLE commands, but don't rebuild
+			 * data.
 			 */
-			if (tab->constraints != NIL || tab->new_notnull ||
+			if (tab->constraints != NIL || tab->verify_new_notnull ||
 tab->partition_constraint != NULL)
 ATRewriteTable(tab, InvalidOid, lockmode);
 
@@ -4714,13 +4718,13 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 	}
 
 	notnull_attrs = NIL;
-	if (newrel || tab->new_notnull)
+	if (newrel || tab->verify_new_notnull)
 	{
 		/*
-		 * If we are rebuilding the tuples OR if we added any new NOT NULL
-		 * constraints, check all not-null constraints.  This is a bit of
-		 * overkill but it minimizes risk of bugs, and heap_attisnull is a
-		 * pretty cheap test anyway.
+		 * If we are rebuilding the tuples OR if we added any new but not
+		 * verified NOT NULL constraints, check all not-null constraints.
+		 * This is a bit of overkill but it minimizes risk of bugs, and
+		 * heap_attisnull is a pretty cheap test anyway.
 		 */
 		for (i = 0; i < newTupDesc->natts; i++)
 		{
@@ -5726,11 +5730,9 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel,
 		{
 			/*
 			 * If the new column is NOT NULL, and there is no missing value,
-			 * tell Phase 3 it needs to test that. (Note we don't do this for
-			 * an OID column.  OID will be marked not null, but since it's
-			 * filled specially, there's no need to test anything.)
+			 * tell Phase 3 it needs to check for NULLs.
 			 */
-			tab->new_notnull |= colDef->is_not_null;
+			tab->verify_new_notnull |= colDef->is_not_null;
 		}
 	}
 
@@ -6098,8 +6100,19 @@ ATExecSetNotNull(AlteredTableInfo *tab, Relation rel,
 
 		CatalogTupleUpdate(attr_rel, >t_self, tuple);
 
-		/* Tell Phase 3 it needs to test the constraint */
-		tab->new_notnull = true;
+		/*
+		 * Ordinarily phase 3 must ensure that no NULLs exist in columns that
+		 * are set NOT NULL, however, if we can find a constraint which proves
+		 * this then we can skip that.  However, we needn't bother looking if
+		 * we've already found that we must verify some other NOT NULL
+		 * constraint.
+		 */
+		if (!tab->verify_new_notnull &&
+			

Re: Performance issue in foreign-key-aware join estimation

2019-03-10 Thread David Rowley
On Sun, 10 Mar 2019 at 21:34, David Rowley  wrote:
> I started looking at this again and I came up with the attached
> eclass_indexes_v2.patch.

The CF bot didn't like v2. It warned about an uninitialized variable.
My compiler didn't.

Here's v3, hopefully with that fixed.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


eclass_indexes_v3.patch
Description: Binary data


Re: Should we add GUCs to allow partition pruning to be disabled?

2019-03-10 Thread David Rowley
On Fri, 11 May 2018 at 17:37, Amit Langote
 wrote:
> 5. The last sentence in caveats, that is,
>
> "Partitioning using these techniques will work well with up to perhaps a
> hundred partitions; don't try to use many thousands of partitions."
>
> should perhaps be reworded as:
>
> "So the legacy inheritance based partitioning will work well with up to
> perhaps a hundred partitions; don't try to use many thousands of partitions."

(digging up 10-month-old thread [1])

There was a report [2] on -general today where someone had a 4000
partition partitioned table and were complaining about memory
consumption in the planner during DELETE.  They didn't mention the
exact version they were using, but mentioned that the problem exists
on 10, 11 and master.  Of course, we're well aware of this issue with
DELETE and UPDATE of large partition hierarchies, Amit has been
working hard with trying to solve it for PG12.

In the -general post, I was just about to point them at the part in
the documents that warn against these large partition hierarchies, but
it looks like the warning was removed in bebc46931a1, or at least
modified to say that constraint exclusion with heritance tables is
slow. I really wonder if we shouldn't put something back in there to
warn against this sort of thing.  It might be a bit late for the
people who've read the docs and done it already, but a warning might
at least stop new people making the mistake.

Hopefully one day we can remove the warning again, but it won't be for PG12.

Thoughts?

[1] 
https://www.postgresql.org/message-id/6bc4e96a-0e30-e9b6-dcc7-791c7486a491%40lab.ntt.co.jp
[2] 
https://www.postgresql.org/message-id/739b7a5e-1192-1011-5aa2-41adad55682d%40perfexpert.ch

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Covering GiST indexes

2019-03-10 Thread Alexander Korotkov
On Fri, Mar 8, 2019 at 11:58 AM Alexander Korotkov
 wrote:
> I made some adjustments for this patch:
>  * Rename tupledesc and tructTupledesc to leafTupdesc and
> nonLeafTupdesc correspondingly.  I think this makes things more clear.
>  * Some improvements to docs and comments.
>  * Revise commit message.
>
> I'm going to push this on no objections.

Pushed with some more small changes in docs.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Performance issue in foreign-key-aware join estimation

2019-03-10 Thread David Rowley
On Sat, 9 Mar 2019 at 13:18, David Rowley  wrote:
> This is something I'd like to look into for v13.  I think it'll be
> much easier to do if we can get your List reimplementation patch in
> first. That's going to allow list_nth on PlannerInfo->eq_classes to
> work quickly and will remove the need for having to build an array to
> store the classes, and as you mention, RelOptInfo could store a
> Bitmapset to store which ECs have members for this rel.   I've
> modified the CF entry to target v13.

I started looking at this again and I came up with the attached
eclass_indexes_v2.patch.  This modifies the original patch
significantly so that we no longer build this big eclass index when we
think we might start to need it.  Instead, I've added a Bitmapset
field to RelOptInfo named "eclass_indexes".  During add_eq_member() we
just note down the PlannerInfo->eq_classes index of the eclass we're
adding to in each RelOptInfo->eclass_indexes that's involved in the
new eclass member being added.  This means we can use the Bitmapset
lookup anytime we like. That allowed me to get rid of those special
cases I had added to make use of the index when it exists.  I've now
modified all the existing code to make use of the
RelOptInfo->eclass_indexes field.

As mentioned above, my thoughts on this patch were that this new
method would only be any good once we got Tom's list implementation
patch as that makes list_nth() O(1) instead of O(N).  On benchmarking
this I was quite surprised that it still improves performance quite a
bit even without the list reimplementation patch.

Using Tom's test case [1], I get the following:

* master (82a5649fb)
latency average = 6992.320 ms
latency average = 6990.297 ms
latency average = 7030.619 ms

* master + eclass_indexes_v2
latency average = 2537.536 ms
latency average = 2530.824 ms
latency average = 2543.770 ms

If I add in Tom's list reimplementation too, I get:

* master + Tom's list reimplementation v3 + eclass_indexes
latency average = 1225.910 ms
latency average = 1209.565 ms
latency average = 1207.326 ms

I really didn't expect just this patch alone to speed this up as it
calls list_nth() in a loop.  I can only imagine that with this
workload that these list_nth() loops are not performing many loops,
otherwise, the performance would die off quickly. I thought about how
we could defend against workloads where there's a large number of
eclasses and most match a given relation.  I ended up with a small
function named list_skip_forward() that simply keeps looping for N
times eating N ListCells into the given ListCell.  This does improve
performance a little bit more, but probably, more importantly, should
be a good defence against the worst case.  As is, the function would
become completely redundant with Tom's list reimplementation patch, so
for now, I just snuck it in as a static function in equivclass.c.

I've attached this list_skip_forward.patch too.  This patch is a bit
rough around the edges as I only just started playing with it in the
past 30 mins or so.  Perhaps it shows that we might be able to fix
this in PG12 and not have to wait for the list reimplementation at
all.

The performance with both attached patches is:

* master + eclass_indexes + list_skip_forward_v0
latency average = 2375.383 ms
latency average = 2351.450 ms
latency average = 2339.259 ms

Still nowhere near as good as with the list reimplementation patch,
but way better than master.

[1] https://www.postgresql.org/message-id/6970.1545327...@sss.pgh.pa.us

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


eclass_indexes_v2.patch
Description: Binary data


list_skip_forward_v0.patch
Description: Binary data


what makes the PL cursor life-cycle must be in the same transaction?

2019-03-10 Thread Andy Fan
for example:
begin;
declare cur cursor for select * from t;
insert into t2 values(...);
fetch next cur;
commit;

// after this,  I can't fetch cur any more.

My question are:
1.  Is this must in principle?  or it is easy to implement as this in PG?
2.  Any bad thing would happen if I keep the named portal (for the cursor)
available even the transaction is commit, so that I can fetch the cursor
after the transaction is committed?

Thanks