Re: allow_system_table_mods and DROP RULE

2019-12-19 Thread Peter Eisentraut

On 2019-12-18 16:53, Robert Haas wrote:

On Wed, Dec 18, 2019 at 3:56 AM Peter Eisentraut
 wrote:

As a curious omission, DROP RULE does not check allow_system_table_mods.
   Creating and renaming a rule does, and also creating, renaming, and
dropping a trigger does.  The impact of this is probably nil in
practice, but for consistency we should probably add that.  The patch is
pretty simple.


+1. LGTM.


committed

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Disallow cancellation of waiting for synchronous replication

2019-12-19 Thread Marco Slot
On Fri, Dec 20, 2019 at 6:04 AM Andrey Borodin  wrote:
> I think proper solution here would be to add GUC to disallow cancellation of 
> synchronous replication. Retry step 3 will wait on locks after hanging 1 and 
> data will be consistent.
> Three is still a problem when backend is not canceled, but terminated [2]. 
> Ideal solution would be to keep locks on changed data. Some well known 
> databases threat termination of synchronous replication as system failure and 
> refuse to operate until standbys appear (see Maximum Protection mode). From 
> my point of view it's enough to PANIC once so that HA tool be informed that 
> something is going wrong.

Sending a cancellation is currently the only way to resume after
disabling synchronous replication. Some HA solutions (e.g.
pg_auto_failover) rely on this behaviour. Would it be worth checking
whether synchronous replication is still required?

Marco




Re: Hooks for session start and end, take two

2019-12-19 Thread Michael Paquier
On Fri, Dec 20, 2019 at 02:45:26AM +, tsunakawa.ta...@fujitsu.com wrote:
> I've got interested in this.  What's the current status of this
> patch?  The CF entry shows it was committed.
>
> But I understood not, because the relevant code doesn't appear in
> HEAD, and Git log shows that it was reverted.  Am I correct? 

The patch has been committed once as of e788bd9, then reverted as of
9555cc8 because it had a couple of fundamental issues and many people
were not happy with it.  The latest discussions point out to some more
advanced designs based on callbacks at certain points of a session
lifetime.  You may want to double-check on that first.
--
Michael


signature.asc
Description: PGP signature


RE: Implementing Incremental View Maintenance

2019-12-19 Thread tsunakawa.ta...@fujitsu.com
Hello,


I'm starting to take a closer look at this feature.  I've just finished reading 
the discussion, excluding other referenced materials.

The following IVM wiki page returns an error.  Does anybody know what's wrong?

https://wiki.postgresql.org/wiki/Incremental_View_Maintenance

[screen]
--
MediaWiki internal error.

 Exception caught inside exception handler.

 Set $wgShowExceptionDetails = true; at the bottom of LocalSettings.php to show 
detailed debugging information.
--


Could you give some concrete use cases, so that I can have a clearer image of 
the target data?  In the discussion, someone referred to master data with low 
update frequency, because the proposed IVM implementation adds triggers on 
source tables, which limits the applicability to update-heavy tables.


Regards
Takayuki Tsunakawa





Re: [HACKERS] Block level parallel vacuum

2019-12-19 Thread Masahiko Sawada
On Thu, 19 Dec 2019 at 22:48, Robert Haas  wrote:
>
> On Thu, Dec 19, 2019 at 12:41 AM Masahiko Sawada
>  wrote:
> > Attached the updated version patch. This version patch incorporates
> > the above comments and the comments from Mahendra. I also fixed one
> > bug around determining the indexes that are vacuumed in parallel based
> > on their option and size. Please review it.
>
> I'm not enthusiastic about the fact that 0003 calls the fast option
> 'disable_delay' in some places. I think it would be more clear to call
> it 'fast' everywhere.
>

Agreed.

I've attached the updated version patch that incorporated the all
review comments I go so far.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v38-0003-Add-FAST-option-to-vacuum-command.patch
Description: Binary data


v38-0001-Add-index-AM-field-and-callback-for-parallel-ind.patch
Description: Binary data


v38-0002-Add-parallel-option-to-VACUUM-command.patch
Description: Binary data


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2019-12-19 Thread Masahiko Sawada
On Mon, 2 Dec 2019 at 17:32, Dilip Kumar  wrote:
>
> On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier  wrote:
> >
> > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > I have rebased the patch on the latest head and also fix the issue of
> > > "concurrent abort handling of the (sub)transaction." and attached as
> > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > the complete patch set.  I have added the version number so that we
> > > can track the changes.
> >
> > The patch has rotten a bit and does not apply anymore.  Could you
> > please send a rebased version?  I have moved it to next CF, waiting on
> > author.
>
> I have rebased the patch set on the latest head.

Thank you for working on this.

This might have already been discussed but I have a question about the
changes of logical replication worker. In the current logical
replication there is a problem that the response time are doubled when
using synchronous replication because wal senders send changes after
commit. It's worse especially when a transaction makes a lot of
changes. So I expected this feature to reduce the response time by
sending changes even while the transaction is progressing but it
doesn't seem to be. The logical replication worker writes changes to
temporary files and applies these changes when the worker received
commit record (STREAM COMMIT). Since the worker sends the LSN of
commit record as flush LSN to the publisher after applying all
changes, the publisher must wait for all changes are applied to the
subscriber.  Another problem would be that the worker doesn't receive
changes during applying changes of other transactions. These things
make me think it's better to have a new worker dedicated to apply
changes like we have the wal receiver process and the startup process.
Maybe we can have 2 workers (receiver and applyer) per subscriptions.
Any thoughts?

Regards,


--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Optimizing TransactionIdIsCurrentTransactionId()

2019-12-19 Thread Simon Riggs
On Thu, 19 Dec 2019 at 19:27, Robert Haas  wrote:

> On Wed, Dec 18, 2019 at 5:07 AM Simon Riggs  wrote:
> > TransactionIdIsCurrentTransactionId() doesn't seem to be well optimized
> for the case when an xid has not yet been assigned, so for read only
> transactions.
> >
> > A patch for this is attached.
>
> It might be an idea to first call TransactionIdIsNormal(xid), then
> GetTopTransactionIdIfAny(), then TransactionIdIsNormal(topxid), so
> that we don't bother with GetTopTransactionIdIfAny() when
> !TransactionIdIsNormal(xid).
>
> But it's also not clear to me whether this is actually a win. You're
> dong an extra TransactionIdIsNormal() test to sometimes avoid a
> GetTopTransactionIdIfAny() test.


That's not the point of the patch.

If the TopTransactionId is not assigned, we can leave the whole function
more quickly, not just avoid a test.

Read only transactions should have a fast path thru this function since
they frequently read more data than write transactions.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Solutions for the Enterprise


Disallow cancellation of waiting for synchronous replication

2019-12-19 Thread Andrey Borodin
Hi, hackers!

This is continuation of thread [0] in pgsql-general with proposed changes. As 
Maksim pointed out, this topic was raised before here [1].

Currently, we can have split brain with the combination of following steps:
0. Setup cluster with synchronous replication. Isolate primary from standbys.
1. Issue upsert query INSERT .. ON CONFLICT DO NOTHING
2. CANCEL 1 during wait for synchronous replication
3. Retry 1. Idempotent query will succeed and client have confirmation of 
written data, while it is not present on any standby.

Thread [0] contain reproduction from psql.

In certain situations we cannot avoid cancelation of timed out queries. Yes, we 
can interpret warnings and thread them as errors, but warning is emitted on 
step 1, not on step 3.

I think proper solution here would be to add GUC to disallow cancellation of 
synchronous replication. Retry step 3 will wait on locks after hanging 1 and 
data will be consistent.
Three is still a problem when backend is not canceled, but terminated [2]. 
Ideal solution would be to keep locks on changed data. Some well known 
databases threat termination of synchronous replication as system failure and 
refuse to operate until standbys appear (see Maximum Protection mode). From my 
point of view it's enough to PANIC once so that HA tool be informed that 
something is going wrong.
Anyway situation with cancelation is more dangerous. We've observed it in some 
user cases.

Please find attached draft of proposed change.

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/B70260F9-D0EC-438D-9A59-31CB996B320A%40yandex-team.ru
[1] 
https://www.postgresql.org/message-id/flat/CAEET0ZHG5oFF7iEcbY6TZadh1mosLmfz1HLm311P9VOt7Z%2Bjeg%40mail.gmail.com
[2] 
https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION-HA



0001-Disallow-cancelation-of-syncronous-commit-V1.patch
Description: Binary data


Re: Implementing Incremental View Maintenance

2019-12-19 Thread Yugo Nagata
Hi,

Attached is the latest patch (v10) to add support for Incremental
Materialized View Maintenance (IVM).   

IVM is a way to make materialized views up-to-date in which only
incremental changes are computed and applied on views rather than
recomputing the contents from scratch as REFRESH MATERIALIZED VIEW
does. IVM can update materialized views more efficiently
than recomputation when only small part of the view need updates.

There are two approaches with regard to timing of view maintenance:
immediate and deferred. In immediate maintenance, views are updated in
the same transaction where its base table is modified. In deferred
maintenance, views are updated after the transaction is committed,
for example, when the view is accessed, as a response to user command
like REFRESH, or periodically in background, and so on. 

This patch implements a kind of immediate maintenance, in which
materialized views are updated immediately in AFTER triggers when a
base table is modified.

This supports views using:
 - inner and outer joins including self-join
 - some built-in aggregate functions (count, sum, agv, min, max)
 - a part of subqueries
   -- simple subqueries in FROM clause
   -- EXISTS subqueries in WHERE clause
 - DISTINCT and views with tuple duplicates

===
Here are major changes we made after the previous submitted patch:

* Aggregate functions are checked if they can be used in IVM 
  using their OID. Per comments from Alvaro Herrera.

  For this purpose, Gen_fmgrtab.pl was modified so that OIDs of
  aggregate functions are output to fmgroids.h.

* Some bug fixes including:

 - Mistake of tab-completion of psql pointed out by nuko-san
 - A bug relating rename of matview pointed out by nuko-san
 - spelling errors
 - etc.

* Add documentations for IVM

* Patch is splited into eleven parts to make review easier
  as suggested by Amit Langote:

 - 0001: Add a new syntax:
 CREATE INCREMENTAL MATERIALIZED VIEW
 - 0002: Add a new column relisivm to pg_class
 - 0003: Change trigger.c to allow to prolong life span of tupestores
 containing Transition Tables generated via AFTER trigger
 - 0004: Add the basic IVM future using counting algorithm:
 This supports inner joins, DISTINCT, and tuple duplicates.
 - 0005: Change GEN_fmgrtab.pl to output aggregate function's OIDs
 - 0006: Add aggregates support for IVM
 - 0007: Add subqueries support for IVM
 - 0008: Add outer joins support for IVM
 - 0009: Add IVM support to psql command
 - 0010: Add regression tests for IVM
 - 0011: Add documentations for IVM

===
Todo:

Currently, REFRESH and pg_dump/pg_restore is not supported, but
we are working on them.

Also, TRUNCATE is not supported. When TRUNCATE command is executed
on a base table, nothing occurs on materialized views. We are
now considering another better options, like:

- Raise an error or warning when a base table is TRUNCATEed.
- Make the view non-scannable (like WITH NO DATA)
- Update the view in some ways. It would be easy for inner joins
  or aggregate views, but there is some difficult with outer joins.


Regards,
Yugo Nagata

-- 
Yugo Nagata 


IVM_patches_v10.tar.gz
Description: application/gzip


Re: Fetching timeline during recovery

2019-12-19 Thread Kyotaro Horiguchi
At Fri, 20 Dec 2019 00:35:19 +0100, Jehan-Guillaume de Rorthais 
 wrote in 
> On Fri, 13 Dec 2019 16:12:55 +0900
> Michael Paquier  wrote:

The first one;

> > I mentioned a SRF function which takes an input argument, but that
> > makes no sense.  What I would prefer having is just having one
> > function, returning one row (LSN, TLI), using in input one argument to
> > extract the WAL information the caller wants with five possible cases
> > (write, insert, flush, receive, replay).
> 
> It looks odd when we look at other five existing functions of the same family
> but without the tli. And this user interaction with admin function is quite
> different of what we are used to with other admin funcs. But mostly, when I
> think of such function, I keep thinking this parameter should be a WHERE
> clause after a SRF function.
> 
> -1

It is realted to the third one, it may be annoying that the case names
cannot have an aid of psql-completion..


The second one;

> > Then, what you are referring to is one function which returns all
> > (LSN,TLI) for the five cases (write, insert, etc.), so it would return
> > one row with 10 columns, with NULL mapping to the values which have no
> > meaning (like replay on a primary).
> 
> This would looks like some other pg_stat_* functions, eg. 
> pg_stat_get_archiver.
> I'm OK with this. This could even be turned as a catalog view.
> 
> However, what's the point of gathering all the values eg from a production
> cluster? Is it really useful to compare current/insert/flush LSN from wal
> writer?

There is a period where pg_controldata shows the previous TLI after
promotion. It's useful if we can read the up-to-date TLI from live
standby. I thought that this project is for that case..

> It's easier to answer from a standby point of view as the lag between received
> and replayed might be interesting to report in various situations.


The third one;

> > And on top of that we have a third possibility: one SRF function
> > returning 5 rows with three attributes (mode, LSN, TLI), where mode
> > corresponds to one value in the set {write, insert, etc.}.
> 
> I prefer the second one. Just select the field(s) you need, no need WHERE
> clause, similar to some other stats function.
> 
> -1

It might be clean in a sense, but I don't come up with the case where
the format is useful..

Anyway as the same with the first one, the case names (write, insert,
flush, receive, replay) comes from two different machineries and
showing them in a row could be confusing.


> As a fourth possibility, as I badly explained my last implementation details, 
> I
> still hope we can keep it in the loop here. Just overload existing functions
> with ones that takes a boolean as parameter and add the TLI as a second field,
> eg.:
> 
> Name   | Result type  | Argument data types
> ---+--+---
> pg_current_wal_lsn | pg_lsn   |
> pg_current_wal_lsn | SETOF record | with_tli bool, OUT lsn pg_lsn, OUT tli int

I prefer this one, in the sense of similarity with existing functions.

> And the fifth one, implementing brand new functions:
> 
>  pg_current_wal_lsn_tli
>  pg_current_wal_insert_lsn_tli
>  pg_current_wal_flush_lsn_tli
>  pg_last_wal_receive_lsn_tli
>  pg_last_wal_replay_lsn_tli

M We should remove exiting ones instead? (Of couse we don't,
though.)

> > I actually prefer the first one, and you mentioned the second.  But
> > there could be a point in doing the third one.  An advantage of the
> > second and third ones is that you may be able to get a consistent view
> > of all the data, but it means holding locks to look at the values a
> > bit longer.  Let's see what others think.
> 
> I like the fourth one, but I was not able to return only one field if given
> parameter is false or NULL. Giving false as argument to these funcs has no
> meaning compared to the original one without arg. I end up with this solution
> because I was worried about adding five more funcs really close to some
> existing one.

Right. It is a restriction of polymorphic functions. It is in the same
relation with pg_stop_backup() and pg_stop_backup(true).

> Fifth one is more consistent with what we already have.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Read Uncommitted

2019-12-19 Thread Craig Ringer
On Fri, 20 Dec 2019 at 12:18, Tom Lane  wrote:

> Craig Ringer  writes:
> > My understanding from reading the above is that Simon didn't propose to
> > make aborted txns visible, only in-progress uncommitted txns.
>
> Yeah, but an "in-progress uncommitted txn" can become an "aborted txn"
> at any moment, and there's no interlock that would prevent its generated
> data from being removed out from under you at any moment after that.
> So there's a race condition, and as Robert observed, the window isn't
> necessarily small.
>

Absolutely. Many of the same issues arise in the work on logical decoding
of in-progress xacts for optimistic logical decoding.

Unless such an interlock is added (with all the problems that entails,
again per the in-progress logical decoding thread) that limits this to:

* running in recovery when stopped at a recovery target; or
* peeking at the contents of individual prepared xacts that we can prevent
someone else concurrently aborting/committing

That'd actually cover the only things I'd personally actually want a
feature like this for anyway.

In any case, Simon's yanked the proposal. I'd like to have some way to peek
at the contents of individual uncommited xacts, but it's clearly not going
to be anything called READ UNCOMMITTED that applies to all uncommitted
xacts at once...


> > I think the suggestions for a SRF based approach might make sense.
>
> Yeah, I'd rather do it that way than via a transaction isolation
> level.  The isolation-level approach seems like people will expect
> stronger semantics than we could actually provide.
>

Yeah. Definitely not an isolation level.

I'll be interesting to see if this sparks any more narrowly scoped and
targeted ideas, anyway. Thanks for taking the time to think about it.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: PATCH: Add uri percent-encoding for binary data

2019-12-19 Thread Arthur Zakirov

Hello,

On 2019/10/07 16:14, Anders Åstrand wrote:

Hello

Attached is a patch for adding uri as an encoding option for
encode/decode. It uses what's called "percent-encoding" in rfc3986
(https://tools.ietf.org/html/rfc3986#section-2.1).


Thank you for the patch. I'm not very familiar with rfc3986. Is it 
insist that an output should have upper case characters? If not maybe it 
is good to reuse hextbl[] (which is in encode.c) instead of adding new 
upper_hex_digits[].


Also can you correct the documentation. encode() is mentioned here:
https://www.postgresql.org/docs/current/functions-binarystring.html

--
Arthur




Re: Read Uncommitted

2019-12-19 Thread Tom Lane
Craig Ringer  writes:
> My understanding from reading the above is that Simon didn't propose to
> make aborted txns visible, only in-progress uncommitted txns.

Yeah, but an "in-progress uncommitted txn" can become an "aborted txn"
at any moment, and there's no interlock that would prevent its generated
data from being removed out from under you at any moment after that.
So there's a race condition, and as Robert observed, the window isn't
necessarily small.

> The bigger issue, and the one that IMO makes it impractical to spell this
> as "READ UNCOMMITTED", is that an uncommitted txn might've changed the
> catalogs so there is no one snapshot that is valid for interpreting all
> possible tuples.

In theory that should be okay, because any such tuples would be in
tables you can't access due to the in-progress txn having taken
AccessExclusiveLock on tables it changes the rowtype of.  But we keep
looking for ways to reduce the locking requirements for ALTER TABLE
actions --- and as I said upthread, it feels like this feature might
someday be the sole reason why we can't safely reduce lock strength
for some form of ALTER.  I can't make a concrete argument for that
though ... maybe it's not really any worse than the situation just
after failure of any DDL-performing txn.  But that would bear closer
study than I think it's had.

> I think the suggestions for a SRF based approach might make sense.

Yeah, I'd rather do it that way than via a transaction isolation
level.  The isolation-level approach seems like people will expect
stronger semantics than we could actually provide.

regards, tom lane




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-19 Thread Amit Khandekar
On Thu, 19 Dec 2019 at 11:59, Amit Kapila  wrote:
>
> On Wed, Dec 18, 2019 at 12:34 PM Amit Khandekar  
> wrote:
> >
> > On Tue, 17 Dec 2019 at 17:40, Amit Khandekar  wrote:
> > > By the way, the backport patch is turning out to be simpler. It's
> > > because in pre-12 versions, the file offset is part of the Vfd
> > > structure, so all the offset handling is not required.
> >
> > Please have a look at the attached backport patch for PG 11. branch.
> > Once you are ok with the patch, I will port it on other branches.
> > Note that in the patch, wherever applicable I have renamed the fd
> > variable to vfd to signify that it is a vfd, and not the kernel fd. If
> > we don't do the renaming, the patch would be still smaller, but I
> > think the renaming makes sense.
> >
>
> The other usage of PathNameOpenFile in md.c is already using 'fd' as a
> variable name (also, if you see example in fd.h, that also uses fd as
> variable name), so I don't see any problem with using fd especially if
> that leads to lesser changes.

Ok. I have retained fd name.

> Apart from that, your patch LGTM.
Attached are the patches from master back up to 94 branch.

PG 9.4 and 9.5 have a common patch to be applied :
pg94_95_use_vfd_for_logrep.patch
>From PG 9.6 onwards, each version has a separate patch.

For PG 9.6, there is no logical decoding perl test file. So I have
made a new file 006_logical_decoding_spill.pl that has only the
specific testcase. Also, for building the test_decoding.so, I had to
add the EXTRA_INSTALL=contrib/test_decoding line in the
src/test/recovery/Makefile, because this is the first time we are
using the plugin in the 9.6 tap test.

>From PG 10 onwards, pgstat_report_*() calls around read() are removed
in the patch, because FileRead() itself reports the wait events.

>From PG 12 onwards, the vfd offset handling had to be added, because
the offset is not present in Vfd structure.

In master, logical_decoding_work_mem is used in the test file.


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


use_vfd_for_logrep_patches.tar.gz
Description: GNU Zip compressed data


Re: polymorphic table functions light

2019-12-19 Thread Tom Lane
Vik Fearing  writes:
> On 16/12/2019 22:13, Tom Lane wrote:
>> That being the case, I'm not in favor of using up SQL syntax space for it
>> if we don't have to.

> Do I understand correctly that you are advocating *against* using
> standard SQL syntax for a feature that is defined by the SQL Standard
> and that we have no similar implementation for?

My point is that what Peter is proposing is exactly *not* the standard's
feature.  We generally avoid using up standard syntax for not-standard
semantics, especially if there's any chance that somebody might come along
and build a more-conformant version later.  (Having said that, I had the
impression that what he was proposing wasn't the standard's syntax either,
but just a homegrown CREATE FUNCTION addition.  I don't really see the
point of doing it like that when we can do it below the level of SQL.)

regards, tom lane




Re: Is querying SPITupleTable with SQL possible?

2019-12-19 Thread Thomas Munro
[un-top-postifying]

On Fri, Dec 20, 2019 at 2:53 PM Wu, Fei  wrote:
>> On 02/10/2019 16:11, Tom Lane wrote:
>> > Tom Mercha  writes:
>> >> I am using PostgreSQL's SPI to execute a simple SQL query (SELECT *
>> >> FROM
>> >> ...) via SPI_exec. As a a result, I get an SPITupleTable with the
>> >> results of my query.
>> >> Now that I have the SPITupleTable, I was wondering if it would be
>> >> possible to later query over it further in my SQL statements using
>> >> SPI, for example, something a bit similar to SPI_Exec ("Select * FROM
>> >> :mySPITupleTable", 0);
>> >
>> > It's possible you could use the "transition table" (aka
>> > EphemeralNamedRelation) infrastructure for this, though I'm not sure
>> > if it's really a close fit, or whether it's been built out enough to
>> > support this usage.  From memory, it wants to work with tuplestores,
>> > which are a bit heavier-weight than SPITupleTables.
>>
>> Thanks for this feedback! The EphemeralNamedRelation seems that it could be 
>> a good fit for what I'm looking for.
>>
>> However, I'm not quite so sure how I can query over the 
>> EphemeralNamedRelation using SQL? Could someone indicate where I can find an 
>> example?
>
> I have had see your discussion about node EphemeralNamedRelation with the 
> Community.
> Now, I want to use this node in SQL(for test), I have saw the manual but 
> could not understand,
> can you show me a example on how to use it in SQL?

I missed this thread before.  If you want to expose an ENR to SQL
you'll need to write some C code for now (unless someone has added
support for other languages?).  Try something like this (not tested):

EphemeralNamedRelation enr = palloc0(sizeof(*enr));

enr->md.name = "my_magic_table";
enr->md.reliddesc = InvalidOid;
enr->md.tupdesc = my_magic_table_tuple_descriptor;
enr->md.enrtype = ENR_NAMED_TUPLESTORE;
enr->md.enrtuples = how_many_tuples_to_tell_the_planner_we_have;
enr->reldata = my_tupestorestate;
rc = SPI_register_relation(enr);
if (rc != SPI_OK_REL_REGISTER)
  explode();

You will need to come up with a TupleDesc that describes the columns
in your magic table, and a Tuplestorestate that holds the tuples.
After that you should be able to plan and execute read-only SQL
queries against that tuplestore using that name, via the usual
SPI_xxx() interfaces.  I'm not sure how you'd really do this though:
you might need to make a function that takes a query as a string, then
does the above setup in a new SPI connection, and then executes the
query.  This would probably be a lot more fun from a PL like Python.
(If you wanted to create ENRs that are available to your top level
connection, I haven't looked into it but I suspect that'd require more
machinery than we have right now, and I'm not sure if it'd be a good
idea.)

In theory, there could be a new type ENR_SPI_TUPLE_TABLE that could
work with SPITupleTable instead of Tuplestore.  The idea was that we
might be able to do more clever things like that in the future, which
is why we tried to make it at least a little bit general.  One thing
that could be nice would be SQL Server style table variables; you
could have a functions that receive them as parameters, return them,
and be able to insert/update/delete.  That's a bit far fetched, but
gives some idea of the reason we invented QueryEnvironment and passed
to all the right places in the planner and executor (or perhaps not
enough yet, we do occasionally find places that we forgot to pass
it...).  So yeah, to use an SPITupleTable now you'd need to insert its
contents into a Tuplestorestate.

As mentioned, this is the mechanism that is used to support SQL
standard "transition tables" in triggers.  You can see that the
function SPI_register_trigger_data() just does what I showed above to
expose all the transition tables to your trigger's SQL statements.
That part even works in Perl, Python, TCL triggers, but to make your
own tables in higher level languages you'd need to expose more
interfaces and figure out sane ways to get TupleDescriptor and
interact with Tuplestore.  If you want to see examples of SQL queries
inside triggers that access transition tables, check out
src/test/regress/expected/triggers.out and
src/pl/plpython/expected/plpython_trigger.out.




RE: Hooks for session start and end, take two

2019-12-19 Thread tsunakawa.ta...@fujitsu.com
From: Michael Paquier 
> Adding extra custom logging information, or plug that information
> elsewhere than Postgres.  I have use cases for that when it comes to
> store external telemetry data or audit things for events happening
> specifically in Postgres.
> 
> Well, hook authors can do a lot of stupid things..  Anyway it looks
> that the end hook is out of scope as far as the discussion has gone
> based on the lack of facility, and that there is still interest for
> the start hook.

I've got interested in this.  What's the current status of this patch?  The CF 
entry shows it was committed.

https://commitfest.postgresql.org/25/2251/

But I understood not, because the relevant code doesn't appear in HEAD, and Git 
log shows that it was reverted.  Am I correct?


I'm thinking of using this feature to address an issue raised by a user who is 
considering migrating from Oracle.  Oracle caches SQL execution plans in shared 
memory and enables any session to use them.  He wants to hide the plan creation 
time from end users after database restart (perhaps the plan creation is slow.) 
 To do that in Oracle, he runs possible SQL statements in a session solely for 
warm-up before resuming service for end users.  Then, later new sessions of end 
users can use the cached execution plans.

Ideally, PostgreSQL should be able to share query plans among sessions without 
user intervention.  But I guess it would be difficult.  Instead, with this 
session start hook and the connection pooling, individual sessions can run 
possible SQL statements using PREPARE and EXECUTE once (forcing generic plans) 
at session start.



BTW, the user interface of the feature is unduly difficult for my use case and 
the use cases you mentioned.  I think users should be able to use the feature 
only with SQL, without creating extensions.  That also allows the use of this 
feature in DBaaS.  How about creating a new event trigger type, connect 
trigger, after Oracle's logon trigger?

CREATE EVENT TRIGGER mytrigger
AFTER CONNECT [ON mydatabase]
EXECUTE {PROCEDURE | FUNCTION} myproc();

If some hacker wants to do what SQL can't, he can specify a UDF in C.


Regards
Takayuki Tsunakawa




Re: Proposal: Global Index

2019-12-19 Thread Bruce Momjian
On Thu, Dec 19, 2019 at 11:28:55AM -0800, Jeremy Schneider wrote:
> On 12/19/19 08:12, Bruce Momjian wrote:
> > I don't see lossy BRIN indexes helping with the uniqueness use-case, and
> > I am not sure they would help with the rare case either.  They would
> > help for range-based partitions, but I thought our existing facilities
> > worked in that case.
> 
> Correlated data.  The existing facilities work if the filtering column
> is exactly the same as the partition column.  But it's not at all
> uncommon to have other columns with correlated data, perhaps the most
> obvious of which is timeseries tables with many date columns of various
> definitions (row first update, row latest update, invoice date, payment
> date, process date, ship date, etc).
> 
> What if you could use *two* indexes in a single execution plan?  Use the
> global BRIN to narrow down to 2 or 3 out of a hundred or more
> partitions, then use local indexes to find specific rows in the
> partitions of interest?  That might work, without being too overly
> complicated.

No, that is very interesting --- having secondary indexes for
partitioned tables that trim most partitions.  Would index lookups on
each partition index be very slow?  BRIN indexes?  I am assuming global
indexes would only avoid such lookups.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Read Uncommitted

2019-12-19 Thread Craig Ringer
On Thu, 19 Dec 2019 at 23:36, Andres Freund  wrote:

> Hi,
>
> > On the patch as proposed this wouldn't be possible because a toast row
> > can't be vacuumed and then reused while holding back xmin, at least as I
> > understand it.
>
> Vacuum and pruning remove rows where xmin didn't commit, without testing
> against the horizon. Which makes sense, because normally there's so far
> no snapshot including them. Unless we were to weaken that logic -
> which'd have bloat impacts - a snapshot wouldn't guarantee anything
> about the non-removal of such tuples, unless I am missing something.
>

My understanding from reading the above is that Simon didn't propose to
make aborted txns visible, only in-progress uncommitted txns.

Vacuum only removes such rows if the xact is (a) explicitly aborted in clog
or (b) provably not still running. It checks RecentXmin and the running
xids arrays to handle xacts that went away after a crash. Per
TransactionIdIsInProgress() as used by HeapTupleSatisfiesVacuum(). I see
that it's not *quite* as simple as using the RecentXmin threhold, as xacts
newer than RecentXmin may also be seen as not in-progress if they're absent
in the shmem xact arrays and there's no overflow.

But that's OK so long as the only xacts that some sort of read-uncommitted
feature allows to become visible are ones that
satisfy TransactionIdIsInProgress(); they cannot have been vacuumed.

The bigger issue, and the one that IMO makes it impractical to spell this
as "READ UNCOMMITTED", is that an uncommitted txn might've changed the
catalogs so there is no one snapshot that is valid for interpreting all
possible tuples. It'd have to see only txns that have no catalog changes,
or be narrowed to see just *one specific txn* that had catalog changes.
That makes it iffy to spell it as "READ UNCOMMITTED" since we can't
actually make all uncommitted txns visible at once.

I think the suggestions for a SRF based approach might make sense. Perhaps
a few functions:

* a function to list all in-progress xids

* a function to list in-progress xids with/without catalog changes (if
possible, unsure if we know that until the commit record is written)

* a function (or procedure?) to execute a read-only SELECT or WITH query
within a faked-up snapshot for some in-progress xact and return a SETOF
RECORD with results. If the txn has catalog changes this would probably
have to coalesce each result field with non-builtin data types to text, or
do some fancy validation to compare the definition in the txn snapshot with
the latest normal snapshot used by the calling session. Ideally this
function could take an array of xids and would query with them all visible
unless there were catalog changes in any of them, then it'd ERROR.

* a function to generate the SQL text for an alias clause that maps the
RECORD returned by the above function, so you can semi-conveniently query
it. (I don't think we have a way for a C callable function to supply a
dynamic resultset type at plan-time to avoid the need for this, do we?
Perhaps if we use a procedure not a function?)



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


RE: Is querying SPITupleTable with SQL possible?

2019-12-19 Thread Wu, Fei
Hi,
I have had see your discussion about node EphemeralNamedRelation with the 
Community.
Now, I want to use this node in SQL(for test), I have saw the manual but could 
not understand,
can you show me a example on how to use it in SQL?

Thanks a lot~

Regards, wu fei

-Original Message-
From: Tom Mercha [mailto:merch...@hotmail.com] 
Sent: 2019年10月3日 2:53
To: Tom Lane 
Cc: pgsql-hack...@postgresql.org
Subject: Re: Is querying SPITupleTable with SQL possible?

On 02/10/2019 16:11, Tom Lane wrote:
> Tom Mercha  writes:
>> I am using PostgreSQL's SPI to execute a simple SQL query (SELECT * 
>> FROM
>> ...) via SPI_exec. As a a result, I get an SPITupleTable with the 
>> results of my query.
>> Now that I have the SPITupleTable, I was wondering if it would be 
>> possible to later query over it further in my SQL statements using 
>> SPI, for example, something a bit similar to SPI_Exec ("Select * FROM 
>> :mySPITupleTable", 0);
> 
> It's possible you could use the "transition table" (aka
> EphemeralNamedRelation) infrastructure for this, though I'm not sure 
> if it's really a close fit, or whether it's been built out enough to 
> support this usage.  From memory, it wants to work with tuplestores, 
> which are a bit heavier-weight than SPITupleTables.
> 
>   regards, tom lane
> 

Thanks for this feedback! The EphemeralNamedRelation seems that it could be a 
good fit for what I'm looking for.

However, I'm not quite so sure how I can query over the EphemeralNamedRelation 
using SQL? Could someone indicate where I can find an example?

Regards
Tom






Re: How is this possible "publication does not exist"

2019-12-19 Thread Tomas Vondra

On Thu, Dec 19, 2019 at 07:19:56PM +0100, Peter Eisentraut wrote:

On 2019-12-19 19:15, Dave Cramer wrote:
It seems that if you drop the publication on an existing slot it 
needs to be recreated. Is this expected behaviour


A publication is not associated with a slot.  Only a subscription is 
associated with a slot.



drop publication dbz_publication ;
DROP PUBLICATION
postgres=# create publication dbz_publication for all tables;
CREATE PUBLICATION
postgres=# SELECT * FROM 
pg_logical_slot_get_binary_changes('debezium', NULL, 
NULL,'proto_version','1','publication_names','dbz_publication');

ERROR:  publication "dbz_publication" does not exist
CONTEXT:  slot "debezium", output plugin "pgoutput", in the change 
callback, associated LSN 0/4324180


This must be something particular to Debezium.



Yeah, I don't see this error message anywhere in our sources on 11 or
12, so perhaps debezium does something funny? It's not clear to me
where, though, as this suggests it uses the pgoutput module.

While trying to reproduce this I however ran into a related issue with
pgoutput/pg_logical_slot_get_binary_changes. If you call the function
repeatedly (~10x) you'll get an error like this:

FATAL:  out of relcache_callback_list slots
CONTEXT:  slot "slot", output plugin "pgoutput", in the startup callback
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

The reason is very simple - each call executes pgoutput_startup, which
does CacheRegisterRelcacheCallback in init_rel_sync_cache. And we do
this on each pg_logical_slot_get_binary_changes() call and never remove
the callbacks, so we simply run out of MAX_RELCACHE_CALLBACKS slots.

Not sure if this is a known issue/behavior, but it seems a bit annoying
and possibly related to the issue reported by Dave.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





RE: Proposal: Add more compile-time asserts to expose inconsistencies.

2019-12-19 Thread Smith, Peter
Hello Michael,

> In short, attached is an updated version of your patch which attempts to 
> solve that.  I have tested this with some cplusplus stuff, and GCC for both 
> versions (static_assert is available in GCC >= 6, but a manual change of c.h 
> does the trick).
> I have edited the patch a bit while on it, your assertions did not use 
> project-style grammar, the use of parenthesis was inconsistent (see relpath.c 
> for example), and pgindent has complained a bit.

Thanks for your updates.

~~

Hello Andres,

>> +StaticAssertDecl(lengthof(LockTagTypeNames) == (LOCKTAG_ADVISORY + 1),
>> + "LockTagTypeNames array inconsistency");
>> +
> These error messages strike me as somewhat unhelpful. I'd probably just 
> reword them as "array length mismatch" or something like that.

I updated the most recent patch (_5 from Michael) so it now has your suggested 
error message rewording.

PSA patch _6

Kind Regards

Peter Smith
Fujitsu Australia


ct_asserts_StaticAssertDecl_6.patch
Description: ct_asserts_StaticAssertDecl_6.patch


Re: [HACKERS] kqueue

2019-12-19 Thread Thomas Munro
On Fri, Dec 20, 2019 at 1:26 PM Thomas Munro  wrote:
> On Fri, Dec 20, 2019 at 12:41 PM Rui DeSousa  wrote:
> > PostgreSQL 11

BTW, PostgreSQL 12 has an improvement that may be relevant for your
case: it suppresses a bunch of high frequency reads on the "postmaster
death" pipe in some scenarios, mainly the streaming replica replay
loop (if you build on a system new enough to have PROC_PDEATHSIG_CTL,
namely FreeBSD 11.2+, it doesn't bother reading the pipe unless it's
received a signal).  That pipe is inherited by every process and
included in every poll() set.  The kqueue patch doesn't even bother to
add it to the wait event set, preferring to use an EVFILT_PROC event,
so in theory we could get rid of the death pipe completely on FreeBSD
and rely on EVFILT_PROC (sleeping) and PDEATHSIG (while awake), but I
wouldn't want to make the code diverge from the Linux code too much,
so I figured we should leave the pipe in place but just avoid
accessing it when possible, if that makes sense.




Re: More issues with expressions always false (no patch)

2019-12-19 Thread Andreas Karlsson

On 12/20/19 1:01 AM, Ranier Vilela wrote:> First case:

src \ backend \ executor \ nodeSubplan.c (line 507)

if (node-> hashtable)

node-> hastable is assigned with NULL at line 498, so the test will 
always fail.


Second case:
Here the case is similar, but worse.

src \ backend \ executor \ nodeSubplan.c (line 535)
if (node-> hashnulls)
   ResetTupleHashTable (node-> hashtable);


These two look like likely bugs. It looks like the code will always 
create new hash tables despite commit 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=356687bd825e5ca7230d43c1bffe7a59ad2e77bd 
intending to reset them if they already exist.


Additionally it looks like the code would reset the wrong hash table in 
the second place if the bug was fixed.


I have attached a patch.


Third case:
\ src \ backend \ utils \ cache \ relcache.c (line 5190)
if (relation-> rd_pubactions)

It will never be executed, because if relation-> rd_pubactions is true, 
the function returns on line 5154.


I have not looked into this one in detail, but the free at line 5192 
looks like potentially dead code.


Andreas

diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 2c364bdd23..840f4033ac 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -495,8 +495,6 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 	 * need to store subplan output rows that contain NULL.
 	 */
 	MemoryContextReset(node->hashtablecxt);
-	node->hashtable = NULL;
-	node->hashnulls = NULL;
 	node->havehashrows = false;
 	node->havenullrows = false;
 
@@ -533,7 +531,7 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 		}
 
 		if (node->hashnulls)
-			ResetTupleHashTable(node->hashtable);
+			ResetTupleHashTable(node->hashnulls);
 		else
 			node->hashnulls = BuildTupleHashTableExt(node->parent,
 	 node->descRight,
@@ -549,6 +547,8 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 	 node->hashtempcxt,
 	 false);
 	}
+	else
+		node->hashnulls = NULL;
 
 	/*
 	 * We are probably in a short-lived expression-evaluation context. Switch


Re: [PATCH] Fix expressions always false

2019-12-19 Thread Ranier Vilela
>1) long is 64 bits on Unix-like platforms
>2) checking a long against LONG_MIN/LONG_MAX is _definitely_ pointless
>3) it's being cast to an int for the from_char_set_int() call below
>Please take your time to read the whole context of the code you're
>changing, and consider other platforms than just Windows.
Thank you for point me, about this.

regards,
Ranier Vilela

Em qui., 19 de dez. de 2019 às 20:58, Dagfinn Ilmari Mannsåker <
ilm...@ilmari.org> escreveu:

> Ranier Vilela  writes:
>
> > More about expressions always false.
> > 2. src/backend/utils/adt/formatting.c
> > result is declared long. Comparison with int limits is always false.
> > 3. src/backend/utils/adt/jsonfuncs.c
> > lindex is declared long. . Comparison with int limits is always false.
>
> 1) long is 64 bits on Unix-like platforms
> 2) checking a long against LONG_MIN/LONG_MAX is _definitely_ pointless
> 3) it's being cast to an int for the from_char_set_int() call below
>
> Please take your time to read the whole context of the code you're
> changing, and consider other platforms than just Windows.
>
> - ilmari
> --
> "A disappointingly low fraction of the human race is,
>  at any given time, on fire." - Stig Sandbeck Mathisen
>
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index dbed597816..f0ad9f23e5 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -2736,7 +2736,7 @@ ExecEvalArrayExpr(ExprState *state, ExprEvalStep *op)
 /* Get sub-array details from first member */
 elem_ndims = this_ndims;
 ndims = elem_ndims + 1;
-if (ndims <= 0 || ndims > MAXDIM)
+if (ndims > MAXDIM)
 	ereport(ERROR,
 			(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 			 errmsg("number of array dimensions (%d) exceeds " \
diff --git a/src/backend/utils/adt/network.c b/src/backend/utils/adt/network.c
index a6dd8b75aa..bb1e8522dd 100644
--- a/src/backend/utils/adt/network.c
+++ b/src/backend/utils/adt/network.c
@@ -280,8 +280,6 @@ network_send(inet *addr, bool is_cidr)
 	pq_sendbyte(, ip_bits(addr));
 	pq_sendbyte(, is_cidr);
 	nb = ip_addrsize(addr);
-	if (nb < 0)
-		nb = 0;
 	pq_sendbyte(, nb);
 	addrptr = (char *) ip_addr(addr);
 	for (i = 0; i < nb; i++)


RE: Libpq support to connect to standby server as priority

2019-12-19 Thread tsunakawa.ta...@fujitsu.com
From: Greg Nancarrow 
> With the permission of the original patch author, Haribabu Kommi, I’ve
> rationalized the existing 8 patches into 3 patches, merging patches
> 1-5 and 6-7, and tidying up some documentation and code comments. I
> also rebased them to the latest PG12 source code (as of October 1,
> 2019). The patch code itself is the same, except for some version
> checks that I have updated to target the features for PG13 instead of
> PG12.
> I’ve attached the updated patches.

Thank you for taking over this patch.  Your arrangement has made the patches 
much easier to read!

I've finished reviewing, and my comments are below.  Unfortunately, 0003 failed 
to apply (I guess only slight modification is needed to apply to HEAD.)  I'd 
like to proceed to testing when the revised patch becomes available.



(1) 0001
+   /*
+* Requested type is prefer-read, then record 
this host index
+* and try the other before considering it 
later. If requested
+* type of connection is read-only, ignore this 
connection.
+*/
+   if (conn->requested_session_type == 
SESSION_TYPE_PREFER_READ ||
+   conn->requested_session_type == 
SESSION_TYPE_READ_ONLY)
{

This if statement seems unnecessary, because the following part at the 
beginning of the CONNECTION_CHECK_TARGET case block precludes entering the if 
block.  Cases other than "any" are handled first here.


if (conn->sversion >= 70400 &&
-   conn->target_session_attrs != NULL &&
-   strcmp(conn->target_session_attrs, 
"read-write") == 0)
+   conn->requested_session_type != 
SESSION_TYPE_ANY)
+   {


(2) 0002
-} TargetSessionAttrsType;
+}  TargetSessionAttrsType;

One space after } is replaced with three tabs.  I guess this is an 
unintentional change.
 

(3) 0002
+reject_checked_read_or_write_connection(PGconn *conn)

To follow the naming style of most internal functions in this file, I find it 
better to change the name to rejectCheckedReadOrWriteConnection.


(4) 0003
+reject_checked_recovery_connection(PGconn *conn)

The same as the previous one.


(5) 0003
Don't we have to describe in_recovery in both or either of 
high-availability.sgml and config.sgml?  transaction_read_only is touched in 
the former.


Regards
Takayuki Tsunakawa



Re: polymorphic table functions light

2019-12-19 Thread Vik Fearing
On 16/12/2019 22:13, Tom Lane wrote:
> That being the case, I'm not in favor of using up SQL syntax space for it
> if we don't have to.


Do I understand correctly that you are advocating *against* using
standard SQL syntax for a feature that is defined by the SQL Standard
and that we have no similar implementation for?


If so, I would like to stand up to it.  We are known as (at least one
of) the most conforming implementations and I hope we will continue to
be so.  I would rather we remove from rather than add to this page:
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL_Standard

-- 

Vik Fearing





Re: [HACKERS] kqueue

2019-12-19 Thread Thomas Munro
On Fri, Dec 20, 2019 at 12:41 PM Rui DeSousa  wrote:
> I’m instrested in the kqueue patch and would like to know its current state 
> and possible timeline for inclusion in the base code.  I have several large 
> FreeBSD systems running PostgreSQL 11 that I believe currently displays this 
> issue.  The system has 88 vCPUs, 512GB Ram, and very active application with 
> over 1000 connections to the database.  The system exhibits high kernel CPU 
> usage servicing poll() for connections that are idle.

Hi Rui,

It's still my intention to get this committed eventually, but I got a
bit frazzled by conflicting reports on several operating systems.  For
FreeBSD, performance was improved in many cases, but there were also
some regressions that seemed to be related to ongoing work in the
kernel that seemed worth waiting for.  I don't have the details
swapped into my brain right now, but there was something about a big
kernel lock for Unix domain sockets which possibly explained some
local pgbench problems, and there was also a problem relating to
wakeup priority with some test parameters, which I'd need to go and
dig up.  If you want to test this and let us know how you get on,
that'd be great!  Here's a rebase against PostgreSQL's master branch,
and since you mentioned PostgreSQL 11, here's a rebased version for
REL_11_STABLE in case that's easier for you to test/build via ports or
whatever and test with your production workload (eg on a throwaway
copy of your production system).  You can see it's working by looking
in top: instead of state "select" (which is how poll() is reported)
you see "kqread", which on its own isn't exciting enough to get this
committed :-)

PS Here's a list of slow burner PostgreSQL/FreeBSD projects:
https://wiki.postgresql.org/wiki/FreeBSD


0001-Add-kqueue-2-support-for-WaitEventSet-v13.patch
Description: Binary data


0001-Add-kqueue-2-support-for-WaitEvent-v13-REL_11_STABLE.patch
Description: Binary data


Re: Optimizing TransactionIdIsCurrentTransactionId()

2019-12-19 Thread Tomas Vondra

On Thu, Dec 19, 2019 at 02:27:01PM -0500, Robert Haas wrote:

On Wed, Dec 18, 2019 at 5:07 AM Simon Riggs  wrote:

TransactionIdIsCurrentTransactionId() doesn't seem to be well optimized for the 
case when an xid has not yet been assigned, so for read only transactions.

A patch for this is attached.


It might be an idea to first call TransactionIdIsNormal(xid), then
GetTopTransactionIdIfAny(), then TransactionIdIsNormal(topxid), so
that we don't bother with GetTopTransactionIdIfAny() when
!TransactionIdIsNormal(xid).

But it's also not clear to me whether this is actually a win. You're
dong an extra TransactionIdIsNormal() test to sometimes avoid a
GetTopTransactionIdIfAny() test. TransactionIdIsNormal() is pretty
cheap, but GetTopTransactionIdIfAny() isn't all that expensive either,
and adding more branches costs something.



I think "optimization" patches should generally come with some sort of
quantification of the gains - e.g. a benchmark with somewhat realistic
workload (but even synthetic is better than nothing). Or at least some
explanation *why* it's going to be an improvement.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

2019-12-19 Thread Peter Geoghegan
On Thu, Dec 19, 2019 at 12:05 PM Robert Haas  wrote:
> My impression is that this is more of an implementation restriction
> than a design goal. I don't really remember the details, but it seems
> to me that there were locking and/or cache invalidation problems with
> making ALTER OPERATOR CLASS do more substantive things -- and that it
> was because of those problems, not a lack of desire, that we didn't
> support it.

I agree with you. My point was only that this is something that the
operator class author is really expected to get right the first time
around -- just like the behavior of B-Tree support function 1. We're
really only concerned about the upgrade path for external types that
could see a benefit from the optimization planned for nbtree (and
possibly other such optimization). Providing a non-disruptive way to
get that benefit after a pg_upgrade only seems like a nice-to-have to
me, because it's not as if anything will stop working as well as it
once did. Also, there aren't that many external types that will be
made more useful by being able to use optimizations like
deduplication; in practice almost all B-Tree indexes only use a small
handful of operator classes that are shipped in core Postgres. Once
you're using common types like text and integer, a pg_upgrade'd
database is only a REINDEX away from being able to use deduplication
(though I am not even sure if even that will be necessary in the final
patch; I hope to be able to avoid even that inconvenience with indexes
using core operator classes).

If the underlying behavior of an operator class actually changes, then
that's a disaster for all the usual reasons. It doesn't make that much
sense to reverse an earlier decision to make an operator class
BITWISE. Better to drop everything, and recreate everything, since
your indexes should be considered corrupt anyway. (Also, I don't think
that it's that hard to get it right, so this will probably never
happen.)

-- 
Peter Geoghegan




Re: [PATCH] Fix expressions always false

2019-12-19 Thread Dagfinn Ilmari Mannsåker
Ranier Vilela  writes:

> More about expressions always false.
> 2. src/backend/utils/adt/formatting.c
> result is declared long. Comparison with int limits is always false.
> 3. src/backend/utils/adt/jsonfuncs.c
> lindex is declared long. . Comparison with int limits is always false.

1) long is 64 bits on Unix-like platforms
2) checking a long against LONG_MIN/LONG_MAX is _definitely_ pointless
3) it's being cast to an int for the from_char_set_int() call below

Please take your time to read the whole context of the code you're
changing, and consider other platforms than just Windows.

- ilmari
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen




Re: [HACKERS] kqueue

2019-12-19 Thread Rui DeSousa


> On Apr 10, 2018, at 9:05 PM, Thomas Munro  
> wrote:
> 
> On Wed, Dec 6, 2017 at 12:53 AM, Thomas Munro
>  wrote:
>> On Thu, Jun 22, 2017 at 7:19 PM, Thomas Munro
>>  wrote:
>>> I don't plan to resubmit this patch myself, but I was doing some
>>> spring cleaning and rebasing today and I figured it might be worth
>>> quietly leaving a working patch here just in case anyone from the
>>> various BSD communities is interested in taking the idea further.
> 
> I heard through the grapevine of some people currently investigating
> performance problems on busy FreeBSD systems, possibly related to the
> postmaster pipe.  I suspect this patch might be a part of the solution
> (other patches probably needed to get maximum value out of this patch:
> reuse WaitEventSet objects in some key places, and get rid of high
> frequency PostmasterIsAlive() read() calls).  The autoconf-fu in the
> last version bit-rotted so it seemed like a good time to post a
> rebased patch.
> 
> -- 
> Thomas Munro
> http://www.enterprisedb.com
> 

Hi, 

I’m instrested in the kqueue patch and would like to know its current state and 
possible timeline for inclusion in the base code.  I have several large FreeBSD 
systems running PostgreSQL 11 that I believe currently displays this issue.  
The system has 88 vCPUs, 512GB Ram, and very active application with over 1000 
connections to the database.  The system exhibits high kernel CPU usage 
servicing poll() for connections that are idle.   

I’ve being testing pg_bouncer to reduce the number of connections and thus 
system CPU usage; however, not all connections can go through pg_bouncer. 

Thanks,
Rui.



Re: Fetching timeline during recovery

2019-12-19 Thread Jehan-Guillaume de Rorthais
On Fri, 13 Dec 2019 16:12:55 +0900
Michael Paquier  wrote:

> On Wed, Dec 11, 2019 at 10:45:25AM -0500, Stephen Frost wrote:
> > I'm confused- wouldn't the above approach be a function that's returning
> > only one row, if you had a bunch of columns and then had NULL values for
> > those cases that didn't apply..?  Or, if you were thinking about the SRF
> > approach that you suggested, you could use a WHERE clause to make it
> > only one row...  Though I can see how it's nicer to just have one row in
> > some cases which is why I was suggesting the "bunch of columns"
> > approach.  
> 
> Oh, sorry.  I see the confusion now and that's my fault.  In
> https://www.postgresql.org/message-id/20191211052002.gk72...@paquier.xyz
> I mentioned a SRF function which takes an input argument, but that
> makes no sense.  What I would prefer having is just having one
> function, returning one row (LSN, TLI), using in input one argument to
> extract the WAL information the caller wants with five possible cases
> (write, insert, flush, receive, replay).

It looks odd when we look at other five existing functions of the same family
but without the tli. And this user interaction with admin function is quite
different of what we are used to with other admin funcs. But mostly, when I
think of such function, I keep thinking this parameter should be a WHERE
clause after a SRF function.

-1

> Then, what you are referring to is one function which returns all
> (LSN,TLI) for the five cases (write, insert, etc.), so it would return
> one row with 10 columns, with NULL mapping to the values which have no
> meaning (like replay on a primary).

This would looks like some other pg_stat_* functions, eg. pg_stat_get_archiver.
I'm OK with this. This could even be turned as a catalog view.

However, what's the point of gathering all the values eg from a production
cluster? Is it really useful to compare current/insert/flush LSN from wal
writer?

It's easier to answer from a standby point of view as the lag between received
and replayed might be interesting to report in various situations.

> And on top of that we have a third possibility: one SRF function
> returning 5 rows with three attributes (mode, LSN, TLI), where mode
> corresponds to one value in the set {write, insert, etc.}.

I prefer the second one. Just select the field(s) you need, no need WHERE
clause, similar to some other stats function.

-1


As a fourth possibility, as I badly explained my last implementation details, I
still hope we can keep it in the loop here. Just overload existing functions
with ones that takes a boolean as parameter and add the TLI as a second field,
eg.:

Name   | Result type  | Argument data types
---+--+---
pg_current_wal_lsn | pg_lsn   |
pg_current_wal_lsn | SETOF record | with_tli bool, OUT lsn pg_lsn, OUT tli int


And the fifth one, implementing brand new functions:

 pg_current_wal_lsn_tli
 pg_current_wal_insert_lsn_tli
 pg_current_wal_flush_lsn_tli
 pg_last_wal_receive_lsn_tli
 pg_last_wal_replay_lsn_tli

> I actually prefer the first one, and you mentioned the second.  But
> there could be a point in doing the third one.  An advantage of the
> second and third ones is that you may be able to get a consistent view
> of all the data, but it means holding locks to look at the values a
> bit longer.  Let's see what others think.

I like the fourth one, but I was not able to return only one field if given
parameter is false or NULL. Giving false as argument to these funcs has no
meaning compared to the original one without arg. I end up with this solution
because I was worried about adding five more funcs really close to some
existing one.

Fifth one is more consistent with what we already have.

Thanks again.

Regards,




[PATCH] Fix expressions always false

2019-12-19 Thread Ranier Vilela
More about expressions always false.
1. /src/backend/executor/execExprInterp.c
ndims <= 0 neve be negative, because ndims aways is added up +1
2. src/backend/utils/adt/formatting.c
result is declared long. Comparison with int limits is always false.
3. src/backend/utils/adt/jsonfuncs.c
lindex is declared long. . Comparison with int limits is always false.
4. src/backend/utils/adt/network.c
ip_addrsize is macro and awlays return 4 or 16

regards,
Ranier Vilela
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index dbed597816..f0ad9f23e5 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -2736,7 +2736,7 @@ ExecEvalArrayExpr(ExprState *state, ExprEvalStep *op)
 /* Get sub-array details from first member */
 elem_ndims = this_ndims;
 ndims = elem_ndims + 1;
-if (ndims <= 0 || ndims > MAXDIM)
+if (ndims > MAXDIM)
 	ereport(ERROR,
 			(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 			 errmsg("number of array dimensions (%d) exceeds " \
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 8fcbc2267f..832b3b2ed2 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -2418,13 +2418,13 @@ from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node,
 	 copy, node->key->name),
 			  errdetail("Value must be an integer.";
 
-	if (errno == ERANGE || result < INT_MIN || result > INT_MAX)
+	if (errno == ERANGE || result < LONG_MIN || result > LONG_MAX)
 		RETURN_ERROR(ereport(ERROR,
 			 (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
 			  errmsg("value for \"%s\" in source string is out of range",
 	 node->key->name),
-			  errdetail("Value must be in the range %d to %d.",
-		INT_MIN, INT_MAX;
+			  errdetail("Value must be in the range %ld to %ld.",
+		LONG_MIN, LONG_MAX;
 
 	if (dest != NULL)
 	{
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index 1b0fb2afae..ec988fb6b9 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -1406,7 +1406,7 @@ get_jsonb_path_all(FunctionCallInfo fcinfo, bool as_text)
 			errno = 0;
 			lindex = strtol(indextext, , 10);
 			if (endptr == indextext || *endptr != '\0' || errno != 0 ||
-lindex > INT_MAX || lindex < INT_MIN)
+lindex > LONG_MAX || lindex < LONG_MIN)
 PG_RETURN_NULL();
 
 			if (lindex >= 0)
@@ -4786,11 +4786,11 @@ setPathArray(JsonbIterator **it, Datum *path_elems, bool *path_nulls,
 
 		errno = 0;
 		lindex = strtol(c, , 10);
-		if (errno != 0 || badp == c || *badp != '\0' || lindex > INT_MAX ||
-			lindex < INT_MIN)
+		if (errno != 0 || badp == c || *badp != '\0' || lindex > LONG_MAX ||
+			lindex < LONG_MIN)
 			ereport(ERROR,
 	(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
-	 errmsg("path element at position %d is not an integer: \"%s\"",
+	 errmsg("path element at position %d is not an long integer: \"%s\"",
 			level + 1, c)));
 		idx = lindex;
 	}
diff --git a/src/backend/utils/adt/network.c b/src/backend/utils/adt/network.c
index a6dd8b75aa..bb1e8522dd 100644
--- a/src/backend/utils/adt/network.c
+++ b/src/backend/utils/adt/network.c
@@ -280,8 +280,6 @@ network_send(inet *addr, bool is_cidr)
 	pq_sendbyte(, ip_bits(addr));
 	pq_sendbyte(, is_cidr);
 	nb = ip_addrsize(addr);
-	if (nb < 0)
-		nb = 0;
 	pq_sendbyte(, nb);
 	addrptr = (char *) ip_addr(addr);
 	for (i = 0; i < nb; i++)


Condition variables vs interrupts

2019-12-19 Thread Thomas Munro
Hi,

While testing something unrelated, Tomas reported[1] that he could
make a parallel worker ignore a SIGTERM and hang forever in
ConditionVariableSleep().  I looked into this and realised that it's
more likely in master.  Commit 1321509f refactored the latch wait loop
to look a little bit more like other examples* by putting
CHECK_FOR_INTERRUPTS() after ResetLatch(), whereas previously it was
at the top of the loop.  ConditionVariablePrepareToSleep() was
effectively relying on that order when it reset the latch without its
own CFI().

The bug goes back to the introduction of CVs however, because there
was no guarantee that you'd ever reach ConditionVariableSleep().  You
could call ConditionVariablePrepareToSleep(), test your condition,
decide you're done, then call ConditionVariableCancelSleep(), then
reach some other WaitLatch() with no intervening CFI().  It might be
hard to find a code path that actually does that without a
coincidental CFI() to save you, but at least in theory the problem
exists.

I think we should probably just remove the unusual ResetLatch() call,
rather than adding a CFI().  See attached.  Thoughts?

*It can't quite be exactly like the two patterns shown in latch.h,
namely { Reset, Test, Wait } and { Test, Wait, Reset }, because the
real test is external to this function; we have the other possible
rotation { Wait, Reset, Test }, and this code is only run if the
client's test failed.  Really it's a nested loop, with the outer loop
belonging to the caller, so we have { Test', { Wait, Reset, Test } }.
However, CHECK_FOR_INTERRUPTS() also counts as a test of work to do,
and AFAICS it always belongs between Reset and Wait, no matter how far
you rotate the loop.  I wonder if latch.h should mention that.

[1] 
https://www.postgresql.org/message-id/20191217232124.3dtrycatgfm6oxxb%40development


0001-Don-t-call-ResetLatch-in-ConditionVariablePrepareToS.patch
Description: Binary data


Re: Add support for automatically updating Unicode derived files

2019-12-19 Thread John Naylor
On Tue, Oct 29, 2019 at 6:06 AM Peter Eisentraut
 wrote:
>
> Continuing the discussion from [0] and [1], here is a patch that
> automates the process of updating Unicode derived files.  Summary:
>
> - Edit UNICODE_VERSION and/or CLDR_VERSION in src/Makefile.global.in
> - Run make update-unicode
> - Commit

Hi Peter,

I gave "make update-unicode" a try. It's unclear to me what the state
of the build tree should be when a maintainer runs this, so I'll just
report what happens when running naively (on MacOS).

After only running configure, "make update-unicode" gives this error
at normalization-check:

ld: library not found for -lpgcommon
clang: error: linker command failed with exit code 1 (use -v to see invocation)

After commenting that out, the next command "$(MAKE) -C
contrib/unaccent $@" failed, seemingly because $(PYTHON) is empty
unless --with-python was specified at configure time.

> Open questions that are currently not handled consistently:
>
> - Should the downloaded files be listed in .gitignore?

These files are transient byproducts of a build, and we don't want
them committed, so they seem like a normal candidate for .gitignore.

> - Should the downloaded files be cleaned by make clean (or distclean or
> maintainer-clean or none)?

It seems one would want to make clean without removing these files,
and maintainer clean is for removing things that are preserved in
distribution tarballs. So I would go with distclean.

> - Should the generated files be excluded from pgindent?  Currently, the
> generated files will not pass pgindent unchanged, so that could cause
> annoying whitespace battles when these files are updated and re-indented
> around release time.

I see what you mean in the norm table header. I think generated files
should not be pgindent'd, since creating them is already a consistent,
mechanical process, and their presentation is not as important as
other code.

Other comments:

+print "/* generated by
src/common/unicode/generate-unicode_combining_table.pl, do not edit
*/\n\n";

I would print out the full boilerplate like for other generated headers.

Lastly, src/common/unicode/README is outdated (and possibly no longer
useful at all?).

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Fetching timeline during recovery

2019-12-19 Thread Jehan-Guillaume de Rorthais
On Wed, 11 Dec 2019 14:20:02 +0900
Michael Paquier  wrote:

> On Thu, Sep 26, 2019 at 07:20:46PM +0200, Jehan-Guillaume de Rorthais wrote:
> > If this solution is accepted, some other function of the same family might
> > be good candidates as well, for the sake of homogeneity:
> > 
> > * pg_current_wal_insert_lsn
> > * pg_current_wal_flush_lsn
> > * pg_last_wal_replay_lsn
> > 
> > However, I'm not sure how useful this would be.
> > 
> > Thanks again for your time, suggestions and review!  
> 
> +{ oid => '3435', descr => 'current wal flush location',
> +  proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
> proisstrict => 'f',
> This description is incorrect.

Indeed. And the one for pg_current_wal_lsn(bool) as well.

> And please use OIDs in the range of 8000~ for patches in
> development.  You could just use src/include/catalog/unused_oids which
> would point out a random range.

Thank you for this information, I wasn't aware.

> +   if (recptr == 0) {
> +   nulls[0] = 1;
> +   nulls[1] = 1;
> +   }
> The indendation of the code is incorrect, these should use actual
> booleans and recptr should be InvalidXLogRecPtr (note also the
> existence of the macro XLogRecPtrIsInvalid).  Just for the style.

Fixed on my side. Thanks.

> As said in the last emails exchanged on this thread, I don't see how
> you cannot use multiple functions which have different meaning
> depending on if the cluster is a primary or a standby knowing that we
> have two different concepts of WAL when at recovery: the received
> LSN and the replayed LSN, and three concepts for primaries (insert,
> current, flush).  

As I wrote in my previous email, existing functions could be overloaded
as well for the sake of homogeneity. So the five of them would have similar
behavior/API.

> I agree as well with the point of Fujii-san about
> not returning the TLI and the LSN across different functions as this
> opens the door for a risk of inconsistency for the data received by
> the client.

My last patch fixed that, indeed.

> + * When the first parameter (variable 'with_tli') is true, returns the
> current
> + * timeline as second field. If false, second field is null.
> I don't see much the point of having this input parameter which
> determines the NULL-ness of one of the result columns, and I think
> that you had better use a completely different function name for each
> one of them instead of enforcing the functions.  Let's remember that a
> lot of tools use the existing functions directly in the SELECT clause
> for LSN calculations, which is just a 64-bit integer *without* a
> timeline assigned to it.  However your patch mixes both concepts by
> using pg_current_wal_lsn.

Sorry, I realize I was not clear enough about implementation details.
My latest patch does **not** introduce regression for existing tools. If you do
not pass any parameter, the behavior is the same, only one column:

  # primary
  $ cat < So we could do more with the introduction of five new functions which 
> allow to grab the LSN and the TLI in use for replay, received, insert,
> write and flush positions:
> - pg_current_wal_flush_info
> - pg_current_wal_insert_info
> - pg_current_wal_info
> - pg_last_wal_receive_info
> - pg_last_wal_replay_info

I could go this way if you prefer, maybe using _tli as suffix instead of _info
as this is the only new info added. I think it feels redundant with original
funcs, but it might be the simplest solution.

> I would be actually tempted to do the following: one single SRF
> function, say pg_wal_info which takes a text argument in input with
> the following values: flush, write, insert, receive, replay.  Thinking
> more about it that would be rather neat, and more extensible than the
> rest discussed until now.  See for example PostgresNode::lsn.

I'll answer in your other mail that summary other possibilities.

Thanks!




Re: global / super barriers (for checksums)

2019-12-19 Thread Magnus Hagander
On Thu, Dec 19, 2019 at 8:57 PM Robert Haas  wrote:

> On Tue, Dec 17, 2019 at 1:38 PM Robert Haas  wrote:
> > On Thu, Dec 12, 2019 at 2:54 PM Andres Freund 
> wrote:
> > > I'd either add a test (if we have some) or placeholder kind
> > > initially. But I'd also be ok with going for either of the other
> > > versions directly - but it seems harder to tackle the patches together.
> >
> > OK. I have committed 0001-0003 as I had mentioned last week that I
> > intended to do. For 0004, I have replaced "sample" with "placeholder"
> > and added comments explaining that this is intended to be replaced by
> > the first real user of the mechanism. If there are no further
> > comments/objections I'll go ahead with this one as well.
>
> And, done.
>

\o/

//Magnus


Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

2019-12-19 Thread Robert Haas
On Wed, Nov 13, 2019 at 4:25 PM Peter Geoghegan  wrote:
> I think that that's probably not desirable. There should at least be a
> strong practical advantage if we go that way. This would mean ALTER
> OPERATOR CLASS could change the "substance" of an opclass, which is
> fundamentally different from what it can do already (it currently just
> changes the owner, or the schema that it is stored in).

My impression is that this is more of an implementation restriction
than a design goal. I don't really remember the details, but it seems
to me that there were locking and/or cache invalidation problems with
making ALTER OPERATOR CLASS do more substantive things -- and that it
was because of those problems, not a lack of desire, that we didn't
support it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: global / super barriers (for checksums)

2019-12-19 Thread Robert Haas
On Tue, Dec 17, 2019 at 1:38 PM Robert Haas  wrote:
> On Thu, Dec 12, 2019 at 2:54 PM Andres Freund  wrote:
> > I'd either add a test (if we have some) or placeholder kind
> > initially. But I'd also be ok with going for either of the other
> > versions directly - but it seems harder to tackle the patches together.
>
> OK. I have committed 0001-0003 as I had mentioned last week that I
> intended to do. For 0004, I have replaced "sample" with "placeholder"
> and added comments explaining that this is intended to be replaced by
> the first real user of the mechanism. If there are no further
> comments/objections I'll go ahead with this one as well.

And, done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Optimizing TransactionIdIsCurrentTransactionId()

2019-12-19 Thread Robert Haas
On Wed, Dec 18, 2019 at 5:07 AM Simon Riggs  wrote:
> TransactionIdIsCurrentTransactionId() doesn't seem to be well optimized for 
> the case when an xid has not yet been assigned, so for read only transactions.
>
> A patch for this is attached.

It might be an idea to first call TransactionIdIsNormal(xid), then
GetTopTransactionIdIfAny(), then TransactionIdIsNormal(topxid), so
that we don't bother with GetTopTransactionIdIfAny() when
!TransactionIdIsNormal(xid).

But it's also not clear to me whether this is actually a win. You're
dong an extra TransactionIdIsNormal() test to sometimes avoid a
GetTopTransactionIdIfAny() test. TransactionIdIsNormal() is pretty
cheap, but GetTopTransactionIdIfAny() isn't all that expensive either,
and adding more branches costs something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Clean up some old cruft related to Windows

2019-12-19 Thread Juan José Santamaría Flecha
On Thu, Dec 19, 2019 at 5:47 AM Kyotaro Horiguchi 
wrote:

> At Thu, 19 Dec 2019 11:15:26 +0900, Michael Paquier 
> wrote in
> > Hi all,
> >
> > As discussed here, there is in the tree a couple of things related to
> > past versions of Windows:
> >
> https://www.postgresql.org/message-id/201912180219susv254.ge1...@paquier.xyz
> >
> > So I have been looking at that more closely, and found more:
> > - MIN_WINNT can be removed from win32.h thanks to d9dd406 which has
> > added a requirement on C99 with Windows 7 as minimum platform
> > supported.  (The issue mentioned previously.)
> > - pipe_read_line(), used when finding another binary for a given
> > installation via find_other_exec() has some special handling related
> > to Windows 2000 and older versions.
> > - When trying to load getaddrinfo(), we try to load it from
> > wship6.ddl, which was something needed in Windows 2000, but newer
> > Windows versions include it in ws2_32.dll.
> > - A portion of the docs still refer to Windows 98.
> >
> > Thoughts?
>
> I think MIN_WINNT is definitely emovable.
>
>
This is probably not an issue for the supported MSVC and their SDK, but
current MinGW defaults to Windows 2003 [1]. So I would suggest a logic like:

#define WINNTVER(ver) ((ver) >> 16)
#define NTDDI_VERSION 0x06000100
#define _WIN32_WINNT WINNTVER(NTDDI_VERSION)

[1]
https://github.com/mirror/mingw-w64/blob/master/mingw-w64-headers/include/sdkddkver.h

Regards,

Juan José Santamaría Flecha


TCP option assign hook doesn't work well if option not supported

2019-12-19 Thread Peter Eisentraut
macOS does not support the socket option TCP_USER_TIMEOUT.  Yet, I can 
start a server with postgres -D ... --tcp-user-timeout=100 without a 
diagnostic.  Only when I connect I get a log entry


LOG:  setsockopt(TCP_USER_TIMEOUT) not supported

Perhaps the logic in pq_settcpusertimeout() should be changed like this:

 int
 pq_settcpusertimeout(int timeout, Port *port)
 {
+#ifdef TCP_USER_TIMEOUT
if (port == NULL || IS_AF_UNIX(port->laddr.addr.ss_family))
return STATUS_OK;

-#ifdef TCP_USER_TIMEOUT
if (timeout == port->tcp_user_timeout)
return STATUS_OK;

So that the #else branch that is supposed to check this will also be run 
in the postmaster (where port == NULL).


Or perhaps there should be a separate GUC check hook that just does

#ifndef TCP_USER_TIMEOUT
if (val != 0)
return false;
#endif
return true;

The same considerations apply to the various TCP keepalive settings, but 
since those are widely supported the unsupported code paths probably 
haven't gotten much attention.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: How is this possible "publication does not exist"

2019-12-19 Thread Peter Eisentraut

On 2019-12-19 19:15, Dave Cramer wrote:
It seems that if you drop the publication on an existing slot it needs 
to be recreated. Is this expected behaviour


A publication is not associated with a slot.  Only a subscription is 
associated with a slot.



drop publication dbz_publication ;
DROP PUBLICATION
postgres=# create publication dbz_publication for all tables;
CREATE PUBLICATION
postgres=# SELECT * FROM pg_logical_slot_get_binary_changes('debezium', 
NULL, NULL,'proto_version','1','publication_names','dbz_publication');

ERROR:  publication "dbz_publication" does not exist
CONTEXT:  slot "debezium", output plugin "pgoutput", in the change 
callback, associated LSN 0/4324180


This must be something particular to Debezium.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: How is this possible "publication does not exist"

2019-12-19 Thread Dave Cramer
On Thu, 19 Dec 2019 at 11:59, Dave Cramer  wrote:

> The publication exists but for some reason the function can't find it
>
> SELECT * FROM pg_logical_slot_get_binary_changes('debezium', NULL,
> NULL,'proto_version','1','publication_names','dbz_publication');
> ERROR:  publication "dbz_publication" does not exist
> CONTEXT:  slot "debezium", output plugin "pgoutput", in the change
> callback, associated LSN 0/307D8E8
> postgres=# select * from pg_publication;
>  pubname | pubowner | puballtables | pubinsert | pubupdate |
> pubdelete | pubtruncate
>
> -+--+--+---+---+---+-
>  dbz_publication |   10 | t| t | t | t
> | t
> (1 row)
>
> postgres=# SELECT * FROM pg_logical_slot_get_binary_changes('debezium',
> NULL, NULL,'proto_version','1','publication_names','dbz_publication');
> ERROR:  publication "dbz_publication" does not exist
> CONTEXT:  slot "debezium", output plugin "pgoutput", in the change
> callback, associated LSN 0/307D8E8
>

It seems that if you drop the publication on an existing slot it needs to
be recreated. Is this expected behaviour

drop publication dbz_publication ;
DROP PUBLICATION
postgres=# create publication dbz_publication for all tables;
CREATE PUBLICATION
postgres=# SELECT * FROM pg_logical_slot_get_binary_changes('debezium',
NULL, NULL,'proto_version','1','publication_names','dbz_publication');
ERROR:  publication "dbz_publication" does not exist
CONTEXT:  slot "debezium", output plugin "pgoutput", in the change
callback, associated LSN 0/4324180

Dave Cramer


Re: [PATCH] Remove twice assignment with var pageop (nbtree.c).

2019-12-19 Thread Peter Geoghegan
On Thu, Dec 19, 2019 at 7:55 AM Tom Lane  wrote:
> I don't think this is actually a good idea.  What it is is a foot-gun,
> because if anyone adds code there that wants to access the special area
> of that particular page, it'll do the wrong thing, unless they remember
> to put back the assignment of "opaque".  The sequence of BufferGetPage()
> and PageGetSpecialPointer() is a very standard switch-our-attention-
> to-another-page locution in nbtree and other index AMs.

+1

-- 
Peter Geoghegan




Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-19 Thread Robert Haas
On Thu, Dec 19, 2019 at 10:59 AM Tom Lane  wrote:
> Bruce Momjian  writes:
> > Good question.  I am in favor of allowing a larger value if no one
> > objects.  I don't think adding the min/max is helpful.
>
> I think there are pretty obvious performance and memory-consumption
> penalties to very large track_activity_query_size values.  Who exactly
> are we really helping if we let them set it to huge values?

The original poster.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP/PoC for parallel backup

2019-12-19 Thread Robert Haas
On Thu, Dec 12, 2019 at 10:20 AM Asif Rehman  wrote:
> I have updated the patches (v7 attached) and have taken care of all issues 
> pointed by Jeevan, additionally
> ran the pgindent on each patch. Furthermore, Command names have been renamed 
> as suggested and I
> have simplified the SendFiles function. Client can only request the regular 
> files, any other kind such as
> directories or symlinks will be skipped, the client will be responsible for 
> taking care of such.

Hi,

Patch 0001 of this series conflicts with my recent commit
303640199d0436c5e7acdf50b837a027b5726594; that commit was actually
inspired by some previous study of 0001. That being said, I think 0001
has the wrong idea. There's no reason that I can see why it should be
correct to remove the PG_ENSURE_ERROR_CLEANUP calls from
perform_base_backup(). It's true that if we register a long-lived
before_shmem_exit hook, then the backup will get cleaned up even
without the PG_ENSURE_ERROR_CLEANUP block, but there's also the
question of the warning message. I think that our goal should be to
emit the warning message about a backup being stopped too early if the
user uses either pg_start_backup() or the new START_BACKUP command and
does not end the backup with either pg_stop_backup() or the new
STOP_BACKUP command -- but not if a single command that both starts
and ends a backup, like BASE_BACKUP, is interrupted. To accomplish
that goal in the wake of 303640199d0436c5e7acdf50b837a027b5726594, we
need to temporarily register do_pg_abort_backup() as a
before_shmem_exit() handler using PG_ENSURE_ERROR_CLEANUP() during
commands like BASE_BACKUP() -- and for things like pg_start_backup()
or the new START_BACKUP command, we just need to add a single call to
register_persistent_abort_backup_handler().

So I think you can drop 0001, and then in the patch that actually
introduces START_BACKUP, add the call to
register_persistent_abort_backup_handler() before calling
do_pg_start_backup(). Also in that patch, also adjust the warning text
that do_pg_abort_backup() emits to be more generic e.g. "aborting
backup due to backend exiting while a non-exclusive backup is in
progress".

0003 creates three new functions, moving code from
do_pg_start_backup() to a new function collectTablespaces() and from
perform_base_backup() to new functions setup_throttle() and
include_wal_files(). I'm skeptical about all of these changes. One
general nitpick is that the way these function names are capitalized
and punctuated does not seem to have been chosen very consistently;
how about name_like_this() throughout? A bit more substantively:

- collectTablespaces() is factored out of do_pg_start_backup() so that
it can also be used by SendFileList(), but that means that a client is
going to invoke START_BACKUP, indirectly calling collectTablespaces(),
and then immediately afterward the client is probably going to call
SEND_FILE_LIST, which will again call collectTablespaces(). That does
not appear to be super-great. For one thing, it's duplicate work,
although because SendFileList() is going to pass infotbssize as false,
it's not a lot of duplicated work. Also, what happens if the two calls
to collectTablespaces() return different answers due to concurrent
CREATE/DROP TABLESPACE commands? Maybe it would all work out fine, but
it seems like there is at least the possibility of bugs if different
parts of the backup have different notions of what tablespaces exist.

- setup_throttle() is factored out of perform_base_backup() so that it
can be called in StartBackup() and StopBackup() and SendFiles(). This
seems extremely odd. Why does it make any sense to give the user an
option to activate throttling when *ending* a backup? Why does it make
sense to give the user a chance to enable throttling *both* at the
startup of a backup *and also* for each individual file. If we're
going to support throttling here, it seems like it should be either a
backup-level property or a file-level property, not both.

- include_wal_files() is factored out of perform_base_backup() so that
it can be called by StopBackup(). This seems like a poor design
decision. The idea behind the BASE_BACKUP command is that you run that
one command, and the server sends you everything. The idea in this new
way of doing business is that the client requests the individual files
it wants -- except for the WAL files, which are for some reason not
requested individually but sent all together as part of the
STOP_BACKUP response. It seems like it would be more consistent if the
client were to decide which WAL files it needs and request them one by
one, just as we do with other files.

I think there's a common theme to all of these complaints, which is
that you haven't done enough to move things that are the
responsibility of the backend in the BASE_BACKUP model to the frontend
in this model. I started wondering, for example, whether it might not
be better to have the client rather than the server construct the
tablespace_map 

How is this possible "publication does not exist"

2019-12-19 Thread Dave Cramer
The publication exists but for some reason the function can't find it

SELECT * FROM pg_logical_slot_get_binary_changes('debezium', NULL,
NULL,'proto_version','1','publication_names','dbz_publication');
ERROR:  publication "dbz_publication" does not exist
CONTEXT:  slot "debezium", output plugin "pgoutput", in the change
callback, associated LSN 0/307D8E8
postgres=# select * from pg_publication;
 pubname | pubowner | puballtables | pubinsert | pubupdate |
pubdelete | pubtruncate
-+--+--+---+---+---+-
 dbz_publication |   10 | t| t | t | t
| t
(1 row)

postgres=# SELECT * FROM pg_logical_slot_get_binary_changes('debezium',
NULL, NULL,'proto_version','1','publication_names','dbz_publication');
ERROR:  publication "dbz_publication" does not exist
CONTEXT:  slot "debezium", output plugin "pgoutput", in the change
callback, associated LSN 0/307D8E8

Dave Cramer


RE: [PATCH] Remove twice assignment with var pageop (nbtree.c).

2019-12-19 Thread Ranier Vilela
De: Bruce Momjian 
Enviado: quinta-feira, 19 de dezembro de 2019 16:19

>Oh, I was not aware that was boilerplate code.  I agree it should be
>consistent, so patch reverted.  Sorry.
I apologize to you, Bruce.
It is difficult to define where a "boilerplate" exists.
I agree that decent compiler, will remove, maybe, msvc not, but that's another 
story...

Best regards,
Ranier Vilela



Re: Too rigorous assert in reorderbuffer.c

2019-12-19 Thread Arseny Sher


Arseny Sher  writes:

> I'm sorry to bother you with this again, but due to new test our
> internal buildfarm revealed that ajacent assert on cmin is also lie. You
> see, we can't assume cmin is stable because the same key (relnode, tid)
> might refer to completely different tuples if tuple was inserted by
> aborted subxact, immeditaly reclaimed and then space occupied by another
> one. Fix is attached.
>
> Technically this might mean a user-facing bug, because we only pick the
> first cmin which means we might get visibility wrong, allowing to see
> some version too early (i.e real cmin of tuple is y, but decoding thinks
> it is x, and x < y). However, I couldn't quickly make up an example
> where this would actually lead to bad consequences. I tried to create
> such extra visible row in pg_attribute, but that's ok because loop in
> RelationBuildTupleDesc spins exactly natts times and ignores what is
> left unscanned. It is also ok with pg_class, because apparently
> ScanPgRelation also fishes out the (right) first tuple and doesn't check
> for duplicates appearing later in the scan. Maybe I just haven't tried
> hard enough though.

This issue still exists, it would be nice to fix it...

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Proposal: Global Index

2019-12-19 Thread Bruce Momjian
On Thu, Dec 19, 2019 at 09:48:40AM +0100, Jose Luis Tallon wrote:
> On 19/12/19 4:03, Bruce Momjian wrote:
> > On Mon, Nov 25, 2019 at 03:44:39PM -0800, Jeremy Schneider wrote:
> > > On 11/25/19 15:05, Jeremy Schneider wrote:
> > > > ... the cost of doing the individual index lookups across 180
> > > > partitions (and 180 indexes) was very high, so they stored max and min
> > > > txn id per partition and would generate a query with all the dates that
> > > > a txn id could have been in so that only a small number of partition
> > > > indexes would be accessed.
> > > > 
> > > > .. If we are looking for higher concurrency, we can usually
> > > > add a hack/workaround that filters on a partition key to provide “pretty
> > > > good” pruning.  The net result is that you get 2-3x the IO due to the
> > > > lack of global index (same workaround as first story above).
> > > Is that basically like a global BRIN index with granularity at the
> > > partition level?
> > Exactly!  :-)
> 
> Actually, one "kind of" BRIN index *per partitioned table* mapping (key
> range) -> (partition oid)... and so concurrency doesn't need to be very
> affected.
> 
> (we don't need to do things just like other RDBMS do, ya know... ;)
> 
> 
> IIRC, this precise approach was suggested around 2016 when initially
> discussing the "declarative partitioning" which originated Postgres' current
> partitioning scheme, in order to optimize partition pruning.

Robert Haas identified two needs for global indexes:


https://www.postgresql.org/message-id/ca+tgmob_j2m2+qkwrhg2njqekmewzntfd7a6ubg34fjuzpk...@mail.gmail.com

One of the biggest reasons why people want it is to enforce uniqueness
for secondary keys - e.g. the employees table is partitioned by
employee ID, but SSN should also be unique, at least among employees
for whom it's not NULL.

But people also want it for faster data retrieval: if you're looking
for a commonly-occurring value, an index per partition is fine. But if
you're looking for values that occur only once or a few times across
the whole hierarchy, an index scan per partition is very costly.

I don't see lossy BRIN indexes helping with the uniqueness use-case, and
I am not sure they would help with the rare case either.  They would
help for range-based partitions, but I thought our existing facilities
worked in that case.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [PATCH] Remove twice assignment with var pageop (nbtree.c).

2019-12-19 Thread Bruce Momjian
On Thu, Dec 19, 2019 at 10:55:42AM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Tue, Nov 26, 2019 at 01:45:10PM +, Ranier Vilela wrote:
> >> Same case on nbtpage.c at line 1637, with var opaque.
> >> make check, passed all 195 tests here with all commits.
> 
> > You were right about both of these, so removed in master.  I am
> > surprised no one else saw this before.
> 
> I don't think this is actually a good idea.  What it is is a foot-gun,
> because if anyone adds code there that wants to access the special area
> of that particular page, it'll do the wrong thing, unless they remember
> to put back the assignment of "opaque".  The sequence of BufferGetPage()
> and PageGetSpecialPointer() is a very standard switch-our-attention-
> to-another-page locution in nbtree and other index AMs.

Oh, I was not aware that was boilerplate code.  I agree it should be
consistent, so patch reverted.  Sorry.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Read Uncommitted

2019-12-19 Thread Andres Freund
Hi,

On 2019-12-19 07:08:06 -0800, Mark Dilger wrote:
> > As soon as a transaction aborts, the TOAST rows can be vacuumed
> > away, but the READ UNCOMMITTED transaction might've already seen the
> > main tuple. This is not even a particularly tight race, necessarily,
> > since for example the table might be scanned, feeding tuples into a
> > tuplesort, and then the detoating might happen further up in the query
> > tree after the sort has completed.
> 
> I don't know if this could be fixed without adding overhead to toast
> processing for non-RECOVERY transactions, but perhaps it doesn't need
> to be fixed at all.  Perhaps you just accept that in RECOVERY mode you
> can't see toast data, and instead get NULLs for all such rows.  Now,
> that could have security implications if somebody defines a policy
> where NULL in a toast column means "allow" rather than "deny" for
> some issue, but if this RECOVERY mode is limited to superusers, that
> isn't such a big objection.

I mean, that's just a small part of the issue. You can get *different*
data back for toast columns - incompatible with the datatype, leading to
crashes. You can get *different* data back for the same query, running
it twice, because data that was just inserted can get pruned away if the
inserting transaction aborted.


> There may be a number of other gotchas still to be resolved, but
> abandoning the patch at this stage strikes me as premature.

I think iff we'd want this feature, you'd have to actually use a much
larger hammer, and change the snapshot logic to include information
about which aborted transactions are visible, and whose rows cannot be
removed. And then vacuuming/hot pruning need to be changed to respect
that. And note that'll affect *all* sessions, not just the one wanting
to use READ UNCOMMITTED.

Greetings,

Andres Freund




Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-19 Thread Tom Lane
Bruce Momjian  writes:
> Good question.  I am in favor of allowing a larger value if no one
> objects.  I don't think adding the min/max is helpful.

I think there are pretty obvious performance and memory-consumption
penalties to very large track_activity_query_size values.  Who exactly
are we really helping if we let them set it to huge values?

(wanders away wondering if we have suitable integer-overflow checks
in relevant code paths...)

regards, tom lane




Re: [PATCH] Remove twice assignment with var pageop (nbtree.c).

2019-12-19 Thread Tom Lane
Bruce Momjian  writes:
> On Tue, Nov 26, 2019 at 01:45:10PM +, Ranier Vilela wrote:
>> Same case on nbtpage.c at line 1637, with var opaque.
>> make check, passed all 195 tests here with all commits.

> You were right about both of these, so removed in master.  I am
> surprised no one else saw this before.

I don't think this is actually a good idea.  What it is is a foot-gun,
because if anyone adds code there that wants to access the special area
of that particular page, it'll do the wrong thing, unless they remember
to put back the assignment of "opaque".  The sequence of BufferGetPage()
and PageGetSpecialPointer() is a very standard switch-our-attention-
to-another-page locution in nbtree and other index AMs.

Any optimizing compiler will delete the dead store, we do not have
to do it by hand.

Let me put it this way: if we had the BufferGetPage() and
PageGetSpecialPointer() calls wrapped up as an "access new page" macro,
would we undo that in order to make this code change?  We would not.

regards, tom lane




Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-19 Thread Bruce Momjian


Good question.  I am in favor of allowing a larger value if no one
objects.  I don't think adding the min/max is helpful.

---

On Tue, Nov 26, 2019 at 05:59:25PM +0300, v.maka...@postgrespro.ru wrote:
> Hi Hackers,
> 
> Some ORMs may generate queries larger than the maximum possible value of
> track_activity_query_size (100 kB).
> Is there any reasons to limit the maximum value of track_activity_query_size
> to such small value?
> Increasing the maximum value to 1 MB will help partially solve this problem.
> Also in the file postgresql.conf.sample pointed maximum value
> track_activity_query_size (before that it was not specified).
> 
> --
> Vyacheslav Makarov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company

> diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
> index ba4edde71a..0e64dc1dbb 100644
> --- a/src/backend/utils/misc/guc.c
> +++ b/src/backend/utils/misc/guc.c
> @@ -3200,7 +3200,7 @@ static struct config_int ConfigureNamesInt[] =
>   GUC_UNIT_BYTE
>   },
>   _track_activity_query_size,
> - 1024, 100, 102400,
> + 1024, 100, 1048576,
>   NULL, NULL, NULL
>   },
>  
> diff --git a/src/backend/utils/misc/postgresql.conf.sample 
> b/src/backend/utils/misc/postgresql.conf.sample
> index 46a06ffacd..55d3bfbfd0 100644
> --- a/src/backend/utils/misc/postgresql.conf.sample
> +++ b/src/backend/utils/misc/postgresql.conf.sample
> @@ -569,7 +569,7 @@
>  #track_counts = on
>  #track_io_timing = off
>  #track_functions = none  # none, pl, all
> -#track_activity_query_size = 1024# (change requires restart)
> +#track_activity_query_size = 1024# range 100B - 1MB (change requires 
> restart)
>  #stats_temp_directory = 'pg_stat_tmp'
>  
>  


-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Read Uncommitted

2019-12-19 Thread Andres Freund
Hi,

On 2019-12-19 09:50:44 +, Simon Riggs wrote:
> On Thu, 19 Dec 2019 at 02:22, Andres Freund  wrote:
> 
> > Hi,
> >
> > On 2019-12-18 18:06:21 +, Simon Riggs wrote:
> > > On Wed, 18 Dec 2019 at 17:35, Robert Haas  wrote:
> > >
> > > > On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs 
> > > > wrote:
> > > > > This was my first concern when I thought about it, but I realised
> > that
> > > > by taking a snapshot and then calculating xmin normally, this problem
> > would
> > > > go away.
> > > >
> > > > Why? As soon as a transaction aborts...
> > > >
> > >
> > > So this is the same discussion as elsewhere about potentially aborted
> > > transactions...
> > > AFAIK, the worst that happens in that case is that the reading
> > transaction
> > > will end with an ERROR, similar to a serializable error.
> >
> > I don't think that's all that can happen. E.g. the toast identifier
> > might have been reused, and there might be a different datum in
> > there. Which then means we'll end up calling operators on data that's
> > potentially for a different datatype - it's trivial to cause crashes
> > that way. And, albeit harder, possible to do more than that.
> >
> 
> On the patch as proposed this wouldn't be possible because a toast row
> can't be vacuumed and then reused while holding back xmin, at least as I
> understand it.

Vacuum and pruning remove rows where xmin didn't commit, without testing
against the horizon. Which makes sense, because normally there's so far
no snapshot including them. Unless we were to weaken that logic -
which'd have bloat impacts - a snapshot wouldn't guarantee anything
about the non-removal of such tuples, unless I am missing something.

Greetings,

Andres Freund




Re: [PATCH] Remove twice assignment with var pageop (nbtree.c).

2019-12-19 Thread Bruce Momjian
On Tue, Nov 26, 2019 at 01:45:10PM +, Ranier Vilela wrote:
> Same case on nbtpage.c at line 1637, with var opaque.
> make check, passed all 195 tests here with all commits.
> 
> Ranier Vilela

You were right about both of these, so removed in master.  I am
surprised no one else saw this before.

---

> diff --git a/src/backend/access/nbtree/nbtpage.c 
> b/src/backend/access/nbtree/nbtpage.c
> index 268f869a36..144fefccad 100644
> --- a/src/backend/access/nbtree/nbtpage.c
> +++ b/src/backend/access/nbtree/nbtpage.c
> @@ -1634,8 +1634,6 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, 
> BTStack stack)
>* delete the following item.
>*/
>   page = BufferGetPage(topparent);
> - opaque = (BTPageOpaque) PageGetSpecialPointer(page);
> -
>   itemid = PageGetItemId(page, topoff);
>   itup = (IndexTuple) PageGetItem(page, itemid);
>   BTreeInnerTupleSetDownLink(itup, rightsib);
> 


-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: client auth docs seem to have devolved

2019-12-19 Thread Alvaro Herrera
On 2019-Dec-19, Tom Lane wrote:

> I wrote:
> > Concretely, I propose the attached.  Anybody want to editorialize on
> > my short descriptions of the auth methods?
> 
> Pushed after a bit more fiddling with the wording.

Looks good, thanks.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Read Uncommitted

2019-12-19 Thread Mark Dilger




On 12/19/19 7:08 AM, Mark Dilger wrote:

and instead get NULLs for all such rows


To clarify, I mean the toasted column is null for rows
where the value was stored in the toast table rather
than stored inline.  I'd prefer some special value
that means "this datum unavailable" so that it could
be distinguished from an actual null, but no such value
comes to mind.

--
Mark Dilger




Re: Building infrastructure for B-Tree deduplication that recognizes when opclass equality is also equivalence

2019-12-19 Thread Antonin Houska
Anastasia Lubennikova  wrote:

> I attached new version with pg_opclass documentation update.
> 
> One more thing I am uncertain about  is array_ops. Arrays may contain bitwise
> and not bitwise element types.
> What is the correct value of opcisbitwise the array_ops itself?

How about setting opcisbitwise to false for the array_ops opclass and checking
opcisbitwise of the element type whenever we need to know whether the array is
"bitwise equal"? When checking array_eq(), I thought whether the existence of
"expanded array" format is a problem but it does not seem to be: the
conversion of "expanded" value to "flat" value and then back to the "expanded"
should not change the array contents.

Anyway, in the current version of the patch I see that array_ops opclasses
have opcisbitwise=true. It should be false even if you don't use the approach
of checking the element type.

Besides that, I think that record_ops is similar to array_ops and therefore it
should not set opcisbitwise to true.

I also remember that, when thinking about the problem in the context of the
aggregate push down patch, I considered some of the geometric types
problematic. For example, box_eq() uses this expression

#define FPeq(A,B)   (fabs((A) - (B)) <= EPSILON)

so equality does not imply bitwise equality here. Maybe you should only set
the flag for btree opclasses for now.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Read Uncommitted

2019-12-19 Thread Mark Dilger




On 12/19/19 1:50 AM, Simon Riggs wrote:
It seems possible that catalog access would be the thing that makes this 
difficult. Cache invalidations wouldn't yet have been fired, so that 
would lead to rather weird errors, and as you say, potential issues from 
data type changes which would not be acceptable in a facility available 
to non-superusers.


We could limit that to xacts that don't do DDL, which is a very small % 
of xacts, but then those xacts are more likely to be ones you'd want to 
recover or investigate.


So I now withdraw this patch as submitted and won't be resubmitting.


Oh, I'm sorry to hear that.  I thought this feature sounded useful, and
we were working out what its limitations were.  What I gathered from
the discussion so far was:

  - It should be called something other than READ UNCOMMITTED
  - It should only be available to superusers, at least for the initial
implementation
  - Extra care might be required to lock catalogs to avoid unsafe
operations that could lead to backends crashing or security
vulnerabilities
  - Toast tables need to be handled with care

For the record, in case we revisit this idea in the future, which were
the obstacles that killed this patch?

Tom's point on that third item:

> But I am quite afraid that we'd introduce security holes by future
> reductions of required lock levels --- or else that this feature would be
> the sole reason why we couldn't reduce the lock level for some DDL
> operation.  I'm doubtful that its use-case is worth that."

Anybody running SET TRANSACTION ISOLATION LEVEL RECOVERY might
have to get ExclusiveLock on most of the catalog tables.  But that
would only be done if somebody starts a transaction using this
isolation level, which is not normal, so it shouldn't be a problem
under normal situations.  If the lock level reduction that Tom
mentions was implemented, there would be no problem, as long as the
lock level you reduce to still blocks against ExclusiveLock, which
surely it must.  If the transaction running in RECOVERY level isolation
can't get the locks, then it blocks and doesn't help the administrator
who is trying to use this feature, but that is no worse than the
present situation where the feature is entirely absent.  When no
catalog changes are in flight, the administrator gets the locks and
can continue inspecting the in-process changes of other transactions.

Robert's point on that fourth item:

> As soon as a transaction aborts, the TOAST rows can be vacuumed
> away, but the READ UNCOMMITTED transaction might've already seen the
> main tuple. This is not even a particularly tight race, necessarily,
> since for example the table might be scanned, feeding tuples into a
> tuplesort, and then the detoating might happen further up in the query
> tree after the sort has completed.

I don't know if this could be fixed without adding overhead to toast
processing for non-RECOVERY transactions, but perhaps it doesn't need
to be fixed at all.  Perhaps you just accept that in RECOVERY mode you
can't see toast data, and instead get NULLs for all such rows.  Now,
that could have security implications if somebody defines a policy
where NULL in a toast column means "allow" rather than "deny" for
some issue, but if this RECOVERY mode is limited to superusers, that
isn't such a big objection.

There may be a number of other gotchas still to be resolved, but
abandoning the patch at this stage strikes me as premature.

--
Mark Dilger




Re: client auth docs seem to have devolved

2019-12-19 Thread Tom Lane
I wrote:
> Concretely, I propose the attached.  Anybody want to editorialize on
> my short descriptions of the auth methods?

Pushed after a bit more fiddling with the wording.

regards, tom lane




Re: obsolete comment in ExecBRUpdateTriggers()

2019-12-19 Thread Robert Haas
On Wed, Dec 18, 2019 at 1:49 AM Amit Langote  wrote:
> It seems that d986d4e87f6 forgot to update a comment upon renaming a variable.
>
> Attached fixes it.

Committed and back-patched to v12.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: non-exclusive backup cleanup is mildly broken

2019-12-19 Thread Robert Haas
On Tue, Dec 17, 2019 at 3:48 PM Tom Lane  wrote:
> Oh, scratch that --- looking closer, I see that the only two use-cases in
> the patched code are via before_shmem_exit and PG_ENSURE_ERROR_CLEANUP,
> and both of those require a function with the signature of an on_exit
> callback.

Yeah, that's why I was surprised that you wanted shim functions.

> So there's no need for a separate wrapper because this isn't
> going to be called any other way.  I still recommend amending the
> comment to explain why it has this signature, though.

Done, and committed.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] Block level parallel vacuum

2019-12-19 Thread Robert Haas
On Thu, Dec 19, 2019 at 12:41 AM Masahiko Sawada
 wrote:
> Attached the updated version patch. This version patch incorporates
> the above comments and the comments from Mahendra. I also fixed one
> bug around determining the indexes that are vacuumed in parallel based
> on their option and size. Please review it.

I'm not enthusiastic about the fact that 0003 calls the fast option
'disable_delay' in some places. I think it would be more clear to call
it 'fast' everywhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: segmentation fault when cassert enabled

2019-12-19 Thread Jehan-Guillaume de Rorthais
On Wed, 18 Dec 2019 08:46:01 +0530
Amit Kapila  wrote:

> On Tue, Dec 17, 2019 at 6:01 PM vignesh C  wrote:
> >
> > On Tue, Dec 17, 2019 at 10:09 AM Amit Kapila 
> > wrote:  
> > >
> > > Attached patch with updated commit message based on suggestions.  I am
> > > planning to commit this tomorrow unless there are more comments.
> > >  
> >
> > While testing the patch on back versions, I found that the patch does
> > not apply on PG 11 & PG 10 branch. Attached patch has the changes for
> > PG 11 & PG 10 branch. Only difference in the patch was that table_open
> > needed to be changed to heap_open. I have verified the patch on back
> > branches and found it to be working. For PG 12 & current the previous
> > patch that Amit need to be used, I'm not reattaching here.
> >  
> 
> Pushed.

Thanks!




Re: Read Uncommitted

2019-12-19 Thread Simon Riggs
On Thu, 19 Dec 2019 at 12:42, Bernd Helmle  wrote:

> Am Donnerstag, den 19.12.2019, 00:13 + schrieb Simon Riggs:
> > So the consensus is for a more-specifically named facility.
> >
> > I was aiming for something that would allow general SELECTs to run
> > with a
> > snapshot that can see uncommitted xacts, so making it a SRF wouldn't
> > really
> > allow that.
>
> There's pg_dirtyread() [1] around for some while, implementing a SRF
> for debugging usage on in normal circumstances disappeared data. Its
> nice to not have worrying about anything when you faced with such kind
> of problems, but for such use cases i think a SRF serves quite well.
>
> [1] https://github.com/df7cb/pg_dirtyread


As an example of an SRF for debugging purposes, sure, but then we already
had the example of pageinspect, which I wrote BTW, so wasn't unfamiliar
with the thought.

Note that pg_dirtyread has got nothing to do with the use cases I
described.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Solutions for the Enterprise


Re: Read Uncommitted

2019-12-19 Thread Bernd Helmle
Am Donnerstag, den 19.12.2019, 00:13 + schrieb Simon Riggs:
> So the consensus is for a more-specifically named facility.
> 
> I was aiming for something that would allow general SELECTs to run
> with a
> snapshot that can see uncommitted xacts, so making it a SRF wouldn't
> really
> allow that.

There's pg_dirtyread() [1] around for some while, implementing a SRF
for debugging usage on in normal circumstances disappeared data. Its
nice to not have worrying about anything when you faced with such kind
of problems, but for such use cases i think a SRF serves quite well.

[1] https://github.com/df7cb/pg_dirtyread


Bernd






Re: Read Uncommitted

2019-12-19 Thread Peter Eisentraut

On 2019-12-18 16:14, Simon Riggs wrote:
On Wed, 18 Dec 2019 at 12:11, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:


As far as I understand with "read uncommitted" policy we can see two
versions of the same tuple if it was updated by two transactions
both of which were started before us and committed during table
traversal by transaction with "read uncommitted" policy. Certainly
"read uncommitted" means that we are ready to get inconsistent
results, but is it really acceptable to multiple versions of the
same tuple?


     "In general, read uncommitted will return inconsistent results and
     wrong answers. If you look at the changes made by a transaction
     while it continues to make changes then you may get partial results
     from queries, or you may miss index entries that haven't yet been
     written. However, if you are reading transactions that are paused
     at the end of their execution for whatever reason then you can
     see a consistent result."

I think I already covered your concerns in my suggested docs for this 
feature.


Independent of the technical concerns, I don't think the SQL standard 
allows the READ UNCOMMITTED level to behave in a way that violates the 
logical requirements of the defined database schema.  So if we wanted to 
add this, we should probably name it something else.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: progress report for ANALYZE

2019-12-19 Thread Tatsuro Yamada

Hi All,


*All* phases are repeated in this case, not not just "finalizing
analyze", because ANALYZE repeatedly runs for each partition after the
parent partitioned table's ANALYZE finishes.  ANALYZE's documentation
mentions that analyzing a partitioned table also analyzes all of its
partitions, so users should expect to see the progress information for
each partition.  So, I don't think we need to clarify that if only in
one phase's description.  Maybe we can add a note after the phase
description table which mentions this implementation detail about
partitioned tables.  Like this:

   
    
 Note that when ANALYZE is run on a partitioned table,
 all of its partitions are also recursively analyzed as also mentioned on
 .  In that case, ANALYZE
 progress is reported first for the parent table, whereby its inheritance
 statistics are collected, followed by that for each partition.
    
   



Ah.. you are right: All phases are repeated, it shouldn't be fixed
the only one phase's description.



Some more comments on the documentation:

+   Number of computed extended stats.  This counter only advances
when the phase
+   is computing extended stats.

Number of computed extended stats -> Number of extended stats computed



Will fix.



+   Number of analyzed child tables.  This counter only advances
when the phase
+   is computing extended stats.

Regarding, "Number of analyzed child table", note that we don't
"analyze" child tables in this phase, only scan its blocks to collect
samples for parent's ANALYZE.  Also, the 2nd sentence is wrong -- you
meant "when the phase is acquiring inherited sample
rows.  I suggest to write this as follows:

Number of child tables scanned.  This counter only advances when the phase
is acquiring inherited sample rows.



Oops, I will fix it. :)




+ OID of the child table currently being scanned.
+   It might be different from relid when analyzing tables that
have child tables.

I suggest:

OID of the child table currently being scanned.  This field is only valid when
the phase is computing extended stats.



Will fix.



+   The command is currently scanning the
current_relid
+   to obtain samples.

I suggest:

The command is currently scanning the the table given by
current_relid to obtain sample rows.



Will fix.



+   The command is currently scanning the
current_child_table_relid
+   to obtain samples.

I suggest (based on phase description pg_stat_progress_create_index
phase descriptions):

The command is currently scanning child tables to obtain sample rows.  Columns
child_tables_total,
child_tables_done, and
current_child_table_relid contain the progress
information for this phase.



Will fix.



+    
+ computing stats

I think the phase name should really be "computing statistics", that
is, use the full word.



Will fix.



+ 
+   The command is computing stats from the samples obtained
during the table scan.
+ 
+    

So I suggest:

The command is computing statistics from the sample rows obtained during
the table scan



Will fix.



+ computing extended stats
+ 
+   The command is computing extended stats from the samples
obtained in the previous phase.
+ 

I suggest:

The command is computing extended statistics from the sample rows obtained
during the table scan.



Will fix.



I fixed the document based on Amit's comments. :)
Please find attached file.


Thanks,
Tatsuro Yamadas







diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a3c5f86b7e..a5da126b8f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -360,6 +360,14 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
   
  
 
+ 
+  
pg_stat_progress_analyzepg_stat_progress_analyze
+  One row for each backend (including autovacuum worker processes) 
running
+   ANALYZE, showing current progress.
+   See .
+  
+ 
+
  
   
pg_stat_progress_clusterpg_stat_progress_cluster
   One row for each backend running
@@ -3502,7 +3510,7 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
PostgreSQL has the ability to report the 
progress of
certain commands during command execution.  Currently, the only commands
which support progress reporting are CREATE INDEX,
-   VACUUM and
+   VACUUM, ANALYZE and
CLUSTER. This may be expanded in the future.
   
 
@@ -3948,6 +3956,179 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
 
  
 
+ 
+  ANALYZE Progress Reporting
+
+  
+   Whenever ANALYZE is running, the
+   pg_stat_progress_analyze view will contain a
+   row for each backend that is currently running that command.  The tables
+   below describe the information that will be reported and provide
+   information about how to interpret it.
+  
+
+  
+   pg_stat_progress_analyze View
+   
+
+
+  Column
+  Type
+  Description
+ 
+
+
+   
+   

Re: [HACKERS] Block level parallel vacuum

2019-12-19 Thread Amit Kapila
On Thu, Dec 19, 2019 at 11:11 AM Masahiko Sawada
 wrote:
>
> On Wed, 18 Dec 2019 at 19:06, Amit Kapila  wrote:
> >
>
> - /* Cap by the worker we computed at the beginning of parallel lazy vacuum */
> - nworkers = Min(nworkers, lps->pcxt->nworkers);
> + /*
> + * The number of workers required for parallel vacuum phase must be less
> + * than the number of workers with which parallel context is initialized.
> + */
> + Assert(lps->pcxt->nworkers >= nworkers);
>
> Regarding the above change in your patch I think we need to cap the
> number of workers by lps->pcxt->nworkers because the computation of
> the number of indexes based on lps->nindexes_paralle_XXX can be larger
> than the number determined when creating the parallel context, for
> example, when max_parallel_maintenance_workers is smaller than the
> number of indexes that can be vacuumed in parallel at bulkdelete
> phase.
>

oh, right, but then probably, you can write a comment as this is not so obvious.

> >
> > Few other comments which I have not fixed:
> >
> > 4.
> > + if (Irel[i]->rd_indam->amusemaintenanceworkmem)
> > + nindexes_mwm++;
> > +
> > + /* Skip indexes that don't participate parallel index vacuum */
> > + if (vacoptions == VACUUM_OPTION_NO_PARALLEL ||
> > + RelationGetNumberOfBlocks(Irel[i]) < min_parallel_index_scan_size)
> > + continue;
> >
> > Won't we need to worry about the number of indexes that uses
> > maintenance_work_mem only for indexes that can participate in a
> > parallel vacuum? If so, the above checks need to be reversed.
>
> You're right. Fixed.
>
> >
> > 5.
> > /*
> > + * Remember indexes that can participate parallel index vacuum and use
> > + * it for index statistics initialization on DSM because the index
> > + * size can get bigger during vacuum.
> > + */
> > + can_parallel_vacuum[i] = true;
> >
> > I am not able to understand the second part of the comment ("because
> > the index size can get bigger during vacuum.").  What is its
> > relevance?
>
> I meant that the indexes can be begger even during vacuum. So we need
> to check the size of indexes and determine participations of parallel
> index vacuum at one place.
>

Okay, but that doesn't go with the earlier part of the comment.  We
can either remove it or explain it a bit more.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: inherits clause for CREATE TYPE? -

2019-12-19 Thread Pavel Stehule
st 18. 12. 2019 v 21:12 odesílatel Merlin Moncure 
napsal:

> On Wed, Dec 18, 2019 at 12:38 PM Pavel Stehule 
> wrote:
> >
> > Hi
> >
> > I had a talk with one boy about development in plpgsql. He uses table's
> functions. More times he uses returns types based on some table type + few
> attributes. Now he use a ugly hack - he create a view on table plus some
> columns - and then he use the view related type as table function result
> type. For similar uses cases there can be interesting to have a possibility
> to create types by extending other types. Probably almost all functionality
> is inside now - so it should not be hard work.
> >
> > My idea is implement inherits clause for CREATE TYPE command.
> >
> > Some like
> >
> > CREATE TYPE fx_rt (xx int) INHERITS(pg_class);
> >
> > What do you think about this idea?
>
> How about using composition style approaches?
>
> create type base as (...)
>
> create type extended as  (b base, ...)
>
> create function foo() returns extended as ...
>

It is a possibility, but it is not practical, because base type will be
nested, it is hard to access to nested fields ..

Currently I can do

CREATE TABLE base (...); -- instead CREATE TYPE
CREATE TABLE extended (...) -- INHERITS (base)

CREATE FUNCTION foo() RETURNS SETOF extended AS ..

This is working perfect - just disadvantage is garbage table "extended"



> merlin
>


Re: Read Uncommitted

2019-12-19 Thread Simon Riggs
On Thu, 19 Dec 2019 at 02:22, Andres Freund  wrote:

> Hi,
>
> On 2019-12-18 18:06:21 +, Simon Riggs wrote:
> > On Wed, 18 Dec 2019 at 17:35, Robert Haas  wrote:
> >
> > > On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs 
> > > wrote:
> > > > This was my first concern when I thought about it, but I realised
> that
> > > by taking a snapshot and then calculating xmin normally, this problem
> would
> > > go away.
> > >
> > > Why? As soon as a transaction aborts...
> > >
> >
> > So this is the same discussion as elsewhere about potentially aborted
> > transactions...
> > AFAIK, the worst that happens in that case is that the reading
> transaction
> > will end with an ERROR, similar to a serializable error.
>
> I don't think that's all that can happen. E.g. the toast identifier
> might have been reused, and there might be a different datum in
> there. Which then means we'll end up calling operators on data that's
> potentially for a different datatype - it's trivial to cause crashes
> that way. And, albeit harder, possible to do more than that.
>

On the patch as proposed this wouldn't be possible because a toast row
can't be vacuumed and then reused while holding back xmin, at least as I
understand it.


> I think there's plenty other problems too, not just toast. There's
> e.g. some parts of the system that access catalogs using a normal
> snapshot - which might not actually be consistent, because we have
> various places where we have to increment the command counter multiple
> times as part of a larger catalog manipulation.
>

It seems possible that catalog access would be the thing that makes this
difficult. Cache invalidations wouldn't yet have been fired, so that would
lead to rather weird errors, and as you say, potential issues from data
type changes which would not be acceptable in a facility available to
non-superusers.

We could limit that to xacts that don't do DDL, which is a very small % of
xacts, but then those xacts are more likely to be ones you'd want to
recover or investigate.

So I now withdraw this patch as submitted and won't be resubmitting.

Thanks everyone for your input.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Solutions for the Enterprise


Re: Proposal: Global Index

2019-12-19 Thread Jose Luis Tallon

On 19/12/19 4:03, Bruce Momjian wrote:

On Mon, Nov 25, 2019 at 03:44:39PM -0800, Jeremy Schneider wrote:

On 11/25/19 15:05, Jeremy Schneider wrote:

... the cost of doing the individual index lookups across 180
partitions (and 180 indexes) was very high, so they stored max and min
txn id per partition and would generate a query with all the dates that
a txn id could have been in so that only a small number of partition
indexes would be accessed.

.. If we are looking for higher concurrency, we can usually
add a hack/workaround that filters on a partition key to provide “pretty
good” pruning.  The net result is that you get 2-3x the IO due to the
lack of global index (same workaround as first story above).

Is that basically like a global BRIN index with granularity at the
partition level?

Exactly!  :-)


Actually, one "kind of" BRIN index *per partitioned table* mapping (key 
range) -> (partition oid)... and so concurrency doesn't need to be very 
affected.


(we don't need to do things just like other RDBMS do, ya know... ;)


IIRC, this precise approach was suggested around 2016 when initially 
discussing the "declarative partitioning" which originated Postgres' 
current partitioning scheme, in order to optimize partition pruning.



Just my .02€

    / J.L.