from:"Jesper Pedersen"

Re: New committers: Melanie Plageman, Richard Guo

2024-04-26 Thread Jesper Pedersen


On 4/26/24 07:54, Jonathan S. Katz wrote:
The Core Team would like to extend our congratulations to Melanie 
Plageman and Richard Guo, who have accepted invitations to become our 
newest PostgreSQL committers.


Please join us in wishing them much success and few reverts!



Congrats !

Best regards,
 Jesper

Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-08 Thread Jesper Pedersen


Hi,

On 4/8/24 14:15, Tomas Vondra wrote:

I think we need to
fix & improve that - not to rework/push it at the very end.



This is going to be very extreme...

Either a patch is ready for merge or it isn't - when 2 or more 
Committers agree on it then it can be merged - the policy have to be 
discussed of course.


However, development happens all through out the year, so having to wait 
for potential feedback during CommitFests windows can stop development 
during the other months - I'm talking about non-Committers here...


Having patches on -hackers@ is best, but maybe there is a hybrid model 
where they exists in pull requests, tested through CfBot, and merged 
when ready - no matter what month it is.


Pull requests could still have labels that ties them to a "CommitFest" 
to "high-light" them, but it would make it easier to have a much clearer 
cut-off date for feature freeze.


And, pull request labels are easily changed.

March is seeing a lot of last-minute changes...

Just something to think about.

Best regards,
 Jesper

Re: 2023-11-09 release announcement draft

2023-11-06 Thread Jesper Pedersen


Hi,

On 11/6/23 17:04, Jonathan S. Katz wrote:
Attached is the release announcement draft for the 2023-11-09 release 
(16.1 et al.).


Please review for accuracy and notable omissions. Please have all 
feedback in by 2023-11-09 08:00 UTC at the latest (albeit the sooner 
the better).



s/PostgreSQL 10/PostgreSQL 11/g


Best regards,

 Jesper

Re: Commitfest manager for September

2023-08-28 Thread Jesper Pedersen


Hi Melanie,

On 8/28/23 11:42, Melanie Plageman wrote:

I too had planned to volunteer to help. I volunteer to do a
triage/summary of patch statuses, as has been done occasionally in the
past [1].
Have folks found this helpful in the past?



Having a summary to begin with is very helpful for reviewers.


Best regards,

 Jesper

Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-24 Thread Jesper Pedersen


On 4/20/23 13:40, Tom Lane wrote:

The Core Team would like to extend our congratulations to
Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
accepted invitations to become our newest Postgres committers.

Please join me in wishing them much success and few reverts.



Huge congrats !


Best regards,

 Jesper

Re: GSOC2023

2023-01-04 Thread Jesper Pedersen


Hi,

On 12/28/22 21:10, diaa wrote:

*Hi Sir.*
I’m Computer engineering student from Egypt Interested in Database Management
Systems.
Can I know details about the list of ideas for 2023 projects or how to prepare
myself to be ready with the required knowledge?



Thanks for your interest in GSoC 2023 !


The GSoC timeline is described here [1], and the PostgreSQL projects - 
if approved - will be listed here [2].



So, check back in February and see which projects has been proposed.


[1] https://developers.google.com/open-source/gsoc/timeline

[2] https://wiki.postgresql.org/wiki/GSoC_2023


Best regards,

 Jesper

Re: GSoC: pgagroal: SCRAM-SHA-256-PLUS support (2022)

2022-04-18 Thread Jesper Pedersen


Hi,

On 4/18/22 04:16, wanghaitao0...@zju.edu.cn wrote:

I'm Haitao Wang, interested in participating in GSOC 2022 PostgreSQL projects.
Attached is my proposal. Please check!


Thanks for your proposal to Google Summer of Code 2022 !


We'll follow up off-list to get this finalized.


Best regards,

 Jesper

Re: GsoC: pgexporter: Custom file

2022-04-17 Thread Jesper Pedersen


Hi,

On 4/17/22 04:43, dl x wrote:

My name is Donglin Xie, a MSc. student at Zhejiang University, in China. I
am interested in the project *pgexporter: Custom_file.*

The proposal is attached  to this email. Looking forward to the suggestions!




Thanks for your proposal to Google Summer of Code 2022 !

We'll follow up off-list to get this finalized.

Best regards,
 Jesper

Re: GSoC: pgmoneta: Write-Ahead Log (WAL) infrastructure (2022)

2022-04-16 Thread Jesper Pedersen


Hi,

On 4/16/22 13:06, dl x wrote:

My name is Donglin Xie, a MSc. student at Zhejiang University, in China. I
am interested in the project *pgmoneta: Write-Ahead Log (WAL)
infrastructure.*

The proposal is attached  to this email. Looking forward to the suggestions!



Thanks for your proposal to Google Summer of Code 2022 !

We'll follow up off-list to get this finalized.

Best regards,
 Jesper

Re: GSoC 2022: Proposal of pgmoneta on-disk encryption

2022-04-13 Thread Jesper Pedersen


Hi Jichen,

On 4/13/22 10:40, Solo Kyo wrote:

To whom it may concern:

Hi, I am Jichen Xu. I am a first year master computer science student at
Waseda University.
Here is a link to my proposal:
https://docs.google.com/document/d/1vdgPY5wvhjhrX9aSUw5mTDOnXwEQNcr4cxfzuSHMWlg
Looking forward to working with you in next few months.



Thanks for your proposal to Google Summer of Code 2022 !


We'll follow up off-list to get this finalized.


Best regards,

 Jesper

Re: GSoC: pgmoneta, storage API

2022-04-04 Thread Jesper Pedersen


Hi Mariam,

On 4/4/22 09:16, Mariam Fahmy wrote:

Hope you’re having a good day!
I am Mariam Fahmy, A senior computer and systems engineering student at
faculty of engineering, AinShams university.

I am interested in working with pgmoneta during GSoC 2022.

Here is a link to the draft proposal for implementing storage engine in
pgmoneta:
https://docs.google.com/document/d/1EbRzgfZCDWG6LCD0puil8bUt10aehT4skfZ1YFlbNKk/edit?usp=drivesdk

I would be grateful if you can have a look at it and give me your feedback.



Thanks for your proposal to Google Summer of Code 2022 !


We'll follow up off-list to get this finalized.


Best regards,
 Jesper

Re: GSoC: pgmoneta: Write-Ahead Log (WAL) infrastructure (2022)

2022-04-01 Thread Jesper Pedersen


Hi Jinlang,

On 4/1/22 12:39, Jinlang Wang wrote:

To whom it may concern:

Here is my proposal 
(https://docs.google.com/document/d/1KKKDU6iP0GOkAMSdGRyxFJRgW964JFVROnpKkbzWyNw/edit?usp=sharing
 
)
 for GSoC. Please check!



Thanks for your proposal to Google Summer of Code 2022 !


We'll follow up off-list to get this finalized.


Best regards,

 Jesper

Re: Index Skip Scan (new UniqueKeys)

2022-03-24 Thread Jesper Pedersen


On 6/9/20 06:22, Dmitry Dolgov wrote:

Here is a new version of index skip scan patch, based on v8 patch for
UniqueKeys implementation from [1]. I want to start a new thread to
simplify navigation, hopefully I didn't forget anyone who actively
participated in the discussion.



This CommitFest entry has been closed with RwF at [1].

Thanks for all the feedback given !

[1] 
https://www.postgresql.org/message-id/ab8636e7-182f-886a-3a39-f3fc279ca45d%40redhat.com


Best regards,
 Dmitry & Jesper

Re: MDAM techniques and Index Skip Scan patch

2022-03-24 Thread Jesper Pedersen


On 3/23/22 18:22, Dmitry Dolgov wrote:


The CF item could be set to RwF, what would you say, Jesper?



We want to thank the community for the feedback that we have received 
over the years for this feature. Hopefully a future implementation can 
use Tom's suggestions to get closer to a committable solution.


Here is the last CommitFest entry [1] for the archives.

RwF

[1] https://commitfest.postgresql.org/37/1741/

Best regards,
 Dmitry & Jesper

Re: Portability report: ninja

2021-11-01 Thread Jesper Pedersen


Hi,

On 11/1/21 15:25, Tom Lane wrote:

So it's pretty clear that if we go this way, it'll be the end of
the line for support for some very old OS versions.  I can't,
however, argue with the idea that it's reasonable to require
POSIX 2001 support now.  Based on these results, I doubt that
ninja will give us trouble on any platform that isn't old
enough to get its drivers license.



You can also look at it as:


If PostgreSQL choose a newer build system then it is up to the companies 
owning the "non-supported" operating systems to add support for the 
build system in question; not the PostgreSQL community.



+1 for POSIX.2001 and meson/ninja.


Best regards,

 Jesper

Re: MDAM techniques and Index Skip Scan patch

2021-10-25 Thread Jesper Pedersen


Hi Peter,

On 10/21/21 22:16, Peter Geoghegan wrote:

I returned to the 1995 paper "Efficient Search of Multidimensional
B-Trees" [1] as part of the process of reviewing v39 of the skip scan
patch, which was posted back in May. It's a great paper, and anybody
involved in the skip scan effort should read it thoroughly (if they
haven't already). It's easy to see why people get excited about skip
scan [2]. But there is a bigger picture here.



Thanks for starting this thread !

The Index Skip Scan patch could affect a lot of areas, so keeping MDAM 
in mind is definitely important.


However, I think the key part to progress on the "core" functionality 
(B-tree related changes) is to get the planner functionality in place 
first. Hopefully we can make progress on that during the November 
CommitFest based on Andy's patch.


Best regards,
 Jesper

Re: Triage on old commitfest entries

2021-10-04 Thread Jesper Pedersen


On 10/3/21 16:18, Peter Geoghegan wrote:

Index Skip Scan 16
 Last substantive discussion 2021-05, currently passing cfbot

 Seems possibly useful, but we're not making progress.

This feature is definitely useful. My pet theory is that it hasn't
made more progress because it requires expertise in two fairly
distinct areas of the system. There is a lot of B-Tree stuff here,
which is clearly my thing. But I know that I personally am much less
likely to work on a patch that requires significant changes to the
planner. Maybe this is a coordination problem.



I still believe that this is an important user-visible improvement.


However, there has been conflicting feedback on the necessary planner 
changes leading to doing double work in order to figure the best way 
forward.



Dmitry and Andy are doing a good job on keeping the patches current, but 
maybe there needs to be a firm decision from a committer on what the 
planner changes should look like before these patches can move forward.



So, is RfC the best state for that ?


Best regards,

 Jesper

Re: Triage on old commitfest entries

2021-10-04 Thread Jesper Pedersen


Hi,

On 10/3/21 16:18, Peter Geoghegan wrote:

Index Skip Scan 16
 Last substantive discussion 2021-05, currently passing cfbot

 Seems possibly useful, but we're not making progress.

This feature is definitely useful. My pet theory is that it hasn't
made more progress because it requires expertise in two fairly
distinct areas of the system. There is a lot of B-Tree stuff here,
which is clearly my thing. But I know that I personally am much less
likely to work on a patch that requires significant changes to the
planner. Maybe this is a coordination problem.



I still believe that this is an important user-visible improvement.


However, there has been conflicting feedback on the necessary planner 
changes leading to doing double work in order to figure the best way 
forward.



Dmitry and Andy are doing a good job on keeping the patches current, but 
maybe there needs to be a firm decision from a committer on what the 
planner changes should look like before these patches can move forward.



So, is RfC the best state for that ?


Best regards,

 Jesper

Increase log level in xlogreader.c ?

2021-09-07 Thread Jesper Pedersen


Hi,

I have a 13.4 based setup (physical streaming replication) where the 
replica does the attach log upon startup, and when the first message is 
sent from the primary.


There is the FATAL from when the WAL receiver shuts down, but I think it 
would be a benefit to have report_invalid_record() log at ERROR level 
instead to highlight to the admin that there is a serious problem.


Feel free to contact me off-list for the setup (20Mb).

Thoughts ?

Best regards,
 Jesper
[1412350] [2021-09-07 08:51:27.172 EDT] [] [0] LOG:  starting PostgreSQL 13.4 
on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.2.1 20210728 (Red Hat 
11.2.1-1), 64-bit
[1412350] [2021-09-07 08:51:27.173 EDT] [] [0] LOG:  listening on IPv4 address 
"0.0.0.0", port 5433
[1412350] [2021-09-07 08:51:27.173 EDT] [] [0] LOG:  listening on IPv6 address 
"::", port 5433
[1412350] [2021-09-07 08:51:27.176 EDT] [] [0] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5433"
[1412352] [2021-09-07 08:51:27.186 EDT] [] [0] LOG:  database system was shut 
down at 2020-10-01 07:48:52 EDT
[1412352] [2021-09-07 08:51:27.186 EDT] [] [0] LOG:  entering standby mode
[1412352] [2021-09-07 08:51:27.191 EDT] [] [0] LOG:  consistent recovery state 
reached at 0/5000118
[1412352] [2021-09-07 08:51:27.191 EDT] [] [0] LOG:  invalid record length at 
0/5000118: wanted 24, got 0
[1412350] [2021-09-07 08:51:27.192 EDT] [] [0] LOG:  database system is ready 
to accept read only connections
[1412356] [2021-09-07 08:51:27.224 EDT] [] [0] LOG:  started streaming WAL from 
primary at 0/500 on timeline 1
[1412352] [2021-09-07 08:52:54.091 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412356] [2021-09-07 08:52:54.091 EDT] [] [0] FATAL:  terminating walreceiver 
process due to administrator command
[1412352] [2021-09-07 08:52:54.092 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412352] [2021-09-07 08:52:54.092 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412352] [2021-09-07 08:52:59.096 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412352] [2021-09-07 08:53:04.101 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412352] [2021-09-07 08:53:09.105 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412352] [2021-09-07 08:53:14.109 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412352] [2021-09-07 08:53:19.113 EDT] [] [0] LOG:  record with incorrect 
prev-link 8DE46ADF/30 at 0/5000118
[1412350] [2021-09-07 08:53:22.187 EDT] [] [0] LOG:  received fast shutdown 
request
[1412350] [2021-09-07 08:53:22.194 EDT] [] [0] LOG:  aborting any active 
transactions
[1412353] [2021-09-07 08:53:22.195 EDT] [] [0] LOG:  shutting down
[1412350] [2021-09-07 08:53:22.209 EDT] [] [0] LOG:  database system is shut 
down

Re: Middleware Messages for FE/BE

2021-08-20 Thread Jesper Pedersen


Hi Simon,

On 8/20/21 10:39 AM, Simon Riggs wrote:

Yeah, but it is a change to the protocol which means that the client
drivers and middleware ecosystem needs to be updated to support that
message.


No, because FE messages don't need to be handled by the client, they
just send them.


Yes, but the middleware need to parse them in order to send a response.


It is the server that needs to be updated to
understand that these messages might be received and to ignore them,
which is simple enough.

If a client doesn't know about a message it COULD send, but doesn't,
then there is no update required.



Agreed.


So, if the message was added to the protocol we could add another
message with the response to the request and make the protocol stricter, say

FE/M
Int32 - Length
Int16 - Request type (0 == Query capability, 1 == Execute capability)
Int32 - Capability type (0 == Failover, 1 == ..., ...)


This much detail is optional. It is up to the middleware to define if
it supports capability requests, or how it handles requests that it
cannot satisfy.

I'm trying to come up with something generic that people can use for
decades to come, not restrict their choices to a very small subset
based upon our current vision.


BE/?
Int32 - Length
Int32 - Capability type
Byte  - Support (0 == No, 1 == Yes)
Byten - Additional information


If we add a new message from BE, then yes, we need to modify all
drivers, which would be an immediate fail for this proposal.

The message replies you foresee are optional; they are not required by
all middleware.

I've already suggested you use NoticeResponse, which is already
defined to ignore unknown field types, so is suitable and extensible.
We could add a new field type of 'm' to represent a message sent from
middleware to the client.



When doing the PoC just keep in mind that middleware acting on a 'M' 
message on a user authenticated connection could lead to a DoS attack.


Best regards,
 Jesper

Re: Middleware Messages for FE/BE

2021-08-20 Thread Jesper Pedersen


Hi Hannu,

On 8/19/21 3:52 PM, Hannu Krosing wrote:

Jesper, please don't confuse my ramblings with Simon's initial proposal.

As I understand, the original proposal was just about adding a new
wire protocol message type, which then could be used for emitting
custom messages by middleware support extension - likely loaded as a
preloaded shared library - and consumed by some middleware, which
could either be some proxy or connection pooler or something compiled
as part of the libpq, JDBC or other client driver. So it was just
about providing extensibility to the protocol (the same way almost
everything else in PostgreSQL is extensible).



Yeah, but it is a change to the protocol which means that the client 
drivers and middleware ecosystem needs to be updated to support that 
message.


So, if the message was added to the protocol we could add another 
message with the response to the request and make the protocol stricter, say


FE/M
Int32 - Length
Int16 - Request type (0 == Query capability, 1 == Execute capability)
Int32 - Capability type (0 == Failover, 1 == ..., ...)

BE/?
Int32 - Length
Int32 - Capability type
Byte  - Support (0 == No, 1 == Yes)
Byten - Additional information

None of the client driver interfaces (specification API, like JDBC) has 
functionality for this currently, so the client would need to type cast 
the interface in all cases in order to get access to the trigger method, say


 org.postgresql.jdbc.PgConnection.failover()

There could be custom capability type codes that targets specific 
middleware like Simon suggested.



But at least initially each middleware would process only it's own
class, so a single if() and not a big switch/case :)



With a defined baseline of capabilities the client drivers and 
middleware wouldn't have to invent their own codes.



The things I talked about were some forward-looking proposals on how
to use this capability and that some of the messages may at some point
become used / usable by more than a single middleware component at
which point they should be standardised .



It would be good to see a PoC of one capability implemented from the 
client to the middleware that deals with coordination at the client level.


Best regards,
 Jesper

Re: Middleware Messages for FE/BE

2021-08-19 Thread Jesper Pedersen


Hi,

On 8/19/21 1:13 PM, Simon Riggs wrote:

We need to be able to send the middleware a message. Replies can be
sent as NoticeResponse, if needed, which already exists.



Yeah, but you need the client driver - of all languages - to understand 
that notice. And, if different middleware uses different codes then the 
problem is just pushed to the client;


Java: M|12|0 (Do you support failover ?) MW1: N|6|-|0 (meaning: yes)
Rust: M|12|0 (Do you support load balancing ?) MW2: N|6|-|0 (meaning: no)

Should the middleware guess the client driver based on the startup 
message ? Or are you thinking about having a class identifier group for 
each client driver and then a massive "switch/case" inside each middleware ?



Once you have established what is supported and what isn't you need the
client driver (libpq is the very easy part) - so lets include Java, C#,
Rust, golang, ... - to understand their environment to trigger the
correct message in the configured scenario. F.ex. how is multiple
application clusters connecting to the same middleware instance
(connection pool) going to coordinate their 'M' message ?


Each component can speak to the middleware to discover that, if it
wishes. The middleware can coordinate.



So, application cluster 1 (written in Java) detects that a failover is 
needed, and application cluster 2..N (written in Rust, C# and golang) 
should detect that mid-transaction - or rollback and send their 'M' for 
the same thing ?



I want to empower all future middleware, not just support one. Perhaps
we can take code or ideas from pgagroal and put it in core? Please
explain the binary protocol some more.



There is nothing fancy there (see management.c) - so the official 
protocol should be the focus.



Could you expand on typical scenarios that you want to see implemented ?


I see this as a generalized interface that can be used for a great many things.
* Failover managers
* Load Balancers
* Routing agents
* Things I haven't thought of yet, but others may have
etc..



Moving control to the client will make some of this harder. Maybe 
PeterE, Andrey and Tatsuo-san have additional feedback based on their 
experience.



We are currently limited by the messages we can send efficiently.



There are a lot of good ideas for the PostgreSQL protocol v4 already.

Best regards,
 Jesper

Re: Middleware Messages for FE/BE

2021-08-19 Thread Jesper Pedersen


Hi Simon,

On 8/19/21 4:33 AM, Simon Riggs wrote:

The current FE/BE protocol assumes that the client is talking to the
server directly. If middleware wants to do something like load
balancing, the only current option is to inspect the incoming commands
and take action from that. That approach performs poorly and has
proven difficult to maintain, limiting the functionality in Postgres
ecosystem middleware.

It would be useful to have a way to speak to middleware within a
session or prior to each command. There are ways to frig this and
obviously it can always be done out-of-core, but there is a clear
requirement for various client tools to agree a standard way for them
to send messages to middleware rather than the database server. If we
get PostgreSQL Project approval for this, then authors of client and
middleware tools will know how to interoperate and can begin adding
features to take advantage of this, allowing the Postgres ecosystem to
improve and extend its middleware.

Byte1('M')
Identifies the message as a middleware message. (Or perhaps use 'U'
for User Message?).

Int32
Length of message contents in bytes, including self.

Int64
Routing/class identifier, allowing middleware to quickly ignore whole
classes of message if not appropriate. We would run some kind of
allocation scheme to ensure unique meaning, probably via the Postgres
Wiki. The first 2 billion values would be reserved for allocation by
the PostgreSQL Project itself, values beyond that open for external
allocation.

Byten
The message itself, where n is the remaining bytes in the message.

The message is intended for middleware only. The server always ignores
these messages, with an optional debug facility that can be enabled to
allow printing NOTICEs to allow testing.

I will supply a patch to any agreed message format, together with a
new libpq call to utilise this.

In summary: the patch is easy, the point is we need agreement to allow
and encourage interoperation between clients and middleware.

Thoughts?



I would say that this is a PostgreSQL protocol v4 thing, as there is a 
bit more to it.



There is a need to trigger middleware functionality, but you need to 
query the middleware stack first on what it supports - failover, load 
balancing, ... And, what type of language is that ? SQL query ? Not all 
middleware support general SQL queries.



Once you have established what is supported and what isn't you need the 
client driver (libpq is the very easy part) - so lets include Java, C#, 
Rust, golang, ... - to understand their environment to trigger the 
correct message in the configured scenario. F.ex. how is multiple 
application clusters connecting to the same middleware instance 
(connection pool) going to coordinate their 'M' message ?



If you are looking to configure the middleware instance then we can just 
use the existing protocol with FE/Q; that said pgagroal has its own 
binary protocol for admin payload.



Could you expand on typical scenarios that you want to see implemented ?


Best regards,

 Jesper

Re: New committers: Daniel Gustafsson and John Naylor

2021-07-01 Thread Jesper Pedersen


On 6/30/21 4:43 PM, Tom Lane wrote:

The Core Team would like to extend our congratulations to
Daniel Gustafsson and John Naylor, who have accepted invitations
to become our newest Postgres committers.

Please join me in wishing them much success and few bugs.

regards, tom lane



Congrats to you both !


Best regards,

 Jesper

Re: [GSoC] Metrics and Monitoring for pgagroal

2021-04-09 Thread Jesper Pedersen


Hi Junduo,

On 4/9/21 11:53 AM, Junduo Dong wrote:

I'm Junduo Dong, a 22-year-old studying at China University of Geosciences.

I would like to participate in the project "pgagroal: Metrics and monitoring"
on page "https://wiki.postgresql.org/wiki/GSoC_2021; for GSoC 2021.

Attachment is the proposal, please review.

I hope that my proposal will be successfully accepted and that I will join
the PostgreSQL hacker community in the future.



Thanks for your interest in Google Summer of Code, and the pgagroal 
proposal within the PostgreSQL umbrella.



I'll contact you off-list, and we can get the process going to finalize 
your submission to the GSoC program before the April 13 deadline.



Thanks !


Best regards,

 Jesper

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-12-07 Thread Jesper Pedersen


Hi,

On 12/5/20 10:38 PM, Andy Fan wrote:

Currently the UniqueKey is defined as a List of Expr, rather than
EquivalenceClasses.
A complete discussion until now can be found at [1] (The messages I replied
to also
care a lot and the information is completed). This patch has stopped at
this place for
a while,  I'm planning to try EquivalenceClasses,  but any suggestion would
be welcome.




Unfortunately I think we need a RfC style patch of both versions in 
their minimum implementation.


Hopefully this will make it easier for one or more committers to decide 
on the right direction since they can do a side-by-side comparison of 
the two solutions.


Just my $0.02.

Thanks for working on this Andy !

Best regards,
 Jesper

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-11-30 Thread Jesper Pedersen


Hi,

On 11/30/20 5:04 AM, Heikki Linnakangas wrote:

On 26/11/2020 16:58, Andy Fan wrote:

This patch has stopped moving for a while,  any suggestion about
how to move on is appreciated.


The question on whether UniqueKey.exprs should be a list of 
EquivalenceClasses or PathKeys is unresolved. I don't have an opinion 
on that, but I'd suggest that you pick one or the other and just go 
with it. If it turns out to be a bad choice, then we'll change it.


In this case I think it is matter of deciding if we are going to use 
EquivalenceClasses or Exprs before going further; there has been work 
ongoing in this area for a while, so having a clear direction from a 
committer would be greatly appreciated.


Deciding would also help potential reviewers to give more feedback on 
the features implemented on top of the base.


Should there be a new thread with the minimum requirements in order to 
get closer ?


Best regards,
 Jesper

Re: Index Skip Scan

2020-01-21 Thread Jesper Pedersen


Hi Peter,

Thanks for your feedback; Dmitry has followed-up with some additional 
questions.


On 1/20/20 8:05 PM, Peter Geoghegan wrote:

This is definitely not okay.


I suggest that you find a way to add assertions to code like
_bt_readpage() that verify that we do in fact have the buffer content
lock. 


If you apply the attached patch on master it will fail the test suite; 
did you mean something else ?


Best regards,
 Jesper
diff --git a/src/backend/access/nbtree/nbtsearch.c 
b/src/backend/access/nbtree/nbtsearch.c
index 4189730f3a..57882f0b8d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1721,6 +1721,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, 
OffsetNumber offnum)
 * used here; this function is what makes it good for currPos.
 */
Assert(BufferIsValid(so->currPos.buf));
+   Assert(IsBufferPinnedAndLocked(so->currPos.buf));
 
page = BufferGetPage(so->currPos.buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index aba3960481..f29f40f9b6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3774,6 +3774,23 @@ HoldingBufferPinThatDelaysRecovery(void)
return false;
 }
 
+/*
+ * Assert that the buffer is pinned and locked
+ */
+bool
+IsBufferPinnedAndLocked(Buffer buffer)
+{
+   BufferDesc *bufHdr;
+
+   Assert(BufferIsPinned(buffer));
+
+   bufHdr = GetBufferDescriptor(buffer - 1);
+
+   Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
+
+   return true;
+}
+
 /*
  * ConditionalLockBufferForCleanup - as above, but don't wait to get the lock
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..46a6aa6560 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -217,6 +217,7 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern bool IsBufferPinnedAndLocked(Buffer buffer);
 
 extern void AbortBufferIO(void);

Re: Index Skip Scan

2020-01-20 Thread Jesper Pedersen


Hi Floris,

On 1/15/20 8:33 AM, Floris Van Nee wrote:

I reviewed the latest version of the patch. Overall some good improvements I 
think. Please find my feedback below.



Thanks for your review !


- I think I mentioned this before - it's not that big of a deal, but it just 
looks weird and inconsistent to me:
create table t2 as (select a, b, c, 10 d from generate_series(1,5) a, 
generate_series(1,100) b, generate_series(1,1) c); create index on t2 
(a,b,c desc);

postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b>=5 and 
b<=5 order by a,b,c desc;
QUERY PLAN
-
  Index Only Scan using t2_a_b_c_idx on t2  (cost=0.43..216.25 rows=500 
width=12)
Skip scan: true
Index Cond: ((a = 2) AND (b >= 5) AND (b <= 5))
(3 rows)

postgres=# explain select distinct on (a,b) a,b,c from t2 where a=2 and b=5 
order by a,b,c desc;
QUERY PLAN
-
  Unique  (cost=0.43..8361.56 rows=500 width=12)
->  Index Only Scan using t2_a_b_c_idx on t2  (cost=0.43..8361.56 rows=9807 
width=12)
  Index Cond: ((a = 2) AND (b = 5))
(3 rows)

When doing a distinct on (params) and having equality conditions for all 
params, it falls back to the regular index scan even though there's no reason 
not to use the skip scan here. It's much faster to write b between 5 and 5 now 
rather than writing b=5. I understand this was a limitation of the unique-keys 
patch at the moment which could be addressed in the future. I think for the 
sake of consistency it would make sense if this eventually gets addressed.



Agreed, that it is an improvement that should be made. I would like 
David's view on this since it relates to the UniqueKey patch.



- nodeIndexScan.c, line 126
This sets xs_want_itup to true in all cases (even for non skip-scans). I don't 
think this is acceptable given the side-effects this has (page will never be 
unpinned in between returned tuples in _bt_drop_lock_and_maybe_pin)



Correct - fixed.


- nbsearch.c, _bt_skip, line 1440
_bt_update_skip_scankeys(scan, indexRel); This function is called twice now - 
once in the else {} and immediately after that outside of the else. The second 
call can be removed I think.



Yes, removed the "else" call site.


- nbtsearch.c _bt_skip line 1597
LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
scan->xs_itup = (IndexTuple) PageGetItem(page, 
itemid);

This is an UNLOCK followed by a read of the unlocked page. That looks incorrect?



Yes, that needed to be changed.


- nbtsearch.c _bt_skip line 1440
if (BTScanPosIsValid(so->currPos) &&
_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, 
dir))

Is it allowed to look at the high key / low key of the page without have a read 
lock on it?



In case of a split the page will still contain a high key and a low key, 
so this should be ok.



- nbtsearch.c line 1634
if (_bt_readpage(scan, indexdir, offnum))  ...
else
  error()

Is it really guaranteed that a match can be found on the page itself? Isn't it 
possible that an extra index condition, not part of the scan key, makes none of 
the keys on the page match?



The logic for this has been changed.


- nbtsearch.c in general
Most of the code seems to rely quite heavily on the fact that xs_want_itup 
forces _bt_drop_lock_and_maybe_pin to never release the buffer pin. Have you 
considered that compacting of a page may still happen even if you hold the pin? 
[1] I've been trying to come up with cases in which this may break the patch, 
but I haven't able to produce such a scenario - so it may be fine. But it would 
be good to consider again. One thing I was thinking of was a scenario where 
page splits and/or compacting would happen in between returning tuples. Could 
this break the _bt_scankey_within_page check such that it thinks the scan key 
is within the current page, while it actually isn't? Mainly for backward and/or 
cursor scans. Forward scans shouldn't be a problem I think. Apologies for being 
a bit vague as I don't have a clear example ready when it would go wrong. It 
may well be fine, but it was one of the things on my mind.



There is a BT_READ lock in place when finding the correct leaf page, or 
searching within the leaf page itself. _bt_vacuum_one_page deletes only 
LP_DEAD tuples, but those are already ignored in _bt_readpage. Peter, do 
you have some feedback for this ?



Please, find the updated patches attached that Dmitry and I made.

Thanks again !

Best regards,
 Jesper
>From 3c540c93307e6cbe792b31b12d4ecb025cd6b327 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 15 Nov 2019 09:46:05 -0500
Subject: [PATCH 1/2] Unique key

Design by Davi

Re: Index Skip Scan

2019-11-15 Thread Jesper Pedersen


Hi,

On 11/11/19 1:24 PM, Jesper Pedersen wrote:

v29 using UniqueKey attached.



Just a small update to the UniqueKey patch to hopefully keep CFbot happy.

Feedback, especially on the planner changes, would be greatly appreciated.

Best regards,
 Jesper
>From b1a69c2791c8aba6caa85d7f24b9836641150875 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 06:44:57 -0400
Subject: [PATCH] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c   |  14 +++
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   3 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/indxpath.c  |  41 +++
 src/backend/optimizer/path/pathkeys.c  |  71 ++--
 src/backend/optimizer/path/uniquekey.c | 147 +
 src/backend/optimizer/plan/planagg.c   |   1 +
 src/backend/optimizer/plan/planmain.c  |   1 +
 src/backend/optimizer/plan/planner.c   |  17 ++-
 src/backend/optimizer/util/pathnode.c  |  12 ++
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  18 +++
 src/include/nodes/print.h  |   1 +
 src/include/optimizer/pathnode.h   |   1 +
 src/include/optimizer/paths.h  |  11 ++
 16 files changed, 373 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b0dcd02ff6..1ccd68d3aa 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1720,6 +1720,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2201,6 +2202,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2210,6 +2212,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);
 	WRITE_NODE_FIELD(query_pathkeys);
+	WRITE_NODE_FIELD(query_uniquekeys);
 	WRITE_NODE_FIELD(group_pathkeys);
 	WRITE_NODE_FIELD(window_pathkeys);
 	WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2397,6 +2400,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+
+	WRITE_NODE_FIELD(eq_clause);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4083,6 +4094,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 _outPathKey(str, obj);
 break;
+			case T_UniqueKey:
+_outUniqueKey(str, obj);
+break;
 			case T_PathTarget:
 _outPathTarget(str, obj);
 break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 4ecde6b421..435b32063c 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
 	printf(")\n");
 }
 
+/*
+ * print_uniquekeys -
+ *	  uniquekeys list of UniqueKeys
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+	ListCell   *l;
+
+	printf("(");
+	foreach(l, uniquekeys)
+	{
+		UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+		EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+		ListCell   *k;
+		bool		first = true;
+
+		/* chase up */
+		while (eclass->ec_merged)
+			eclass = eclass->ec_merged;
+
+		printf("(");
+		foreach(k, eclass->ec_members)
+		{
+			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+
+			if (first)
+first = false;
+			else
+printf(", ");
+			print_expr((Node *) mem->em_expr, rtable);
+		}
+		printf(")");
+		if (lnext(uniquekeys, l))
+			printf(", ");
+	}
+	printf(")\n");
+}
+
 /*
  * print_tl
  *	  print targetlist in a more legible way.
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..63cc1505d9 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekey.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index db3a68a51d..5fc9b81746 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3954,6 +3954,14 @@ print_path(PlannerInfo *root, Path *path, int indent)
 		print_pathkeys(path->pathkeys, root->

Re: Index Skip Scan

2019-11-11 Thread Jesper Pedersen


Hi Tomas,

On 11/10/19 4:18 PM, Tomas Vondra wrote:

I've looked at the patch again - in general it seems in pretty good
shape, all the issues I found are mostly minor.

Firstly, I'd like to point out that not all of the things I complained
about in my 2019/06/23 review got addressed. Those were mostly related
to formatting and code style, and the attached patch fixes some (but
maybe not all) of them.



Sorry about that !


The patch also tweaks wording of some bits in the docs and comments that
I found unclear. Would be nice if a native speaker could take a look.

A couple more comments:


1) pathkey_is_unique

The one additional issue I found is pathkey_is_unique - it's not really
explained what "unique" means and hot it's different from "redundant"
(which has quite a long explanation before pathkey_is_redundant).

My understanding is that pathkey is "unique" when it's EC does not match
an EC of another pathkey in the list. But if that's the case, then the
function name is wrong - it does exactly the opposite (i.e. it returns
'true' when the pathkey is *not* unique).



Yeah, you are correct - forgot to move that part from the _uniquekey 
version of the patch.




2) explain

I wonder if we should print the "Skip scan" info always, or similarly to
"Inner Unique" which does this:

/* try not to be too chatty about this in text mode */
     if (es->format != EXPLAIN_FORMAT_TEXT ||
     (es->verbose && ((Join *) plan)->inner_unique))
     ExplainPropertyBool("Inner Unique",
     ((Join *) plan)->inner_unique,
     es);
     break;

I'd do the same thing for skip scan - print it only in verbose mode, or
when using non-text output format.



I think it is of benefit to see if skip scan kicks in, but used your 
"Skip scan" suggestion.




3) There's an annoying limitation that for this to kick in, the order of
expressions in the DISTINCT clause has to match the index, i.e. with
index on (a,b,c) the skip scan will only kick in for queries using

    DISTINCT a
    DISTINCT a,b
    DISTINCT a,b,c

but not e.g. DISTINCT a,c,b. I don't think there's anything forcing us
to sort result of DISTINCT in any particular case, except that we don't
consider the other orderings "interesting" so we don't really consider
using the index (so no chance of using the skip scan).

That leads to pretty annoying speedups/slowdowns due to seemingly
irrelevant changes:

-- everything great, a,b,c matches an index
test=# explain (analyze, verbose) select distinct a,b,c from t;
  QUERY PLAN 
- 

  Index Only Scan using t_a_b_c_idx on public.t  (cost=0.42..565.25 
rows=1330 width=12) (actual time=0.016..10.387 rows=1331 loops=1)

    Output: a, b, c
    Skip scan: true
    Heap Fetches: 1331
  Planning Time: 0.106 ms
  Execution Time: 10.843 ms
(6 rows)

-- slow, mismatch with index
test=# explain (analyze, verbose) select distinct a,c,b from t;
     QUERY PLAN 
--- 

  HashAggregate  (cost=22906.00..22919.30 rows=1330 width=12) (actual 
time=802.067..802.612 rows=1331 loops=1)

    Output: a, c, b
    Group Key: t.a, t.c, t.b
    ->  Seq Scan on public.t  (cost=0.00..15406.00 rows=100 
width=12) (actual time=0.010..355.361 rows=100 loops=1)

  Output: a, b, c
  Planning Time: 0.076 ms
  Execution Time: 803.078 ms
(7 rows)

-- fast again, the extra ordering allows using the index again
test=# explain (analyze, verbose) select distinct a,c,b from t order by 
a,b,c;
  QUERY PLAN 
- 

  Index Only Scan using t_a_b_c_idx on public.t  (cost=0.42..565.25 
rows=1330 width=12) (actual time=0.035..12.120 rows=1331 loops=1)

    Output: a, c, b
    Skip scan: true
    Heap Fetches: 1331
  Planning Time: 0.053 ms
  Execution Time: 12.632 ms
(6 rows)


This is a more generic issue, not specific to this patch, of course. I
think we saw it with the incremental sort patch, IIRC. I wonder how
difficult would it be to fix this here (not necessarily in v1).



Yeah, I see it as separate to this patch as well. But definitely 
something that should be revisited.


Thanks for your patch ! v29 using UniqueKey attached.

Best regards,
 Jesper
>From 4e27a04702002d06f60468f8a9033d2ac2e12d8a Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Mon, 11 Nov 2019 08:49:43 -0500
Subject: [PATCH 1/2] Unique key

Design by David Rowley.

Re: Application name for pg_basebackup and friends

2019-11-07 Thread Jesper Pedersen


On 11/7/19 1:51 AM, Michael Paquier wrote:

I don't think we need a new comand line switch for it.


+1.


Please note that I have marked this patch as rejected in the CF app,
per the arguments upthread.


Ok, agreed.

Thanks for the feedback !

Best regards,
 Jesper

Application name for pg_basebackup and friends

2019-10-31 Thread Jesper Pedersen


Hi,

The attached patch adds an -a / --appname command line switch to 
pg_basebackup, pg_receivewal and pg_recvlogical.


This is useful when f.ex. pg_receivewal needs to connect as a 
synchronous client (synchronous_standby_names),


 pg_receivewal -h myhost -p 5432 -S replica1 -a replica1 --synchronous 
-D /wal


I'll add the patch to the CommitFest for discussion, as there is overlap 
with the -d switch.


Best regards,
 Jesper
>From 3aee659423137a547ed178a1dab34fe3caf30702 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Thu, 31 Oct 2019 08:34:41 -0400
Subject: [PATCH] Add an -a / --appname command line switch to control the
 application_name property.

Author: Jesper Pedersen 
---
 doc/src/sgml/ref/pg_basebackup.sgml| 11 +++
 doc/src/sgml/ref/pg_receivewal.sgml| 11 +++
 doc/src/sgml/ref/pg_recvlogical.sgml   | 11 +++
 src/bin/pg_basebackup/pg_basebackup.c  |  7 ++-
 src/bin/pg_basebackup/pg_receivewal.c  |  7 ++-
 src/bin/pg_basebackup/pg_recvlogical.c |  7 ++-
 src/bin/pg_basebackup/streamutil.c |  7 +++
 src/bin/pg_basebackup/streamutil.h |  1 +
 8 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index fc9e222f8d..28b5dee206 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -543,6 +543,17 @@ PostgreSQL documentation
 The following command-line options control the database connection parameters.
 
 
+ 
+  -a application_name
+  --appname=application_name
+  
+   
+Specifies the application name used to connect to the server.
+See  for more information.
+   
+  
+ 
+
  
   -d connstr
   --dbname=connstr
diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 177e9211c0..0957e0a9f5 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -248,6 +248,17 @@ PostgreSQL documentation
 The following command-line options control the database connection parameters.
 
 
+ 
+  -a application_name
+  --appname=application_name
+  
+   
+Specifies the application name used to connect to the server.
+See  for more information.
+   
+  
+ 
+
  
   -d connstr
   --dbname=connstr
diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index 4c79f90414..b2d9b35362 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -272,6 +272,17 @@ PostgreSQL documentation
 The following command-line options control the database connection parameters.
 
 
+  
+   -a application_name
+   --appname=application_name
+   
+
+ Specifies the application name used to connect to the server.
+ See  for more information.
+
+   
+  
+
   
-d database
--dbname=database
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index a9d162a7da..237945f879 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -351,6 +351,7 @@ usage(void)
 			 " do not verify checksums\n"));
 	printf(_("  -?, --help show this help, then exit\n"));
 	printf(_("\nConnection options:\n"));
+	printf(_("  -a, --appname=NAME application name\n"));
 	printf(_("  -d, --dbname=CONNSTR   connection string\n"));
 	printf(_("  -h, --host=HOSTNAMEdatabase server host or socket directory\n"));
 	printf(_("  -p, --port=PORTdatabase server port number\n"));
@@ -2031,6 +2032,7 @@ main(int argc, char **argv)
 		{"label", required_argument, NULL, 'l'},
 		{"no-clean", no_argument, NULL, 'n'},
 		{"no-sync", no_argument, NULL, 'N'},
+		{"appname", required_argument, NULL, 'a'},
 		{"dbname", required_argument, NULL, 'd'},
 		{"host", required_argument, NULL, 'h'},
 		{"port", required_argument, NULL, 'p'},
@@ -2070,7 +2072,7 @@ main(int argc, char **argv)
 
 	atexit(cleanup_directories_atexit);
 
-	while ((c = getopt_long(argc, argv, "CD:F:r:RS:T:X:l:nNzZ:d:c:h:p:U:s:wWkvP",
+	while ((c = getopt_long(argc, argv, "CD:F:r:RS:T:X:l:nNzZ:a:d:c:h:p:U:s:wWkvP",
 			long_options, _index)) != -1)
 	{
 		switch (c)
@@ -2176,6 +2178,9 @@ main(int argc, char **argv)
 	exit(1);
 }
 break;
+			case 'a':
+application_name = pg_strdup(optarg);
+break;
 			case 'd':
 connection_string = pg_strdup(optarg);
 break;
diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c
index c0c8747982..863dcdb161 100644
--- a/src/bin/pg_basebackup/pg_receivewal.c
+++ b/src/bin/pg_basebackup/pg_receivewal.c
@@ -

Re: Index Skip Scan

2019-08-02 Thread Jesper Pedersen


Hi,

On 8/2/19 8:14 AM, Dmitry Dolgov wrote:

And this too (slightly rewritten:). We will soon post the new version of patch
with updates about UniqueKey from Jesper.



Yes.

We decided to send this now, although there is still feedback from David 
that needs to be considered/acted on.


The patches can be reviewed independently, but we will send them as a 
set from now on. Development of UniqueKey will be kept separate though [1].


Note, that while UniqueKey can form the foundation of optimizations for 
GROUP BY queries it isn't the focus for this patch series. Contributions 
are very welcomed of course.


[1] https://github.com/jesperpedersen/postgres/tree/uniquekey

Best regards,
 Jesper
>From 35018a382d792d6ceeb8d0e9d16bc14ea2e3f148 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 2 Aug 2019 07:52:08 -0400
Subject: [PATCH 1/2] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c   |  14 +++
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   2 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/costsize.c  |   5 +
 src/backend/optimizer/path/indxpath.c  |  41 +++
 src/backend/optimizer/path/pathkeys.c  |  72 ++--
 src/backend/optimizer/path/uniquekey.c | 147 +
 src/backend/optimizer/plan/planner.c   |  18 ++-
 src/backend/optimizer/util/pathnode.c  |  12 ++
 src/backend/optimizer/util/tlist.c |   1 -
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  18 +++
 src/include/nodes/print.h  |   2 +-
 src/include/optimizer/pathnode.h   |   1 +
 src/include/optimizer/paths.h  |  11 ++
 16 files changed, 377 insertions(+), 15 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e6ce8e2110..9a4f3e8e4b 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1720,6 +1720,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2201,6 +2202,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2210,6 +2212,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);
 	WRITE_NODE_FIELD(query_pathkeys);
+	WRITE_NODE_FIELD(query_uniquekeys);
 	WRITE_NODE_FIELD(group_pathkeys);
 	WRITE_NODE_FIELD(window_pathkeys);
 	WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2397,6 +2400,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+
+	WRITE_NODE_FIELD(eq_clause);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4073,6 +4084,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 _outPathKey(str, obj);
 break;
+			case T_UniqueKey:
+_outUniqueKey(str, obj);
+break;
 			case T_PathTarget:
 _outPathTarget(str, obj);
 break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 4ecde6b421..62c9d4ef7a 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
 	printf(")\n");
 }
 
+/*
+ * print_uniquekeys -
+ *	  unique_key an UniqueKey
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+	ListCell   *l;
+
+	printf("(");
+	foreach(l, uniquekeys)
+	{
+		UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+		EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+		ListCell   *k;
+		bool		first = true;
+
+		/* chase up */
+		while (eclass->ec_merged)
+			eclass = eclass->ec_merged;
+
+		printf("(");
+		foreach(k, eclass->ec_members)
+		{
+			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+
+			if (first)
+first = false;
+			else
+printf(", ");
+			print_expr((Node *) mem->em_expr, rtable);
+		}
+		printf(")");
+		if (lnext(uniquekeys, l))
+			printf(", ");
+	}
+	printf(")\n");
+}
+
 /*
  * print_tl
  *	  print targetlist in a more legible way.
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 6864a62132..8249a6b6db 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../

Re: pg_receivewal documentation

2019-07-24 Thread Jesper Pedersen


Hi,

On 7/23/19 10:29 PM, Michael Paquier wrote:

Thanks.  Applied down to 9.6 where remote_apply has been introduced,
with tweaks for 9.6 as the tool is named pg_receivexlog there.


Thanks to everybody involved !

Best regards,
 Jesper

Re: pg_receivewal documentation

2019-07-23 Thread Jesper Pedersen


Hi,

On 7/22/19 8:08 PM, Michael Paquier wrote:

+to something other than


Looks fine to me.  Just a tiny nit.  For the second reference to
synchronous_commit, I would change the link to a  markup.


Sure.

Best regards,
 Jesper


>From f6c5e9128e0779f928d94bf9bcc8813bf03150f0 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 13:14:25 -0400
Subject: [PATCH] Highlight that pg_receivewal doesn't apply WAL, and as such
 synchronous-commit needs to be remote_write or lower.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Author: Robert Haas
Review-by: Michael Paquier, Ãlvaro Herrera, Laurenz Albe and Jesper Pedersen
---
 doc/src/sgml/ref/pg_receivewal.sgml | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 0506120c00..a7536bed92 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -52,7 +52,15 @@ PostgreSQL documentation
Unlike the WAL receiver of a PostgreSQL standby server, pg_receivewal
by default flushes WAL data only when a WAL file is closed.
The option --synchronous must be specified to flush WAL data
-   in real time.
+   in real time. Since pg_receivewal does not apply WAL,
+   you should not allow it to become a synchronous standby when
+equals remote_apply.
+   If it does, it will appear to be a standby which never catches up,
+   which may cause commits to block.  To avoid this, you should either
+   configure an appropriate value for , or
+   specify an application_name for
+   pg_receivewal that does not match it, or change the value of
+   synchronous_commit to something other than remote_apply.
   
 
   
-- 
2.21.0

Re: pg_receivewal documentation

2019-07-22 Thread Jesper Pedersen


Hi,

On 7/21/19 9:48 PM, Michael Paquier wrote:

Since pg_receivewal does not apply WAL, you should not allow it to
become a synchronous standby when synchronous_commit = remote_apply.
If it does, it will appear to be a standby which never catches up,
which may cause commits to block.  To avoid this, you should either
configure an appropriate value for synchronous_standby_names, or
specify an application_name for pg_receivewal that does not match it,
or change the value of synchronous_commit to something other than
remote_apply.

I think that'd be a lot more useful than enumerating the total-failure
scenarios.


+1.  Thanks for the suggestions!  Your wording looks good to me.


+1

Here is the patch for it, with Robert as the author.

Best regards,
 Jesper

>From a6512e79e9fd054b188489757ee622dbf98ddf2b Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Mon, 22 Jul 2019 13:19:56 -0400
Subject: [PATCH] Highlight that pg_receivewal doesn't apply WAL, and as such
 synchronous-commit needs to be remote_write or lower.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Author: Robert Haas
Review-by: Michael Paquier, Ãlvaro Herrera, Laurenz Albe and Jesper Pedersen
---
 doc/src/sgml/ref/pg_receivewal.sgml | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index e96d753955..beab6f0391 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -52,7 +52,16 @@ PostgreSQL documentation
Unlike the WAL receiver of a PostgreSQL standby server, pg_receivewal
by default flushes WAL data only when a WAL file is closed.
The option --synchronous must be specified to flush WAL data
-   in real time.
+   in real time. Since pg_receivewal does not apply WAL,
+   you should not allow it to become a synchronous standby when
+equals remote_apply.
+   If it does, it will appear to be a standby which never catches up,
+   which may cause commits to block.  To avoid this, you should either
+   configure an appropriate value for , or
+   specify an application_name for
+   pg_receivewal that does not match it, or change the value of
+to something other than
+   remote_apply.
   
 
   
@@ -207,16 +216,6 @@ PostgreSQL documentation
 server as a synchronous standby, to ensure that timely feedback is
 sent to the server.

-
-   
-Note that while WAL will be flushed with this setting,
-pg_receivewal never applies it,
-so  must not be set
-to remote_apply or on if
-pg_receivewal is a synchronous standby, be it a
-member of a priority-based (FIRST) or a
-quorum-based (ANY) synchronous replication setup.
-   
   
  
 
-- 
2.21.0

Re: Index Skip Scan

2019-07-22 Thread Jesper Pedersen


Hi,

On 7/22/19 1:44 AM, David Rowley wrote:

Here are the comments I noted down during the review:

cost_index:

I know you've not finished here, but I think it'll need to adjust
tuples_fetched somehow to account for estimate_num_groups() on the
Path's unique keys. Any Eclass with an ec_has_const = true does not
need to be part of the estimate there as there can only be at most one
value for these.

For example, in a query such as:

SELECT x,y FROM t WHERE x = 1 GROUP BY x,y;

you only need to perform estimate_num_groups() on "y".

I'm really not quite sure on what exactly will be required from
amcostestimate() here.  The cost of the skip scan is not the same as
the normal scan. So other that API needs adjusted to allow the caller
to mention that we want skip scans estimated, or there needs to be
another callback.



I think this part will become more clear once the index skip scan patch 
is rebased, as we got the uniquekeys field on the Path, and the 
indexskipprefixy info on the IndexPath node.




build_index_paths:

I don't quite see where you're checking if the query's unique_keys
match what unique keys can be produced by the index.  This is done for
pathkeys with:

pathkeys_possibly_useful = (scantype != ST_BITMAPSCAN &&
!found_lower_saop_clause &&
has_useful_pathkeys(root, rel));
index_is_ordered = (index->sortopfamily != NULL);
if (index_is_ordered && pathkeys_possibly_useful)
{
index_pathkeys = build_index_pathkeys(root, index,
   ForwardScanDirection);
useful_pathkeys = truncate_useless_pathkeys(root, rel,
index_pathkeys);
orderbyclauses = NIL;
orderbyclausecols = NIL;
}

Here has_useful_pathkeys() checks if the query requires some ordering.
For unique keys you'll want to do the same. You'll have set the
query's unique key requirements in standard_qp_callback().

I think basically build_index_paths() should be building index paths
with unique keys, for all indexes that can support the query's unique
keys. I'm just a bit uncertain if we need to create both a normal
index path and another path for the same index with unique keys.
Perhaps we can figure that out down the line somewhere. Maybe it's
best to build path types for now, when possible, and we can consider
later if we can skip the non-uniquekey paths.  Likely that would
require a big XXX comment to explain we need to review that before the
code makes it into core.

get_uniquekeys_for_index:

I think this needs to follow more the lead from build_index_pathkeys.
Basically, ask the index what it's pathkeys are.

standard_qp_callback:

build_group_uniquekeys & build_distinct_uniquekeys could likely be one
function that takes a list of SortGroupClause. You just either pass
the groupClause or distinctClause in.  Pretty much the UniqueKey
version of make_pathkeys_for_sortclauses().

Yeah, I'll move this part of the index skip scan patch to the unique key 
patch.


Thanks for your feedback !

Best regards,
 Jesper

Re: pg_receivewal documentation

2019-07-19 Thread Jesper Pedersen


Hi,

On 7/18/19 9:09 PM, Michael Paquier wrote:

pg_receivewal -D /tmp/wal -S replica1 --synchronous -h localhost -p 5432 -U
repluser -W
psql -c 'SELECT * FROM pg_stat_replication;' postgres
psql -c 'SELECT * FROM pg_replication_slots;' postgres
psql -c 'CREATE DATABASE test' postgres

In what scenarios do you see 'on' working ?


Because the code says so, "on" is an alias for "remote_flush" (which
is not user-visible by the way):
src/include/access/xact.h:#define SYNCHRONOUS_COMMIT_ON
SYNCHRONOUS_COMMIT_REMOTE_FLUSH

And if you do that it works fine (pg_receivewal --synchronous runs in
the background and I created a dummy table):
=# SELECT application_name, sync_state, flush_lsn, replay_lsn FROM
pg_stat_replication;
  application_name | sync_state | flush_lsn | replay_lsn
--++---+
  pg_receivewal| sync   | 0/15E1F88 | null
(1 row)
=# set synchronous_commit to on ;
SET
=# insert into aa values (2);
INSERT 0 1



I forgot to use pg_receivewal -d with application_name instead of -h -p -U.

Maybe we should have an explicit option for that, but that is a separate 
thread.


Best regards,
 Jesper

Re: pg_receivewal documentation

2019-07-19 Thread Jesper Pedersen


Hi,

On 7/18/19 9:27 PM, Michael Paquier wrote:

The location of the warning is
also harder to catch for the reader, so instead let's move it to the
top where we have an extra description for --synchronous.  I am
finishing with the attached that I would be fine to commit and
back-patch as needed.  Does that sound fine?


LGTM.

Best regards,
 Jesper

Re: pg_receivewal documentation

2019-07-18 Thread Jesper Pedersen


Hi,

On 7/18/19 1:29 AM, Michael Paquier wrote:

Or more simply like that?
"Note that while WAL will be flushed with this setting,
pg_receivewal never applies it, so synchronous_commit must not be set
to remote_apply if pg_receivewal is a synchronous standby, be it a
member of a priority-based (FIRST) or a quorum-based (ANY) synchronous
replication setup."


Yeah, better.

Best regards,
 Jesper


>From 2bcb8d6376d94a265a598f552eed3915b980aa94 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 13:14:25 -0400
Subject: [PATCH] Highlight that pg_receivewal doesn't apply WAL, and as such
 synchronous-commit needs to be remote_write or lower.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Authors: Michael Paquier, Ãlvaro Herrera, Laurenz Albe and Jesper Pedersen
Review-by: Michael Paquier, Ãlvaro Herrera, Laurenz Albe and Jesper Pedersen
---
 doc/src/sgml/ref/pg_receivewal.sgml | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 0506120c00..e96d753955 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -207,6 +207,16 @@ PostgreSQL documentation
 server as a synchronous standby, to ensure that timely feedback is
 sent to the server.

+
+   
+Note that while WAL will be flushed with this setting,
+pg_receivewal never applies it,
+so  must not be set
+to remote_apply or on if
+pg_receivewal is a synchronous standby, be it a
+member of a priority-based (FIRST) or a
+quorum-based (ANY) synchronous replication setup.
+   
   
  
 
-- 
2.21.0

Re: pg_receivewal documentation

2019-07-18 Thread Jesper Pedersen


Hi Laurenz,

On 7/17/19 5:21 PM, Laurenz Albe wrote:

That's factually wrong.  "on" (wait for WAL flush) works fine with
pg_receivewal, only "remote_apply" doesn't.



Please, try

mkdir /tmp/wal
initdb /tmp/pgsql
pg_ctl -D /tmp/pgsql -l /tmp/logfile start
psql postgres
SELECT pg_create_physical_replication_slot('replica1');
CREATE ROLE repluser WITH LOGIN REPLICATION PASSWORD 'replpass';
\q

synchronous_commit = on
synchronous_standby_names = 'replica1'

pg_ctl -D /tmp/pgsql -l /tmp/logfile restart
pg_receivewal -D /tmp/wal -S replica1 --synchronous -h localhost -p 5432 
-U repluser -W

psql -c 'SELECT * FROM pg_stat_replication;' postgres
psql -c 'SELECT * FROM pg_replication_slots;' postgres
psql -c 'CREATE DATABASE test' postgres

In what scenarios do you see 'on' working ?

Best regards,
 Jesper

Re: pg_receivewal documentation

2019-07-17 Thread Jesper Pedersen


Hi,

On 7/17/19 4:04 AM, Michael Paquier wrote:

How about adding "or priority-based" after "quorum-based"?


I would be fine with that for the first part.  I am not sure of what a
good formulation would be for the second part of the sentence.  Now it
only refers to quorum, but with priority sets that does not apply.
And I am not sure what "won't count towards the quorum" actually
means.


Maybe something like the attached ?  Although it doesn't help we need to 
include on as well...


Best regards,
 Jesper


>From f1ec4f9e715c798440288bb4e5c97100bf4efacc Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 13:14:25 -0400
Subject: [PATCH] Highlight that pg_receivewal doesn't apply WAL, and as such
 synchronous-commit needs to be remote_write or lower.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Authors: Ãlvaro Herrera, Laurenz Albe and Jesper Pedersen
Review-by: Ãlvaro Herrera, Laurenz Albe and Jesper Pedersen
---
 doc/src/sgml/ref/pg_receivewal.sgml | 13 +
 1 file changed, 13 insertions(+)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 0506120c00..254315c6bf 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -207,6 +207,19 @@ PostgreSQL documentation
 server as a synchronous standby, to ensure that timely feedback is
 sent to the server.

+
+   
+Note that while WAL will be flushed with this setting,
+pg_receivewal never applies it, so
+ must not be set to
+remote_apply or on
+if pg_receivewal is the only synchronous standby.
+Similarly, if pg_receivewal is part of a
+priority-based synchronous replication setup (FIRST),
+or a quorum-based setup (ANY) it won't count towards
+the policy specified if  is
+set to remote_apply or on.
+   
   
  
 
-- 
2.21.0

Re: pg_receivewal documentation

2019-07-16 Thread Jesper Pedersen


Hi,

On 7/16/19 12:28 PM, Laurenz Albe wrote:

This is not true in all cases as since 9.6 it is possible to specify
multiple synchronous standbys.  So if for example pg_receivewal and
another synchronous standby are set in s_s_names and that the number
of a FIRST (priority-based) or ANY (quorum set) is two, then the same
issue exists, but this documentation is incorrect.  I think that we
should have a more extensive wording  here, like "if pg_receivewal is
part of a quorum-based or priority-based set of synchronous standbys."


I think this would be overly complicated.
The wording above seems to cover the priority-based base sufficiently
in my opinion.
Maybe a second sentence with more detail would be better:

   ... must not be set to remote_apply if
   pg_receivewal is the only synchronous standby.
   Similarly, if pg_receivewal is part of
   a quorum-based set of synchronous standbys, it won't count towards
   the quorum if  is set to
   remote_apply.



Here is the patch for that.

Best regards,
 Jesper


>From 4fb28d6fe08ddea5a9740c9a81e8e00b94283d78 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 13:14:25 -0400
Subject: [PATCH] Highlight that pg_receivewal doesn't apply WAL, and as such
 synchronous-commit needs to be remote_write or lower.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Authors: Ãlvaro Herrera, Laurenz Albe and Jesper Pedersen
Review-by: Ãlvaro Herrera, Laurenz Albe and Jesper Pedersen
---
 doc/src/sgml/ref/pg_receivewal.sgml | 12 
 1 file changed, 12 insertions(+)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 0506120c00..03b20ecd38 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -207,6 +207,18 @@ PostgreSQL documentation
 server as a synchronous standby, to ensure that timely feedback is
 sent to the server.

+
+   
+Note that while WAL will be flushed with this setting,
+pg_receivewal never applies it, so
+ must not be set to
+remote_apply if pg_receivewal
+is the only synchronous standby. Similarly, if
+pg_receivewal is part of a quorum-based
+set of synchronous standbys, it won't count towards the quorum if
+ is set to
+remote_apply.
+   
   
  
 
-- 
2.21.0

Re: Index Skip Scan

2019-07-16 Thread Jesper Pedersen


Hi David,

On 7/11/19 7:38 AM, David Rowley wrote:

The UniqueKeys idea is quite separate from pathkeys.  Currently, a
Path can have a List of PathKeys which define the order that the
tuples will be read from the Plan node that's created from that Path.
The idea with UniqueKeys is that a Path can also have a non-empty List
of UniqueKeys to define that there will be no more than 1 row with the
same value for the Paths UniqueKey column/exprs.

As of now, if you look at standard_qp_callback() sets
root->query_pathkeys, the idea here would be that the callback would
also set a new List field named "query_uniquekeys" based on the
group_pathkeys when non-empty and !root->query->hasAggs, or by using
the distinct clause if it's non-empty. Then in build_index_paths()
around the call to match_pathkeys_to_index() we'll probably also want
to check if the index can support UniqueKeys that would suit the
query_uniquekeys that were set during standard_qp_callback().

As for setting query_uniquekeys in standard_qp_callback(), this should
be simply a matter of looping over either group_pathkeys or
distinct_pathkeys and grabbing the pk_eclass from each key and making
a canonical UniqueKey from that. To have these canonical you'll need
to have a new field in PlannerInfo named canon_uniquekeys which will
do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
need an equivalent of make_canonical_pathkey() in uniquekey.c

With this implementation, the code that the patch adds in
create_distinct_paths() can mostly disappear. You'd only need to look
for a path that uniquekeys_contained_in() matches the
root->query_uniquekeys and then just leave it to
set_cheapest(distinct_rel); to pick the cheapest path.

It would be wasted effort to create paths with UniqueKeys if there's
multiple non-dead base rels in the query as the final rel in
create_distinct_paths would be a join rel, so it might be worth
checking that before creating such index paths in build_index_paths().
However, down the line, we could carry the uniquekeys forward into the
join paths if the join does not duplicate rows from the other side of
the join... That's future stuff though, not for this patch, I don't
think.

I think a UniqueKey can just be a struct similar to PathKey, e.g be
located in pathnodes.h around where PathKey is defined.  Likely we'll
need a uniquekeys.c file that has the equivalent to
pathkeys_contained_in() ... uniquekeys_contained_in(), which return
true if arg1 is a superset of arg2 rather than check for one set being
a prefix of another. As you mention above: UniqueKeys { x, y } ==
UniqueKeys { y, x }.  That superset check could perhaps be optimized
by sorting UniqueKey lists in memory address order, that'll save
having a nested loop, but likely that's not going to be required for a
first cut version.  This would work since you'd want UniqueKeys to be
canonical the same as PathKeys are (Notice that compare_pathkeys()
only checks memory addresses of pathkeys and not equals()).

I think the UniqueKey struct would only need to contain an
EquivalenceClass field. I think all the other stuff that's in PathKey
is irrelevant to UniqueKey.



Here is a patch more in that direction.

Some questions:

1) Do we really need the UniqueKey struct ?  If it only contains the 
EquivalenceClass field then we could just have a list of those instead. 
That would make the patch simpler.


2) Do we need both canon_uniquekeys and query_uniquekeys ?  Currently 
the patch only uses canon_uniquekeys because the we attach the list 
directly on the Path node.


I'll clean the patch up based on your feedback, and then start to rebase 
v21 on it.


Thanks in advance !

Best regards,
 Jesper
From 174a6425036e2d4ca7d3d68c635cd55a58a9b9e6 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 06:44:57 -0400
Subject: [PATCH] UniqueKey

---
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   2 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/costsize.c  |   5 +
 src/backend/optimizer/path/indxpath.c  |  39 +++
 src/backend/optimizer/path/uniquekey.c | 149 +
 src/backend/optimizer/plan/planner.c   |  12 +-
 src/backend/optimizer/util/pathnode.c  |  12 ++
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  18 +++
 src/include/nodes/print.h  |   2 +-
 src/include/optimizer/pathnode.h   |   1 +
 src/include/optimizer/paths.h  |   8 ++
 13 files changed, 293 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 4ecde6b421..ed5684bf19 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
printf(")\n");
 }
 
+/*
+ * print_unique_keys -
+ *   unique_key an UniqueKey
+ */
+void

Re: Index Skip Scan

2019-07-11 Thread Jesper Pedersen


Hi David,

On 7/11/19 7:38 AM, David Rowley wrote:

The UniqueKeys idea is quite separate from pathkeys.  Currently, a
Path can have a List of PathKeys which define the order that the
tuples will be read from the Plan node that's created from that Path.
The idea with UniqueKeys is that a Path can also have a non-empty List
of UniqueKeys to define that there will be no more than 1 row with the
same value for the Paths UniqueKey column/exprs.

As of now, if you look at standard_qp_callback() sets
root->query_pathkeys, the idea here would be that the callback would
also set a new List field named "query_uniquekeys" based on the
group_pathkeys when non-empty and !root->query->hasAggs, or by using
the distinct clause if it's non-empty. Then in build_index_paths()
around the call to match_pathkeys_to_index() we'll probably also want
to check if the index can support UniqueKeys that would suit the
query_uniquekeys that were set during standard_qp_callback().

As for setting query_uniquekeys in standard_qp_callback(), this should
be simply a matter of looping over either group_pathkeys or
distinct_pathkeys and grabbing the pk_eclass from each key and making
a canonical UniqueKey from that. To have these canonical you'll need
to have a new field in PlannerInfo named canon_uniquekeys which will
do for UniqueKeys what canon_pathkeys does for PathKeys. So you'll
need an equivalent of make_canonical_pathkey() in uniquekey.c

With this implementation, the code that the patch adds in
create_distinct_paths() can mostly disappear. You'd only need to look
for a path that uniquekeys_contained_in() matches the
root->query_uniquekeys and then just leave it to
set_cheapest(distinct_rel); to pick the cheapest path.

It would be wasted effort to create paths with UniqueKeys if there's
multiple non-dead base rels in the query as the final rel in
create_distinct_paths would be a join rel, so it might be worth
checking that before creating such index paths in build_index_paths().
However, down the line, we could carry the uniquekeys forward into the
join paths if the join does not duplicate rows from the other side of
the join... That's future stuff though, not for this patch, I don't
think.

I think a UniqueKey can just be a struct similar to PathKey, e.g be
located in pathnodes.h around where PathKey is defined.  Likely we'll
need a uniquekeys.c file that has the equivalent to
pathkeys_contained_in() ... uniquekeys_contained_in(), which return
true if arg1 is a superset of arg2 rather than check for one set being
a prefix of another. As you mention above: UniqueKeys { x, y } ==
UniqueKeys { y, x }.  That superset check could perhaps be optimized
by sorting UniqueKey lists in memory address order, that'll save
having a nested loop, but likely that's not going to be required for a
first cut version.  This would work since you'd want UniqueKeys to be
canonical the same as PathKeys are (Notice that compare_pathkeys()
only checks memory addresses of pathkeys and not equals()).

I think the UniqueKey struct would only need to contain an
EquivalenceClass field. I think all the other stuff that's in PathKey
is irrelevant to UniqueKey.



Thanks for the feedback ! I'll work on these changes for the next 
uniquekey patch.


Best regards,
 Jesper

Re: pg_receivewal documentation

2019-07-10 Thread Jesper Pedersen


Hi,

On 7/10/19 10:24 AM, Alvaro Herrera wrote:

+1 to document this caveat.

How about
 Note that while WAL will be flushed with this setting,
 pg_receivewal never applies it, so
  must not be set to
 remote_apply if 
pg_receivewal
 is the only synchronous standby.
?



Sure.

Best regards,
 Jesper
>From cc51d52b2f67b33cd0faba1a74e05e93fc859011 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 13:14:25 -0400
Subject: [PATCH] Highlight that pg_receivewal doesn't apply WAL, and as such
 synchronous-commit needs to be remote_write or lower.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Authors: Ãlvaro Herrera, Laurenz Albe and Jesper Pedersen
Review-by: Ãlvaro Herrera, Laurenz Albe and Jesper Pedersen
---
 doc/src/sgml/ref/pg_receivewal.sgml | 8 
 1 file changed, 8 insertions(+)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 0506120c00..16f9c41ff1 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -207,6 +207,14 @@ PostgreSQL documentation
 server as a synchronous standby, to ensure that timely feedback is
 sent to the server.

+
+   
+Note that while WAL will be flushed with this setting,
+pg_receivewal never applies it, so
+ must not be set to
+remote_apply if pg_receivewal
+is the only synchronous standby.
+   
   
  
 
-- 
2.21.0

Re: Index Skip Scan

2019-07-10 Thread Jesper Pedersen


Hi,

On 7/9/19 10:14 PM, Thomas Munro wrote:

Thomas, do you have any ideas for this ? I can see that MySQL did the
functionality in two change sets (base and function support), but like
you said we shouldn't paint ourselves into a corner.


I think amskip() could be augmented by later patches to take a
parameter 'skip mode' that can take values SKIP_FIRST, SKIP_LAST and
SKIP_FIRST | SKIP_LAST (meaning please give me both).  What we have in
the current patch is SKIP_FIRST behaviour.  Being able to choose
SKIP_FIRST or SKIP_LAST allows you do handle i, MAX(j) GROUP BY (i)
ORDER BY i (ie forward scan for the order, but wanting the highest key
for each distinct prefix), and being able to choose both allows for i,
MIN(j), MAX(j) GROUP BY i.  Earlier I thought that this scheme that
allows you to ask for both might be incompatible with David's
suggestion of tracking UniqueKeys in paths, but now I see that it
isn't: it's just that SKIP_FIRST | SKIP_LAST would have no UniqueKeys
and therefore require a regular agg on top.  I suspect that's OK: this
min/max transformation stuff is highly specialised and can have
whatever magic path-building logic is required in
preprocess_minmax_aggregates().



Ok, great.

Thanks for your feedback !

Best regards,
 Jesper

Re: pg_receivewal documentation

2019-07-10 Thread Jesper Pedersen


Hi,

On 7/10/19 4:04 AM, Michael Paquier wrote:

On Wed, Jul 10, 2019 at 12:22:02AM +0200, Laurenz Albe wrote:

Works for me.

Marked as "ready for committer".


Hmm.  synchronous_commit is user-settable, which means that it is
possible to enforce a value in the connection string doing the
connection.  Isn't that something we had better enforce directly in
the tool?  In this case what could be fixed is GetConnection() which
builds the connection string parameters.  One thing that we would need
to be careful about is that if the caller has provided a parameter for
"options" (which is plausible as wal_sender_timeout is user-settable
as of 12), then we need to make sure that the original value is
preserved, and that the enforced of synchronous_commit is appended.



I think that the above is out-of-scope for this patch. And ...


Or, as you say, we just adjust the documentation.  However I would
recommend adding at least an example of connection string which uses
"options" if the server sets synchronous_commit to "remote_apply" in
this case.  Still it seems to me that we have ways to reduce the
confusion automatically.



The patch tries to highlight that if you f.ex. have

postgresql.conf
===
synchronous_commit = remote_apply
synchronous_standby_names = '*'

and you _only_ have pg_receivewal connected then changes are only 
applied locally to the primary instance and any client (psql, ...) won't 
get acknowledged. The replay_lsn for the pg_receivewal connection will 
keep increasing, so


env PGOPTIONS="-c synchronous_commit=remote_write" pg_receivewal -D 
/tmp/wal -S replica1 --synchronous


won't help you.

We could add some wording around 'synchronous_standby_names' if it makes 
the case clearer.


Best regards,
 Jesper

Re: pg_receivewal documentation

2019-07-10 Thread Jesper Pedersen


Hi,

On 7/9/19 6:22 PM, Laurenz Albe wrote:

Works for me.

Marked as "ready for committer".



Thank you !

Best regards,
 Jesper

Re: pg_receivewal documentation

2019-07-09 Thread Jesper Pedersen


Hi Laurenz,

On 7/9/19 5:16 AM, Laurenz Albe wrote:

On Thu, 2019-06-27 at 10:06 -0400, Jesper Pedersen wrote:

Here is a patch for the pg_receivewal documentation to highlight that
WAL isn't acknowledged to be applied.


I think it is a good idea to document this, but I have a few quibbles
with the patch as it is:

- I think there shouldn't be commas after the "note" and before the "if".
   Disclaimer: I am not a native speaker, so I am lacking authority.

- The assertion is wrong.  "on" (remote flush) is perfectly fine
   for synchronous_commit, only "remote_apply" is a problem.

- There is already something about "--synchronous" in the "Description"
   section.  It might make sense to add the additional information there.

How about the attached patch?



Thanks for the review, and the changes.

However, I think it belongs in the --synchronous section, so what about 
moving it there as attached ?


Best regards,
 Jesper
>From 6cd525b365f3afcdb63f478c4410d6e20ca2f6e0 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Jul 2019 13:14:25 -0400
Subject: [PATCH] Highlight that pg_receivewal doesn't acknowledge that WAL has
 been applied, and as such synchronous-commit needs to be remote_write or
 lower.

Authors: Laurenz Albe and Jesper Pedersen
Review-by: Laurenz Albe
---
 doc/src/sgml/ref/pg_receivewal.sgml | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 0506120c00..46605db662 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -207,6 +207,13 @@ PostgreSQL documentation
 server as a synchronous standby, to ensure that timely feedback is
 sent to the server.

+
+   
+Note that while WAL will be flushed with this setting,
+it will never be applied, so  must
+not be set to remote_apply if pg_receivewal
+is the only synchronous standby.
+   
   
  
 
-- 
2.21.0

Re: Index Skip Scan

2019-07-09 Thread Jesper Pedersen


Hi,

On 7/4/19 6:59 AM, Thomas Munro wrote:

For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
}, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.




David, are you thinking about something like the attached ?

Some questions.

* Do you see UniqueKey as a "complete" planner node ?
  - I didn't update the nodes/*.c files for this yet

* Is a UniqueKey with a list of EquivalenceClass best, or a list of 
UniqueKey with a single EquivalenceClass


Likely more questions around this coming -- should this be a separate 
thread ?


Based on this I'll start to update the v21 patch to use UniqueKey, and 
post a new version.



While updating the Loose Index Scan wiki page with links to other
products' terminology on this subject, I noticed that MySQL can
skip-scan MIN() and MAX() in the same query.  Hmm.  That seems quite
desirable.  I think it requires a new kind of skipping: I think you
have to be able to skip to the first AND last key that has each
distinct prefix, and then stick a regular agg on top to collapse them
into one row.  Such a path would not be so neatly describable by
UniqueKeys, or indeed by the amskip() interface in the current patch.
I mention all this stuff not because I want us to run before we can
walk, but because to be ready to commit the basic distinct skip scan
feature, I think we should know approximately how it'll handle the
future stuff we'll need.



Thomas, do you have any ideas for this ? I can see that MySQL did the 
functionality in two change sets (base and function support), but like 
you said we shouldn't paint ourselves into a corner.


Feedback greatly appreciated.

Best regards,
 Jesper
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 4b9e141404..2e07db2e6e 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,44 @@ print_pathkeys(const List *pathkeys, const List *rtable)
printf(")\n");
 }
 
+/*
+ * print_unique_key -
+ *   unique_key an UniqueKey
+ */
+void
+print_unique_key(const UniqueKey *unique_key, const List *rtable)
+{
+   ListCell   *l;
+
+   printf("(");
+   foreach(l, unique_key->eq_clauses)
+   {
+   EquivalenceClass *eclass = (EquivalenceClass *) lfirst(l);
+   ListCell   *k;
+   boolfirst = true;
+
+   /* chase up */
+   while (eclass->ec_merged)
+   eclass = eclass->ec_merged;
+
+   printf("(");
+   foreach(k, eclass->ec_members)
+   {
+   EquivalenceMember *mem = (EquivalenceMember *) 
lfirst(k);
+
+   if (first)
+   first = false;
+   else
+   printf(", ");
+   print_expr((Node *) mem->em_expr, rtable);
+   }
+   printf(")");
+   if (lnext(l))
+   printf(", ");
+   }
+   printf(")\n");
+}
+
 /*
  * print_tl
  *   print targetlist in a more legible way.
diff --git a/src/backend/optimizer/path/Makefile 
b/src/backend/optimizer/path/Makefile
index 6864a62132..8249a6b6db 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = allpaths.o clausesel.o costsize.o equivclass.o indxpath.o \
-   joinpath.o joinrels.o pathkeys.o tidpath.o
+   joinpath.o joinrels.o pathkeys.o tidpath.o uniquekey.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index b7723481b0..a8c8fe8a30 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3957,6 +3957,14 @@ print_path(PlannerInfo *root, Path *path, int indent)
print_pathkeys(path->pathkeys, root->parse->rtable);
}
 
+   if (path->unique_key)
+   {
+   for (i = 0; i < indent; i++)
+   printf("\t");
+   printf("  unique_key: ");
+   print_unique_key(path->unique_key, root->parse->rtable);
+   }
+
if (join)
{
JoinPath   *jp = (JoinPath *) path;
diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index a2a9b1f7be..dbd0bbf3dc 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -705,6 +705,11 @@ cost_index(IndexPath *path, PlannerInfo *root, double 
loop_count,
path->path.parallel_aware = true;
}
 
+   /* Consider cost based on unique key */
+   if (path->path.unique_key)
+   {
+   }
+
/*
 * Now interpolate based on estimated index order correlation to get 
total
 * disk I/O cost for main table accesses.
diff --git a/src/backend/optimizer/path/uniquekey.c

Re: Index Skip Scan

2019-07-03 Thread Jesper Pedersen


Hi Thomas and David,

Thanks for the feedback !

On 7/2/19 8:27 AM, David Rowley wrote:

On Tue, 2 Jul 2019 at 21:00, Thomas Munro  wrote:

I took this for a quick spin today.  The DISTINCT ON support is nice
and I think it will be very useful.  I've signed up to review it and
will have more to say later.  But today I had a couple of thoughts
after looking into how src/backend/optimizer/plan/planagg.c works and
wondering how to do some more skipping tricks with the existing
machinery.

1.  SELECT COUNT(DISTINCT i) FROM t could benefit from this.  (Or
AVG(DISTINCT ...) or any other aggregate).  Right now you get a seq
scan, with the sort/unique logic inside the Aggregate node.  If you
write SELECT COUNT(*) FROM (SELECT DISTINCT i FROM t) ss then you get
a skip scan that is much faster in good cases.  I suppose you could
have a process_distinct_aggregates() in planagg.c that recognises
queries of the right form and generates extra paths a bit like
build_minmax_path() does.  I think it's probably better to consider
that in the grouping planner proper instead.  I'm not sure.


I think to make that happen we'd need to do a bit of an overhaul in
nodeAgg.c to allow it to make use of presorted results instead of
having the code blindly sort rows for each aggregate that has a
DISTINCT or ORDER BY.  The planner would also then need to start
requesting paths with pathkeys that suit the aggregate and also
probably dictate the order the AggRefs should be evaluated to allow
all AggRefs to be evaluated that can be for each sort order.  Once
that part is done then the aggregates could then also request paths
with certain "UniqueKeys" (a feature I mentioned in [1]), however we'd
need to be pretty careful with that one since there may be other parts
of the query that require that all rows are fed in, not just 1 row per
value of "i", e.g SELECT COUNT(DISTINCT i) FROM t WHERE z > 0; can't
just feed through 1 row for each "i" value, since we need only the
ones that have "z > 0".  Getting the first part of this solved is much
more important than making skip scans work here, I'd say. I think we
need to be able to walk before we can run with DISTINCT  / ORDER BY
aggs.



I agree that the above is outside of scope for the first patch -- I 
think the goal should be the simple use-cases for IndexScan and 
IndexOnlyScan.


Maybe we should expand [1] with possible cases, so we don't lose track 
of the ideas.



2.  SELECT i, MIN(j) FROM t GROUP BY i could benefit from this if
you're allowed to go forwards.  Same for SELECT i, MAX(j) FROM t GROUP
BY i if you're allowed to go backwards.  Those queries are equivalent
to SELECT DISTINCT ON (i) i, j FROM t ORDER BY i [DESC], j [DESC]
(though as Floris noted, the backwards version gives the wrong answers
with v20).  That does seem like a much more specific thing applicable
only to MIN and MAX, and I think preprocess_minmax_aggregates() could
be taught to handle that sort of query, building an index only scan
path with skip scan in build_minmax_path().


For the MIN query you just need a path with Pathkeys: { i ASC, j ASC
}, UniqueKeys: { i, j }, doing the MAX query you just need j DESC.



Ok.


The more I think about these UniqueKeys, the more I think they need to
be a separate concept to PathKeys. For example, UniqueKeys: { x, y }
should be equivalent to { y, x }, but with PathKeys, that's not the
case, since the order of each key matters. UniqueKeys equivalent
version of pathkeys_contained_in() would not care about the order of
individual keys, it would say things like, { a, b, c } is contained in
{ b, a }, since if the path is unique on columns { b, a } then it must
also be unique on { a, b, c }.



I'm looking at this, and will keep this in mind.

Thanks !

[1] https://wiki.postgresql.org/wiki/Loose_indexscan

Best regards,
 Jesper

Re: POC: converting Lists into arrays

2019-07-01 Thread Jesper Pedersen


Hi,

On 7/1/19 2:44 PM, Tom Lane wrote:

Yup, here's a rebase against HEAD (and I also find that check-world shows
no problems).


Thanks - no further comments.


 This is pretty much of a pain to maintain, since it changes
the API for lnext() which is, um, a bit invasive.  I'd like to make a
decision pretty quickly on whether we're going to do this, and either
commit this patch or abandon it.



IMHO it is an improvement over the existing API.


There is some unneeded MemoryContext stuff in async.c's
pg_listening_channels() which should be cleaned up.


Yeah, there's a fair amount of follow-on cleanup that could be undertaken
afterwards, but I've wanted to keep the patch's footprint as small as
possible for the moment.  Assuming we pull the trigger, I'd then go look
at removing the planner's duplicative lists+arrays for RTEs and such as
the first cleanup step.  But thanks for the pointer to async.c, I'll
check that too.



Yeah, I only called out the async.c change, as that function isn't 
likely to change in any of the follow up patches.


Best regards,
 Jesper

Re: POC: converting Lists into arrays

2019-07-01 Thread Jesper Pedersen


Hi,

On 5/25/19 11:48 AM, Tom Lane wrote:

The cfbot noticed a set-but-not-used variable that my compiler hadn't
whined about.  Here's a v5 to pacify it.  No other changes.



This needs a rebase. After that check-world passes w/ and w/o 
-DDEBUG_LIST_MEMORY_USAGE.


There is some unneeded MemoryContext stuff in async.c's 
pg_listening_channels() which should be cleaned up.


Thanks for working on this, as the API is more explicit now about what 
is going on.


Best regards,
 Jesper

pg_receivewal documentation

2019-06-27 Thread Jesper Pedersen


Hi,

Here is a patch for the pg_receivewal documentation to highlight that 
WAL isn't acknowledged to be applied.


I'll add a CF entry for it.

Best regards,
 Jesper
>From 138e09c74db5ea08fd03b4e77853e2ca01742512 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Thu, 27 Jun 2019 09:54:44 -0400
Subject: [PATCH] Highlight that pg_receivewal doesn't acknowledge that WAL has
 been applied, and as such synchronous-commit needs to be remote_write or
 lower.

Author: Jesper Pedersen 
---
 doc/src/sgml/ref/pg_receivewal.sgml | 8 
 1 file changed, 8 insertions(+)

diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index 0506120c00..132a599d1b 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -207,6 +207,14 @@ PostgreSQL documentation
 server as a synchronous standby, to ensure that timely feedback is
 sent to the server.

+
+   
+Note, that pg_receivewal doesn't acknowledge
+that the Write-Ahead Log () has been applied, so
+ needs to have a setting
+of remote_write or lower, if pg_receivewal
+is the only standby.
+   
   
  
 
-- 
2.21.0

Re: Index Skip Scan

2019-06-20 Thread Jesper Pedersen


Hi,

On 6/19/19 9:57 AM, Dmitry Dolgov wrote:

Here is what I was talking about, POC for an integration with index scan. About
using of create_skipscan_unique_path and suggested planner improvements, I hope
together with Jesper we can come up with something soon.



I made some minor changes, but I did move all the code in 
create_distinct_paths() under enable_indexskipscan to limit the overhead 
if skip scan isn't enabled.


Attached is v20, since the last patch should have been v19.

Best regards,
 Jesper
>From 4fd4bd601f510ccce858196c0e93d37aa2d0f20f Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Thu, 20 Jun 2019 07:42:24 -0400
Subject: [PATCH 1/2] Implementation of Index Skip Scan (see Loose Index Scan
 in the wiki [1]) on top of IndexScan and IndexOnlyScan. To make it suitable
 for both situations when there are small number of distinct values and
 significant amount of distinct values the following approach is taken -
 instead of searching from the root for every value we're searching for then
 first on the current page, and then if not found continue searching from the
 root.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
 contrib/bloom/blutils.c   |   1 +
 doc/src/sgml/config.sgml  |  16 ++
 doc/src/sgml/indexam.sgml |  10 +
 doc/src/sgml/indices.sgml |  24 ++
 src/backend/access/brin/brin.c|   1 +
 src/backend/access/gin/ginutil.c  |   1 +
 src/backend/access/gist/gist.c|   1 +
 src/backend/access/hash/hash.c|   1 +
 src/backend/access/index/indexam.c|  16 ++
 src/backend/access/nbtree/nbtree.c|  12 +
 src/backend/access/nbtree/nbtsearch.c | 224 +-
 src/backend/access/spgist/spgutils.c  |   1 +
 src/backend/commands/explain.c|  24 ++
 src/backend/executor/nodeIndexonlyscan.c  |  22 ++
 src/backend/executor/nodeIndexscan.c  |  22 ++
 src/backend/nodes/copyfuncs.c |   2 +
 src/backend/nodes/outfuncs.c  |   3 +
 src/backend/nodes/readfuncs.c |   2 +
 src/backend/optimizer/path/costsize.c |   1 +
 src/backend/optimizer/path/pathkeys.c |  84 ++-
 src/backend/optimizer/plan/createplan.c   |  20 +-
 src/backend/optimizer/plan/planagg.c  |   1 +
 src/backend/optimizer/plan/planner.c  |  79 +-
 src/backend/optimizer/util/pathnode.c |  40 
 src/backend/optimizer/util/plancat.c  |   1 +
 src/backend/utils/misc/guc.c  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h|   5 +
 src/include/access/genam.h|   1 +
 src/include/access/nbtree.h   |   5 +
 src/include/nodes/execnodes.h |   7 +
 src/include/nodes/pathnodes.h |   8 +
 src/include/nodes/plannodes.h |   2 +
 src/include/optimizer/cost.h  |   1 +
 src/include/optimizer/pathnode.h  |   5 +
 src/include/optimizer/paths.h |   4 +
 src/test/regress/expected/create_index.out|   1 +
 src/test/regress/expected/select_distinct.out | 209 
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/create_index.sql |   2 +
 src/test/regress/sql/select_distinct.sql  |  68 ++
 41 files changed, 918 insertions(+), 22 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index ee3bd56274..a88b730f2e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..a1c8a1ea27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4400,6 +4400,22 @@ ANY num_sync ( 
+  enable_indexskipscan (boolean)
+  
+   enable_indexskipscan configuration parameter
+  
+  
+  
+   
+Enables or disables the query planner's use of index-skip-scan plan
+types (see ). This parameter requires
+that enable_indexonlyscan is on.
+The default is on.
+   
+  
+ 
+
  
   enable_material (boolean)
   
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index dd54c68802..c2eb296306 100644
--- a/doc/src/sgml/indexam

Re: Index Skip Scan

2019-06-14 Thread Jesper Pedersen


Hi David,

On 6/14/19 3:19 AM, David Rowley wrote:

I read over this thread a few weeks ago while travelling back from
PGCon. (I wish I'd read it on the outward trip instead since it would
have been good to talk about it in person.)

First off. I think this is a pretty great feature. It certainly seems
worthwhile working on it.

I've looked over the patch just to get a feel for how the planner part
works and I have a few ideas to share.

The code in create_distinct_paths() I think should work a different
way. I think it would be much better to add a new field to Path and
allow a path to know what keys it is distinct for.  This sort of goes
back to an idea I thought about when developing unique joins
(9c7f5229ad) about an easier way to detect fields that a relation is
unique for. I've been calling these "UniqueKeys" in a few emails [1].
The idea was to tag these onto RelOptInfo to mention which columns or
exprs a relation is unique by so that we didn't continuously need to
look at unique indexes in all the places that call
relation_has_unique_index_for().  The idea there was that unique joins
would know when a join was unable to duplicate rows. If the outer side
of a join didn't duplicate the inner side, then the join RelOptInfo
could keep the UniqueKeys from the inner rel, and vice-versa. If both
didn't duplicate then the join rel would obtain the UniqueKeys from
both sides of the join.  The idea here is that this would be a better
way to detect unique joins, and also when it came to the grouping
planner we'd also know if the distinct or group by should be a no-op.
DISTINCT could be skipped, and GROUP BY could do a group aggregate
without any sort.

I think these UniqueKeys ties into this work, perhaps not adding
UniqueKeys to RelOptInfo, but just to Path so that we create paths
that have UniqueKeys during create_index_paths() based on some
uniquekeys that are stored in PlannerInfo, similar to how we create
index paths in build_index_paths() by checking if the index
has_useful_pathkeys().  Doing it this way would open up more
opportunities to use skip scans. For example, semi-joins and
anti-joins could make use of them if the uniquekeys covered the entire
join condition. With this idea, the code you've added in
create_distinct_paths() can just search for the cheapest path that has
the correct uniquekeys, or if none exist then just do the normal
sort->unique or hash agg.  I'm not entirely certain how we'd instruct
a semi/anti joined relation to build such paths, but that seems like a
problem that could be dealt with when someone does the work to allow
skip scans to be used for those.

Also, I'm not entirely sure that these UniqueKeys should make use of
PathKey since there's no real need to know about pk_opfamily,
pk_strategy, pk_nulls_first as those all just describe how the keys
are ordered. We just need to know if they're distinct or not. All
that's left after removing those fields is pk_eclass, so could
UniqueKeys just be a list of EquivalenceClass? or perhaps even a
Bitmapset with indexes into PlannerInfo->ec_classes (that might be
premature for not since we've not yet got
https://commitfest.postgresql.org/23/1984/ or
https://commitfest.postgresql.org/23/2019/ )  However, if we did use
PathKey, that does allow us to quickly check if the UniqueKeys are
contained within the PathKeys, since pathkeys are canonical which
allows us just to compare their memory address to know if two are
equal, however, if you're storing eclasses we could probably get the
same just by comparing the address of the eclass to the pathkey's
pk_eclass.

Otherwise, I think how you're making use of paths in
create_distinct_paths() and create_skipscan_unique_path() kind of
contradicts how they're meant to be used.



Thank you very much for this feedback ! Will need to revise the patch 
based on this.



I also agree with James that this should not be limited to Index Only
Scans. From testing the patch, the following seems pretty strange to
me:

# create table abc (a int, b int, c int);
CREATE TABLE
# insert into abc select a,b,1 from generate_Series(1,1000) a,
generate_Series(1,1000) b;
INSERT 0 100
# create index on abc(a,b);
CREATE INDEX
# explain analyze select distinct on (a) a,b from abc order by a,b; --
this is fast.
  QUERY PLAN
-
  Index Only Scan using abc_a_b_idx on abc  (cost=0.42..85.00 rows=200
width=8) (actual time=0.260..20.518 rows=1000 loops=1)
Scan mode: Skip scan
Heap Fetches: 1000
  Planning Time: 5.616 ms
  Execution Time: 21.791 ms
(5 rows)


# explain analyze select distinct on (a) a,b,c from abc order by a,b;
-- Add one more column and it's slow.
 QUERY PLAN

Re: Index Skip Scan

2019-06-13 Thread Jesper Pedersen


Hi,

On 6/5/19 3:39 PM, Floris Van Nee wrote:

Thanks! I've verified that it works now.


Here is a rebased version.


I was wondering if we're not too strict in some cases now though. Consider the 
following queries:


[snip]


This is basically the opposite case - when distinct_pathkeys matches the 
filtered list of index keys, an index skip scan could be considered. Currently, 
the user needs to write 'distinct m,f' explicitly, even though he specifies in 
the WHERE-clause that 'm' can only have one value anyway. Perhaps it's fine 
like this, but it could be a small improvement for consistency.



I think it would be good to get more feedback on the patch in general 
before looking at further optimizations. We should of course fix any 
bugs that shows up.


Thanks for your testing and feedback !

Best regards,
 Jesper
>From afebbc8a844b59d1037ac7fe66131cc3eabcb5ae Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Thu, 13 Jun 2019 09:04:14 -0400
Subject: [PATCH] Implementation of Index Skip Scan (see Loose Index Scan in
 the wiki [1]) on top of IndexOnlyScan. To make it suitable for both
 situations when there are small number of distinct values and significant
 amount of distinct values the following approach is taken - instead of
 searching from the root for every value we're searching for then first on the
 current page, and then if not found continue searching from the root.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com
---
 contrib/bloom/blutils.c   |   1 +
 doc/src/sgml/config.sgml  |  16 ++
 doc/src/sgml/indexam.sgml |  10 +
 doc/src/sgml/indices.sgml |  24 ++
 src/backend/access/brin/brin.c|   1 +
 src/backend/access/gin/ginutil.c  |   1 +
 src/backend/access/gist/gist.c|   1 +
 src/backend/access/hash/hash.c|   1 +
 src/backend/access/index/indexam.c|  16 ++
 src/backend/access/nbtree/nbtree.c|  12 +
 src/backend/access/nbtree/nbtsearch.c | 224 +-
 src/backend/access/spgist/spgutils.c  |   1 +
 src/backend/commands/explain.c|  12 +
 src/backend/executor/nodeIndexonlyscan.c  |  22 ++
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/nodes/outfuncs.c  |   2 +
 src/backend/nodes/readfuncs.c |   1 +
 src/backend/optimizer/path/costsize.c |   1 +
 src/backend/optimizer/path/pathkeys.c |  84 ++-
 src/backend/optimizer/plan/createplan.c   |  10 +-
 src/backend/optimizer/plan/planagg.c  |   1 +
 src/backend/optimizer/plan/planner.c  |  67 +-
 src/backend/optimizer/util/pathnode.c |  40 
 src/backend/optimizer/util/plancat.c  |   1 +
 src/backend/utils/misc/guc.c  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h|   5 +
 src/include/access/genam.h|   1 +
 src/include/access/nbtree.h   |   5 +
 src/include/nodes/execnodes.h |   4 +
 src/include/nodes/pathnodes.h |   8 +
 src/include/nodes/plannodes.h |   1 +
 src/include/optimizer/cost.h  |   1 +
 src/include/optimizer/pathnode.h  |   5 +
 src/include/optimizer/paths.h |   4 +
 src/test/regress/expected/create_index.out|   1 +
 src/test/regress/expected/select_distinct.out | 176 ++
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/create_index.sql |   2 +
 src/test/regress/sql/select_distinct.sql  |  55 +
 40 files changed, 812 insertions(+), 19 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index ee3bd56274..a88b730f2e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -129,6 +129,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..a1c8a1ea27 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4400,6 +4400,22 @@ ANY num_sync ( 
+  enable_indexskipscan (boolean)
+  
+   enable_indexskipscan configuration parameter
+  
+  
+  
+   
+Enables or disables the query planner's use of index-skip-scan plan
+types (see ). This parameter requires
+t

Re: Index Skip Scan

2019-06-03 Thread Jesper Pedersen


Hi Rafia,

On 6/1/19 6:03 AM, Rafia Sabih wrote:

Here is my repeatable test case,

create table t (market text, feedcode text, updated_at timestamptz,
value float8) ;
create index on t (market, feedcode, updated_at desc);
insert into t values('TEST', 'abcdef', (select timestamp '2019-01-10
20:00:00' + random() * (timestamp '2014-01-20 20:00:00' - timestamp
'2019-01-20 20:00:00') ), generate_series(1,100)*9.88);
insert into t values('TEST', 'jsgfhdfjd', (select timestamp
'2019-01-10 20:00:00' + random() * (timestamp '2014-01-20 20:00:00' -
timestamp '2019-01-20 20:00:00') ), generate_series(1,100)*9.88);

Now, without the patch,
select distinct on (market, feedcode) market, feedcode from t  where
market='TEST';
  market | feedcode
+---
  TEST   | abcdef
  TEST   | jsgfhdfjd
(2 rows)
explain select distinct on (market, feedcode) market, feedcode from t
where market='TEST';
QUERY PLAN

  Unique  (cost=12.20..13.21 rows=2 width=13)
->  Sort  (cost=12.20..12.70 rows=201 width=13)
  Sort Key: feedcode
  ->  Seq Scan on t  (cost=0.00..4.51 rows=201 width=13)
Filter: (market = 'TEST'::text)
(5 rows)

And with the patch,
select distinct on (market, feedcode) market, feedcode from t   where
market='TEST';
  market | feedcode
+--
  TEST   | abcdef
(1 row)

explain select distinct on (market, feedcode) market, feedcode from t
where market='TEST';
QUERY PLAN

  Index Only Scan using t_market_feedcode_updated_at_idx on t
(cost=0.14..0.29 rows=2 width=13)
Scan mode: Skip scan
Index Cond: (market = 'TEST'::text)
(3 rows)

Notice that in the explain statement it shows correct number of rows
to be skipped.



Thanks for your test case; this is very helpful.

For now, I would like to highlight that

 SET enable_indexskipscan = OFF

can be used for testing with the patch applied.

Dmitry and I will look at the feedback provided.

Best regards,
 Jesper

Re: Index Skip Scan

2019-06-03 Thread Jesper Pedersen


Hi Floris,

On 6/1/19 12:10 AM, Floris Van Nee wrote:

Given a table definition of (market text, feedcode text, updated_at 
timestamptz, value float8) and an index on (market, feedcode, updated_at desc) 
(note that this table slightly deviates from what I described in my previous 
mail) and filling it with data.


The following query uses an index skip scan and returns just 1 row (incorrect!)

select distinct on (market, feedcode) market, feedcode
from streams.base_price
where market='TEST'

The following query still uses the regular index scan and returns many more 
rows (correct)
select distinct on (market, feedcode) *
from streams.base_price
where market='TEST'


It seems that partially filtering on one of the distinct columns triggers 
incorrect behavior where too many rows in the index are skipped.




Thanks for taking a look at the patch, and your feedback on it.

I'll def look into this once I'm back from my travels.

Best regards,
 Jesper

Re: New committer: David Rowley

2019-05-31 Thread Jesper Pedersen


On 5/30/19 11:39 AM, Magnus Hagander wrote:

Congratulations to David, may the buildfarm be gentle to him, and his first
revert far away!



Congrats !

Best regards,
 Jesper

Re: COLLATE: Hash partition vs UPDATE

2019-04-09 Thread Jesper Pedersen


Hi Amit,

On 4/8/19 11:18 PM, Amit Langote wrote:

As of this commit, hashing functions hashtext() and hashtextextended()
require a valid collation to be passed in.  ISTM,
satisfies_hash_partition() that's called by hash partition constraint
checking should have been changed to use FunctionCall2Coll() interface to
account for the requirements of the above commit.  I see that it did that
for compute_partition_hash_value(), which is used by hash partition tuple
routing.  That also seems to be covered by regression tests, but there are
no tests that cover satisfies_hash_partition().

Attached patch is an attempt to fix this.  I've also added Amul Sul who
can maybe comment on the satisfies_hash_partition() changes.



Yeah, that works here - apart from an issue with the test case; fixed in 
the attached.


Best regards,
 Jesper
>From 1902dcf1fd31c95fd95e1231630b3c446857cd59 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 9 Apr 2019 07:35:28 -0400
Subject: [PATCH] 0001

---
 src/backend/partitioning/partbounds.c   | 12 +---
 src/test/regress/expected/hash_part.out | 14 ++
 src/test/regress/sql/hash_part.sql  | 10 ++
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index c8770ccfee..331dab3954 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -2778,6 +2778,7 @@ satisfies_hash_partition(PG_FUNCTION_ARGS)
 		bool		variadic_typbyval;
 		char		variadic_typalign;
 		FmgrInfo	partsupfunc[PARTITION_MAX_KEYS];
+		Oid			partcollid[PARTITION_MAX_KEYS];
 	} ColumnsHashData;
 	Oid			parentId;
 	int			modulus;
@@ -2865,6 +2866,7 @@ satisfies_hash_partition(PG_FUNCTION_ARGS)
 fmgr_info_copy(_extra->partsupfunc[j],
 			   >partsupfunc[j],
 			   fcinfo->flinfo->fn_mcxt);
+my_extra->partcollid[j] = key->partcollation[j];
 			}
 
 		}
@@ -2888,6 +2890,7 @@ satisfies_hash_partition(PG_FUNCTION_ARGS)
 
 			/* check argument types */
 			for (j = 0; j < key->partnatts; ++j)
+			{
 if (key->parttypid[j] != my_extra->variadic_type)
 	ereport(ERROR,
 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -2895,6 +2898,8 @@ satisfies_hash_partition(PG_FUNCTION_ARGS)
 	j + 1,
 	format_type_be(key->parttypid[j]),
 	format_type_be(my_extra->variadic_type;
+my_extra->partcollid[j] = key->partcollation[j];
+			}
 
 			fmgr_info_copy(_extra->partsupfunc[0],
 		   >partsupfunc[0],
@@ -2928,9 +2933,10 @@ satisfies_hash_partition(PG_FUNCTION_ARGS)
 
 			Assert(OidIsValid(my_extra->partsupfunc[i].fn_oid));
 
-			hash = FunctionCall2(_extra->partsupfunc[i],
- PG_GETARG_DATUM(argno),
- seed);
+			hash = FunctionCall2Coll(_extra->partsupfunc[i],
+	 my_extra->partcollid[i],
+	 PG_GETARG_DATUM(argno),
+	 seed);
 
 			/* Form a single 64-bit hash value */
 			rowHash = hash_combine64(rowHash, DatumGetUInt64(hash));
diff --git a/src/test/regress/expected/hash_part.out b/src/test/regress/expected/hash_part.out
index 731d26fc3d..adb4410b4b 100644
--- a/src/test/regress/expected/hash_part.out
+++ b/src/test/regress/expected/hash_part.out
@@ -99,6 +99,20 @@ ERROR:  number of partitioning columns (2) does not match number of partition ke
 SELECT satisfies_hash_partition('mcinthash'::regclass, 4, 0,
 variadic array[now(), now()]);
 ERROR:  column 1 of the partition key has type "integer", but supplied value is of type "timestamp with time zone"
+-- check satisfies_hash_partition passes correct collation
+create table text_hashp (a text) partition by hash (a);
+create table text_hashp0 partition of text_hashp for values with (modulus 2, remainder 0);
+create table text_hashp1 partition of text_hashp for values with (modulus 2, remainder 1);
+-- The result here should always be true, because 'xxx' must belong to at least
+-- one of the two defined partitions
+select satisfies_hash_partition('text_hashp'::regclass, 2, 0, 'xxx'::text) OR
+	   satisfies_hash_partition('text_hashp'::regclass, 2, 1, 'xxx'::text) AS satisfies;
+ satisfies 
+---
+ t
+(1 row)
+
 -- cleanup
 DROP TABLE mchash;
 DROP TABLE mcinthash;
+DROP TABLE text_hashp;
diff --git a/src/test/regress/sql/hash_part.sql b/src/test/regress/sql/hash_part.sql
index f457ac344c..8cd9e8a8a2 100644
--- a/src/test/regress/sql/hash_part.sql
+++ b/src/test/regress/sql/hash_part.sql
@@ -75,6 +75,16 @@ SELECT satisfies_hash_partition('mcinthash'::regclass, 4, 0,
 SELECT satisfies_hash_partition('mcinthash'::regclass, 4, 0,
 variadic array[now(), now()]);
 
+-- check satisfies_hash_partition passes correct collation
+create table text_hashp (a text) partition by hash (a);
+create table text_hashp0 partition of text_hashp for values with (modulus 2, remainder 0);
+create table text_hashp1 partition of text_hashp for values with (modulus 2, remainder 1);
+-- The result here should always be true, because 'xxx' must

COLLATE: Hash partition vs UPDATE

2019-04-08 Thread Jesper Pedersen


Hi,

The following case

-- test.sql --
CREATE TABLE test (a text PRIMARY KEY, b text) PARTITION BY HASH (a);
CREATE TABLE test_p0 PARTITION OF test FOR VALUES WITH (MODULUS 2, 
REMAINDER 0);
CREATE TABLE test_p1 PARTITION OF test FOR VALUES WITH (MODULUS 2, 
REMAINDER 1);

-- CREATE INDEX idx_test_b ON test USING HASH (b);

INSERT INTO test VALUES ('', '');

-- Regression
UPDATE test SET b = '' WHERE a = '';
-- test.sql --

fails on master, which includes [1], with


psql:test.sql:9: ERROR:  could not determine which collation to use for 
string hashing

HINT:  Use the COLLATE clause to set the collation explicitly.


It passes on 11.x.

I'll add it to the open items list.

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=5e1963fb764e9cc092e0f7b58b28985c311431d9


Best regards,
 Jesper

Re: partitioned tables referenced by FKs

2019-04-03 Thread Jesper Pedersen


On 4/3/19 1:52 PM, Alvaro Herrera wrote:

Pushed, many thanks Amit and Jesper for reviewing.



Thank you for working on this feature.

Best regards,
 Jesper

Re: partitioned tables referenced by FKs

2019-04-02 Thread Jesper Pedersen


Hi,

On 4/1/19 4:03 PM, Alvaro Herrera wrote:

I'm satisfied with this patch now, so I intend to push early tomorrow.



Passes check-world, and my tests.

Best regards,
 Jesper

Re: partitioned tables referenced by FKs

2019-03-29 Thread Jesper Pedersen


Hi,

On 3/29/19 12:29 PM, Alvaro Herrera wrote:

On 2019-Mar-29, Jesper Pedersen wrote:

Maybe the "(" / ")" in the CASCADE description should be removed from
ref/drop_table.sgml as part of this patch.


I'm not sure what text you propose to remove?



Just the attached.


Should catalogs.sgml be updated for this case ?


I'm not adding any new dependency type, nor modifying any existing one.
Are you looking at anything that needs changing specifically?



No, I just mentioned it if it was a requirement that the precise 
use-case for each dependency type were documented.


Best regards,
 Jesper
diff --git a/doc/src/sgml/ref/drop_table.sgml b/doc/src/sgml/ref/drop_table.sgml
index bf8996d198..328157c849 100644
--- a/doc/src/sgml/ref/drop_table.sgml
+++ b/doc/src/sgml/ref/drop_table.sgml
@@ -41,9 +41,9 @@ DROP TABLE [ IF EXISTS ] name [, ..
triggers, and constraints that exist for the target table.
However, to drop a table that is referenced by a view or a foreign-key
constraint of another table, CASCADE must be
-   specified. (CASCADE will remove a dependent view 
entirely,
+   specified. CASCADE will remove a dependent view entirely,
but in the foreign-key case it will only remove the foreign-key
-   constraint, not the other table entirely.)
+   constraint, not the other table entirely.

Re: partitioned tables referenced by FKs

2019-03-29 Thread Jesper Pedersen


Hi,

On 3/29/19 11:22 AM, Alvaro Herrera wrote:

On 2019-Mar-29, Jesper Pedersen wrote:


Could expand a bit on the change to DEPENDENCY_INTERNAL instead of
DEPENDENCY_PARTITION_PRI / DEPENDENCY_PARTITION_SEC ?


The PARTITION dependencies work in a way that doesn't do what we want.
Admittedly, neither does INTERNAL, but at least it's less bad.


If you run "DROP TABLE t2_p32 CASCADE" the foreign key constraint is removed
from all of t1.


Yes.  CASCADE is always a dangerous tool; if you run the DROP partition
without cascade, it explicitly lists that the constraint is going to be
dropped.

If you get in the habit of added CASCADE to all your drops, you're going
to lose data pretty quickly.  In this case, no data is lost, only a
constraint.



Thanks !

Maybe the "(" / ")" in the CASCADE description should be removed from 
ref/drop_table.sgml as part of this patch.


Should catalogs.sgml be updated for this case ?

Best regards,
 Jesper

Re: partitioned tables referenced by FKs

2019-03-29 Thread Jesper Pedersen


Hi Alvaro,

On 3/28/19 2:59 PM, Alvaro Herrera wrote:

I ended up revising the dependencies that we give to the constraint in
the partition -- instead of giving it partition-type dependencies, we
give it an INTERNAL dependency.  Now when you request to drop the
partition, it says this:

create table pk (a int primary key) partition by list (a);
create table fk (a int references pk);
create table pk1 partition of pk for values in (1);

alvherre=# drop table pk1;
ERROR:  cannot drop table pk1 because other objects depend on it
DETAIL:  constraint fk_a_fkey on table fk depends on table pk1
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

If you do say CASCADE, the constraint is dropped.  Not really ideal (I
would prefer that the drop is prevented completely), but at least it's
not completely bogus.  If you do "DROP TABLE pk", it works sanely.
Also, if you DETACH the partition that pg_depend row goes away, so a
subsequent drop of the partition works sanely.

Fixed the psql issue pointed out by Amit L too.



Could expand a bit on the change to DEPENDENCY_INTERNAL instead of 
DEPENDENCY_PARTITION_PRI / DEPENDENCY_PARTITION_SEC ?


If you run "DROP TABLE t2_p32 CASCADE" the foreign key constraint is 
removed from all of t1.


-- ddl.sql --
CREATE TABLE t1 (i1 INT PRIMARY KEY, i2 INT NOT NULL) PARTITION BY HASH 
(i1);
CREATE TABLE t2 (i1 INT PRIMARY KEY, i2 INT NOT NULL) PARTITION BY HASH 
(i1);


\o /dev/null
SELECT 'CREATE TABLE t1_p' || x::text || ' PARTITION OF t1
FOR VALUES WITH (MODULUS 64, REMAINDER ' || x::text || ');'
from generate_series(0,63) x;
\gexec
\o

\o /dev/null
SELECT 'CREATE TABLE t2_p' || x::text || ' PARTITION OF t2
FOR VALUES WITH (MODULUS 64, REMAINDER ' || x::text || ');'
from generate_series(0,63) x;
\gexec
\o

ALTER TABLE t1 ADD CONSTRAINT fk_t1_i2_t2_i1 FOREIGN KEY (i2) REFERENCES 
t2(i1);


INSERT INTO t2 (SELECT i, i FROM generate_series(1, 1000) AS i);
INSERT INTO t1 (SELECT i, i FROM generate_series(1, 1000) AS i);

ANALYZE;
-- ddl.sql --

Detaching the partition for DROP seems safer to me.

Thanks in advance !

Best regards,
 Jesper

Re: partitioned tables referenced by FKs

2019-03-27 Thread Jesper Pedersen


Hi,

On 3/26/19 5:39 PM, Alvaro Herrera wrote:

As I said before, I'm thinking of getting rid of the whole business of
checking partitions on the referenced side of an FK at DROP time, and
instead jut forbid the DROP completely if any FKs reference an ancestor
of that partition.


Will that allow `DROP TABLE parted_pk CASCADE` to succeed even if
partitions still contain referenced data?  I suppose that's the example
you cited upthread as a bug that remains to be solved.


That's the idea, yes, it should do that: only allow a DROP of a
partition referenced by an FK if the topmost constraint is also being
dropped.  Maybe this means I need to get rid of 0002 completely.  But I
haven't got to doing that yet.



I think that is ok, if doc/src/sgml/ref/drop_table.sgml is updated with 
a description of how CASCADE works in this scenario.


Best regards,
 Jesper

Re: partitioned tables referenced by FKs

2019-03-26 Thread Jesper Pedersen


Hi Amit,

On 3/26/19 2:06 AM, Amit Langote wrote:

Wouldn't you get the same numbers on HEAD too?  IOW, I'm not sure how the
patch here, which seems mostly about getting DDL in order to support
foreign keys on partitioned tables, would have affected the result of this
benchmark.  Can you clarify your intention of running this benchmark
against these patches?



Yeah, you are right. As I'm also seeing this issue with a test case of

-- ddl.sql --
CREATE TABLE t1 (i1 INT PRIMARY KEY, i2 INT NOT NULL) PARTITION BY HASH 
(i1);

CREATE TABLE t2 (i1 INT PRIMARY KEY, i2 INT NOT NULL);

\o /dev/null
SELECT 'CREATE TABLE t1_p' || x::text || ' PARTITION OF t1
FOR VALUES WITH (MODULUS 64, REMAINDER ' || x::text || ');'
from generate_series(0,63) x;
\gexec
\o

ALTER TABLE t1 ADD CONSTRAINT fk_t1_i2_t2_i1 FOREIGN KEY (i2) REFERENCES 
t2(i1);


ANALYZE;

we shouldn't focus on it in this thread.

Best regards,
 Jesper

Re: partitioned tables referenced by FKs

2019-03-22 Thread Jesper Pedersen


Hi Alvaro,

On 3/21/19 6:18 PM, Alvaro Herrera wrote:

On 2019-Mar-21, Jesper Pedersen wrote:

pgbench -M prepared -f select.sql 

I'm seeing 82.64% spent in GetCachedPlan(). plan_cache_mode is auto.


Hmm, I can't reproduce this at all ...  I don't even see GetCachedPlan
in the profile.  Do you maybe have some additional patch in your tree?



No, with 7df159a62 and v7 compiled with "-O0 -fno-omit-frame-pointer" I 
still see it.


plan_cache_mode = auto
  2394 TPS w/ GetCachePlan() @ 81.79%

plan_cache_mode = force_generic_plan
 10984 TPS w/ GetCachePlan() @ 23.52%

perf sent off-list.

Best regards,
 Jesper

Re: partitioned tables referenced by FKs

2019-03-22 Thread Jesper Pedersen


Hi Alvaro,

On 3/21/19 4:49 PM, Alvaro Herrera wrote:

I think the attached is a better solution, which I'll go push shortly.



I see you pushed this, plus 0002 as well.

Thanks !

Best regards,
 Jesper

Re: speeding up planning with partitions

2019-03-22 Thread Jesper Pedersen


Hi Amit,

On 3/22/19 3:39 AM, Amit Langote wrote:

I took a stab at this.  I think rearranging the code in
make_partitionedrel_pruneinfo() slightly will take care of this.

The problem is that make_partitionedrel_pruneinfo() does some work which
involves allocating arrays containing nparts elements and then looping
over the part_rels array to fill values in those arrays that will be used
by run-time pruning.  It does all that even *before* it has checked that
run-time pruning will be needed at all, which if it is not, it will simply
discard the result of the aforementioned work.

Patch 0008 of 0009 rearranges the code such that we check whether we will
need run-time pruning at all before doing any of the work that's necessary
for run-time to work.



Yeah, this is better :) Increasing number of partitions leads to a TPS 
within the noise level.


Passes check-world.

Best regards,
 Jesper

Re: partitioned tables referenced by FKs

2019-03-21 Thread Jesper Pedersen


Hi Alvaro,

On 3/18/19 6:02 PM, Alvaro Herrera wrote:

I spent a few hours studying this and my conclusion is the opposite of
yours: we should make addFkRecurseReferencing the recursive one, and
CloneFkReferencing a non-recursive caller of that.  So we end up with
both addFkRecurseReferenced and addFkRecurseReferencing as recursive
routines, and CloneFkReferenced and CloneFkReferencing being
non-recursive callers of those.  With this structure change there is one
more call to CreateConstraintEntry than before, and now there are two
calls of tryAttachPartitionForeignKey instead of one; I think with this
new structure things are much simpler.  I also changed
CloneForeignKeyConstraints's API: instead of returning a list of cloned
constraint giving its caller the responsibility of adding FK checks to
phase 3, we now give CloneForeignKeyConstraints the 'wqueue' list, so
that it can add the FK checks itself.  It seems much cleaner this way.



Using

-- ddl.sql --
CREATE TABLE t1 (i1 INT PRIMARY KEY, i2 INT NOT NULL) PARTITION BY HASH 
(i1);
CREATE TABLE t2 (i1 INT PRIMARY KEY, i2 INT NOT NULL) PARTITION BY HASH 
(i1);


\o /dev/null
SELECT 'CREATE TABLE t1_p' || x::text || ' PARTITION OF t1
FOR VALUES WITH (MODULUS 64, REMAINDER ' || x::text || ');'
from generate_series(0,63) x;
\gexec
\o

\o /dev/null
SELECT 'CREATE TABLE t2_p' || x::text || ' PARTITION OF t2
FOR VALUES WITH (MODULUS 64, REMAINDER ' || x::text || ');'
from generate_series(0,63) x;
\gexec
\o

ALTER TABLE t1 ADD CONSTRAINT fk_t1_i2_t2_i1 FOREIGN KEY (i2) REFERENCES 
t2(i1);


ANALYZE;


with

-- select.sql --
\set a random(1, 10)
SELECT t1.i1 AS t1i1, t1.i2 AS t1i2, t2.i1 AS t2i1, t2.i2 AS t2i2 FROM 
t1, t2 WHERE t1.i1 = :a;


running

pgbench -M prepared -f select.sql 

I'm seeing 82.64% spent in GetCachedPlan(). plan_cache_mode is auto.

Best regards,
 Jesper

Re: speeding up planning with partitions

2019-03-21 Thread Jesper Pedersen


Hi Amit,

On 3/19/19 8:41 PM, Amit Langote wrote:

I have fixed all.  Attached updated patches.



The comments in the thread has been addressed, so I have put the CF 
entry into Ready for Committer.


Best regards,
 Jesper

Re: speeding up planning with partitions

2019-03-20 Thread Jesper Pedersen


Hi,

On 3/19/19 11:15 PM, Imai, Yoshikazu wrote:

Here the details.

[creating partitioned tables (with 1024 partitions)]
drop table if exists rt;
create table rt (a int, b int, c int) partition by range (a);
\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
  (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x;
\gexec
\o

[select1024.sql]
\set a random (1, 1024)
select * from rt where a = :a;

[pgbench]
pgbench -n -f select1024.sql -T 60




My tests - using hash partitions - shows that the extra time is spent in 
make_partition_pruneinfo() for the relid_subplan_map variable.


  64 partitions: make_partition_pruneinfo() 2.52%
8192 partitions: make_partition_pruneinfo() 5.43%

TPS goes down ~8% between the two. The test setup is similar to the above.

Given that Tom is planning to change the List implementation [1] in 13 I 
think using the palloc0 structure is ok for 12 before trying other 
implementation options.


perf sent off-list.

[1] https://www.postgresql.org/message-id/24783.1551568303%40sss.pgh.pa.us

Best regards,
 Jesper

Re: pg_upgrade: Pass -j down to vacuumdb

2019-03-12 Thread Jesper Pedersen


Hi,

Thanks Kirk !

On 3/12/19 2:20 PM, Robert Haas wrote:

The words 'by default' should be removed here, because there is also
no non-default way to get that behavior, either.



Here is v9 based on Kirk's and your input.

Best regards,
 Jesper

>From 5b879f79300412638705e32aa3ed51aff3cbe75c Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Tue, 12 Mar 2019 16:02:27 -0400
Subject: [PATCH] Highlight that the --jobs option isn't passed down to
 vacuumdb.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Author: Jesper Pedersen and Kirk Jamison
Reviewed-by: Peter Eisentraut, Jerry Sievers, Michael Paquier, Tom Lane, FabrÃzio de Royes Mello, Ãlvaro Herrera and Robert Haas
Discussion: https://www.postgresql.org/message-id/flat/11b6b108-6ac0-79a6-785a-f13dfe60a...@redhat.com
---
 doc/src/sgml/ref/pgupgrade.sgml |  6 +-
 src/bin/pg_upgrade/check.c  | 23 +++
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 7e1afaf0a5..5f2a94918d 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -395,7 +395,11 @@ NET STOP postgresql-
  in parallel;  a good place to start is the maximum of the number of
  CPU cores and tablespaces.  This option can dramatically reduce the
  time to upgrade a multi-database server running on a multiprocessor
- machine.
+ machine. Note that this option isn't passed to the
+ vacuumdb application.
+ The system environment variable VACUUMDB_OPTS can be
+ used to pass extra options to the vacuumdb
+ application during the script execution.
 
 
 
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index fc5aa7010f..e4356f011c 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -493,17 +493,32 @@ create_script_for_cluster_analyze(char **analyze_script_file_name)
 
 	fprintf(script, "echo %sIf you would like default statistics as quickly as possible, cancel%s\n",
 			ECHO_QUOTE, ECHO_QUOTE);
-	fprintf(script, "echo %sthis script and run:%s\n",
+	fprintf(script, "echo %sthis script. You may add optional vacuumdb options (e.g. --jobs)%s\n",
 			ECHO_QUOTE, ECHO_QUOTE);
-	fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
+	fprintf(script, "echo %sby editing $VACUUMDB_OPTS, then run the command below.%s\n",
+			ECHO_QUOTE, ECHO_QUOTE);
+#ifndef WIN32
+	fprintf(script, "echo %s\"%s/vacuumdb\" %s $VACUUMDB_OPTS --all %s%s\n", ECHO_QUOTE,
 			new_cluster.bindir, user_specification.data,
-	/* Did we copy the free space files? */
+			/* Did we copy the free space files? */
 			(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
 			"--analyze-only" : "--analyze", ECHO_QUOTE);
+#else
+	fprintf(script, "echo %s\"%s\\vacuumdb.exe\" %s %%VACUUMDB_OPTS%% --all %s%s\n", ECHO_QUOTE,
+			new_cluster.bindir, user_specification.data,
+			/* Did we copy the free space files? */
+			(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+			"--analyze-only" : "--analyze", ECHO_QUOTE);
+#endif
 	fprintf(script, "echo%s\n\n", ECHO_BLANK);
 
-	fprintf(script, "\"%s/vacuumdb\" %s--all --analyze-in-stages\n",
+#ifndef WIN32
+	fprintf(script, "\"%s/vacuumdb\" %s $VACUUMDB_OPTS --all --analyze-in-stages\n",
 			new_cluster.bindir, user_specification.data);
+#else
+	fprintf(script, "\"%s\\vacuumdb.exe\" %s %%VACUUMDB_OPTS%% --all --analyze-in-stages\n",
+			new_cluster.bindir, user_specification.data);
+#endif
 	/* Did we copy the free space files? */
 	if (GET_MAJOR_VERSION(old_cluster.major_version) < 804)
 		fprintf(script, "\"%s/vacuumdb\" %s--all\n", new_cluster.bindir,
-- 
2.17.2

Re: speeding up planning with partitions

2019-03-12 Thread Jesper Pedersen


Hi Amit,

On 3/12/19 4:22 AM, Amit Langote wrote:

I wrestled with this idea a bit and concluded that we don't have to
postpone *all* of preprocess_targetlist() processing to query_planner,
only the part that adds row mark "junk" Vars, because only those matter
for the problem being solved.  To recap, the problem is that delaying
adding inheritance children (and hence their row marks if any) means we
can't add "junk" columns needed to implement row marks, because which ones
to add is not clear until we've seen all the children.

I propose that we delay the addition of "junk" Vars to query_planner() so
that it doesn't stand in the way of deferring inheritance expansion to
query_planner() too.  That means the order of reltarget expressions for
row-marked rels will change, but I assume that's OK.  At least it will be
consistent for both non-inherited baserels and inherited ones.

Attached updated version of the patches with the above proposal
implemented by patch 0002.  To summarize, the patches are as follows:

0001: make building of "other rels" a separate step that runs after
deconstruct_jointree(), implemented by a new subroutine of query_planner
named add_other_rels_to_query()

0002: delay adding "junk" Vars to after add_other_rels_to_query()

0003: delay inheritance expansion to add_other_rels_to_query()

0004, 0005: adjust inheritance_planner() to account for the fact that
inheritance is now expanded by query_planner(), not subquery_planner()

0006: perform partition pruning while inheritance is being expanded,
instead of during set_append_append_rel_size()

0007: add a 'live_parts' field to RelOptInfo to store partition indexes
(not RT indexes) of unpruned partitions, which speeds up looping over
part_rels array of the partitioned parent

0008: avoid copying PartitionBoundInfo struct for planning



After applying 0004 check-world fails with the attached. CFBot agrees [1].

[1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/505107884

Best regards,
 Jesper

diff -U3 
/home/jpedersen/PostgreSQL/postgresql/contrib/postgres_fdw/expected/postgres_fdw.out
 
/home/jpedersen/PostgreSQL/postgresql/contrib/postgres_fdw/results/postgres_fdw.out
--- 
/home/jpedersen/PostgreSQL/postgresql/contrib/postgres_fdw/expected/postgres_fdw.out
2019-03-12 07:46:08.430690229 -0400
+++ 
/home/jpedersen/PostgreSQL/postgresql/contrib/postgres_fdw/results/postgres_fdw.out
 2019-03-12 07:59:01.134285159 -0400
@@ -7190,8 +7190,8 @@
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
- QUERY PLAN
  
--
+  QUERY PLAN   

+---
  Update on public.bar
Update on public.bar
Foreign Update on public.bar2
@@ -7214,22 +7214,22 @@
  Output: foo2.f1, foo2.ctid, foo2.*, 
foo2.tableoid
  Remote SQL: SELECT f1, f2, f3, ctid FROM 
public.loct1
->  Hash Join
- Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, 
foo.*, foo.tableoid
+ Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*
  Inner Unique: true
  Hash Cond: (bar2.f1 = foo.f1)
  ->  Foreign Scan on public.bar2
Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
  ->  Hash
-   Output: foo.f1, foo.ctid, foo.*, foo.tableoid
+   Output: foo.f1, foo.ctid, foo.*
->  HashAggregate
- Output: foo.f1, foo.ctid, foo.*, foo.tableoid
+ Output: foo.f1, foo.ctid, foo.*
  Group Key: foo.f1
  ->  Append
->  Seq Scan on public.foo
- Output: foo.f1, foo.ctid, foo.*, foo.tableoid
+ Output: foo.f1, foo.ctid, foo.*
->  Foreign Scan on public.foo2
- Output: foo2.f1, foo2.ctid, foo2.*, 
foo2.tableoid
+ Output: foo2.f1, foo2.ctid, foo2.*
  Remote SQL: SELECT f1, f2, f3, ctid FROM 
public.loct1
 (39 rows)

Re: speeding up planning with partitions

2019-03-07 Thread Jesper Pedersen


Hi,

On 3/5/19 5:24 AM, Amit Langote wrote:
Attached an updated version. 


Need a rebase due to 898e5e3290.

Best regards,
 Jesper

Re: speeding up planning with partitions

2019-03-05 Thread Jesper Pedersen


On 3/5/19 5:24 AM, Amit Langote wrote:

Attached an updated version.  This incorporates fixes for both Jesper's
and Imai-san's review.  I haven't been able to pin down the bug (or
whatever) that makes throughput go down as the partition count increases,
when tested with a --enable-cassert build.



Thanks !

I'm seeing the throughput going down as well, but are you sure it isn't 
just the extra calls of MemoryContextCheck you are seeing ? A flamegraph 
diff highlights that area -- sent offline.


A non cassert build shows the same profile for 64 and 1024 partitions.

Best regards,
 Jesper

Re: speeding up planning with partitions

2019-03-04 Thread Jesper Pedersen


Hi Amit,

Passes check-world.

On 3/4/19 5:38 AM, Amit Langote wrote:

See patch 0001.



+* members of any appendrels we find here are added built later when

s/built//


See patch 0002.



+   grouping_planner(root, false, 0.0 /* retrieve all tuples */);

Move comment out of function call.

+   if (root->simple_rte_array[childRTindex])
+   elog(ERROR, "rel %d already exists", childRTindex);
+   root->simple_rte_array[childRTindex] = childrte;
+   if (root->append_rel_array[childRTindex])
+   elog(ERROR, "child relation %d already exists", 
childRTindex);
+   root->append_rel_array[childRTindex] = appinfo;


Could the "if"s be made into Assert's instead ?

+ * the newly added bytes with zero

Extra spaces

+   if (rte->rtekind == RTE_RELATION &&  !root->contains_inherit_children)

s/TAB/space


See patch 0003.



+* because they correspond to flattneed UNION ALL subqueries.  
Especially,

s/flattneed/flatten


See patch 0004.



+* no need to make any new entries, because anything that'd need those

Use "would" explicit

+ * this case, since it needn't be scanned.

, since it doesn't need to be scanned


See patch 0005.

See patch 0006.



I'll run some tests using a hash partitioned setup.

Best regards,
 Jesper

Re: partitioned tables referenced by FKs

2019-03-01 Thread Jesper Pedersen


Hi Alvaro,

On 2/28/19 1:28 PM, Alvaro Herrera wrote:

Rebased to current master.  I'll reply later to handle the other issues
you point out.



Applying with hunks.

With 0003 using

export CFLAGS="-DCOPY_PARSE_PLAN_TREES -O0 -fno-omit-frame-pointer" && 
CC="ccache gcc" ./configure --prefix /usr/local/packages/postgresql-12.x 
--enable-dtrace --with-openssl --with-gssapi --with-libxml --with-llvm 
--enable-debug --enable-depend --enable-tap-tests --enable-cassert


I'm getting a failure in the pg_upgrade test:

 --
+-- Name: pk5 pk5_pkey; Type: CONSTRAINT; Schema: regress_fk; Owner: 
jpedersen

+--
+
+ALTER TABLE ONLY regress_fk.pk5
+ADD CONSTRAINT pk5_pkey PRIMARY KEY (a);
+
+
+--

when running check-world.

Best regards,
 Jesper

Re: Index Skip Scan

2019-02-01 Thread Jesper Pedersen

Hi,

On 1/31/19 1:31 AM, Kyotaro HORIGUCHI wrote:

At Wed, 30 Jan 2019 18:19:05 +0100, Dmitry Dolgov <9erthali...@gmail.com> wrote in

A bit of adjustment after nodes/relation -> nodes/pathnodes.

I had a look on this.

The name "index skip scan" is a different feature from the
feature with the name on other prodcuts, which means "index scan
with postfix key (of mainly of multi column key) that scans
ignoring the prefixing part" As Thomas suggested I'd suggest that
we call it "index hop scan". (I can accept Hopscotch, either:p)

Also as mentioned upthread by Peter Geoghegan, this could easly
give worse plan by underestimation. So I also suggest that this
has dynamic fallback function. In such perspective it is not
suitable for AM API level feature.

If all leaf pages are on the buffer and the average hopping
distance is less than (expectedly) 4 pages (the average height of
the tree), the skip scan will lose. If almost all leaf pages are
staying on disk, we could win only by 2-pages step (skipping over
one page).

=
As I'm writing the above, I came to think that it's better
implement this as an pure executor optimization.

Specifically, let _bt_steppage count the ratio of skipped pages
so far then if the ratio exceeds some threshold (maybe around
3/4) it gets into hopscotching mode, where it uses index scan to
find the next page (rather than traversing). As mentioned above,
I think using skip scan to go beyond the next page is a good
bet. If the success ration of skip scans goes below some
threshold (I'm not sure for now), we should fall back to
traversing.

Any opinions?

Some comments on the patch below.

+ skip scan approach, which is based on the idea of
+ https://wiki.postgresql.org/wiki/Free_Space_Map_Problems;>
+ Loose index scan. Rather than scanning all equal values of a key,
+ as soon as a new value is found, it will search for a larger value on the

I'm not sure it is a good thing to put a pointer to rather
unstable source in the documentation.

This adds a new AM method but it seems avaiable only for ordered
indexes, specifically btree. And it seems that the feature can be
implemented in btgettuple since btskip apparently does the same
thing. (I agree to Robert in the point in [1]).

[1]
https://www.postgresql.org/message-id/CA%2BTgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr%3DOK5_kC_SSg%40mail.gmail.com

Related to the above, it seems better that the path generation of
skip scan is a part of index scan. Whether skip scan or not is a
matter of index scan itself.

Thanks for your valuable feedback ! And, calling it "Loose index scan"
or something else is better.

Dmitry and I will look at this and take it into account for the next
version.

For now, I have switched the CF entry to WoA.

Thanks again !

Best regards,
Jesper

Re: pg_upgrade: Pass -j down to vacuumdb

2019-02-01 Thread Jesper Pedersen


Hi,

On 2/1/19 4:58 AM, Alvaro Herrera wrote:

On 2019-Feb-01, Jamison, Kirk wrote:


I'm not sure if misunderstood the purpose of $VACUUMDB_OPTS. I thought what
the other developers suggested is to utilize it so that --jobs for vacuumdb
would be optional and just based from user-specified option $VACUUMDB_OPTS.
By which it would not utilize the amount of jobs used for pg_upgrade.
To address your need of needing a parallel, the script would just then
suggest if the user wants a --jobs option. The if/else for number of jobs is
not needed in that case, maybe only for ifndef WIN32 else case.


No Kirk, you got it right -- (some of) those ifdefs are not needed, as
adding the --jobs in the command line is not needed.  I do think some
extra whitespace in the format string is needed.



I'm confused by this.

The point of the patch is to highlight to the user that even though 
he/she does "pg_upgrade -j 8" the "-j 8" option isn't passed down to 
vacuumdb.


Default value is 1, so in that case the echo command doesn't highlight 
that fact, otherwise it is. The user can then cancel the script and use 
the suggested command line.


The script then executes the vaccumdb command with the $VACUUMDB_OPTS 
argument as noted in the documentation.


Sample script attached as well.

I added extra space in the --analyze-in-stages part.

Kirk, can you provide a delta patch to match what you are thinking ?

Best regards,
 Jesper

>From 4b5856725af5d971a26a2ba150c1067ee21bb242 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 21 Dec 2018 05:05:31 -0500
Subject: [PATCH] Highlight that the --jobs option isn't passed down to
 vacuumdb by default.

Author: Jesper Pedersen 
---
 doc/src/sgml/ref/pgupgrade.sgml |  6 +-
 src/bin/pg_upgrade/check.c  | 28 ++--
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 7e1afaf0a5..84a7ccb6f1 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -395,7 +395,11 @@ NET STOP postgresql-
  in parallel;  a good place to start is the maximum of the number of
  CPU cores and tablespaces.  This option can dramatically reduce the
  time to upgrade a multi-database server running on a multiprocessor
- machine.
+ machine. Note that this option isn't passed to the
+ vacuumdb application by default.
+ The system environment variable VACUUMDB_OPTS can be
+ used to pass extra options to the vacuumdb
+ application during the script execution.
 
 
 
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index fc5aa7010f..570b7c9ae7 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -495,15 +495,31 @@ create_script_for_cluster_analyze(char **analyze_script_file_name)
 			ECHO_QUOTE, ECHO_QUOTE);
 	fprintf(script, "echo %sthis script and run:%s\n",
 			ECHO_QUOTE, ECHO_QUOTE);
-	fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
-			new_cluster.bindir, user_specification.data,
-	/* Did we copy the free space files? */
-			(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
-			"--analyze-only" : "--analyze", ECHO_QUOTE);
+	if (user_opts.jobs <= 1)
+	{
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data,
+/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	}
+	else
+	{
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s--jobs %d --all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data, user_opts.jobs,
+/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	}
 	fprintf(script, "echo%s\n\n", ECHO_BLANK);
 
-	fprintf(script, "\"%s/vacuumdb\" %s--all --analyze-in-stages\n",
+#ifndef WIN32
+	fprintf(script, "\"%s/vacuumdb\" %s $VACUUMDB_OPTS --all --analyze-in-stages\n",
 			new_cluster.bindir, user_specification.data);
+#else
+	fprintf(script, "\"%s\\vacuumdb.exe\" %s %%VACUUMDB_OPTS%% --all --analyze-in-stages\n",
+			new_cluster.bindir, user_specification.data);
+#endif
 	/* Did we copy the free space files? */
 	if (GET_MAJOR_VERSION(old_cluster.major_version) < 804)
 		fprintf(script, "\"%s/vacuumdb\" %s--all\n", new_cluster.bindir,
-- 
2.17.2



analyze_new_cluster.sh
Description: application/shellscript

Re: pg_upgrade: Pass -j down to vacuumdb

2019-01-31 Thread Jesper Pedersen


Hi,

On 1/30/19 7:59 PM, Jamison, Kirk wrote:

I added most of the documentation back, as requested by Kirk.


Actually, I find it useful to be documented. However, major contributors have 
higher opinions in terms of experience, so I think it's alright with me not to 
include the doc part if they finally say so.



I think we need to leave it to the Committer to decide, as both Peter  
and Michael are committers; provided the patch reaches RfC.



It seems to me that the latest patch sent is incorrect for multiple
reasons:
1) You still enforce -j to use the number of jobs that the caller of
pg_upgrade provides, and we agreed that both things are separate
concepts upthread, no?  What has been suggested by Alvaro is to add a
comment so as one can use VACUUM_OPTS with -j optionally, instead of
suggesting a full-fledged vacuumdb command which depends on what
pg_upgrade uses.  So there is no actual need for the if/else
complication business.



I think it is ok for the echo command to highlight to the user that running 
--analyze-only using the same amount of jobs will give a faster result.


I missed that part.
IIUC, using the $VACUUMDB_OPTS variable would simplify and correct the usage of 
jobs for vacuumdb. (pg_upgrade jobs is not equal to vacuumdb jobs) Plus, it 
might simplify and reduce the number of additional lines.
Tom Lane also suggested above to make the script absorb the value from env.
Would that address your concern of getting a faster result, yes?



The actual line in the script executed is using $VACUUMDB_OPTS at the  
moment, so could you expand on the above ? The 'if' could be removed if  
we assume that pg_upgrade would only be used with server > 9.5, but to  
be safer I would leave it in, as it is only console output.


Best regards,
 Jesper

Re: pg_upgrade: Pass -j down to vacuumdb

2019-01-29 Thread Jesper Pedersen


Hi,

On 1/29/19 12:08 AM, Michael Paquier wrote:

On Tue, Jan 29, 2019 at 12:35:30AM +, Jamison, Kirk wrote:

I just checked the patch.
As per advice, you removed the versioning and specified --jobs.
The patch still applies, builds and passed the tests successfully.


I would document the optional VACUUM_OPTS on the page of pg_upgrade.
If Peter thinks it is fine to not do so, that's fine for me as well.



I added most of the documentation back, as requested by Kirk.


It seems to me that the latest patch sent is incorrect for multiple
reasons:
1) You still enforce -j to use the number of jobs that the caller of
pg_upgrade provides, and we agreed that both things are separate
concepts upthread, no?  What has been suggested by Alvaro is to add a
comment so as one can use VACUUM_OPTS with -j optionally, instead of
suggesting a full-fledged vacuumdb command which depends on what
pg_upgrade uses.  So there is no actual need for the if/else
complication business.


I think it is ok for the echo command to highlight to the user that 
running --analyze-only using the same amount of jobs will give a faster 
result.



2) Perhaps we need to worry about the second vacuumdb --all command,
which may want custom options, which are not necessarily the same as
the options of the first command?  I don't think we need to care as it
applies only to an upgraded cluster using something older 8.4, just
wondering..


I think that --all --analyze-in-stages is what most people want. And 
with the $VACUUMDB_OPTS variable people have an option to put in more, 
such as -j X.


Best regards,
 Jesper


>From b02e1f414e309361595ab3fe84fb6f66bcf63265 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 21 Dec 2018 05:05:31 -0500
Subject: [PATCH] Highlight that the --jobs option isn't passed down to
 vacuumdb by default.

Author: Jesper Pedersen 
---
 doc/src/sgml/ref/pgupgrade.sgml |  6 +-
 src/bin/pg_upgrade/check.c  | 28 ++--
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 7e1afaf0a5..84a7ccb6f1 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -395,7 +395,11 @@ NET STOP postgresql-
  in parallel;  a good place to start is the maximum of the number of
  CPU cores and tablespaces.  This option can dramatically reduce the
  time to upgrade a multi-database server running on a multiprocessor
- machine.
+ machine. Note that this option isn't passed to the
+ vacuumdb application by default.
+ The system environment variable VACUUMDB_OPTS can be
+ used to pass extra options to the vacuumdb
+ application during the script execution.
 
 
 
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index fc5aa7010f..847e604088 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -495,15 +495,31 @@ create_script_for_cluster_analyze(char **analyze_script_file_name)
 			ECHO_QUOTE, ECHO_QUOTE);
 	fprintf(script, "echo %sthis script and run:%s\n",
 			ECHO_QUOTE, ECHO_QUOTE);
-	fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
-			new_cluster.bindir, user_specification.data,
-	/* Did we copy the free space files? */
-			(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
-			"--analyze-only" : "--analyze", ECHO_QUOTE);
+	if (user_opts.jobs <= 1)
+	{
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data,
+/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	}
+	else
+	{
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s--jobs %d --all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data, user_opts.jobs,
+/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	}
 	fprintf(script, "echo%s\n\n", ECHO_BLANK);
 
-	fprintf(script, "\"%s/vacuumdb\" %s--all --analyze-in-stages\n",
+#ifndef WIN32
+	fprintf(script, "\"%s/vacuumdb\" %s$VACUUMDB_OPTS --all --analyze-in-stages\n",
 			new_cluster.bindir, user_specification.data);
+#else
+	fprintf(script, "\"%s\\vacuumdb.exe\" %s%%VACUUMDB_OPTS%% --all --analyze-in-stages\n",
+			new_cluster.bindir, user_specification.data);
+#endif
 	/* Did we copy the free space files? */
 	if (GET_MAJOR_VERSION(old_cluster.major_version) < 804)
 		fprintf(script, "\"%s/vacuumdb\" %s--all\n", new_cluster.bindir,
-- 
2.17.2

Re: pg_upgrade: Pass -j down to vacuumdb

2019-01-28 Thread Jesper Pedersen


On 1/28/19 3:50 PM, Peter Eisentraut wrote:

Done, and v4 attached.


I would drop the changes in pgupgrade.sgml.  We don't need to explain
what doesn't happen, when nobody before now ever thought that it would
happen.

Also, we don't need the versioning stuff.  The new cluster in pg_upgrade
is always of the current version, and we know what that one supports.



Done.

Best regards,
 Jesper
>From 5b51f927dff808b4a8516b424c2ef790dbe96da6 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 21 Dec 2018 05:05:31 -0500
Subject: [PATCH] Highlight that the --jobs option isn't passed down to
 vacuumdb by default.

Author: Jesper Pedersen 
---
 src/bin/pg_upgrade/check.c | 28 ++--
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index fc5aa7010f..847e604088 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -495,15 +495,31 @@ create_script_for_cluster_analyze(char **analyze_script_file_name)
 			ECHO_QUOTE, ECHO_QUOTE);
 	fprintf(script, "echo %sthis script and run:%s\n",
 			ECHO_QUOTE, ECHO_QUOTE);
-	fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
-			new_cluster.bindir, user_specification.data,
-	/* Did we copy the free space files? */
-			(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
-			"--analyze-only" : "--analyze", ECHO_QUOTE);
+	if (user_opts.jobs <= 1)
+	{
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data,
+/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	}
+	else
+	{
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s--jobs %d --all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data, user_opts.jobs,
+/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	}
 	fprintf(script, "echo%s\n\n", ECHO_BLANK);
 
-	fprintf(script, "\"%s/vacuumdb\" %s--all --analyze-in-stages\n",
+#ifndef WIN32
+	fprintf(script, "\"%s/vacuumdb\" %s$VACUUMDB_OPTS --all --analyze-in-stages\n",
 			new_cluster.bindir, user_specification.data);
+#else
+	fprintf(script, "\"%s\\vacuumdb.exe\" %s%%VACUUMDB_OPTS%% --all --analyze-in-stages\n",
+			new_cluster.bindir, user_specification.data);
+#endif
 	/* Did we copy the free space files? */
 	if (GET_MAJOR_VERSION(old_cluster.major_version) < 804)
 		fprintf(script, "\"%s/vacuumdb\" %s--all\n", new_cluster.bindir,
-- 
2.17.2

Re: pg_upgrade: Pass -j down to vacuumdb

2019-01-28 Thread Jesper Pedersen


On 1/28/19 11:30 AM, Fabrízio de Royes Mello wrote:

IMHO you should use long option name '--jobs' like others.



Thanks for your feedback !

Done, and v4 attached.

Best regards,
 Jesper
>From 142b4723f433f39d0eed2ced4beccc3721451103 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 21 Dec 2018 05:05:31 -0500
Subject: [PATCH] Highlight that the --jobs option isn't passed down to
 vacuumdb by default.

Author: Jesper Pedersen 
---
 doc/src/sgml/ref/pgupgrade.sgml |  6 +-
 src/bin/pg_upgrade/check.c  | 28 ++--
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 7e1afaf0a5..b1da0cc7a2 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -395,7 +395,11 @@ NET STOP postgresql-
  in parallel;  a good place to start is the maximum of the number of
  CPU cores and tablespaces.  This option can dramatically reduce the
  time to upgrade a multi-database server running on a multiprocessor
- machine.
+ machine. Note, that this option isn't passed to the
+ vacuumdb application by default.
+ The system environment variable VACUUMDB_OPTS can be
+ used to pass extra options to the vacuumdb
+ application during the script execution such as --jobs.
 
 
 
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index fc5aa7010f..3b478ef95f 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -495,15 +495,31 @@ create_script_for_cluster_analyze(char **analyze_script_file_name)
 			ECHO_QUOTE, ECHO_QUOTE);
 	fprintf(script, "echo %sthis script and run:%s\n",
 			ECHO_QUOTE, ECHO_QUOTE);
-	fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
-			new_cluster.bindir, user_specification.data,
-	/* Did we copy the free space files? */
-			(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
-			"--analyze-only" : "--analyze", ECHO_QUOTE);
+	if (user_opts.jobs <= 1 || GET_MAJOR_VERSION(new_cluster.major_version) < 905)
+	{
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data,
+/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	}
+	else
+	{
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s--jobs %d --all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data, user_opts.jobs,
+/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	}
 	fprintf(script, "echo%s\n\n", ECHO_BLANK);
 
-	fprintf(script, "\"%s/vacuumdb\" %s--all --analyze-in-stages\n",
+#ifndef WIN32
+	fprintf(script, "\"%s/vacuumdb\" %s$VACUUMDB_OPTS --all --analyze-in-stages\n",
 			new_cluster.bindir, user_specification.data);
+#else
+	fprintf(script, "\"%s\\vacuumdb.exe\" %s%%VACUUMDB_OPTS%% --all --analyze-in-stages\n",
+			new_cluster.bindir, user_specification.data);
+#endif
 	/* Did we copy the free space files? */
 	if (GET_MAJOR_VERSION(old_cluster.major_version) < 804)
 		fprintf(script, "\"%s/vacuumdb\" %s--all\n", new_cluster.bindir,
-- 
2.17.2

Re: pg_upgrade: Pass -j down to vacuumdb

2019-01-28 Thread Jesper Pedersen


Hi,

On 1/27/19 7:50 PM, Jamison, Kirk wrote:

There were also helpful comments from the developers above,
pointing it to the right direction.
In addition to Peter's comment, quoting "...if you want to do this
as fast as possible, don't use this script.  That comment could be
enhanced to suggest the use of the -j option.", so I think you
should also update the following script in create_script_for_cluster_analyze():
  fprintf(script, "echo %sIf you would like default statistics as quickly as 
possible, cancel%s\n",
  ECHO_QUOTE, ECHO_QUOTE);



Yes, it will echo the -j option if passed, and the server supports it.


Alvaro Herrera  writes:

So let's have it write with a $VACUUMDB_OPTS variable, which is by
default defined as empty but with a comment suggesting that maybe the
user wants to add the -j option.  This way, if they have to edit it,
they only have to edit the VACUUMDB_OPTS line instead of each of the
two vacuumdb lines.



Tom Lane wrote:

Even better, allow the script to absorb a value from the environment, so that 
it needn't be edited at all.




Done.

Attached is v3.

Best regards,
 Jesper
>From 1cba8299ae328942a4bf623d841a80b1b6043d3a Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 21 Dec 2018 05:05:31 -0500
Subject: [PATCH] Highlight that the -j option isn't passed down to vacuumdb by
 default.

Author: Jesper Pedersen 
---
 doc/src/sgml/ref/pgupgrade.sgml |  6 +-
 src/bin/pg_upgrade/check.c  | 28 ++--
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 7e1afaf0a5..b1da0cc7a2 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -395,7 +395,11 @@ NET STOP postgresql-
  in parallel;  a good place to start is the maximum of the number of
  CPU cores and tablespaces.  This option can dramatically reduce the
  time to upgrade a multi-database server running on a multiprocessor
- machine.
+ machine. Note, that this option isn't passed to the
+ vacuumdb application by default.
+ The system environment variable VACUUMDB_OPTS can be
+ used to pass extra options to the vacuumdb
+ application during the script execution such as --jobs.
 
 
 
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index fc5aa7010f..5114df01b5 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -495,15 +495,31 @@ create_script_for_cluster_analyze(char **analyze_script_file_name)
 			ECHO_QUOTE, ECHO_QUOTE);
 	fprintf(script, "echo %sthis script and run:%s\n",
 			ECHO_QUOTE, ECHO_QUOTE);
-	fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
-			new_cluster.bindir, user_specification.data,
-	/* Did we copy the free space files? */
-			(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
-			"--analyze-only" : "--analyze", ECHO_QUOTE);
+	if (user_opts.jobs <= 1 || GET_MAJOR_VERSION(new_cluster.major_version) < 905)
+	{
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data,
+/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	}
+	else
+	{
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s-j %d --all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data, user_opts.jobs,
+/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	}
 	fprintf(script, "echo%s\n\n", ECHO_BLANK);
 
-	fprintf(script, "\"%s/vacuumdb\" %s--all --analyze-in-stages\n",
+#ifndef WIN32
+	fprintf(script, "\"%s/vacuumdb\" %s$VACUUMDB_OPTS --all --analyze-in-stages\n",
 			new_cluster.bindir, user_specification.data);
+#else
+	fprintf(script, "\"%s\\vacuumdb.exe\" %s%%VACUUMDB_OPTS%% --all --analyze-in-stages\n",
+			new_cluster.bindir, user_specification.data);
+#endif
 	/* Did we copy the free space files? */
 	if (GET_MAJOR_VERSION(old_cluster.major_version) < 804)
 		fprintf(script, "\"%s/vacuumdb\" %s--all\n", new_cluster.bindir,
-- 
2.17.2

Re: pg_upgrade: Pass -j down to vacuumdb

2019-01-25 Thread Jesper Pedersen


Hi Kirk,

On 1/24/19 9:31 PM, Jamison, Kirk wrote:

According to CF app, this patch needs review so I took a look at it.
Currently, your patch applies and builds cleanly, and all tests are also 
successful
based from the CF bot patch tester.

I'm not sure if I have understood the argument raised by Peter correctly.
Quoting Peter, "it's not clear that pg_upgrade and vacuumdb are bound the same way, 
so it's not a given that the same -j number should be used."
I think it's whether the # jobs for pg_upgrade is used the same way for 
parallel vacuum.

According to the official docs, the recommended setting for pg_upgrade --j 
option is the maximum of the number of CPU cores and tablespaces. [1]
As for vacuumdb, parallel vacuum's (-j) recommended setting is based from your 
max_connections [2], which is the max # of concurrent connections to your db 
server.



Thanks for your feedback !

As per Peter's comments I have changed the patch (v2) to not pass down 
the -j option to vacuumdb.


Only an update to the documentation and console output is made in order 
to make it more clear.


Best regards,
 Jesper

Re: speeding up planning with partitions

2019-01-21 Thread Jesper Pedersen


Hi,

On 1/21/19 4:01 AM, Amit Langote wrote:

Rebased again due to 8d8dcead1295.



Passes check-world on 8d8dcead1295.

I ran a work load on the series, measuring the time to load the data 
plus the run-time to complete the work load.


  LoadRun
master3m39s   144s
1778  3m12s36s
NP [*]2m20s11s

[*] Non partitioned case.

Best regards,
 Jesper

Re: partitioned tables referenced by FKs

2019-01-08 Thread Jesper Pedersen


Hi,

On 1/7/19 1:07 PM, Jesper Pedersen wrote:
While I'm still looking at 0004 - should we have a test that updates one 
of the constraints of fk to another partition in pk ?




In addition:

* Document pre_drop_class_check()
* I think index_get_partition() would be more visible in partition.c
* Remove code in #if 0 block

I have tested this with a hash based setup.

Thanks again !

Best regards,
 Jesper

Re: partitioned tables referenced by FKs

2019-01-07 Thread Jesper Pedersen


Hi,

On 11/30/18 2:12 PM, Alvaro Herrera wrote:

Here's a more credible version of this patch series.



The patch series applies with hunks, and passes check-world.

No comments for 0001, 0002, 0003 and 0005.

While I'm still looking at 0004 - should we have a test that updates one 
of the constraints of fk to another partition in pk ?


I'll didn't try this yet with [1], so sending you a flamegraph of 
INSERTs off-list, since data loading takes a while during an OLTP based 
work load.


Thanks for working on this !

[1] https://commitfest.postgresql.org/21/1897/

Best regards,
 Jesper

Re: pg_upgrade: Pass -j down to vacuumdb

2019-01-07 Thread Jesper Pedersen


Hi Peter,

On 1/3/19 7:03 AM, Peter Eisentraut wrote:

Perhaps more documentation would be useful here.



Here is v2 that just notes that the -j option isn't passed down.

Best regards,
 Jesper
>From fe0be6c1f5cbcaeb2981ff4dcfceae2e2cb286b7 Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 21 Dec 2018 05:05:31 -0500
Subject: [PATCH] Highlight that the -j option isn't passed down to vacuumdb by
 default.

Author: Jesper Pedersen 
---
 doc/src/sgml/ref/pgupgrade.sgml |  3 ++-
 src/bin/pg_upgrade/check.c  | 21 -
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 7e1afaf0a5..937e95688d 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -395,7 +395,8 @@ NET STOP postgresql-
  in parallel;  a good place to start is the maximum of the number of
  CPU cores and tablespaces.  This option can dramatically reduce the
  time to upgrade a multi-database server running on a multiprocessor
- machine.
+ machine. Note, that this option isn't passed to the
+ vacuumdb application by default.
 
 
 
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index fc5aa7010f..7bae43a0b1 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -495,11 +495,22 @@ create_script_for_cluster_analyze(char **analyze_script_file_name)
 			ECHO_QUOTE, ECHO_QUOTE);
 	fprintf(script, "echo %sthis script and run:%s\n",
 			ECHO_QUOTE, ECHO_QUOTE);
-	fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
-			new_cluster.bindir, user_specification.data,
-	/* Did we copy the free space files? */
-			(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
-			"--analyze-only" : "--analyze", ECHO_QUOTE);
+	if (user_opts.jobs <= 1 || GET_MAJOR_VERSION(new_cluster.major_version) < 905)
+	{
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data,
+/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	}
+	else
+	{
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s-j %d --all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data, user_opts.jobs,
+/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	}
 	fprintf(script, "echo%s\n\n", ECHO_BLANK);
 
 	fprintf(script, "\"%s/vacuumdb\" %s--all --analyze-in-stages\n",
-- 
2.17.2

Re: pg_upgrade: Pass -j down to vacuumdb

2019-01-02 Thread Jesper Pedersen


Hi,

On 12/29/18 10:03 AM, Peter Eisentraut wrote:

On 21/12/2018 11:12, Jesper Pedersen wrote:

Here is a patch that passes the -j option from pg_upgrade down to
vacuumdb if supported.


It's not clear to me that that is what is usually wanted.

The point of the post-upgrade analyze script is specifically to do the
analyze in a gentle fashion.  Mixing in parallelism would seem to defeat
that a bit.



Well, that really depends. The user passed -j to pg_upgrade in order for 
the upgrade to happen faster, so maybe they would expect, as I would, 
that the ANALYZE phase would happen in parallel too.


At least there should be a "notice" about what the script will do in 
this case.


Best regards,
 Jesper

pg_upgrade: Pass -j down to vacuumdb

2018-12-21 Thread Jesper Pedersen


Hi Hackers,

Here is a patch that passes the -j option from pg_upgrade down to 
vacuumdb if supported.


I'll add it to the January 'Fest.

Thanks for considering !

Best regards,
 Jesper
>From ea941f942830589469281e0d5c17740469c6aebc Mon Sep 17 00:00:00 2001
From: jesperpedersen 
Date: Fri, 21 Dec 2018 05:05:31 -0500
Subject: [PATCH] Pass the -j option down to vacuumdb if supported.

Author: Jesper Pedersen 
---
 src/bin/pg_upgrade/check.c | 25 ++---
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index f2215b7095..102221a201 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -495,15 +495,26 @@ create_script_for_cluster_analyze(char **analyze_script_file_name)
 			ECHO_QUOTE, ECHO_QUOTE);
 	fprintf(script, "echo %sthis script and run:%s\n",
 			ECHO_QUOTE, ECHO_QUOTE);
-	fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
-			new_cluster.bindir, user_specification.data,
-	/* Did we copy the free space files? */
-			(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
-			"--analyze-only" : "--analyze", ECHO_QUOTE);
+	if (user_opts.jobs <= 1 || GET_MAJOR_VERSION(new_cluster.major_version) < 905)
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s--all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data,
+		/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
+	else
+		fprintf(script, "echo %s\"%s/vacuumdb\" %s-j %d --all %s%s\n", ECHO_QUOTE,
+new_cluster.bindir, user_specification.data, user_opts.jobs,
+		/* Did we copy the free space files? */
+(GET_MAJOR_VERSION(old_cluster.major_version) >= 804) ?
+"--analyze-only" : "--analyze", ECHO_QUOTE);
 	fprintf(script, "echo%s\n\n", ECHO_BLANK);
 
-	fprintf(script, "\"%s/vacuumdb\" %s--all --analyze-in-stages\n",
-			new_cluster.bindir, user_specification.data);
+	if (user_opts.jobs <= 1 || GET_MAJOR_VERSION(new_cluster.major_version) < 905)
+		fprintf(script, "\"%s/vacuumdb\" %s--all --analyze-in-stages\n",
+new_cluster.bindir, user_specification.data);
+	else
+		fprintf(script, "\"%s/vacuumdb\" %s-j %d --all --analyze-in-stages\n",
+new_cluster.bindir, user_specification.data, user_opts.jobs);
 	/* Did we copy the free space files? */
 	if (GET_MAJOR_VERSION(old_cluster.major_version) < 804)
 		fprintf(script, "\"%s/vacuumdb\" %s--all\n", new_cluster.bindir,
-- 
2.17.2

Re: Index Skip Scan

2018-11-16 Thread Jesper Pedersen

Hi,

On 11/15/18 6:41 AM, Alexander Kuzmenkov wrote:
But having this logic inside _bt_next means that it will make a
non-skip index

only scan a bit slower, am I right?

Correct.

Well, it depends on how the skip scan is implemented. We don't have to
make normal scans slower, because skip scan is just a separate thing.

My main point was that current implementation is good as a proof of
concept, but it is inefficient for some data and needs some unreliable
planner logic to work around this inefficiency. And now we also have
execution-time fallback because planning-time fallback isn't good
enough. This looks overly complicated. Let's try to design an algorithm
that works well regardless of the particular data and doesn't need these
heuristics. It should be possible to do so for btree.

Of course, I understand the reluctance to implement an entire new type
of btree traversal. Anastasia Lubennikova suggested a tweak for the
current method that may improve the performance for small groups of
equal values. When we search for the next unique key, first check if it
is contained on the current btree page using its 'high key'. If it is,
find it on the current page. If not, search from the root of the tree
like we do now. This should improve the performance for small equal
groups, because there are going to be several such groups on the page.
And this is exactly where we have the regression compared to unique +
index scan.

Robert suggested something similar in [1]. I'll try and look at that
when I'm back from my holiday.

By the way, what is the data for which we intend this feature to work?
Obviously a non-unique btree index, but how wide are the tuples, and how
big the equal groups? It would be good to have some real-world examples.

Although my primary use-case is int I agree that we should test the
different data types, and tuple widths.

[1]
https://www.postgresql.org/message-id/CA%2BTgmobb3uN0xDqTRu7f7WdjGRAXpSFxeAQnvNr%3DOK5_kC_SSg%40mail.gmail.com

Best regards,
Jesper

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-13 Thread Jesper Pedersen


Hi,

On 11/12/18 6:17 PM, David Rowley wrote:

On 9 November 2018 at 19:18, Amit Langote  wrote:

I have a comment regarding how you chose to make
PartitionTupleRouting private.

Using the v14_to_v15 diff, I could quickly see that there are many diffs
changing PartitionTupleRouting to struct PartitionTupleRouting, but they
would be unnecessary if you had added the following in execPartition.h, as
my upthread had done.

-/* See execPartition.c for the definition. */
+/* See execPartition.c for the definitions. */
  typedef struct PartitionDispatchData *PartitionDispatch;
+typedef struct PartitionTupleRouting PartitionTupleRouting;


Okay, done that way. v16 attached.



Still passes check-world.

Best regards,
 Jesper

Re: speeding up planning with partitions

2018-11-09 Thread Jesper Pedersen


Hi Amit,

On 11/9/18 3:55 AM, Amit Langote wrote:

And here are patches structured that way.  I've addressed some of the
comments posted by Imai-san upthread.  Also, since David's patch to
postpone PlannerInfo.total_pages calculation went into the tree earlier
this week, it's no longer included in this set.



Thanks for doing the split this way. The patch passes check-world.

I ran a SELECT test using hash partitions, and got

   Masterv5
64:6k59k
1024:  283   59k

The non-partitioned case gives 77k. The difference in TPS between the 
partition case vs. the non-partitioned case comes down to 
set_plain_rel_size() vs. set_append_rel_size() under 
set_base_rel_sizes(); flamegraphs for this sent off-list.


Best regards,
 Jesper

1 2 >

1 - 100 of 180 matches

Mail list logo