[HACKERS] Authentication mechanisms categorization

2017-07-07 Thread Álvaro Hernández Tortosa


There has been some prior discussion, that we recently continued at 
pgday.ru, about what to do if a client wants to use a "strong" 
authentication mechanism but a rogue server forces the client to use a 
weaker authentication mechanism. This is the case if the client expects 
SCRAM to be used but a rogue server just replies with 
AuthenticationCleartextPassword, for example. Client drivers will 
authenticate using this latter mechanism, transparently (at least pgjdbc 
implementation does this, and I believe libpq also). This somehow 
defeats the purpose of some mechanisms like SCRAM.


It was discussed to add a parameter to the driver like 
"SCRAM-only", but I think this may not be ideal. "SCRAM-only" means that 
code needs to be written to prevent every other authentication 
mechanism, explicitly, which is far from ideal. Much worse, it defeats 
using other auth mechanisms that might be OK for the user. Also, this 
doesn't consider whether SCRAM is good without channel binding.


I think it would be better to make a categorization of 
authentication mechanisms and then have an agreement among libpq and 
drivers to set a minimum level of security based on the user's request. 
Some initial ideas are:


- Three security levels: Basic, Medium, Advanced.
- Prevents MITM / does not.
- Given this X possible attacks, a matrix of which mechanisms avoid 
which attacks (something similar to the table comparing the possible 
effects of the different isolation levels).


This is not trivial: for example, SCRAM may be OK without channel 
binding in the presence of SSL, but without SSL channel binding is a 
must to prevent MITM. Similarly, are other auth mechanisms like Kerberos 
(I'm not an expert here) as "safe" as SCRAM with our without channel 
binding?


I believe this should be discussed and find a common agreement to 
be implemented by libpq and all the drivers, including a single naming 
scheme for the parameter and possible values. Opinions?



Álvaro

--

Álvaro Hernández Tortosa


---
<8K>data



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-07 Thread Mark Kirkwood

On 07/07/17 19:54, Michael Banck wrote:


On Fri, Jul 07, 2017 at 07:40:55PM +1200, Mark Kirkwood wrote:

On 07/07/17 13:29, Amit Langote wrote:

Someone complained about this awhile back [1].  And then it came up again
[2], where Noah appeared to take a stance that partitions should be
visible in views / output of commands that list "tables".

Although I too tend to prefer not filling up the \d output space by
listing partitions (pg_class.relispartition = true relations), there
wasn't perhaps enough push for creating a patch for that.  If some
committer is willing to consider such a patch, I can make one.

Yeah, me too (clearly). However if the consensus is that all these partition
tables *must* be shown in \d output, then I'd be happy if they were
identified as such rather than just 'table' (e.g 'partition table').

+1.

Or maybe just 'partition' is enough if 'partition table' would widen the
column output unnecessarily.




Yeah, that is probably better (and 'partition table' is potentially 
confusing as Robert pointed out).


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM auth and Pgpool-II

2017-07-07 Thread Álvaro Hernández Tortosa



On 06/07/17 04:03, Tatsuo Ishii wrote:

Hi PostgreSQL hackers,

I would like to hear ideas how Pgpool-II can deal with SCRAM auth
which will be in PostgreSQL 10.

For those who are not familiar with Pgpool-II[1], it is an external
OSS project to provide some additional features to PostgreSQL,
including load balancing and automatic failover. Pgpool-II works as a
proxy between PostgreSQL client and PostgreSQL server(s).

When a client wants to connects to PostgreSQL and SCRAM auth is
enabled, it sends user name to server. Then the server sends
information including a salt to the client. The client computes a
"ClientProof" using the salt and other information, and sends it to
the server[2].

For Pgpool-II, things would go as follows:

1) clients sends user name to Pgpool-II.
2) Pgpool-II forwards it to PostgreSQL servers.
3) Each PostgreSQL server sends their own salt to Pgpool-II.
4) Pgpool-II is confused because there are multiple salts and each has
different values. The client only accepts single salt obviously.

So my question is, is there any solution or workaround for the problem
#4. Someone at PGCon 2017 suggested that the problem could be avoided
if the auth method between the client and Pgpool-II is "trust" (which
means no auth). But this does not seem to be a best solution for me
because it would weaken the security.

I suspect this problem may not be specific to Pgpool-II. Any middle
ware which handles multiple PostgreSQL servers could have the similar
problem.

Any suggestion would be appreciated. Thanks in advance.

[1] https://pgpool.net
[2] https://tools.ietf.org/html/rfc5802#section-3



Hi Tatsuo.

There's definitely an important concern here that should be 
addressed: how poolers/proxies/middleware/etc can deal with SCRAM, 
specifically in the context of channel binding.


If there is to be a single connection from client to PostgreSQL 
server, intercepted by pgpool to perform the magic foo, then channel 
binding is, indeed, designed to defeat this. If, however, pgpool or the 
middleware manages two separate connections (client<->pool and 
pool<->PG) then there is some light here.


One SCRAM feature not implemented as of today is the authzid 
(authorization identity: see 
https://tools.ietf.org/html/rfc5802#page-10, SCRAM attribute "a" and 
https://tools.ietf.org/html/rfc5801). Authzid is basically "I want to 
authenticate as user X and once authenticated, consider I'm user Y". 
With authzid, a pool/proxy may have a common user name with its own 
SCRAM credentials to authenticate with the backend PostgreSQL, and pass 
the authzid with the real username (the one provided on the 
client<->pool connection).


This would require:

a) That authzid is implemented in PostgreSQL.
b) A mechanism in PG to name which user(s) Y are allowed to be 
authenticated by user X. This is similar, but not identical, to the 
current SET ROLE.


From a SCRAM protocol perspective, this is very simple, just an 
extra attribute. Part b) may need more discussion.


I believe this could be of help to your case and other middleware.


Álvaro


--

Álvaro Hernández Tortosa


---
<8K>data



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-07 Thread Amit Kapila
On Sat, Jul 8, 2017 at 9:08 AM, Robert Haas  wrote:
> On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila  wrote:
>> I think we should do that as a separate patch (I can write the same as
>> well) because that is not new behavior introduced by this patch, ...
>
> True, although maybe hash indexes are the only way that happens today?
>

No, I think it will happen for all other indexes as well.  Basically,
we log new page for init forks of other indexes and then while
restoring that full page image, it will fall through the same path.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-07 Thread Robert Haas
On Fri, Jul 7, 2017 at 11:36 PM, Amit Kapila  wrote:
> I think we should do that as a separate patch (I can write the same as
> well) because that is not new behavior introduced by this patch, ...

True, although maybe hash indexes are the only way that happens today?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-07 Thread Amit Kapila
On Sat, Jul 8, 2017 at 8:52 AM, Robert Haas  wrote:
> On Thu, Jul 6, 2017 at 5:54 AM, Kyotaro HORIGUCHI
>  wrote:
>> Check for INIT_FORKNUM appears both accompanied and not
>> accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
>> is the convention here.
>
> Checking only for INIT_FORKNUM seems logical to me.  Also checking for
> RELPERSISTENCE_UNLOGGED just makes the code longer to no benefit.  I
> guess Amit copied the test from ATExecSetTablespace, which does it as
> he did, but it seems unnecessarily long-winded.
>

Okay.  If you and Michael feel the check that way is better, I will
change and submit the patch.

>> The difference of the two is an init fork of TEMP
>> relations. However I believe that init fork is by definition a
>> part of an unlogged relation, it seems to me that it ought not to
>> be wal-logged if we had it. From this viewpoint, the middle line
>> makes sense.
>
> Actually, the init fork of an unlogged relation *must* be WAL-logged.
> All other forks are removed on a crash (and the main fork is copied
> anew from the init fork).  But the init fork itself must survive -
> therefore it, and only it, must be WAL-logged and thus durable.
>
>> By the way the comment of the function ReadBufferWithoutRelcache
>> has the following sentense.
>>
>> | * NB: At present, this function may only be used on permanent relations, 
>> which
>> | * is OK, because we only use it during XLOG replay.  If in the future we
>> | * want to use it on temporary or unlogged relations, we could pass 
>> additional
>> | * parameters.
>>
>> and does
>>
>> | return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
>>  mode, strategy, 
>> );
>>
>> This surely works since BufferAlloc recognizes INIT_FORK. But
>> some adjustment may be needed around here.
>
> Yeah, it should probably mention that the init fork of an unlogged
> relation is also OK.
>

I think we should do that as a separate patch (I can write the same as
well) because that is not new behavior introduced by this patch, but
let me know if you think that we should add such a comment in this
patch itself.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-07 Thread Robert Haas
On Thu, Jul 6, 2017 at 5:54 AM, Kyotaro HORIGUCHI
 wrote:
> Check for INIT_FORKNUM appears both accompanied and not
> accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
> is the convention here.

Checking only for INIT_FORKNUM seems logical to me.  Also checking for
RELPERSISTENCE_UNLOGGED just makes the code longer to no benefit.  I
guess Amit copied the test from ATExecSetTablespace, which does it as
he did, but it seems unnecessarily long-winded.

> The difference of the two is an init fork of TEMP
> relations. However I believe that init fork is by definition a
> part of an unlogged relation, it seems to me that it ought not to
> be wal-logged if we had it. From this viewpoint, the middle line
> makes sense.

Actually, the init fork of an unlogged relation *must* be WAL-logged.
All other forks are removed on a crash (and the main fork is copied
anew from the init fork).  But the init fork itself must survive -
therefore it, and only it, must be WAL-logged and thus durable.

> By the way the comment of the function ReadBufferWithoutRelcache
> has the following sentense.
>
> | * NB: At present, this function may only be used on permanent relations, 
> which
> | * is OK, because we only use it during XLOG replay.  If in the future we
> | * want to use it on temporary or unlogged relations, we could pass 
> additional
> | * parameters.
>
> and does
>
> | return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
>  mode, strategy, 
> );
>
> This surely works since BufferAlloc recognizes INIT_FORK. But
> some adjustment may be needed around here.

Yeah, it should probably mention that the init fork of an unlogged
relation is also OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-07 Thread Robert Haas
On Wed, Jul 5, 2017 at 11:02 PM, Noah Misch  wrote:
> [Action required within three days.  This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 10 open item.  Robert,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.

I don't intend to rush this in before the beta2 wrap, although some
other committer is welcome to do so if they feel confident in the fix.
I will update again by July 17th.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM auth and Pgpool-II

2017-07-07 Thread Tatsuo Ishii
>> For more details what Pgpool-II actually does in md5 auth, please see:
>> 
>> https://pgpool.net/mediawiki/index.php/FAQ#How_does_pgpool-II_handle_md5_authentication.3F
> 
> I see.  In short, that's not going to be able to work with SCRAM.

That's my understanding with SCRAM too.

> We also certainly can't weaken or break SCRAM to address Pgpool's needs.

Right, that's not what I want too.

> I'm afraid an alternative approach will need to be considered for Pgpool
> to be able to do what it does today, as I outlined in my other email.

Thank your for your comment. Probably now is the time to give up
to support SCRAM in Pgpool-II.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM auth and Pgpool-II

2017-07-07 Thread Stephen Frost
Tatsuo,

* Tatsuo Ishii (is...@sraoss.co.jp) wrote:
> > I recall vaguely Ishii-san mentioning me at PGcon that pgpool was
> > actually cheating, but my memories are a bit fuzzy for this part.
> 
> What I meant by "cheating" was, Pgpool-II behaves as if PostgreSQL
> server in md5 auth.
> 
> For more details what Pgpool-II actually does in md5 auth, please see:
> 
> https://pgpool.net/mediawiki/index.php/FAQ#How_does_pgpool-II_handle_md5_authentication.3F

I see.  In short, that's not going to be able to work with SCRAM.

We also certainly can't weaken or break SCRAM to address Pgpool's needs.
I'm afraid an alternative approach will need to be considered for Pgpool
to be able to do what it does today, as I outlined in my other email.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] SCRAM auth and Pgpool-II

2017-07-07 Thread Stephen Frost
Michael,

* Michael Paquier (michael.paqu...@gmail.com) wrote:
> On Sat, Jul 8, 2017 at 1:24 AM, Robert Haas  wrote:
> > IIUC, things will get even worse once channel binding is committed,
> > presumably for PostgreSQL 11.  The point of channel binding is to
> > guarantee that you are conducting the authentication exchange with the
> > target server, not some intermediate proxy that might be conducting a
> > hostile MITM attack.  pgpool may not be a hostile attack, but it is
> > acting as a MITM, and channel binding is going to figure that out and
> > fail the authentication.  So unless I'm misunderstanding, the solution
> > you are proposing figures to have a very limited shelf life.
> 
> We may be misunderstanding each other then. The solution I am
> proposing, or at least the solution that I imagine should be done but
> my pgpool-fu is limited, is to keep around connection context and SSL
> context to handle properly connections messages, and switch between
> them. When using channel binding, the client needs the server
> certificate for end-point and the TLS finish message which are both
> stored in the SSL context after the handshake. So you need to cache
> that for each server.

I'm not sure what you mean here- the whole point of channel binding is
to prevent exactly the kind of MITM that pgPool would essentially be in
this case.  If you're suggesting that things be changed such that a
server provides information necessary to pgPool to pass along to the
client to make the client believe that it's talking to the server and
that there's no man-in-the-middle then the whole point of channel
binding goes out the window.  If what you're suggesting doesn't require
any help from either the client or the server to work then I fear we've
done something wrong in the implementation of channel binding (I've not
looked at it very closely).

I don't have any perfect answers for pgPool here, unfortunately.  One
approach would be to give it a list of usernames/passwords to use,
though that's hardly ideal.  Perhaps it would be possible for pgPool to
impersonate the server to the client and the client to the server if it
was able to query the database as a superuser on some other connection,
but I don't *think* so because of the way SCRAM works and because pgPool
wouldn't have access to the client's cleartext password.

Of course, if pgPool could convince the client to use 'password' auth
and get the cleartext password from the client then that would work to
authenticate against the server and there wouldn't be any channel
binding involved since the client is talking the basic cleartext
password protocol and not SCRAM.

> One problem is that pgpool does not use directly libpq but tries to
> handle things by itself at protocol level, so it needs to replicate a
> lot of existing APIs. Postgres-XC/XL use a connection pooler,
> presumably XL uses even a background worker for this stuff since it
> has been integrated with Postgres 9.3. And this stuff use libpq. This
> makes life way easier to handle context data on a connection basis.
> Those don't need to handle protocol-level messages either, which is
> perhaps different from what pgpool needs.

I don't really think that pgpool using or not using libpq makes any
real difference here.  What'll be an issue for pgpool is when clients
get updated libpq libraries that don't support password auth...

> In short I would think that pgpool needs first to un-cheat its
> handling with MD5, which is likely here to simplify its code.

This also doesn't seem particularly relevant to me, but then again, I've
never actually looked at the pgPool code myself.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Rust bindings to pgtypes lib

2017-07-07 Thread Andres Freund
On 2017-07-07 12:54:33 +0200, Michael Meskes wrote:
> > Some people use http://libpqtypes.esilo.com/
>
> Never before saw this. It does not seem to have more in common than the
> name, though.

It has binary to text conversion functionality for various types.  I
don't exactly know what Kaare needs, but that seems like a relevant
need.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM auth and Pgpool-II

2017-07-07 Thread Tatsuo Ishii
> I recall vaguely Ishii-san mentioning me at PGcon that pgpool was
> actually cheating, but my memories are a bit fuzzy for this part.

What I meant by "cheating" was, Pgpool-II behaves as if PostgreSQL
server in md5 auth.

For more details what Pgpool-II actually does in md5 auth, please see:

https://pgpool.net/mediawiki/index.php/FAQ#How_does_pgpool-II_handle_md5_authentication.3F

>> IIUC, things will get even worse once channel binding is committed,
>> presumably for PostgreSQL 11.  The point of channel binding is to
>> guarantee that you are conducting the authentication exchange with the
>> target server, not some intermediate proxy that might be conducting a
>> hostile MITM attack.  pgpool may not be a hostile attack, but it is
>> acting as a MITM, and channel binding is going to figure that out and
>> fail the authentication.  So unless I'm misunderstanding, the solution
>> you are proposing figures to have a very limited shelf life.
> 
> We may be misunderstanding each other then. The solution I am
> proposing, or at least the solution that I imagine should be done but
> my pgpool-fu is limited, is to keep around connection context and SSL
> context to handle properly connections messages, and switch between
> them. When using channel binding, the client needs the server
> certificate for end-point and the TLS finish message which are both
> stored in the SSL context after the handshake. So you need to cache
> that for each server.
> 
> One problem is that pgpool does not use directly libpq but tries to
> handle things by itself at protocol level, so it needs to replicate a
> lot of existing APIs. Postgres-XC/XL use a connection pooler,
> presumably XL uses even a background worker for this stuff since it
> has been integrated with Postgres 9.3. And this stuff use libpq. This
> makes life way easier to handle context data on a connection basis.
> Those don't need to handle protocol-level messages either, which is
> perhaps different from what pgpool needs.

Yeah, I wish I could use libpq in Pgpool-II. Unfortunately libpq does
not provide necessary features what Pgpool-II needs.

> In short I would think that pgpool needs first to un-cheat its
> handling with MD5, which is likely here to simplify its code.

See the link above why it's not possible.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM auth and Pgpool-II

2017-07-07 Thread Tatsuo Ishii
> I think it is.  From a security point of view, the fact that the same
> salt is always used for the same username is a weakness of md5
> authentication which SCRAM corrects.

In my understanding, md5 does not always use the same salt for the
same username. PostgreSQL keeps md5(password+username) in
pg_authid. When md5 auth is required, the backend sends a random
number (i.e. salt) to the client. The client replies back
md5(salt+md5(password+username)) to the backend. The backend does the
calculation (md5(salt+md5(password+username))). If the result matches
the value sent from the user, md5 authentication succeeds.

So the salt is differ in each session in md5.

The weakness in md5 is , IMO, each PostgreSQL installation always
keeps the same value (md5(password+username)).

> IIUC, things will get even worse once channel binding is committed,
> presumably for PostgreSQL 11.  The point of channel binding is to
> guarantee that you are conducting the authentication exchange with the
> target server, not some intermediate proxy that might be conducting a
> hostile MITM attack.  pgpool may not be a hostile attack, but it is
> acting as a MITM, and channel binding is going to figure that out and
> fail the authentication.  So unless I'm misunderstanding, the solution
> you are proposing figures to have a very limited shelf life.

Check...

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM auth and Pgpool-II

2017-07-07 Thread Michael Paquier
On Sat, Jul 8, 2017 at 1:24 AM, Robert Haas  wrote:
> On Wed, Jul 5, 2017 at 9:32 PM, Michael Paquier
>  wrote:
>> On Thu, Jul 6, 2017 at 10:03 AM, Tatsuo Ishii  wrote:
>>> For Pgpool-II, things would go as follows:
>>>
>>> 1) clients sends user name to Pgpool-II.
>>> 2) Pgpool-II forwards it to PostgreSQL servers.
>>> 3) Each PostgreSQL server sends their own salt to Pgpool-II.
>>> 4) Pgpool-II is confused because there are multiple salts and each has
>>>different values. The client only accepts single salt obviously.
>>
>> Couldn't you cache one single SASL exchange status for each
>> connection, meaning one PGconn saved for each? As the challenge sent
>> by the server and the response generated by the client are different
>> by design, I am afraid you would need to do that anyway in this
>> context (Isn't PG-pool using already the weaknesses of MD5 to make
>> things easier?).
>
> I think it is.  From a security point of view, the fact that the same
> salt is always used for the same username is a weakness of md5
> authentication which SCRAM corrects.

I recall vaguely Ishii-san mentioning me at PGcon that pgpool was
actually cheating, but my memories are a bit fuzzy for this part.

> IIUC, things will get even worse once channel binding is committed,
> presumably for PostgreSQL 11.  The point of channel binding is to
> guarantee that you are conducting the authentication exchange with the
> target server, not some intermediate proxy that might be conducting a
> hostile MITM attack.  pgpool may not be a hostile attack, but it is
> acting as a MITM, and channel binding is going to figure that out and
> fail the authentication.  So unless I'm misunderstanding, the solution
> you are proposing figures to have a very limited shelf life.

We may be misunderstanding each other then. The solution I am
proposing, or at least the solution that I imagine should be done but
my pgpool-fu is limited, is to keep around connection context and SSL
context to handle properly connections messages, and switch between
them. When using channel binding, the client needs the server
certificate for end-point and the TLS finish message which are both
stored in the SSL context after the handshake. So you need to cache
that for each server.

One problem is that pgpool does not use directly libpq but tries to
handle things by itself at protocol level, so it needs to replicate a
lot of existing APIs. Postgres-XC/XL use a connection pooler,
presumably XL uses even a background worker for this stuff since it
has been integrated with Postgres 9.3. And this stuff use libpq. This
makes life way easier to handle context data on a connection basis.
Those don't need to handle protocol-level messages either, which is
perhaps different from what pgpool needs.

In short I would think that pgpool needs first to un-cheat its
handling with MD5, which is likely here to simplify its code.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] More race conditions in logical replication

2017-07-07 Thread Tom Lane
Alvaro Herrera  writes:
> I'll next update this on or before Monday 10th at 19:00 CLT.

Schedule note --- we'll be wrapping beta2 on Monday, a couple hours
before that.  Although it'd be great to have this issue fixed before
beta2, jamming in a patch just a few hours before wrap is probably
not prudent.  If you can't get it done over the weekend, I'd counsel
holding off till after beta2 is tagged.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] More race conditions in logical replication

2017-07-07 Thread Alvaro Herrera
Petr Jelinek wrote:

> So best idea I could come up with is to make use of the new condition
> variable API. That lets us wait for variable which can be set per slot.
> 
> It's not backportable however, I am not sure how big problem that is
> considering the lack of complaints until now (maybe we could backport
> using the ugly timeout version?).
> 
> The attached patch is a prototype of such solution and there are some
> race conditions (variable can get signaled before the waiting process
> starts sleeping for it). I am mainly sending it to get feedback on the
> approach.

This one looks a lot better than the original one.

I wonder if it's possible to do this using lwlocks instead of condition
variables (similar to how we do the "wait for IO" thing both for slots
and buffers).  We could backport that easily ...

I'll next update this on or before Monday 10th at 19:00 CLT.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Peter Geoghegan
On Fri, Jul 7, 2017 at 12:59 PM, Alvaro Herrera
 wrote:
> Hmm, this seems potentially very useful.  Care to upload it to
> https://wiki.postgresql.org/wiki/Category:Snippets ?

Sure. I've added it here, under "index maintenance":

https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index

It would be a nice additional touch if there was an easy way of taking
the on-disk representation of index tuples (in this case that would be
little-endian signed integers from bt_page_items()), and from that
output actual typed values. Maybe just for a few select datatypes.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] More race conditions in logical replication

2017-07-07 Thread Petr Jelinek
On 06/07/17 18:20, Petr Jelinek wrote:
> On 06/07/17 17:33, Petr Jelinek wrote:
>> On 03/07/17 01:54, Tom Lane wrote:
>>> I noticed a recent failure that looked suspiciously like a race condition:
>>>
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet=2017-07-02%2018%3A02%3A07
>>>
>>> The critical bit in the log file is
>>>
>>> error running SQL: 'psql::1: ERROR:  could not drop the replication 
>>> slot "tap_sub" on publisher
>>> DETAIL:  The error was: ERROR:  replication slot "tap_sub" is active for 
>>> PID 3866790'
>>> while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R 
>>> dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION 
>>> tap_sub' at 
>>> /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm
>>>  line 1198.
>>>
>>> After poking at it a bit, I found that I can cause several different
>>> failures of this ilk in the subscription tests by injecting delays at
>>> the points where a slot's active_pid is about to be cleared, as in the
>>> attached patch (which also adds some extra printouts for debugging
>>> purposes; none of that is meant for commit).  It seems clear that there
>>> is inadequate interlocking going on when we kill and restart a logical
>>> rep worker: we're trying to start a new one before the old one has
>>> gotten out of the slot.
>>>
>>
>> Thanks for the test case.
>>
>> It's not actually that we start new worker fast. It's that we try to
>> drop the slot right after worker process was killed but if the code that
>> clears slot's active_pid takes too long, it still looks like it's being
>> used. I am quite sure it's possible to make this happen for physical
>> replication as well when using slots.
>>
>> This is not something that can be solved by locking on subscriber. ISTM
>> we need to make pg_drop_replication_slot behave more nicely, like making
>> it wait for the slot to become available (either by default or as an
>> option).
>>
>> I'll have to think about how to do it without rewriting half of
>> replication slots or reimplementing lock queue though because the
>> replication slots don't use normal catalog access so there is no object
>> locking with wait queue. We could use some latch wait with small timeout
>> but that seems ugly as that function can be called by user without
>> having dropped the slot before so the wait can be quite long (as in
>> "forever").
>>
> 
> Naive fix would be something like attached. But as I said, it's not
> exactly pretty.
> 

So best idea I could come up with is to make use of the new condition
variable API. That lets us wait for variable which can be set per slot.

It's not backportable however, I am not sure how big problem that is
considering the lack of complaints until now (maybe we could backport
using the ugly timeout version?).

The attached patch is a prototype of such solution and there are some
race conditions (variable can get signaled before the waiting process
starts sleeping for it). I am mainly sending it to get feedback on the
approach.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From b72ea52c865b2d7f0d7d29d0834d71e1ec33d54a Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Thu, 6 Jul 2017 18:16:44 +0200
Subject: [PATCH] Wait for slot to become free in when dropping it

---
 src/backend/replication/logical/logicalfuncs.c |  2 +-
 src/backend/replication/slot.c | 43 +-
 src/backend/replication/slotfuncs.c|  2 +-
 src/backend/replication/walsender.c|  6 ++--
 src/include/replication/slot.h |  8 +++--
 5 files changed, 46 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 363ca82..a3ba2b1 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -244,7 +244,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	else
 		end_of_wal = GetXLogReplayRecPtr();
 
-	ReplicationSlotAcquire(NameStr(*name));
+	ReplicationSlotAcquire(NameStr(*name), true);
 
 	PG_TRY();
 	{
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index dc7de20..2993bb9 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -157,6 +158,7 @@ ReplicationSlotsShmemInit(void)
 			/* everything else is zeroed by the memset above */
 			SpinLockInit(>mutex);
 			LWLockInitialize(>io_in_progress_lock, LWTRANCHE_REPLICATION_SLOT_IO_IN_PROGRESS);
+			ConditionVariableInit(>active_cv);
 		}
 	}
 }
@@ -323,7 +325,7 @@ ReplicationSlotCreate(const char *name, bool 

[HACKERS] Subscription code improvements

2017-07-07 Thread Petr Jelinek
Hi,

I have done some review of subscription handling (well self-review) and
here is the result of that (It's slightly improved version from another
thread [1]).

I split it into several patches:

0001 - Makes subscription worker exit nicely when there is subscription
missing (ie was removed while it was starting) and also makes disabled
message look same as the message when disabled state was detected while
worker is running. This is mostly just nicer behavior for user (ie no
unexpected errors in log when you just disabled the subscription).

0002 - This is bugfix - the sync worker should exit when waiting for
apply and apply dies otherwise there is possibility of not being
correctly synchronized.

0003 - Splits the schizophrenic SetSubscriptionRelState function into
two which don't try to do broken upsert and report proper errors instead.

0004 - Solves the issue which Masahiko Sawada reported [2] about ALTER
SUBSCRIPTION handling of workers not being transactional - we now do
killing of workers transactionally (and we do the same when possible in
DROP SUBSCRIPTION).

0005 - Bugfix to allow syscache access to subscription without being
connected to database - this should work since subscription is pinned
catalog, it wasn't caught because nothing in core is using it (yet).

0006 - Makes the locking of subscriptions more granular (this depends on
0005). This removes the ugly AccessExclusiveLock on the pg_subscription
catalog from DROP SUBSCRIPTION and also solves some potential race
conditions between launcher, ALTER SUBSCRIPTION and
process_syncing_tables_for_apply(). Also simplifies memory handling in
launcher as well as logicalrep_worker_stop() function. This basically
makes subscriptions behave like every other object in the database in
terms of locking.

Only the 0002, 0004 and 0005 are actual bug fixes, but I'd still like to
get it all into PG10 as especially the locking now behaves really
differently than everything else and that does not seem like a good idea.

[1]
https://www.postgresql.org/message-id/flat/3e06c16c-f441-c606-985c-6d6239097...@2ndquadrant.com
[2]
https://www.postgresql.org/message-id/flat/cad21aobd4t2rwtiwd8yfxd+q+m9swx50yjqt5ibj2mzebvn...@mail.gmail.com

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
From 4c1ef64493ee930dfde3aa787c454a5d68ac3efd Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Thu, 6 Jul 2017 23:42:56 +0200
Subject: [PATCH 1/6] Improve messaging during logical replication worker
 startup

---
 src/backend/replication/logical/worker.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0d48dfa..2e4099c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1527,24 +1527,31 @@ ApplyWorkerMain(Datum main_arg)
 		 ALLOCSET_DEFAULT_SIZES);
 	StartTransactionCommand();
 	oldctx = MemoryContextSwitchTo(ApplyContext);
-	MySubscription = GetSubscription(MyLogicalRepWorker->subid, false);
+	MySubscription = GetSubscription(MyLogicalRepWorker->subid, true);
+	if (!MySubscription)
+	{
+		ereport(LOG,
+(errmsg("logical replication apply worker for subscription %u will "
+		"stop because the subscription was removed",
+		MyLogicalRepWorker->subid)));
+		proc_exit(0);
+	}
 	MySubscriptionValid = true;
 	MemoryContextSwitchTo(oldctx);
 
-	/* Setup synchronous commit according to the user's wishes */
-	SetConfigOption("synchronous_commit", MySubscription->synccommit,
-	PGC_BACKEND, PGC_S_OVERRIDE);
-
 	if (!MySubscription->enabled)
 	{
 		ereport(LOG,
-(errmsg("logical replication apply worker for subscription \"%s\" will not "
-		"start because the subscription was disabled during startup",
+(errmsg("logical replication apply worker for subscription \"%s\" will "
+		"stop because the subscription was disabled",
 		MySubscription->name)));
-
 		proc_exit(0);
 	}
 
+	/* Setup synchronous commit according to the user's wishes */
+	SetConfigOption("synchronous_commit", MySubscription->synccommit,
+	PGC_BACKEND, PGC_S_OVERRIDE);
+
 	/* Keep us informed about subscription changes. */
 	CacheRegisterSyscacheCallback(SUBSCRIPTIONOID,
   subscription_change_cb,
-- 
2.7.4

From b5d4d9a130658859bcf6e21ca3bed131dbdddb57 Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Fri, 7 Jul 2017 00:04:43 +0200
Subject: [PATCH 2/6] Exit in sync worker if relation was removed during
 startup

---
 src/backend/replication/logical/tablesync.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 32abf5b..9fbdd8c 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -824,6 +824,20 @@ 

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Alvaro Herrera
Peter Geoghegan wrote:

> Here is the query:
> 
> with recursive index_details as (
>   select
> 'pgbench_accounts_pkey'::text idx
> ), [...]

Hmm, this seems potentially very useful.  Care to upload it to
https://wiki.postgresql.org/wiki/Category:Snippets ?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Peter Geoghegan
On Fri, Jul 7, 2017 at 12:45 AM, Alik Khilazhev
 wrote:
> On scale = 10(1 million rows) it gives following results on machine with 144 
> cores(with synchronous_commit=off):
> nclientstps
> 1   8842.401870
> 2   18358.140869
> 4   45999.378785
> 8   88713.743199
> 16  170166.998212
> 32  290069.221493
> 64  178128.030553
> 128 88712.825602
> 256 38364.937573
> 512 13512.765878
> 10006188.136736

Is it possible for you to instrument the number of B-Tree page
accesses using custom instrumentation for pgbench_accounts_pkey?

If that seems like too much work, then it would still be interesting
to see what the B-Tree keyspace looks like before and after varying
the "nclient" count from, say, 32 to 128. Maybe there is a significant
difference in how balanced or skewed it is in each case. Or, the index
could simply be more bloated.

There is a query that I sometimes use, that itself uses pageinspect,
to summarize the keyspace quickly. It shows you the highkey for every
internal page, starting from the root and working down to the lowest
internal page level (the one just before the leaf level -- level 1),
in logical/keyspace order. You can use it to visualize the
distribution of values. It could easily include the leaf level, too,
but that's less interesting and tends to make the query take ages. I
wonder what the query will show here.

Here is the query:

with recursive index_details as (
  select
'pgbench_accounts_pkey'::text idx
),
size_in_pages_index as (
  select
(pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
  from
index_details
),
page_stats as (
  select
index_details.*,
stats.*
  from
index_details,
size_in_pages_index,
lateral (select i from generate_series(1, size_pages - 1) i) series,
lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
  select
*
  from
page_stats
  where
type != 'l'),
meta_stats as (
  select
*
  from
index_details s,
lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
  select
*
  from
internal_page_stats
  order by
btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
  select
1,
blkno,
btpo
  from
internal_items
  where
btpo_prev = 0
and btpo = (select level from meta_stats)
  union
  select
case when level = btpo then o.item + 1 else 1 end,
blkno,
btpo
  from
internal_items i,
ordered_internal_items o
  where
i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
  idx,
  btpo as level,
  item as l_item,
  blkno,
  btpo_prev,
  btpo_next,
  btpo_flags,
  type,
  live_items,
  dead_items,
  avg_item_size,
  page_size,
  free_size,
  -- Only non-rightmost pages have high key.
  case when btpo_next != 0 then (select data from bt_page_items(idx,
blkno) where itemoffset = 1) end as highkey
from
  ordered_internal_items o
  join internal_items i on o.blk = i.blkno
order by btpo desc, item;

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Peter Geoghegan
On Fri, Jul 7, 2017 at 5:17 AM, Robert Haas  wrote:
> How is that possible?  In a Zipfian distribution, no matter how big
> the table is, almost all of the updates will be concentrated on a
> handful of rows - and updates to any given row are necessarily
> serialized, or so I would think.  Maybe MongoDB can be fast there
> since there are no transactions, so it can just lock the row slam in
> the new value and unlock the row, all (I suppose) without writing WAL
> or doing anything hard.

If you're not using the Wired Tiger storage engine, than the locking
is at the document level, which means that a Zipfian distribution is
no worse than any other as far as lock contention goes. That's one
possible explanation. Another is that indexed organized tables
naturally have much better locality, which matters at every level of
the memory hierarchy.

> I'm more curious about why we're performing badly than I am about a
> general-purpose random_zipfian function.  :-)

I'm interested in both. I think that a random_zipfian function would
be quite helpful for modeling certain kinds of performance problems,
like CPU cache misses incurred at the page level.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM auth and Pgpool-II

2017-07-07 Thread David Fetter
On Thu, Jul 06, 2017 at 10:03:37AM +0900, Tatsuo Ishii wrote:
> Hi PostgreSQL hackers,
> 
> I would like to hear ideas how Pgpool-II can deal with SCRAM auth
> which will be in PostgreSQL 10.
> 
> For those who are not familiar with Pgpool-II[1], it is an external
> OSS project to provide some additional features to PostgreSQL,
> including load balancing and automatic failover. Pgpool-II works as a
> proxy between PostgreSQL client and PostgreSQL server(s).
> 
> When a client wants to connects to PostgreSQL and SCRAM auth is
> enabled, it sends user name to server. Then the server sends
> information including a salt to the client. The client computes a
> "ClientProof" using the salt and other information, and sends it to
> the server[2].
> 
> For Pgpool-II, things would go as follows:
> 
> 1) clients sends user name to Pgpool-II.
> 2) Pgpool-II forwards it to PostgreSQL servers.
> 3) Each PostgreSQL server sends their own salt to Pgpool-II.
> 4) Pgpool-II is confused because there are multiple salts and each has
>different values. The client only accepts single salt obviously.
> 
> So my question is, is there any solution or workaround for the problem
> #4. Someone at PGCon 2017 suggested that the problem could be avoided
> if the auth method between the client and Pgpool-II is "trust" (which
> means no auth). But this does not seem to be a best solution for me
> because it would weaken the security.

In the end, what poolers do is doing is indistinguishable, in security
terms, from a man-in-the-middle attack.  To the client, the thing with
which they're negotiating auth and doing queries is Pgpool-II, in a
manner similar to writing to a RAID volume rather than any individual
disk in it.

Are people actually running Pgpool on an untrusted network to the
PostgreSQL nodes?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM auth and Pgpool-II

2017-07-07 Thread Robert Haas
On Wed, Jul 5, 2017 at 9:32 PM, Michael Paquier
 wrote:
> On Thu, Jul 6, 2017 at 10:03 AM, Tatsuo Ishii  wrote:
>> For Pgpool-II, things would go as follows:
>>
>> 1) clients sends user name to Pgpool-II.
>> 2) Pgpool-II forwards it to PostgreSQL servers.
>> 3) Each PostgreSQL server sends their own salt to Pgpool-II.
>> 4) Pgpool-II is confused because there are multiple salts and each has
>>different values. The client only accepts single salt obviously.
>
> Couldn't you cache one single SASL exchange status for each
> connection, meaning one PGconn saved for each? As the challenge sent
> by the server and the response generated by the client are different
> by design, I am afraid you would need to do that anyway in this
> context (Isn't PG-pool using already the weaknesses of MD5 to make
> things easier?).

I think it is.  From a security point of view, the fact that the same
salt is always used for the same username is a weakness of md5
authentication which SCRAM corrects.

IIUC, things will get even worse once channel binding is committed,
presumably for PostgreSQL 11.  The point of channel binding is to
guarantee that you are conducting the authentication exchange with the
target server, not some intermediate proxy that might be conducting a
hostile MITM attack.  pgpool may not be a hostile attack, but it is
acting as a MITM, and channel binding is going to figure that out and
fail the authentication.  So unless I'm misunderstanding, the solution
you are proposing figures to have a very limited shelf life.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-07 Thread David Fetter
On Fri, Jul 07, 2017 at 10:29:26AM +0900, Amit Langote wrote:
> Hi Mark,
> 
> On 2017/07/07 9:02, Mark Kirkwood wrote:
> > I've been trying out the new partitioning in version 10. Firstly, I must
> > say this is excellent - so much nicer than the old inheritance based method!
> 
> Thanks. :)
> 
> > My only niggle is the display of partitioned tables via \d etc. e.g:
> >
> > part=# \d
> > List of relations
> >  Schema | Name | Type  |  Owner
> > +--+---+--
> >  public | date_fact| table | postgres
> >  public | date_fact_201705 | table | postgres
> >  public | date_fact_201706 | table | postgres
> >  public | date_fact_20170601   | table | postgres
> >  public | date_fact_2017060100 | table | postgres
> >  public | date_fact_201707 | table | postgres
> >  public | date_fact_rest   | table | postgres
> > (7 rows)

Would showing relispartition=tru tables only in \d+ fix this?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rust bindings to pgtypes lib

2017-07-07 Thread Kaare Rasmussen
I agree with Michael Meskes that duplicating the code is probably not a good 
idea. Now I see other rust crates just depend on the right libraries to be 
installed by the developer, so I'll punt the problem until it bites me for 
real.

This confuses me, though

> Indeed. I'm quite strongly against exposing / using pgtypeslib in more
> places. If anything it should be phased out. Because that code is
> definitely not always kept up2date, and it's a lot of work to do so.  If
> anything the code should be rewritten to *not* require so much
> duplication, then we could talk

Is this a widespread opinion?  If so, perhaps it would be an idea to mention 
that somewhere in the docs. Just reading them, it would seem to be a nice 
interface to a library that can do datetime and arbitrary numeric calculation 
for you. Both areas are very hard to get right.

No problem on my part. The target of my small project was to learn some rust 
by throing myself at it. And I learned more than expected, so the outcome is 
already fine. If the resulting project also could be of use to me and anybody 
else, it would just be an extra win.

-- 
Med venlig hilsen
Kaare Rasmussen, Jasonic

Jasonic:  Nordre Fasanvej 12, 2000 Frederiksberg



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

2017-07-07 Thread Masahiko Sawada
On Fri, Jul 7, 2017 at 3:48 PM, Michael Paquier
 wrote:
> On Wed, Jul 5, 2017 at 4:57 PM, Michael Paquier
>  wrote:
>> Why not refactoring a bit do_pg_stop_backup() so as the wait phase
>> happens even if a backup is started in recovery? Now wait_for_archive
>> is ignored because no wait is happening and the stop point is directly
>> returned back to the caller. For the wait actually happening, I don't
>> have a better idea than documenting the fact that enforcing manually a
>> segment switch on the primary needs to happen. That's better than
>> having users including WAL in their base backups but not actually
>> having everything they need. And I think that documenting that
>> properly is better than restricting things that should work.
>
> While looking at that in more details, I got surprised by two things:
> 1) No backup history file is generated on a standby during a base backup.
> 2) Because of 1), those files are not archived even if archive_mode = always.
>
> This sounds to me like a missing optimization of archive_mode =
> always, and the following comment block in xlog.c is at least
> incorrect as an archiver can be invoked in recovery:
>  * XXX currently a backup history file is for informational and debug
>  * purposes only. It's not essential for an online backup. Furthermore,
>  * even if it's created, it will not be archived during recovery because
>  * an archiver is not invoked. So it doesn't seem worthwhile to write a
>  * backup history file during recovery.
>
> So I would suggest the following things to address this issue:
> 1) Generate a backup history file for backups taken at recovery as well.
> 2) Archive it if archive_mode = always.
> 3) Wait for both the segment of the stop point and the backup history
> files to be archived before returning back to the caller of
> pg_stop_backup.
>
> It would be nice to get all that addressed in 10. Thoughts?

Yeah, I agree. Attached patch makes it works and deals with the
history file issue.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3631922..eb47742 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18594,10 +18594,12 @@ postgres=# select pg_start_backup('label_goes_here');
 backup (and not in the data directory).  There is an optional second
 parameter of type boolean.  If false, the pg_stop_backup
 will return immediately after the backup is completed without waiting for
-WAL to be archived.  This behavior is only useful for backup
-software which independently monitors WAL archiving. Otherwise, WAL
-required to make the backup consistent might be missing and make the backup
-useless.
+WAL to be archived.  This behavior is only useful for backup software
+which independently monitors WAL archiving. Otherwise, WAL required to
+make the backup consistent might be missing and make the backup useless.
+If second parameter is true and on standby, pg_stop_backup
+waits for WAL to be archived without forcibly switching WAL on standby.
+So enforcing manually a WAL switch on primary needs to happen.

 

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5b6cec8..8efb174 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10875,10 +10875,10 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	 * however.
 	 *
 	 * We don't force a switch to new WAL file and wait for all the required
-	 * files to be archived. This is okay if we use the backup to start the
-	 * standby. But, if it's for an archive recovery, to ensure all the
-	 * required files are available, a user should wait for them to be
-	 * archived, or include them into the backup.
+	 * files to be archived if waitforarchive is false. This is okay if we use
+	 * the backup to start the standby. But, if it's for an archive recovery,
+	 * to ensure all the required files are available, a user should set
+	 * waitforarchive true and wait for them to be archived.
 	 *
 	 * We return the current minimum recovery point as the backup end
 	 * location. Note that it can be greater than the exact backup end
@@ -10886,10 +10886,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	 * pg_control. This is harmless for current uses.
 	 *
 	 * XXX currently a backup history file is for informational and debug
-	 * purposes only. It's not essential for an online backup. Furthermore,
-	 * even if it's created, it will not be archived during recovery because
-	 * an archiver is not invoked. So it doesn't seem worthwhile to write a
-	 * backup history file during recovery.
+	 * purposes only. It's not essential for an online backup.
 	 */
 	if (backup_started_in_recovery)
 	{
@@ -10919,39 +10916,40 

Re: [HACKERS] Revisiting NAMEDATALEN

2017-07-07 Thread Mark Dilger

> On Jul 7, 2017, at 2:53 AM, Emrul  wrote:
> 
> Tom, thank you for that pointer.  I get now that it is not free and therefore
> why its not something that should be changed by default.
> 
> I guess the problem is 'build your own copy' (i.e. compiling from source) is
> something that sends most DB teams running into the hills.

To make matters worse, if you change NAMEDATALEN, compile, and run
'make check', some of the tests will fail.  The tests are very sensitive to the
exact output of the sql they execute, and changing NAMEDATALEN, or
indeed any one of many other options, causes some of the test output to
change. Even configure's options, such as --with-blocksize, cannot be
changed from the default value without potentially breaking the regression
tests.

mark

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-07 Thread Robert Haas
On Fri, Jul 7, 2017 at 8:20 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> I don't have a strong view on whether partitions should be hidden by
>> default, although I lean slightly against it (say, -0.25).  But if we
>> do decide to hide them by default, then I definitely want an
>> easy-to-use modifier that overrides that behavior, like being able to
>> type \d! or whatever to have them included after all.
>
> AIUI the user is responsible for DDL on partitions, like say creating
> indexes for them?  Seems like hiding them is a bad idea given that.
> Also, we need to be careful about calling them something very separate
> from "table", because that would rouse the need to have duplicate syntax
> for every sort of ALTER TABLE and suchlike command that we want to have
> be usable with partitions.  I think we've largely gone the wrong direction
> in that respect with respect to foreign tables and matviews.

Well, I'm not sure what other direction we could have taken there, and
I don't think it follows that just because it's labeled differently in
\d output it has to have different SQL syntax.

On the core question of whether they should be hidden, I think the
answer is that it depends on the situation.  As Simon says, if people
use partitioning with large numbers of partitions, listing many
nearly-identical partition names clutters up the list to an extent
that makes life quite difficult; I've encountered this as a real
usability problem on actual systems.  On the other hand, people with
more modest numbers of partitions (say, 10) might well find it more
convenient to see those names included, because they're legitimate
targets for commands like COMMENT and DROP TABLE and lots of other
things, and somebody might very well find it convenient to be able to
get that with \d rather than \d+ parent_table_name.

As I say, I don't feel hugely strongly about the default behavior, but
I do feel strongly that the idea that only one of the two proposed
behavior is useful is entirely incorrect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-07 Thread Tom Lane
I wrote:
> I looked at that again and realized that one of the answers I was missing
> lies in the behavior of analyze.c's compute_scalar_stats().

I updated the patch's comments based on this.  I'll just attach the
selfuncs.c part of the patch, since nothing else changed.

regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index e103f5e..3ec629b 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -163,7 +163,7 @@ static double var_eq_non_const(VariableStatData *vardata, Oid operator,
  bool varonleft, bool negate);
 static double ineq_histogram_selectivity(PlannerInfo *root,
 		   VariableStatData *vardata,
-		   FmgrInfo *opproc, bool isgt,
+		   FmgrInfo *opproc, bool isgt, bool iseq,
 		   Datum constval, Oid consttype);
 static double eqjoinsel_inner(Oid operator,
 VariableStatData *vardata1, VariableStatData *vardata2);
@@ -544,18 +544,21 @@ neqsel(PG_FUNCTION_ARGS)
 /*
  *	scalarineqsel		- Selectivity of "<", "<=", ">", ">=" for scalars.
  *
- * This is the guts of both scalarltsel and scalargtsel.  The caller has
- * commuted the clause, if necessary, so that we can treat the variable as
- * being on the left.  The caller must also make sure that the other side
- * of the clause is a non-null Const, and dissect same into a value and
- * datatype.
+ * This is the guts of scalarltsel/scalarlesel/scalargtsel/scalargesel.
+ * The isgt and iseq flags distinguish which of the four cases apply.
+ *
+ * The caller has commuted the clause, if necessary, so that we can treat
+ * the variable as being on the left.  The caller must also make sure that
+ * the other side of the clause is a non-null Const, and dissect that into
+ * a value and datatype.  (This definition simplifies some callers that
+ * want to estimate against a constructed value instead of a Const node.)
  *
  * This routine works for any datatype (or pair of datatypes) known to
  * convert_to_scalar().  If it is applied to some other datatype,
  * it will return a default estimate.
  */
 static double
-scalarineqsel(PlannerInfo *root, Oid operator, bool isgt,
+scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
 			  VariableStatData *vardata, Datum constval, Oid consttype)
 {
 	Form_pg_statistic stats;
@@ -587,7 +590,8 @@ scalarineqsel(PlannerInfo *root, Oid operator, bool isgt,
 	 * If there is a histogram, determine which bin the constant falls in, and
 	 * compute the resulting contribution to selectivity.
 	 */
-	hist_selec = ineq_histogram_selectivity(root, vardata, , isgt,
+	hist_selec = ineq_histogram_selectivity(root, vardata,
+			, isgt, iseq,
 			constval, consttype);
 
 	/*
@@ -757,7 +761,8 @@ histogram_selectivity(VariableStatData *vardata, FmgrInfo *opproc,
  *	ineq_histogram_selectivity	- Examine the histogram for scalarineqsel
  *
  * Determine the fraction of the variable's histogram population that
- * satisfies the inequality condition, ie, VAR < CONST or VAR > CONST.
+ * satisfies the inequality condition, ie, VAR < (or <=, >, >=) CONST.
+ * The isgt and iseq flags distinguish which of the four cases apply.
  *
  * Returns -1 if there is no histogram (valid results will always be >= 0).
  *
@@ -768,7 +773,7 @@ histogram_selectivity(VariableStatData *vardata, FmgrInfo *opproc,
 static double
 ineq_histogram_selectivity(PlannerInfo *root,
 		   VariableStatData *vardata,
-		   FmgrInfo *opproc, bool isgt,
+		   FmgrInfo *opproc, bool isgt, bool iseq,
 		   Datum constval, Oid consttype)
 {
 	double		hist_selec;
@@ -795,11 +800,17 @@ ineq_histogram_selectivity(PlannerInfo *root,
 		if (sslot.nvalues > 1)
 		{
 			/*
-			 * Use binary search to find proper location, ie, the first slot
-			 * at which the comparison fails.  (If the given operator isn't
-			 * actually sort-compatible with the histogram, you'll get garbage
-			 * results ... but probably not any more garbage-y than you would
-			 * from the old linear search.)
+			 * Use binary search to find the desired location, namely the
+			 * right end of the histogram bin containing the comparison value,
+			 * which is the leftmost entry for which the comparison operator
+			 * succeeds (if isgt) or fails (if !isgt).  (If the given operator
+			 * isn't actually sort-compatible with the histogram, you'll get
+			 * garbage results ... but probably not any more garbage-y than
+			 * you would have from the old linear search.)
+			 *
+			 * In this loop, we pay no attention to whether the operator iseq
+			 * or not; that detail will be mopped up below.  (We cannot tell,
+			 * anyway, whether the operator thinks the values are equal.)
 			 *
 			 * If the binary search accesses the first or last histogram
 			 * entry, we try to replace that endpoint with the true column min
@@ -864,25 +875,74 @@ ineq_histogram_selectivity(PlannerInfo *root,
 
 			if (lobound <= 0)
 			{
-/* Constant is 

Re: [HACKERS] New partitioning - some feedback

2017-07-07 Thread Simon Riggs
On 7 July 2017 at 13:20, Tom Lane  wrote:
> Robert Haas  writes:
>> I don't have a strong view on whether partitions should be hidden by
>> default, although I lean slightly against it (say, -0.25).  But if we
>> do decide to hide them by default, then I definitely want an
>> easy-to-use modifier that overrides that behavior, like being able to
>> type \d! or whatever to have them included after all.
>
> AIUI the user is responsible for DDL on partitions, like say creating
> indexes for them?  Seems like hiding them is a bad idea given that.
> Also, we need to be careful about calling them something very separate
> from "table", because that would rouse the need to have duplicate syntax
> for every sort of ALTER TABLE and suchlike command that we want to have
> be usable with partitions.  I think we've largely gone the wrong direction
> in that respect with respect to foreign tables and matviews.

Hmm, "hiding" would not be an accurate description of the proposal. I
would characterize it more as removing extraneous information, since
for a partitioned table seeing 1000 records all with roughly the same
name isn't helpful output from \d

\d would show tables but not partitions
\d  would show partitions exist and how many
\d+ would show partition details

So the information would be available, just at different levels of
detail, just as we have now for other things.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting NAMEDATALEN

2017-07-07 Thread Tom Lane
Robert Haas  writes:
> On Fri, Jul 7, 2017 at 5:53 AM, Emrul  wrote:
>> A solution might be to make NAMEDATALEN configurable without having to
>> recompile source (perhaps a config variable or an initdb parameter). When I
>> have some free time I will investigate whether this is doable.

> Well, it wouldn't be free.  The problem is:

> typedef struct nameData
> {
> chardata[NAMEDATALEN];
> } NameData;

> If it were not a compile-time constant, every bit of code that uses
> NameData (or otherwise depends on NAMEDATALEN being constant) would
> have to be changed.  That would be invasive and likely have at least a
> minor performance cost.

It's a lot worse than just the code that accesses the names; if that were
all then we might hope to hide most of the issue underneath macros like
RelationGetRelationName.  The problem is that if NameData isn't constant
length, that also breaks the struct overlay mechanism for catalog rows,
introducing notational and performance costs for essentially every catalog
field access in the entire backend.  That is, in places like

CATALOG(pg_type,1247) BKI_BOOTSTRAP BKI_ROWTYPE_OID(71) BKI_SCHEMA_MACRO
{
NameDatatypname;/* type name */
Oid typnamespace;   /* OID of namespace containing this type */
Oid typowner;   /* type owner */
...

typnamespace and following fields are no longer easily accessible
to C code.

You could partly get around that by relocating name columns to the ends
of their catalog rows --- but only partly, and it would be a darn odd
looking result.

It's interesting to speculate about NameData becoming some sort of short
fixed-length pointer to a variable-length string stored somewhere else
(like the end of the row).  But TBH I cannot see a scheme like that ever
getting out of the realm of speculation --- it would break too many
*other* assumptions, many of them in performance-critical places like
tuple assembly and disassembly.

In the end I just don't think this is worth the work it would take to
improve matters significantly over the current situation.  It's too much
work benefitting too few people.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2017-07-07 Thread Alexander Korotkov
On Thu, Jun 15, 2017 at 10:16 PM, Andres Freund  wrote:

> On 2017-06-14 11:48:25 +0300, Marina Polyakova wrote:
> > Advanced options:
> > - mostly for testing built-in scripts: you can set the default
> transaction
> > isolation level by the appropriate benchmarking option (-I);
>
> I'm less convinced of the need of htat, you can already set arbitrary
> connection options with
> PGOPTIONS='-c default_transaction_isolation=serializable' pgbench
>

Right, there is already way to specify default isolation level using
environment variables.
However, once we make pgbench work with various isolation levels, users may
want to run pgbench multiple times in a row with different isolation
levels.  Command line option would be very convenient in this case.
In addition, isolation level is vital parameter to interpret benchmark
results correctly.  Often, graphs with pgbench results are entitled with
pgbench command line.  Having, isolation level specified in command line
would naturally fit into this entitling scheme.
Of course, this is solely usability question and it's fair enough to live
without such a command line option.  But I'm +1 to add this option.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-07 Thread Amit Kapila
On Fri, Jul 7, 2017 at 8:22 AM, AP  wrote:
> On Thu, Jul 06, 2017 at 05:19:59PM +0530, Amit Kapila wrote:
>> I think if you are under development, it is always advisable to create
>> indexes after initial bulk load.  That way it will be faster and will
>> take lesser space atleast in case of hash index.
>
> This is a bit of a pickle, actually:
> * if I do have a hash index I'll wind up with a bloated one at some stage
>   that refused to allow more inserts until the index is re-created
> * if I don't have an index then I'll wind up with a table where I cannot
>   create a hash index because it has too many rows for it to handle
>
> I'm at a bit of a loss as to how to deal with this.
>

I can understand your concerns.  To address first concern we need to
work on one or more of following work items: (a) work on vacuums that
can be triggered on insert only workload (it should perform index
vacuum as well) (b) separate utility statement/function to squeeze
hash index (c) db internally does squeezing like after each split, so
that chances of such a problem can be reduced, but that will be at the
cost of performance reduction in other workloads, so not sure if it is
advisable.  Among these (b) is simplest to do but may not be
convenient for the user.

To address your second concern, we need to speed up the creation of
hash index which is a relatively big project.  Having said that, I
think in your case, this is one-time operation so spending once more
time might be okay.

>
>> >> As mentioned above REINDEX might be a better option.  I think for such
>> >> situation we should have some provision to allow squeeze functionality
>> >> of hash exposed to the user, this will be less costly than REINDEX and
>> >> might serve the purpose for the user.  Hey, can you try some hack in
>> >
>> > Assuming it does help, would this be something one would need to guess
>> > at? "I did a whole bunch of concurrent INSERT heavy transactions so I
>> > guess I should do a squeeze now"?
>> >
>> > Or could it be figured out programmatically?
>>
>> I think one can refer free_percent and number of overflow pages to
>> perform such a command.  It won't be 100% correct, but we can make a
>> guess.  We can even check free space in overflow pages with page
>> inspect to make it more accurate.
>
> Does this take much time? Main reason I am asking is that this looks like
> something that the db ought to handle underneath (say as part of an autovac
> run) and so if there are stats that the index code can maintain that can
> then be used by the autovac (or something) code to trigger a cleanup this
> I think would be of benefit.
>

Sure, I agree that database should automatically handle bloat, but as
said above this will be a separate project and may not be straight
forward.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-07 Thread Tom Lane
Robert Haas  writes:
> I don't have a strong view on whether partitions should be hidden by
> default, although I lean slightly against it (say, -0.25).  But if we
> do decide to hide them by default, then I definitely want an
> easy-to-use modifier that overrides that behavior, like being able to
> type \d! or whatever to have them included after all.

AIUI the user is responsible for DDL on partitions, like say creating
indexes for them?  Seems like hiding them is a bad idea given that.
Also, we need to be careful about calling them something very separate
from "table", because that would rouse the need to have duplicate syntax
for every sort of ALTER TABLE and suchlike command that we want to have
be usable with partitions.  I think we've largely gone the wrong direction
in that respect with respect to foreign tables and matviews.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Robert Haas
On Fri, Jul 7, 2017 at 3:45 AM, Alik Khilazhev
 wrote:
> PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% 
> UPDATE of random row by PK) on benchmarking with big number of clients using 
> Zipfian distribution. MySQL also has decline but it is not significant as it 
> is in PostgreSQL. MongoDB does not have decline at all.

How is that possible?  In a Zipfian distribution, no matter how big
the table is, almost all of the updates will be concentrated on a
handful of rows - and updates to any given row are necessarily
serialized, or so I would think.  Maybe MongoDB can be fast there
since there are no transactions, so it can just lock the row slam in
the new value and unlock the row, all (I suppose) without writing WAL
or doing anything hard.  But MySQL is going to have to hold the row
lock until transaction commit just like we do, or so I would think.
Is it just that their row locking is way faster than ours?

I'm more curious about why we're performing badly than I am about a
general-purpose random_zipfian function.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-07 Thread Robert Haas
On Fri, Jul 7, 2017 at 3:54 AM, Michael Banck  wrote:
> +1.
>
> Or maybe just 'partition' is enough if 'partition table' would widen the
> column output unnecessarily.

Internally to the source code, the parent is called a "partitioned
table" and the child is called a "partition".  I think we should not
use the term "partition table" because I think it could create
confusion as to which of those two things we're talking about.  It
would be reasonable to write "partition" rather than "table" for
partitions, though.  We'd probably also need "partition index" (for
indexes on partition) and "foreign partition" (for foreign tables that
are partitions).

I don't have a strong view on whether partitions should be hidden by
default, although I lean slightly against it (say, -0.25).  But if we
do decide to hide them by default, then I definitely want an
easy-to-use modifier that overrides that behavior, like being able to
type \d! or whatever to have them included after all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix header comment of streamutil.c

2017-07-07 Thread Magnus Hagander
On Fri, Jul 7, 2017 at 8:31 AM, Masahiko Sawada 
wrote:

> Hi,
>
> While reading source code, I found that the header comment of
> streamutil.c is not correct. I guess pg_receivelog is a mistake of
> pg_receivexlog and it's an oversight when changing xlog to wal.
>
> Attached patch fixes this.
>

Yeah, both a typo and a missing user :)

Applied, thanks.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [HACKERS] New partitioning - some feedback

2017-07-07 Thread Simon Riggs
On 7 July 2017 at 08:54, Michael Banck  wrote:
> On Fri, Jul 07, 2017 at 07:40:55PM +1200, Mark Kirkwood wrote:
>> On 07/07/17 13:29, Amit Langote wrote:
>> >Someone complained about this awhile back [1].  And then it came up again
>> >[2], where Noah appeared to take a stance that partitions should be
>> >visible in views / output of commands that list "tables".
>> >
>> >Although I too tend to prefer not filling up the \d output space by
>> >listing partitions (pg_class.relispartition = true relations), there
>> >wasn't perhaps enough push for creating a patch for that.  If some
>> >committer is willing to consider such a patch, I can make one.
>>
>> Yeah, me too (clearly). However if the consensus is that all these partition
>> tables *must* be shown in \d output, then I'd be happy if they were
>> identified as such rather than just 'table' (e.g 'partition table').
>
> +1.

+1 to remove partitions from \d display

With 1000 partitions that would just be annoying

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Fabien COELHO


Hello Alik,

PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% 
UPDATE of random row by PK) on benchmarking with big number of clients 
using Zipfian distribution. MySQL also has decline but it is not 
significant as it is in PostgreSQL. MongoDB does not have decline at 
all. And if pgbench would have Zipfian distribution random number 
generator, everyone will be able to make research on this topic without 
using YCSB.


Your description is not very precise. What version of Postgres is used? If 
there is a decline, compared to which version? Is there a link to these 
results?



This is the reason why I am currently working on random_zipfian function.


The bottleneck of algorithm that I use is that it calculates zeta 
function (it has linear complexity - 
https://en.wikipedia.org/wiki/Riemann_zeta_function). It my cause 
problems on generating huge amount of big numbers.


Indeed, the function computation is over expensive, and the numerical 
precision of the implementation is doubtful.


If there is no better way to compute this function, ISTM that it should be 
summed in reverse order to accumulate small values first, from (1/n)^s + 
... + (1/2)^ s. As 1/1 == 1, the corresponding term is 1, no point in 
calling pow for this one, so it could be:


   double ans = 0.0;
   for (i = n; i >= 2; i--)
 ans += pow(1. / i, theta);
   return 1.0 + ans;

That’s why I added caching for zeta value. And it works good for cases 
when random_zipfian called with same parameters in script. For example:


That’s why I have a question: should I implement support of caching zeta 
values for calls with different parameters, or not?


I do not envision the random_zipfian function to be used widely, but if it 
is useful to someone this is fine with me. Could an inexpensive 
exponential distribution be used instead for the same benchmarking 
purpose?


If the functions when actually used is likely to be called with different 
parameters, then some caching beyond the last value would seem in order. 
Maybe a small fixed size array?


However, it should be somehow thread safe, which does not seem to be the 
case with the current implementation. Maybe a per-thread cache? Or use a 
lock only to update a shared cache? At least it should avoid locking to 
read values...



P.S. I attaching patch and script - analogue of YCSB Workload A.
Run benchmark with command:
$ pgbench -f  ycsb_read_zipf.sql -f  ycsb_update_zipf.sql


Given the explanations, the random draw mostly hits values at the 
beginning of the interval, so when the number of client goes higher one 
just get locking contention on the updated row?


ISTM that also having the tps achieved with a flat distribution would 
allow to check this hypothesis.


On scale = 10(1 million rows) it gives following results on machine with 
144 cores(with synchronous_commit=off):



nclientstps
1   8842.401870
2   18358.140869
4   45999.378785
8   88713.743199
16  170166.998212
32  290069.221493
64  178128.030553
128 88712.825602
256 38364.937573
512 13512.765878
10006188.136736


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

2017-07-07 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hi Craig,

The scenario is lock and unlock of the system for 30 times. During this 
scenario 5 sh-QUIT core is generated. GDB of 5 core is pointing to different 
locations.
I have attached output for 2 such instance.

Regards,
Sandhya

From: Craig Ringer [mailto:cr...@2ndquadrant.com]
Sent: Friday, July 07, 2017 12:55 PM
To: K S, Sandhya (Nokia - IN/Bangalore) 
Cc: pgsql-bugs ; PostgreSQL Hackers 
; T, Rasna (Nokia - IN/Bangalore) 
; Itnal, Prakash (Nokia - IN/Bangalore) 

Subject: Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

On 7 July 2017 at 15:10, K S, Sandhya (Nokia - IN/Bangalore) 
> wrote:
Hi Craig,

You were right about the restore_command.

This all makes sense then.

PostgreSQL sends SIGQUIT for immediate shutdown to its children. So the 
restore_command would get signalled too.

Can't immediately explain the exit code, and SIGQUIT should _not_ generate a 
core file. Can you show the result of attaching 'gdb' to the core file and 
running 'bt full' ?

--
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
GDB of first instance of corefile.
# gdb /bin/bash CFPU-1-7919-595e59a9-sh-QUIT.core
GNU gdb (Wind River Linux G++ 4.4a-470) 7.2.50.20100908-cvs
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "mips64-wrs-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /bin/bash...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 7919]
Reading symbols from /lib64/libreadline.so.5...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libreadline.so.5
Reading symbols from /lib64/libhistory.so.5...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libhistory.so.5
Reading symbols from /lib64/libncurses.so.5...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libncurses.so.5
Reading symbols from /lib64/libdl.so.2...Reading symbols from 
/mnt/sysimg/usr/lib/debug/lib64/libdl-2.11.1.so.debug...(no debugging symbols 
found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libc.so.6...Reading symbols from 
/mnt/sysimg/usr/lib/debug/lib64/libc-2.11.1.so.debug...done.
done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libtinfo.so.5...(no debugging symbols found)...done.
Loaded symbols for /lib64/libtinfo.so.5
Reading symbols from /lib64/ld.so.1...Reading symbols from 
/mnt/sysimg/usr/lib/debug/lib64/ld-2.11.1.so.debug...(no debugging symbols 
found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/ld.so.1
Core was generated by `sh -c exit 1'.
Program terminated with signal 3, Quit.
#0  0x005558246a80 in _dl_lookup_symbol_x () from /lib64/ld.so.1
(gdb) bt full
#0  0x005558246a80 in _dl_lookup_symbol_x () from /lib64/ld.so.1
No symbol table info available.
#1  0x00555824816c in _dl_relocate_object () from /lib64/ld.so.1
No symbol table info available.
#2  0x00555823fb6c in dl_main () from /lib64/ld.so.1
No symbol table info available.
#3  0x005558254214 in _dl_sysdep_start () from /lib64/ld.so.1
No symbol table info available.
#4  0x00555823d1b0 in _dl_start_final () from /lib64/ld.so.1
No symbol table info available.
#5  0x00555823d3f0 in _dl_start () from /lib64/ld.so.1
No symbol table info available.
#6  0x00555823cc10 in __start () from /lib64/ld.so.1
No symbol table info available.
Backtrace stopped: frame did not save the PC






GDB of second instance of corefile.
# gdb /bin/bash CFPU-1-15638-595e5efb-sh-QUIT.core   
GNU gdb (Wind River Linux G++ 4.4a-470) 7.2.50.20100908-cvs
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "mips64-wrs-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /bin/bash...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 15638]
Reading symbols from /lib64/libreadline.so.5...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libreadline.so.5
Reading symbols from /lib64/libhistory.so.5...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libhistory.so.5
Reading 

Re: [HACKERS] Revisiting NAMEDATALEN

2017-07-07 Thread Robert Haas
On Fri, Jul 7, 2017 at 5:53 AM, Emrul  wrote:
> Tom, thank you for that pointer.  I get now that it is not free and therefore
> why its not something that should be changed by default.
>
> I guess the problem is 'build your own copy' (i.e. compiling from source) is
> something that sends most DB teams running into the hills.
>
> A solution might be to make NAMEDATALEN configurable without having to
> recompile source (perhaps a config variable or an initdb parameter). When I
> have some free time I will investigate whether this is doable.

Well, it wouldn't be free.  The problem is:

typedef struct nameData
{
chardata[NAMEDATALEN];
} NameData;

If it were not a compile-time constant, every bit of code that uses
NameData (or otherwise depends on NAMEDATALEN being constant) would
have to be changed.  That would be invasive and likely have at least a
minor performance cost.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rust bindings to pgtypes lib

2017-07-07 Thread Michael Meskes
> Indeed. I'm quite strongly against exposing / using pgtypeslib in more
> places. If anything it should be phased out. Because that code is

Feel free to come up with a replacement. :)

> definitely not always kept up2date, and it's a lot of work to do so.  If
> anything the code should be rewritten to *not* require so much
> duplication, then we could talk.

Agreed. However, it could be argued that the code is not duplication
(anymore) but a separate implementation of these datatypes.

>> I'm not aware of any other language binding for pgtypeslib.
> 
> Some people use http://libpqtypes.esilo.com/

Never before saw this. It does not seem to have more in common than the
name, though.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Revisiting NAMEDATALEN

2017-07-07 Thread Emrul
Tom, thank you for that pointer.  I get now that it is not free and therefore
why its not something that should be changed by default.

I guess the problem is 'build your own copy' (i.e. compiling from source) is
something that sends most DB teams running into the hills.

A solution might be to make NAMEDATALEN configurable without having to
recompile source (perhaps a config variable or an initdb parameter). When I
have some free time I will investigate whether this is doable.




--
View this message in context: 
http://www.postgresql-archive.org/Revisiting-NAMEDATALEN-tp5969858p5970351.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Oddity in error handling of constraint violation in ExecConstraints for partitioned tables

2017-07-07 Thread Amit Langote
Fujita-san,

On 2017/07/06 16:06, Etsuro Fujita wrote:
> Here is an example:
> 
> postgres=# create table col_desc (a int, b int) partition by list (a);
> postgres=# create table col_desc_1 partition of col_desc for values in (1);
> postgres=# alter table col_desc_1 add check (b > 0);
> postgres=# create role col_desc_user;
> postgres=# grant insert on col_desc to col_desc_user;
> postgres=# revoke select on col_desc from col_desc_user;
> postgres=# set role col_desc_user;
> postgres=> insert into col_desc values (1, -1) returning 1;
> ERROR:  new row for relation "col_desc_1" violates check constraint
> "col_desc_1_b_check"
> DETAIL:  Failing row contains (a, b) = (1, -1).
> 
> Looks good, but
> 
> postgres=> reset role;
> postgres=# create table result (f1 text default 'foo', f2 text default
> 'bar', f3 int);
> postgres=# grant insert on result to col_desc_user;
> postgres=# set role col_desc_user;
> postgres=> with t as (insert into col_desc values (1, -1) returning 1)
> insert into result (f3) select * from t;
> ERROR:  new row for relation "col_desc_1" violates check constraint
> "col_desc_1_b_check"
> 
> Looks odd to me because the error message doesn't show any DETAIL info;
> since the CTE query, which produces the message, is the same as the above
> query, the message should also be the same as the one for the above
> query.

I agree that the DETAIL should be shown.

> I think the reason for that is: ExecConstraints looks at an
> incorrect resultRelInfo to build the message for a violating tuple that
> has been routed *in the case where the partitioned table isn't the primary
> ModifyTable node*, which leads to deriving an incorrect modifiedCols and
> then invoking ExecBuildSlotValueDescription with the wrong bitmap.

That's true.  Choosing the appropriate ResultRelInfo will lead us to
looking at the correct RangeTblEntry (via ri_RangeTableIndex) and hence
get the desired modifiedCols bitmap.  But...

> I think this should be fixed.  Attached is a patch for that.

Looking up the ResultRelInfo using the proposed method in
ExecFindResultRelInfo() can be unreliable sometimes, that is, the method
of using the root table OID to pin down the RangeTableEntry to get the the
modified columns set.  If the same table is a target in the main query
with different, the corresponding RTE will appear earlier in the
es_range_table list and will get returned.  For example (slightly altered
version of the example in your email):

alter table col_desc add c int;
set session authorization col_desc_user ;
with t (a) as (insert into col_desc values (1, -1) returning 1)
  insert into col_desc (c) select * from t;
ERROR:  new row for relation "col_desc_1" violates check constraint
"col_desc_1_b_check"
DETAIL:  Failing row contains (c) = (null).

Note that the patched code ends up matching the main query RTE of col_desc
instead of that in the CTE query.  And the columns shown in the DETAIL
part reflects which RTE got used to compute the modifiedCols set.

How about setting ri_RangeTableIndex of the partition ResultRelInfo
correctly in the first place?  If you look at the way InitResultRelInfo()
is called in ExecSetupPartitionTupleRouting(), a dummy RT index of 1 is
passed.  We could instead pass the correct one.  I think
ModifyTable.nominalRelation is that (at least in the INSERT case.

The attached patch implements that.  It modifies
ExecSetupPartitionTupleRouting to accept one more argument resultRTindex
and passes along the same to InitResultRelInfo().  Later when
ExecConstraints() wants to build the modifiedCols set, it looks up the
correct RTE using the partition ResultRelInfo, which has the appropriate
ri_RangeTableIndex set and uses the same.  With the patch:

with t (a) as (insert into col_desc values (1, -1) returning 1)
   insert into col_desc (c) select * from t;
ERROR:  new row for relation "col_desc_1" violates check constraint
"col_desc_1_b_check"
DETAIL:  Failing row contains (a, b) = (1, -1).

The patch keeps tests that you added in your patch.  Added this to the
open items list.

Thoughts?

Thanks,
Amit
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f391828e74..51693791cc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -1433,6 +1433,7 @@ BeginCopy(ParseState *pstate,
num_partitions;
 
ExecSetupPartitionTupleRouting(rel,
+   
   1,

   _dispatch_info,

   ,

   _tupconv_maps,
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0f08283f81..df9302896c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ 

Re: [HACKERS] Multi column range partition table

2017-07-07 Thread Dean Rasheed
On 7 July 2017 at 03:21, Amit Langote  wrote:
> The patch looks generally good, although I found and fixed some minor
> issues (typos and such).  Please find attached the updated patch.
>

Thanks for the review. Those changes all look good. I also see that I
missed an example in the docs at the bottom of the CREATE TABLE page,
so I'll go update that.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

2017-07-07 Thread Craig Ringer
On 7 July 2017 at 15:41, K S, Sandhya (Nokia - IN/Bangalore) <
sandhya@nokia.com> wrote:

> Hi Craig,
>
>
>
> The scenario is lock and unlock of the system for 30 times. During this
> scenario 5 sh-QUIT core is generated. GDB of 5 core is pointing to
> different locations.
>
> I have attached output for 2 such instance.
>
>

You seem to be missing debug symbols. Install appropriate debuginfo
packages.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

2017-07-07 Thread Masahiko Sawada
On Fri, Jul 7, 2017 at 4:21 PM, Michael Paquier
 wrote:
> On Fri, Jul 7, 2017 at 4:06 PM, Masahiko Sawada  wrote:
>> I agree with this idea. I've tried to make it wait for archiving but
>> it seems to me that there are other two issues we need to deal with:
>> the timeline ID on standby server is always reset after created a
>> restart point, and ThisTimeLineID variable of a backend process is not
>> set on the standby node when initializing. The former would be easy to
>> fix because resetting it is for debugging purposes. However, to deal
>> with latter issue, I'm considering the influence on other things.
>
> ThisTimeLineID does not need to be touched. It seems to me that you
> should just use the wait timeline as minRecoveryPointTLI, and wait for
> segments using this value. The segment and backup history files to
> wait for should be defined with minRecoveryPoint.

Thank you for the advise. I'll post the patch later.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-07 Thread Michael Banck
On Fri, Jul 07, 2017 at 07:40:55PM +1200, Mark Kirkwood wrote:
> On 07/07/17 13:29, Amit Langote wrote:
> >Someone complained about this awhile back [1].  And then it came up again
> >[2], where Noah appeared to take a stance that partitions should be
> >visible in views / output of commands that list "tables".
> >
> >Although I too tend to prefer not filling up the \d output space by
> >listing partitions (pg_class.relispartition = true relations), there
> >wasn't perhaps enough push for creating a patch for that.  If some
> >committer is willing to consider such a patch, I can make one.
> 
> Yeah, me too (clearly). However if the consensus is that all these partition
> tables *must* be shown in \d output, then I'd be happy if they were
> identified as such rather than just 'table' (e.g 'partition table').

+1.

Or maybe just 'partition' is enough if 'partition table' would widen the
column output unnecessarily.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-07 Thread Alik Khilazhev
Hello!

PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE 
of random row by PK) on benchmarking with big number of clients using Zipfian 
distribution. MySQL also has decline but it is not significant as it is in 
PostgreSQL. MongoDB does not have decline at all. And if pgbench would have 
Zipfian distribution random number generator, everyone will be able to make 
research on this topic without using YCSB. 
 
This is the reason why I am currently working on random_zipfian function.

The bottleneck of algorithm that I use is that it calculates zeta function (it 
has linear complexity - https://en.wikipedia.org/wiki/Riemann_zeta_function). 
It my cause problems on generating huge amount of big numbers. 

That’s why I added caching for zeta value. And it works good for cases when 
random_zipfian called with same parameters in script. For example:

… 
\set a random_zipfian(1, 100, 1.2)
\set b random_zipfian(1, 100, 1.2)
…

In other case, second call will override cache of first and caching does not 
make any sense:
…
\set a random_zipfian(1, 100, 1.2)
\set b random_zipfian(1, 200, 1.4)
… 

That’s why I have a question: should I implement support of caching zeta values 
for calls with different parameters, or not? 

P.S. I attaching patch and script - analogue of YCSB Workload A.
Run benchmark with command:
$ pgbench -f  ycsb_read_zipf.sql -f  ycsb_update_zipf.sql

On scale = 10(1 million rows) it gives following results on machine with 144 
cores(with synchronous_commit=off):
nclientstps
1   8842.401870
2   18358.140869
4   45999.378785
8   88713.743199
16  170166.998212
32  290069.221493
64  178128.030553
128 88712.825602
256 38364.937573
512 13512.765878
10006188.136736


ycsb_read_zipf.sql
Description: application/sql


pgbench-zipf-01v.patch
Description: Binary data


ycsb_update_zipf.sql
Description: application/sql

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New partitioning - some feedback

2017-07-07 Thread Mark Kirkwood

On 07/07/17 13:29, Amit Langote wrote:



Someone complained about this awhile back [1].  And then it came up again
[2], where Noah appeared to take a stance that partitions should be
visible in views / output of commands that list "tables".

Although I too tend to prefer not filling up the \d output space by
listing partitions (pg_class.relispartition = true relations), there
wasn't perhaps enough push for creating a patch for that.  If some
committer is willing to consider such a patch, I can make one.




Yeah, me too (clearly). However if the consensus is that all these 
partition tables *must* be shown in \d output, then I'd be happy if they 
were identified as such rather than just 'table' (e.g 'partition table').


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-07 Thread Michael Paquier
On Fri, Jul 7, 2017 at 4:31 PM, Kuntal Ghosh  wrote:
> On Fri, Jul 7, 2017 at 7:49 AM, Michael Paquier
>  wrote:
> I don't have any more inputs on this patch and it looks good to me.
> So, I'm moving the status to ready for committer.

Thanks!

>> At some point it would really make sense to group all things under the
>> same banner (64-b LO, pg_basebackup, and now pg_rewind).
>>
> +1. Implementation-wise, I prefer pg_recvint64 to fe_recvint64.

So do I. That's a matter of taste I guess.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-07 Thread Kuntal Ghosh
On Fri, Jul 7, 2017 at 7:49 AM, Michael Paquier
 wrote:
> On Thu, Jul 6, 2017 at 8:51 PM, Dilip Kumar  wrote:
>> On Thu, Jul 6, 2017 at 4:18 PM, Kuntal Ghosh  
>> wrote:
>>> But, I'm little concerned/doubt regarding the following part of the code.
>>> +/*
>>> + * Converts an int64 from network byte order to native format.
>>> + */
>>> +static int64
>>> +pg_recvint64(int64 value)
>>> +{
>>> +   union
>>> +   {
>>> +   int64   i64;
>>> +   uint32  i32[2];
>>> +   } swap;
>>> +   int64   result;
>>> +
>>> +   swap.i64 = value;
>>> +
>>> +   result = (uint32) ntohl(swap.i32[0]);
>>> +   result <<= 32;
>>> +   result |= (uint32) ntohl(swap.i32[1]);
>>> +
>>> +   return result;
>>> +}
>>> Does this always work correctly irrespective of the endianess of the
>>> underlying system?
>>
>> I think this will have problem, we may need to do like
>>
>> and reverse complete array if byte order is changed
>
> This comes from the the implementation of 64b-large objects, which was
> first broken on big-endian systems, until Tom fixed it with the
> following commit, and this looks fine to me:
> commit: 26fe56481c0f7baa705f0b3265b5a0676f894a94
> author: Tom Lane 
> date: Mon, 8 Oct 2012 18:24:32 -0400
> Code review for 64-bit-large-object patch.
>
Great. Besides, the logic for pg_recvint64 looks same as
fe_recvint64() defined in pg_basebackup.

I don't have any more inputs on this patch and it looks good to me.
So, I'm moving the status to ready for committer.

> At some point it would really make sense to group all things under the
> same banner (64-b LO, pg_basebackup, and now pg_rewind).
>
+1. Implementation-wise, I prefer pg_recvint64 to fe_recvint64.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

2017-07-07 Thread Craig Ringer
On 7 July 2017 at 15:10, K S, Sandhya (Nokia - IN/Bangalore) <
sandhya@nokia.com> wrote:

> Hi Craig,
>
>
>
> You were right about the restore_command.
>

This all makes sense then.

PostgreSQL sends SIGQUIT for immediate shutdown to its children. So the
restore_command would get signalled too.

Can't immediately explain the exit code, and SIGQUIT should _not_ generate
a core file. Can you show the result of attaching 'gdb' to the core file
and running 'bt full' ?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

2017-07-07 Thread Michael Paquier
On Fri, Jul 7, 2017 at 4:06 PM, Masahiko Sawada  wrote:
> I agree with this idea. I've tried to make it wait for archiving but
> it seems to me that there are other two issues we need to deal with:
> the timeline ID on standby server is always reset after created a
> restart point, and ThisTimeLineID variable of a backend process is not
> set on the standby node when initializing. The former would be easy to
> fix because resetting it is for debugging purposes. However, to deal
> with latter issue, I'm considering the influence on other things.

ThisTimeLineID does not need to be touched. It seems to me that you
should just use the wait timeline as minRecoveryPointTLI, and wait for
segments using this value. The segment and backup history files to
wait for should be defined with minRecoveryPoint.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

2017-07-07 Thread Masahiko Sawada
On Wed, Jul 5, 2017 at 4:57 PM, Michael Paquier
 wrote:
> On Wed, Jul 5, 2017 at 10:19 AM, Masahiko Sawada  
> wrote:
>> I feel that since we cannot switch the WAL forcibly on the standby
>> server we need to find a new way to do so. I'm not sure but it might
>> be a hard work and be late for PG10. Or you meant that you have a idea
>> for this?
>
> Why not refactoring a bit do_pg_stop_backup() so as the wait phase
> happens even if a backup is started in recovery? Now wait_for_archive
> is ignored because no wait is happening and the stop point is directly
> returned back to the caller. For the wait actually happening, I don't
> have a better idea than documenting the fact that enforcing manually a
> segment switch on the primary needs to happen. That's better than
> having users including WAL in their base backups but not actually
> having everything they need. And I think that documenting that
> properly is better than restricting things that should work.

I agree with this idea. I've tried to make it wait for archiving but
it seems to me that there are other two issues we need to deal with:
the timeline ID on standby server is always reset after created a
restart point, and ThisTimeLineID variable of a backend process is not
set on the standby node when initializing. The former would be easy to
fix because resetting it is for debugging purposes. However, to deal
with latter issue, I'm considering the influence on other things.

> In most workloads, multiple WAL segments can be generated per second,
> and in even more of them a new segment generated would happen in less
> than a minute, so waiting for a segment switch on the primary should
> not be a problem for most users. The log letting user know about the
> wait should be more informative when things happen on a standby, like
> "waiting for segment to be finished or switched on the primary".
>
> If the restriction approach is preferred, I think that the check
> should happen in do_pg_stop_backup as well, and not in
> pg_stop_backup_v2 as your patch suggests. pg_basebackup is not able to
> do non-exclusive backups but this may happen some day, who knows..

Yeah, I understood.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server

2017-07-07 Thread Michael Paquier
On Wed, Jul 5, 2017 at 4:57 PM, Michael Paquier
 wrote:
> Why not refactoring a bit do_pg_stop_backup() so as the wait phase
> happens even if a backup is started in recovery? Now wait_for_archive
> is ignored because no wait is happening and the stop point is directly
> returned back to the caller. For the wait actually happening, I don't
> have a better idea than documenting the fact that enforcing manually a
> segment switch on the primary needs to happen. That's better than
> having users including WAL in their base backups but not actually
> having everything they need. And I think that documenting that
> properly is better than restricting things that should work.

While looking at that in more details, I got surprised by two things:
1) No backup history file is generated on a standby during a base backup.
2) Because of 1), those files are not archived even if archive_mode = always.

This sounds to me like a missing optimization of archive_mode =
always, and the following comment block in xlog.c is at least
incorrect as an archiver can be invoked in recovery:
 * XXX currently a backup history file is for informational and debug
 * purposes only. It's not essential for an online backup. Furthermore,
 * even if it's created, it will not be archived during recovery because
 * an archiver is not invoked. So it doesn't seem worthwhile to write a
 * backup history file during recovery.

So I would suggest the following things to address this issue:
1) Generate a backup history file for backups taken at recovery as well.
2) Archive it if archive_mode = always.
3) Wait for both the segment of the stop point and the backup history
files to be archived before returning back to the caller of
pg_stop_backup.

It would be nice to get all that addressed in 10. Thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race conditions with WAL sender PID lookups

2017-07-07 Thread Michael Paquier
On Tue, Jul 4, 2017 at 1:34 PM, Noah Misch  wrote:
> Bundling code cleanup into commits that also do something else is strictly
> worse than bundling whitespace cleanup, which is itself bad:
> https://postgr.es/m/flat/20160113144826.gb3379...@tornado.leadboat.com

FWIW, I agree with that. I favor as well separate commits for cleanups
and for fixes, so as each commit has its own goal and protects it.

(The cleanups discussed on this thread have been partially done in
commit 572d6ee where a bug has been fixed, not by me on the patches I
submitted ;) )
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers